summary.money_ready: []). Tournament and lab numbers are research surfaces — do not size live capital from them without forward n≥100 + policy-clean gates.
| Surface | What it measures | Today's verdict |
|---|---|---|
| /audit production | Closed picks after resolver + policy-clean filters | 0/9 money-ready CRYPTO PF 0.92 n=368; EQUITY PF 0.33 n=52 |
| AI tournament | Paper portfolios per LLM model (Wilson WR × bootstrap PF) | Best paper: deepseek_v4 WR 57.7% PF 3.46 n=208 |
| Verified lab + paper pilot | OHLCV backtest + walk-forward OOS | etf_dual_momentum lab PF 1.60 n=104; live ETF n=3 |
| Pick / sleeve | Tier | Why we cite it | Key stats |
|---|---|---|---|
| deepseek_v4 tournament book | PAPER WATCH | Top rank on leaderboard with n≥30 gate; not wired to production sizing | WR 57.7%, PF 3.46, n=208 resolved |
| gpt4o tournament book | PAPER WATCH | #2 rank-eligible model; same caveats as above | WR 59.7%, PF 3.14, n=134 |
| CRYPTO SHORT (BTC/ETH) | PAPER WATCH | EAGLE3: production emits LONG (33% WR) while tournament SHORT is 67% WR n=216 — flip in production_scanner.py is shadow, not promoted | SHORT 67% vs LONG 33% |
| ETF: EEM, IWM, GLD | PAPER WATCH | EAGLE3 symbol whitelist on tournament resolved picks; production ETF only n=3 | EEM 93%, IWM 75%, GLD 68% (EAGLE3 table) |
| EQUITY: BAC, JPM, MSFT, NVDA | PAPER WATCH | EAGLE3 LONG-only bias in tournament; production EQUITY book fails gates | Prod EQUITY WR 26.9% PF 0.33 |
| etf_verified_dual_momentum | SHADOW PILOT | Only lab Tier-2 pass in multi-class lab; WF OOS PASS; forward log until n≥30 | Lab PF 1.60 WR 53.8% n=104; OOS PF 1.21 n=32 |
| /audit Smart Picks (all classes) | DO NOT SIZE | Policy-clean money_ready empty; recency panels can diverge from headline WR | See production table below |
| Class | n | WR | PF | Verdict | Action |
|---|---|---|---|---|---|
| COMMODITY | 4 | 50.0% | 1.6758 | INSUFFICIENT_DATA | NO LIVE SIZE |
| CRYPTO | 368 | 36.1% | 0.919 | NOT_READY | NO LIVE SIZE |
| EQUITY | 52 | 26.9% | 0.3269 | NOT_READY | NO LIVE SIZE |
| ETF | 3 | 66.7% | 1.4581 | INSUFFICIENT_DATA | NO LIVE SIZE |
| FOREX | 32 | 28.1% | 0.4812 | INSUFFICIENT_DATA | NO LIVE SIZE |
| FUTURES | 13 | 15.4% | 0.5193 | INSUFFICIENT_DATA | NO LIVE SIZE |
| PENNY_STOCK | 1 | 0% | 0.0 | INSUFFICIENT_DATA | NO LIVE SIZE |
| UNKNOWN | 9 | 66.7% | 0.724 | INSUFFICIENT_DATA | NO LIVE SIZE |
Source: audit_dashboard/data/money_ready_verdict.json generated 2026-06-02T10:19:20.625546+00:00. We show n_resolved because Tier-2 requires enough closed trades; PF (profit factor) because it captures tail risk better than WR alone; verdict because it bundles DSR/SPA/MDD gates.
| Model | n resolved | WR | PF | Tier | Action |
|---|---|---|---|---|---|
| deepseek_v4 | 208 | 57.7% | 3.457 | T1 | PAPER WATCH |
| gpt4o | 134 | 59.7% | 3.139 | T1 | PAPER WATCH |
| deepseek_r1 | 132 | 62.9% | 2.933 | T1 | PAPER WATCH |
| grok3 | 303 | 55.8% | 2.286 | T1 | PAPER WATCH |
| claude_haiku_4_5 | 74 | 66.2% | 2.709 | T1 | PAPER WATCH |
Source: audit_dashboard/data/ai_tournament_leaderboard.json. Rank score = Wilson lower-bound WR × bootstrap lower-bound PF. High PF here does not override production NOT_READY.
Full: money_ready_verdict.json lists summary.money_ready: []. Every class must pass WR, PF, n, DSR, SPA, and drawdown gates simultaneously. CRYPTO and EQUITY have enough n to fail honestly (NOT_READY); others lack n (INSUFFICIENT_DATA).
Full: Tournament picks live in a parallel DB/book with model personas and faster resolution. Production picks flow through production_scanner, battleground, gates, and resolver fixes. A model can show PF 3.4 on paper while production CRYPTO PF is 0.92.
Full: WR alone hides payoff asymmetry (FOREX had pretty WR with bad PF historically). PF < 1 means losers outweigh winners in dollar terms. We require both for Tier-2 per hedge-fund review tier table in CLAUDE.md.
Full: EAGLE3 analyzed 216 resolved CRYPTO tournament picks: SHORT 67% WR vs LONG 33%. Production still emits LONG-heavy CRYPTO. EAGLE-4 admissibility can flip direction in scanner — that is a mutation, not proof of live edge until 14d/48h panels confirm.
Full: Lab harness reports PF 1.60, n=104, walk-forward OOS PF 1.21 (PASS). Live production ETF class has n=3 — far below promotion threshold. Paper pilot logs to verified_strategies/paper_pilot/etf_dual_momentum_paper_log.jsonl; flag stays OFF until forward criteria in updates/2026-05-31-etf-promotion-path.md.
Full: Scanning hundreds of symbols × models × directions without multiplicity control inflates false discoveries. EAGLE swarm used DSR/SPA framing; SPA failed on CRYPTO (spa_p 0.59). Treat symbol whitelist rows as hypotheses until pre-registered (M-107).
Full: Funnel headline can disagree with policy-clean DB (duplicate signal-ts, resolver labels, concentration in one source). Always read asset_class_health + 14d/48h recency panels before sizing.
# Verdict Table
| Claim | Verdict | Evidence Block |
| --- | --- | --- |
| deepseek_v4 / gpt4o tournament models | REJECT | tournament_leaderboard |
| CRYPTO SHORT direction | WEAK | eagle3_directional |
| ETF symbols EEM, IWM, GLD | SUPPORT | eagle3_directional |
| EQUITY symbols BAC, JPM, MSFT, NVDA | REJECT | eagle3_directional |
| ETF dual momentum lab | SUPPORT | lab_sleeve |
| Production /audit | SUPPORT | money_ready |
# Stats Explained
- **WR (Win Rate)**: The percentage of times a model or strategy wins in a tournament or simulation.
- **PF (Profit Factor)**: A measure of a strategy's profitability, calculated as the ratio of total profits to total losses.
- **n_resolved**: The number of resolved picks or outcomes for a model or strategy.
- **verdict**: A summary of the model or strategy's performance, such as "NOT_READY" or "INSUFFICIENT_DATA".
# ELI5 Feedback Points
### 1. deepseek_v4 / gpt4o tournament models
* REJECT because the leaderboard only shows the watch-only performance of the models, not their actual tournament performance.
* Evidence: "paper watch only (leaderboard)" in the claim, and the leaderboard data in `tournament_leaderboard`.
### 2. CRYPTO SHORT direction
* WEAK because the win rate of 67% is high, but the note mentions a "tournament edge" and a "flip target", which suggests that the strategy may not be reliable in production.
* Evidence: "tournament edge — flip target" in the note of the CRYPTO SHORT row in `eagle3_directional`.
### 3. ETF symbols EEM, IWM, GLD
* SUPPORT because the note mentions that these symbols have high win rates in EAGLE3 (93%, 75%, and 68% respectively).
* Evidence: "EEM 93%, IWM 75%, GLD 68% WR in EAGLE3" in the note of the ETF symbol whitelist in `eagle3_directional`.
### 4. EQUITY symbols BAC, JPM, MSFT, NVDA
* REJECT because the production EQUITY PF is 0.33, which is low, and the note mentions a "production bug" that emits LONG.
* Evidence: "production EQUITY PF 0.33 FAIL" in the claim, and the EQUITY row in `eagle3_directional`.
### 5. ETF dual momentum lab
* SUPPORT because the lab PF is 1.6, which is high, and the WF OOS PF is 1.21, which is also high.
* Evidence: "lab_pf: 1.6" and "wf_oos_pf: 1.21" in `lab_sleeve`.
### 6. Production /audit
* SUPPORT because the money_ready count is 0, which means that all classes are NOT_READY, and the CRYPTO NOT_READY PF is 0.92, which is high.
* Evidence: "money_ready_count: 0" in `money_ready`, and the CRYPTO row in `money_ready`.
Artifact: reports/best_picks_swarm_review_2026-06-02.json · Re-run: python3 tools/verify_best_picks_swarm.py
python3 tools/verify_best_picks_swarm.py
python3 tools/eagle_swarm_synthesis.py
python3 tools/generate_eagle_best_picks_guide.py
VERIFY_SKIP_FRED=1 python3 tools/multi_class_strategy_lab.py
python3 verified_strategies/paper_pilot/etf_dual_momentum_pilot.py --one-shot
curl -s "https://findtorontoevents.ca/audit/data/money_ready_verdict.json" | python3 -m json.tool | head -40