EAGLE Best Picks & Research Index (2026-06-02)

Goal #1 · Swarm-reviewed via LiteLLM proxy · Sources: money_ready_verdict.json, ai_tournament_leaderboard.json, EAGLE3_2026-06-02_minimax-m3-free.MD, lab report · ← Updates · /audit

Capital rule: Zero asset classes pass production Money Ready (summary.money_ready: []). Tournament and lab numbers are research surfaces — do not size live capital from them without forward n≥100 + policy-clean gates.

1. Three surfaces (do not mix them)

SurfaceWhat it measuresToday's verdict
/audit productionClosed picks after resolver + policy-clean filters0/9 money-ready CRYPTO PF 0.92 n=368; EQUITY PF 0.33 n=52
AI tournamentPaper portfolios per LLM model (Wilson WR × bootstrap PF)Best paper: deepseek_v4 WR 57.7% PF 3.46 n=208
Verified lab + paper pilotOHLCV backtest + walk-forward OOSetf_dual_momentum lab PF 1.60 n=104; live ETF n=3

2. Best possible picks (evidence-bound)

Pick / sleeveTierWhy we cite itKey stats
deepseek_v4 tournament bookPAPER WATCHTop rank on leaderboard with n≥30 gate; not wired to production sizingWR 57.7%, PF 3.46, n=208 resolved
gpt4o tournament bookPAPER WATCH#2 rank-eligible model; same caveats as aboveWR 59.7%, PF 3.14, n=134
CRYPTO SHORT (BTC/ETH)PAPER WATCHEAGLE3: production emits LONG (33% WR) while tournament SHORT is 67% WR n=216 — flip in production_scanner.py is shadow, not promotedSHORT 67% vs LONG 33%
ETF: EEM, IWM, GLDPAPER WATCHEAGLE3 symbol whitelist on tournament resolved picks; production ETF only n=3EEM 93%, IWM 75%, GLD 68% (EAGLE3 table)
EQUITY: BAC, JPM, MSFT, NVDAPAPER WATCHEAGLE3 LONG-only bias in tournament; production EQUITY book fails gatesProd EQUITY WR 26.9% PF 0.33
etf_verified_dual_momentumSHADOW PILOTOnly lab Tier-2 pass in multi-class lab; WF OOS PASS; forward log until n≥30Lab PF 1.60 WR 53.8% n=104; OOS PF 1.21 n=32
/audit Smart Picks (all classes)DO NOT SIZEPolicy-clean money_ready empty; recency panels can diverge from headline WRSee production table below

3. Production /audit stats (why each number is shown)

ClassnWRPFVerdictAction
COMMODITY450.0%1.6758INSUFFICIENT_DATANO LIVE SIZE
CRYPTO36836.1%0.919NOT_READYNO LIVE SIZE
EQUITY5226.9%0.3269NOT_READYNO LIVE SIZE
ETF366.7%1.4581INSUFFICIENT_DATANO LIVE SIZE
FOREX3228.1%0.4812INSUFFICIENT_DATANO LIVE SIZE
FUTURES1315.4%0.5193INSUFFICIENT_DATANO LIVE SIZE
PENNY_STOCK10%0.0INSUFFICIENT_DATANO LIVE SIZE
UNKNOWN966.7%0.724INSUFFICIENT_DATANO LIVE SIZE

Source: audit_dashboard/data/money_ready_verdict.json generated 2026-06-02T10:19:20.625546+00:00. We show n_resolved because Tier-2 requires enough closed trades; PF (profit factor) because it captures tail risk better than WR alone; verdict because it bundles DSR/SPA/MDD gates.

4. Tournament leaderboard (paper only)

Modeln resolvedWRPFTierAction
deepseek_v420857.7%3.457T1PAPER WATCH
gpt4o13459.7%3.139T1PAPER WATCH
deepseek_r113262.9%2.933T1PAPER WATCH
grok330355.8%2.286T1PAPER WATCH
claude_haiku_4_57466.2%2.709T1PAPER WATCH

Source: audit_dashboard/data/ai_tournament_leaderboard.json. Rank score = Wilson lower-bound WR × bootstrap lower-bound PF. High PF here does not override production NOT_READY.

5. Feedback points — full explanation + ELI5

5.1 Zero money-ready classes

Full: money_ready_verdict.json lists summary.money_ready: []. Every class must pass WR, PF, n, DSR, SPA, and drawdown gates simultaneously. CRYPTO and EQUITY have enough n to fail honestly (NOT_READY); others lack n (INSUFFICIENT_DATA).

ELI5: The report card has no “A” grades yet — so we don’t bet the house on any single asset class from the main dashboard.

5.2 Tournament ≠ production

Full: Tournament picks live in a parallel DB/book with model personas and faster resolution. Production picks flow through production_scanner, battleground, gates, and resolver fixes. A model can show PF 3.4 on paper while production CRYPTO PF is 0.92.

ELI5: Practice game stats don’t automatically count in the real league — different players, different rules.

5.3 Why we cite PF and WR together

Full: WR alone hides payoff asymmetry (FOREX had pretty WR with bad PF historically). PF < 1 means losers outweigh winners in dollar terms. We require both for Tier-2 per hedge-fund review tier table in CLAUDE.md.

ELI5: Winning often but with tiny wins and huge losses is still losing money — PF catches that; win rate alone does not.

5.4 CRYPTO SHORT flip (EAGLE3)

Full: EAGLE3 analyzed 216 resolved CRYPTO tournament picks: SHORT 67% WR vs LONG 33%. Production still emits LONG-heavy CRYPTO. EAGLE-4 admissibility can flip direction in scanner — that is a mutation, not proof of live edge until 14d/48h panels confirm.

ELI5: Our crypto picks were often betting “up” when the scoreboard says “down” worked better — fixing direction is step one, not permission to go all-in.

5.5 ETF dual momentum shadow sleeve

Full: Lab harness reports PF 1.60, n=104, walk-forward OOS PF 1.21 (PASS). Live production ETF class has n=3 — far below promotion threshold. Paper pilot logs to verified_strategies/paper_pilot/etf_dual_momentum_paper_log.jsonl; flag stays OFF until forward criteria in updates/2026-05-31-etf-promotion-path.md.

ELI5: This ETF strategy passed homework (backtest) but hasn’t enough real homework days (forward trades) to join the live team.

5.6 Bonferroni / multiple testing

Full: Scanning hundreds of symbols × models × directions without multiplicity control inflates false discoveries. EAGLE swarm used DSR/SPA framing; SPA failed on CRYPTO (spa_p 0.59). Treat symbol whitelist rows as hypotheses until pre-registered (M-107).

ELI5: If you try enough guesses, one looks brilliant by luck — we discount “winners” found after trying many combos.

5.7 Pick funnel “78% CRYPTO” dispute

Full: Funnel headline can disagree with policy-clean DB (duplicate signal-ts, resolver labels, concentration in one source). Always read asset_class_health + 14d/48h recency panels before sizing.

ELI5: The big marketing number on the funnel page can be wrong — check the detailed scoreboard before trusting it.

6. LiteLLM swarm review (automated second check)

# Verdict Table

| Claim | Verdict | Evidence Block |
| --- | --- | --- |
| deepseek_v4 / gpt4o tournament models | REJECT | tournament_leaderboard |
| CRYPTO SHORT direction | WEAK | eagle3_directional |
| ETF symbols EEM, IWM, GLD | SUPPORT | eagle3_directional |
| EQUITY symbols BAC, JPM, MSFT, NVDA | REJECT | eagle3_directional |
| ETF dual momentum lab | SUPPORT | lab_sleeve |
| Production /audit | SUPPORT | money_ready |

# Stats Explained

- **WR (Win Rate)**: The percentage of times a model or strategy wins in a tournament or simulation.
- **PF (Profit Factor)**: A measure of a strategy's profitability, calculated as the ratio of total profits to total losses.
- **n_resolved**: The number of resolved picks or outcomes for a model or strategy.
- **verdict**: A summary of the model or strategy's performance, such as "NOT_READY" or "INSUFFICIENT_DATA".

# ELI5 Feedback Points

### 1. deepseek_v4 / gpt4o tournament models

*   REJECT because the leaderboard only shows the watch-only performance of the models, not their actual tournament performance.
    *   Evidence: "paper watch only (leaderboard)" in the claim, and the leaderboard data in `tournament_leaderboard`.

### 2. CRYPTO SHORT direction

*   WEAK because the win rate of 67% is high, but the note mentions a "tournament edge" and a "flip target", which suggests that the strategy may not be reliable in production.
    *   Evidence: "tournament edge — flip target" in the note of the CRYPTO SHORT row in `eagle3_directional`.

### 3. ETF symbols EEM, IWM, GLD

*   SUPPORT because the note mentions that these symbols have high win rates in EAGLE3 (93%, 75%, and 68% respectively).
    *   Evidence: "EEM 93%, IWM 75%, GLD 68% WR in EAGLE3" in the note of the ETF symbol whitelist in `eagle3_directional`.

### 4. EQUITY symbols BAC, JPM, MSFT, NVDA

*   REJECT because the production EQUITY PF is 0.33, which is low, and the note mentions a "production bug" that emits LONG.
    *   Evidence: "production EQUITY PF 0.33 FAIL" in the claim, and the EQUITY row in `eagle3_directional`.

### 5. ETF dual momentum lab

*   SUPPORT because the lab PF is 1.6, which is high, and the WF OOS PF is 1.21, which is also high.
    *   Evidence: "lab_pf: 1.6" and "wf_oos_pf: 1.21" in `lab_sleeve`.

### 6. Production /audit

*   SUPPORT because the money_ready count is 0, which means that all classes are NOT_READY, and the CRYPTO NOT_READY PF is 0.92, which is high.
    *   Evidence: "money_ready_count: 0" in `money_ready`, and the CRYPTO row in `money_ready`.

Artifact: reports/best_picks_swarm_review_2026-06-02.json · Re-run: python3 tools/verify_best_picks_swarm.py

7. EAGLE*.MD* file index (60 files)

8. Recent planning & session docs (8 listed)

9. Reproducers

python3 tools/verify_best_picks_swarm.py
python3 tools/eagle_swarm_synthesis.py
python3 tools/generate_eagle_best_picks_guide.py
VERIFY_SKIP_FRED=1 python3 tools/multi_class_strategy_lab.py
python3 verified_strategies/paper_pilot/etf_dual_momentum_pilot.py --one-shot
curl -s "https://findtorontoevents.ca/audit/data/money_ready_verdict.json" | python3 -m json.tool | head -40