← Back to Updates
EAGLE2 — Turning Research Edge into Capital-Ready Strategies
2026-06-02 · Claude Opus 4.8 · Goal #1 (per-asset-class edge on /audit) · Full report: reports/EAGLE2_SYNTHESIS_GROUNDED_2026-06-02_claude-opus-4-8.md
One-paragraph summary. Every number here was read straight from our own
canonical files (
money_ready_verdict.json +
pf_registry.json, generated
2026-06-02 10:19 UTC). The headline:
no asset class is production-ready
(
money_ready = []). The real edge in the project lives on the
AI tournament page — but it is
paper, not live-clean. The whole problem is the gap between those two scoreboards.
ELI5: We have two report cards. The "practice" report card looks great.
The "for-real-money" report card is failing. We figured out exactly why the practice grades don't
carry over — and what to fix.
0classes money-ready
9asset classes scored
2live strategies PF>1 (n≥20)
3.46best paper PF (deepseek_v4)
1 · What we found
Finding A — Nothing passes the real-money gate yet
| Class | n | Win rate | Profit factor | Verdict |
| CRYPTO | 368 | 36.1% | 0.92 | NOT READY |
| EQUITY | 52 | 26.9% | 0.33 | NOT READY |
| FOREX | 32 | 28.1% | 0.48 | INSUFFICIENT |
| FUTURES | 13 | 15.4% | 0.52 | INSUFFICIENT |
| ETF | 3 | 66.7% | 1.46 | INSUFFICIENT |
| COMMODITY | 4 | 50.0% | 1.68 | INSUFFICIENT |
| BOND | 0 | — | — | NO DATA |
ELI5: A "profit factor" above 1.0 means a strategy makes more than it
loses. Almost all of ours are below 1.0 — they lose money after costs. ETF and Commodity look okay
but they've only made 3–4 trades, which is far too few to trust (like calling a coin "lucky" after
2 flips).
Finding B — The profitable picks live on the AI Tournament page (but it's paper)
| Model sleeve | Resolved picks | Win rate | Profit factor |
| deepseek_v4 | 208 | 57.7% | 3.46 |
| gpt4o | 134 | 59.7% | 3.14 |
| deepseek_r1 | 132 | 62.9% | 2.93 |
| claude_haiku_4_5 | 74 | 66.2% | 2.71 |
ELI5: When AI models "play the market" on paper, several do really well —
deepseek_v4 has a solid track record over 208 picks. But "paper" means pretend money with friendlier
scoring. We haven't yet proven it survives the strict real-money rules.
Finding C — Why the paper edge dies in production
- Concentration: one data source = 55% of crypto picks (85% of futures). Both of our only
PF>1 strategies are 100% single-source.
- Label pollution: crypto shows a −85% tail loss with a 100% max drawdown — a fingerprint of
trades that were mislabeled or never properly closed.
- Too few trades + no multiple-testing correction: across ~73 strategy×class buckets, finding
2 "winners" by pure luck is exactly what statistics predicts.
ELI5: If 73 people each guess a coin flip, a couple will get a "hot streak"
by chance. We've been treating those lucky streaks as skill. Real funds correct for that — we don't yet.
2 · Best picks today — the honest answer
No pick in the project is real-money-ready. Here is the detailed rationale for the
candidates people keep asking about:
NVDA — not our edge (hold as plain beta only)
Our EQUITY signals are losing (27% win rate, PF 0.33). NVDA's appeal is a
macro story — the leading AI-chip company — not a signal our system validated. As a long-term
buy-and-hold it's a reasonable market bet, but presenting it as an "/audit pick" would be dishonest.
One brainstorm model "endorsed" NVDA citing a 62.9% win rate — that number is actually
deepseek_r1's tournament score, not NVDA's. We caught and rejected that hallucination.
ELI5: NVDA might go up because it's a great company, but our crystal ball
didn't tell us that — so we won't pretend it did.
BTCUSD — the most defensible forward-test candidate
Our only two PF>1 live strategies (crypto_liquidity_wick_reversal PF 1.55, 60% WR;
atr_percentile_gate PF 1.10, 58.6% WR) trade crypto including BTC, and BTC is the top
crypto symbol (22% of picks). This is the closest thing we have to a real signal — but it's only ~30
trades and single-source, so it fails a strict statistical test. Action: shadow-size only
(tiny pretend allocation to watch it live), not real capital.
ELI5: This is our best "maybe." We'll keep testing it with play money before
ever betting real money.
Safe long-term pick — a broad index ETF (e.g. SPY/VOO)
The only "well-known safe" holding we can defend without a proven signal is the whole-market
index. It's a market-average bet, not an alpha claim. The real project goal is to turn this into a
backtested ETF dual-momentum sleeve (see plan).
ELI5: If you must park money somewhere "safe and boring," owning a slice of the
whole stock market is the textbook answer — and it doesn't require us to be fortune-tellers.
3 · More strategies per asset class (what to backtest next)
| Class | Strategy to backtest | First step |
| CRYPTO | Liquidity-wick reversal, de-concentrated + ATR vol gate | Re-run across ≥3 sources |
| EQUITY | Cross-sectional momentum (12-1) + 200-day MA filter | Walk-forward on S&P 500 |
| ETF | Dual-momentum (absolute + relative), monthly | 24-month walk-forward |
| FOREX | Carry + trend (drop intraday scalps) | Fix resolver first |
| COMMODITY | Time-series momentum (managed-futures style) | Long-lookback backtest |
| BOND | Yield-curve / duration timing pilot | Stand up data feed |
ELI5: Each asset (crypto, stocks, etc.) needs its own playbook. We listed the
one play per asset most likely to actually work, and the very first test to run for each.
Documented in ejaguiar1_backtests (backtest results + the project's
destructive-op backup DB); live picks flow to ejaguiar1_stocks. Every idea is
pre-registered in reports/hypothesis_registry.json before any backtest (rule M-107).
4 · Are our picks statistically "real-money ready"?
The gates a signal must clear before real capital:
- Bare minimum: Profit factor ≥ 1.5 · Sharpe ≥ 1.0 (after costs) · max drawdown ≤ 20% ·
out-of-sample profit factor ≥ 80% of in-sample · Bonferroni / FDR correction · 5-bps cost model.
- Ideal: purged-embargoed walk-forward · 1000-sample bootstrap (95% confidence the profit
factor stays above 1.0) · works in ≥3 of 4 market regimes · source concentration (HHI) < 0.20 ·
shadow-sized 4–8 weeks live.
- Forward-test minimum: ≥30–50 symbols per class, run for ≥2× the lookback window, live
profit factor within ±10% of backtest for two consecutive 4-week windows before scaling.
The single most important gate we are missing: a
multiple-testing correction (Bonferroni / FDR) applied across all strategies
before the DSR/SPA gates. At α=0.05 over ~73 buckets, a strategy needs p < 0.00068 — neither
of our two "winning" crypto sleeves clears that at n≈30. Two of three brainstorm AIs independently
flagged this same gap.
ELI5: "Bonferroni correction" = if you test lots of ideas, you must raise the
bar for calling any one a "winner," because some will look good by luck. We weren't raising the bar.
That's the #1 fix.
5 · The plan going forward
Short-term (0–2 weeks)
- Add the Bonferroni/FDR pre-gate to the readiness verdict (closes the false-positive hole).
- Make "single-source artifact" a hard reject at promotion.
- Resolver-hygiene pass on CRYPTO + FOREX (fix the mislabeled/never-closed tails).
- Pre-register the per-class strategy archetypes above.
Long-term (3–12 weeks)
- One promotion path only: purged-embargoed walk-forward — kill the two-scoreboard split so paper
and production use the same labels and costs.
- Shadow-size anything that clears the harness at ≤0.5% capital for 4–8 weeks.
- Build the missing data feeds (ETF dual-momentum, commodities, bonds).
- Promote to real capital only after two consecutive 4-week windows hold within ±10% of backtest.
ELI5: First, stop fooling ourselves (raise the bar, fix the data). Then test
the best ideas with play money. Only the ones that keep working for real go live.
6 · What was accomplished this session
- Synced the local repo (was 1820 commits behind → 0) without losing any peer work (saved on a backup branch).
- Verified all 3 new LiteLLM proxy modes work:
ollama-cloud-large→Claude,
ollama-cloud→Qwen, ollama-cloud-local→Gemma; ran a grounded brainstorm through them.
- Reconciled 30+ peer "EAGLE" documents against the real numbers and corrected a fabricated stat.
- Produced the grounded synthesis report, per-class backtest plan, and this page.
ELI5: We tidied the workshop, confirmed our new AI helpers work, fact-checked
everyone's homework, and wrote up the honest results.
7 · Consensus Quick Picks (no backtest, opinion-aggregation)
Separate from the audit edge-hunt above: a fast, stability-tilted basket built purely from a
6-model AI consensus vote (analyst ratings + 13F ownership + moat/quality knowledge). Clearly
labeled model opinion, not a proven signal — verify live before sizing.
| Pick | Tier | Votes | Conviction | Read |
| MSFT | Mega-cap | 6/6 | 90 | Unanimous #1 — moat + balance sheet |
| BRK.B | Mega-cap | 5/6 | 89 | Fortress / "sleep-well" equity |
| SGOV | Bond/cash | 4/6 | 95 | Highest conviction — T-bill yield anchor |
| VOO / VTI | Broad ETF | 4/6 | 92-94 | Core equity ballast |
| COST / AGG | Mega-cap / Bond | 2-3/6 | 89-92 | Quality + bond ballast |
| NVDA | Semi | 2 pick / 3 avoid | 84 | Divisive — too cyclical for a safe basket |
| INTC | Semi | 0 pick / 3 avoid | — | Consensus AVOID (fading moat) |
ELI5: We asked 6 AIs "what would you safely park money in?" They agreed: a
mix of T-bills (SGOV), the whole stock market (VOO), and a couple of rock-solid companies (Microsoft,
Berkshire). They warned against Intel, and split on Nvidia.
Source: reports/CONSENSUS_QUICK_PICKS_2026-06-02.md + raw votes cqp_vote_*.txt.
8 · Swarm-refined selection methodology
A 15-round multi-model swarm produced a concrete quick-pick + long-term playbook for
stocks · ETFs · bonds · futures · commodities — each with signal sources, numeric thresholds,
position sizing, rebalance cadence, and when-to-avoid rules. Examples:
- Stocks (quick): mkt-cap ≥ $100B, P/E ≤ 25, ROIC ≥ 15%, ≥5 analysts, top-50 fund ownership ≥ 2%; equal-weight top 10, max 5%/name, quarterly.
- ETFs: AUM ≥ $5B, expense ≤ 0.10%, tracking error ≤ 0.5%; avoid leveraged/inverse + >50% overlap.
- Bonds: duration ≤ 2yr T-bills + A-rated short IG ladder; avoid <A-, high-yield, inverted curve >6mo.
- Futures/Commodities: only when 12-month momentum > 0 (SG CTA / BCOM); DBMF as managed-futures proxy; 10% cap, 0% if momentum negative.
ELI5: A checklist of rules for picking each type of investment — and just as
important, rules for when to not buy. Written so anyone can follow it without a finance degree.
Full playbook: reports/QUICK_PICK_METHODOLOGY_SWARM_2026-06-02.md. A 5-cycle hourly
debate / devil's-advocate swarm then stress-tested + converged the basket. Final agreed:
BRK.B 86 · GLD 80 · MSFT 79 · PEP 74 · VOO 70 · COST 66 · JNJ 42 (avoid INTC/NVDA/TSLA).
The debate dropped SGOV, raised GLD (regime hedge), and collapsed JNJ on litigation overhang — see
reports/HOURLY_PICKS_ENHANCEMENT_2026-06-02.md for the per-cycle trail.
Provenance: all stats from audit_dashboard/data/money_ready_verdict.json and
audit_dashboard/data/pf_registry.json (generated 2026-06-02T10:19Z). AI brainstorm outputs
are advisory only and quarantined in reports/eagle2_brainstorm_*.md; no model was permitted
to "fetch" live pages. This is analysis, not financial advice.