EAGLE2 — Turning Research Edge into Capital-Ready Strategies

2026-06-02 · Claude Opus 4.8 · Goal #1 (per-asset-class edge on /audit) · Full report: reports/EAGLE2_SYNTHESIS_GROUNDED_2026-06-02_claude-opus-4-8.md

One-paragraph summary. Every number here was read straight from our own canonical files (money_ready_verdict.json + pf_registry.json, generated 2026-06-02 10:19 UTC). The headline: no asset class is production-ready (money_ready = []). The real edge in the project lives on the AI tournament page — but it is paper, not live-clean. The whole problem is the gap between those two scoreboards.

ELI5: We have two report cards. The "practice" report card looks great. The "for-real-money" report card is failing. We figured out exactly why the practice grades don't carry over — and what to fix.

0classes money-ready

9asset classes scored

2live strategies PF>1 (n≥20)

3.46best paper PF (deepseek_v4)

1 · What we found

Finding A — Nothing passes the real-money gate yet

Class	n	Win rate	Profit factor	Verdict
CRYPTO	368	36.1%	0.92	NOT READY
EQUITY	52	26.9%	0.33	NOT READY
FOREX	32	28.1%	0.48	INSUFFICIENT
FUTURES	13	15.4%	0.52	INSUFFICIENT
ETF	3	66.7%	1.46	INSUFFICIENT
COMMODITY	4	50.0%	1.68	INSUFFICIENT
BOND	0	—	—	NO DATA

ELI5: A "profit factor" above 1.0 means a strategy makes more than it loses. Almost all of ours are below 1.0 — they lose money after costs. ETF and Commodity look okay but they've only made 3–4 trades, which is far too few to trust (like calling a coin "lucky" after 2 flips).

Finding B — The profitable picks live on the AI Tournament page (but it's paper)

Model sleeve	Resolved picks	Win rate	Profit factor
deepseek_v4	208	57.7%	3.46
gpt4o	134	59.7%	3.14
deepseek_r1	132	62.9%	2.93
claude_haiku_4_5	74	66.2%	2.71

ELI5: When AI models "play the market" on paper, several do really well — deepseek_v4 has a solid track record over 208 picks. But "paper" means pretend money with friendlier scoring. We haven't yet proven it survives the strict real-money rules.

Finding C — Why the paper edge dies in production

Concentration: one data source = 55% of crypto picks (85% of futures). Both of our only PF>1 strategies are 100% single-source.
Label pollution: crypto shows a −85% tail loss with a 100% max drawdown — a fingerprint of trades that were mislabeled or never properly closed.
Too few trades + no multiple-testing correction: across ~73 strategy×class buckets, finding 2 "winners" by pure luck is exactly what statistics predicts.

ELI5: If 73 people each guess a coin flip, a couple will get a "hot streak" by chance. We've been treating those lucky streaks as skill. Real funds correct for that — we don't yet.

2 · Best picks today — the honest answer

No pick in the project is real-money-ready. Here is the detailed rationale for the candidates people keep asking about:

NVDA — not our edge (hold as plain beta only)

Our EQUITY signals are losing (27% win rate, PF 0.33). NVDA's appeal is a macro story — the leading AI-chip company — not a signal our system validated. As a long-term buy-and-hold it's a reasonable market bet, but presenting it as an "/audit pick" would be dishonest. One brainstorm model "endorsed" NVDA citing a 62.9% win rate — that number is actually deepseek_r1's tournament score, not NVDA's. We caught and rejected that hallucination.

ELI5: NVDA might go up because it's a great company, but our crystal ball didn't tell us that — so we won't pretend it did.

BTCUSD — the most defensible forward-test candidate

Our only two PF>1 live strategies (crypto_liquidity_wick_reversal PF 1.55, 60% WR; atr_percentile_gate PF 1.10, 58.6% WR) trade crypto including BTC, and BTC is the top crypto symbol (22% of picks). This is the closest thing we have to a real signal — but it's only ~30 trades and single-source, so it fails a strict statistical test. Action: shadow-size only (tiny pretend allocation to watch it live), not real capital.

ELI5: This is our best "maybe." We'll keep testing it with play money before ever betting real money.

Safe long-term pick — a broad index ETF (e.g. SPY/VOO)

The only "well-known safe" holding we can defend without a proven signal is the whole-market index. It's a market-average bet, not an alpha claim. The real project goal is to turn this into a backtested ETF dual-momentum sleeve (see plan).

ELI5: If you must park money somewhere "safe and boring," owning a slice of the whole stock market is the textbook answer — and it doesn't require us to be fortune-tellers.

3 · More strategies per asset class (what to backtest next)

Class	Strategy to backtest	First step
CRYPTO	Liquidity-wick reversal, de-concentrated + ATR vol gate	Re-run across ≥3 sources
EQUITY	Cross-sectional momentum (12-1) + 200-day MA filter	Walk-forward on S&P 500
ETF	Dual-momentum (absolute + relative), monthly	24-month walk-forward
FOREX	Carry + trend (drop intraday scalps)	Fix resolver first
COMMODITY	Time-series momentum (managed-futures style)	Long-lookback backtest
BOND	Yield-curve / duration timing pilot	Stand up data feed

ELI5: Each asset (crypto, stocks, etc.) needs its own playbook. We listed the one play per asset most likely to actually work, and the very first test to run for each.

Documented in ejaguiar1_backtests (backtest results + the project's destructive-op backup DB); live picks flow to ejaguiar1_stocks. Every idea is pre-registered in reports/hypothesis_registry.json before any backtest (rule M-107).

4 · Are our picks statistically "real-money ready"?

The gates a signal must clear before real capital:

Bare minimum: Profit factor ≥ 1.5 · Sharpe ≥ 1.0 (after costs) · max drawdown ≤ 20% · out-of-sample profit factor ≥ 80% of in-sample · Bonferroni / FDR correction · 5-bps cost model.
Ideal: purged-embargoed walk-forward · 1000-sample bootstrap (95% confidence the profit factor stays above 1.0) · works in ≥3 of 4 market regimes · source concentration (HHI) < 0.20 · shadow-sized 4–8 weeks live.
Forward-test minimum: ≥30–50 symbols per class, run for ≥2× the lookback window, live profit factor within ±10% of backtest for two consecutive 4-week windows before scaling.

The single most important gate we are missing: a multiple-testing correction (Bonferroni / FDR) applied across all strategies before the DSR/SPA gates. At α=0.05 over ~73 buckets, a strategy needs p < 0.00068 — neither of our two "winning" crypto sleeves clears that at n≈30. Two of three brainstorm AIs independently flagged this same gap.

ELI5: "Bonferroni correction" = if you test lots of ideas, you must raise the bar for calling any one a "winner," because some will look good by luck. We weren't raising the bar. That's the #1 fix.

5 · The plan going forward

Short-term (0–2 weeks)

Add the Bonferroni/FDR pre-gate to the readiness verdict (closes the false-positive hole).
Make "single-source artifact" a hard reject at promotion.
Resolver-hygiene pass on CRYPTO + FOREX (fix the mislabeled/never-closed tails).
Pre-register the per-class strategy archetypes above.

Long-term (3–12 weeks)

One promotion path only: purged-embargoed walk-forward — kill the two-scoreboard split so paper and production use the same labels and costs.
Shadow-size anything that clears the harness at ≤0.5% capital for 4–8 weeks.
Build the missing data feeds (ETF dual-momentum, commodities, bonds).
Promote to real capital only after two consecutive 4-week windows hold within ±10% of backtest.

ELI5: First, stop fooling ourselves (raise the bar, fix the data). Then test the best ideas with play money. Only the ones that keep working for real go live.

6 · What was accomplished this session

Synced the local repo (was 1820 commits behind → 0) without losing any peer work (saved on a backup branch).
Verified all 3 new LiteLLM proxy modes work: ollama-cloud-large→Claude, ollama-cloud→Qwen, ollama-cloud-local→Gemma; ran a grounded brainstorm through them.
Reconciled 30+ peer "EAGLE" documents against the real numbers and corrected a fabricated stat.
Produced the grounded synthesis report, per-class backtest plan, and this page.

ELI5: We tidied the workshop, confirmed our new AI helpers work, fact-checked everyone's homework, and wrote up the honest results.

7 · Consensus Quick Picks (no backtest, opinion-aggregation)

Separate from the audit edge-hunt above: a fast, stability-tilted basket built purely from a 6-model AI consensus vote (analyst ratings + 13F ownership + moat/quality knowledge). Clearly labeled model opinion, not a proven signal — verify live before sizing.

Pick	Tier	Votes	Conviction	Read
MSFT	Mega-cap	6/6	90	Unanimous #1 — moat + balance sheet
BRK.B	Mega-cap	5/6	89	Fortress / "sleep-well" equity
SGOV	Bond/cash	4/6	95	Highest conviction — T-bill yield anchor
VOO / VTI	Broad ETF	4/6	92-94	Core equity ballast
COST / AGG	Mega-cap / Bond	2-3/6	89-92	Quality + bond ballast
NVDA	Semi	2 pick / 3 avoid	84	Divisive — too cyclical for a safe basket
INTC	Semi	0 pick / 3 avoid	—	Consensus AVOID (fading moat)

ELI5: We asked 6 AIs "what would you safely park money in?" They agreed: a mix of T-bills (SGOV), the whole stock market (VOO), and a couple of rock-solid companies (Microsoft, Berkshire). They warned against Intel, and split on Nvidia.

Source: reports/CONSENSUS_QUICK_PICKS_2026-06-02.md + raw votes cqp_vote_*.txt.

8 · Swarm-refined selection methodology

A 15-round multi-model swarm produced a concrete quick-pick + long-term playbook for stocks · ETFs · bonds · futures · commodities — each with signal sources, numeric thresholds, position sizing, rebalance cadence, and when-to-avoid rules. Examples:

Stocks (quick): mkt-cap ≥ $100B, P/E ≤ 25, ROIC ≥ 15%, ≥5 analysts, top-50 fund ownership ≥ 2%; equal-weight top 10, max 5%/name, quarterly.
ETFs: AUM ≥ $5B, expense ≤ 0.10%, tracking error ≤ 0.5%; avoid leveraged/inverse + >50% overlap.
Bonds: duration ≤ 2yr T-bills + A-rated short IG ladder; avoid <A-, high-yield, inverted curve >6mo.
Futures/Commodities: only when 12-month momentum > 0 (SG CTA / BCOM); DBMF as managed-futures proxy; 10% cap, 0% if momentum negative.

ELI5: A checklist of rules for picking each type of investment — and just as important, rules for when to not buy. Written so anyone can follow it without a finance degree.

Full playbook: reports/QUICK_PICK_METHODOLOGY_SWARM_2026-06-02.md. A 5-cycle hourly debate / devil's-advocate swarm then stress-tested + converged the basket. Final agreed: BRK.B 86 · GLD 80 · MSFT 79 · PEP 74 · VOO 70 · COST 66 · JNJ 42 (avoid INTC/NVDA/TSLA). The debate dropped SGOV, raised GLD (regime hedge), and collapsed JNJ on litigation overhang — see reports/HOURLY_PICKS_ENHANCEMENT_2026-06-02.md for the per-cycle trail.

Provenance: all stats from audit_dashboard/data/money_ready_verdict.json and audit_dashboard/data/pf_registry.json (generated 2026-06-02T10:19Z). AI brainstorm outputs are advisory only and quarantined in reports/eagle2_brainstorm_*.md; no model was permitted to "fetch" live pages. This is analysis, not financial advice.