Skip to content

nanare-sudo/llm-quant-research-loop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Augmented Quant Strategy Research Loop

A small, honest framework that demonstrates a modern, AI-assisted quant research workflow: a language model proposes trading strategies, a deterministic and leakage-controlled harness evaluates them, and the model critiques its own results against a methodology rubric — with multiple-testing deflation to stop the search from fooling itself.

The methodology (triple-barrier labeling, meta-labeling, purged cross-validation, deflated Sharpe) follows Marcos López de Prado, Advances in Financial Machine Learning. The domain knowledge is supplied to the model as skills — short, self-authored markdown summaries it loads into context.

What this project actually demonstrates. Not "an AI that finds alpha." It demonstrates how to drive an LLM as a research multiplier responsibly: the model is grounded in domain knowledge, constrained to a vetted toolbox, judged by deterministic code, and held to an honest statistical bar. The full, transparent loop — including failed attempts — is the point.

How the loop works

   ┌──────────┐   structured spec   ┌───────────────┐   metrics   ┌──────────┐
   │ PROPOSE  │ ──────────────────▶ │   EVALUATE    │ ──────────▶ │ DEFLATE  │
   │  (LLM)   │                     │ (deterministic│             │ (no LLM) │
   └──────────┘                     │  AFML harness)│             └────┬─────┘
        ▲                           └───────────────┘                  │
        │         next spec               ▲                            ▼
        │        ┌──────────┐              │                     ┌──────────┐
        └─────── │ CRITIQUE │ ◀────────────┴──────────────────── │  LEDGER  │
                 │  (LLM)   │      result + deflation             │ (record) │
                 └──────────┘                                     └──────────┘
  1. Propose — the model receives the relevant skills, the exact catalog of allowed signals and features, and the history of prior results. It returns a structured strategy spec (JSON), which is strictly validated.
  2. Evaluate — a deterministic harness runs the AFML pipeline (volatility target → CUSUM events → primary signal → triple-barrier labels → meta-labels → uniqueness-weighted samples → purged-CV random forest) and returns honest out-of-fold metrics. No LLM is involved in scoring.
  3. Deflate — the best result so far is compared against the expected-max-Sharpe under the null for the number of trials run (False Strategy Theorem). The bar rises with every iteration.
  4. Critique — the model audits its own result against the leakage_audit rubric, lists issues, and proposes the next spec. The verdict drives the loop.
  5. Ledger — every iteration, including failures and rejects, is appended to a JSONL log and rendered to a markdown report. The ledger is the artifact.

Why it's credible, not "vibes"

  • The LLM composes from a vetted catalog — it never writes arbitrary trading or backtest code, so it cannot inject look-ahead or leakage-prone transforms.
  • The judge is deterministic and leakage-controlled — chronological splits, train-only scaling, purged & embargoed cross-validation, uniqueness-weighted samples.
  • Multiple testing is penalized explicitly — an automated loop that tries many strategies will find a lucky winner; the deflated Sharpe makes that visible and raises the bar with each trial. This is the single most important honesty mechanism here.
  • Negative results are recorded, not hidden — on data without real structure the loop reports ROC-AUC ≈ 0.5 and keeps saying "revise". That is the correct behavior.

Runs out of the box (no API key needed)

The loop runs offline by default: with no ANTHROPIC_API_KEY, the propose and critique steps use deterministic heuristic stubs, so you can see the full machine run immediately. Set the key (and offline: false) to drive it with Claude.

uv sync
uv run researchloop run-loop          # offline demo on SPY daily data

With Claude:

export ANTHROPIC_API_KEY=sk-ant-...
# set llm.offline: false in config.yaml
uv run researchloop run-loop

Other commands:

uv run researchloop fetch-data                 # download + cache prices
uv run researchloop replay-ledger ledger/run_<id>.jsonl   # re-render report

Everything is controlled from config.yaml (ticker, iteration budget, CV settings, model). The iteration budget directly controls cost and the multiple-testing bar — more iterations means a higher deflated-Sharpe threshold.

Example ledger (offline stub run, illustrative)

# strategy ROC-AUC Sharpe (gated) deflated p verdict
1 momentum(lookback=38) 0.498 0.007 0.61 revise
3 rsi_reversion(11, 29/70) 0.497 0.029 0.53 revise
5 ma_crossover(19, 48) 0.504 −0.028 0.46 revise

Note how the deflated probability falls as more strategies are tried — the same Sharpe becomes less convincing the longer you search. With real Claude and real structure in the data, the proposals and critiques become substantive instead of random.

Verification & Known Limitations

This loop was run with real Claude (claude-sonnet-4-6) on SPY (≈7,900 daily bars) and BTC-USD (≈4,300 bars), 4 iterations each. Outcome: all 8 proposed strategies were rejected — no strategy found a real out-of-fold edge (ROC-AUC stayed near 0.5). That is the intended, honest negative-result behavior, and the model did not manufacture a spurious edge.

The critiques were substantive and methodology-aware: they reasoned about break-even hit-rate given the PT/SL asymmetry, used barrier-count diagnostics as evidence against a directional thesis, and compared the observed Sharpe to the expected-max-under-null. On one BTC iteration the model even flagged a blind spot of the harness itself — that the true number of trials is higher than the loop counts.

Two honest limitations to read before trusting the output:

  1. The deflated Sharpe is not reliable as a standalone signal on high-per-trade-return assets. On SPY it behaved correctly (the deflated probability fell across trials). On BTC it stayed pinned near 1.0 — not because there was edge, but because crypto's large per-trade returns keep the best Sharpe above the null and the standard-error term saturates the probability. Read it together with ROC-AUC and the economic/barrier reasoning, never alone.
  2. Catalog coverage of the search. An earlier version let the critique's suggestion drive the next iteration verbatim, making the search shallow (it alternated two signals and never tried ma_crossover). PROPOSE now runs every iteration with explicit untested-signal coverage, treating the critique only as a hint — so the loop actually explores the catalog.

These findings came out of running the loop on itself, which is the point: the workflow is designed to surface its own weaknesses rather than hide them.

Project Structure

llm-quant-research-loop/
├── config.yaml                 # data, loop budget, model, CV
├── main.py                     # thin CLI wrapper
├── skills/                     # domain knowledge as markdown (self-authored)
│   ├── triple_barrier.md
│   ├── meta_labeling.md
│   ├── leakage_audit.md        # the critique rubric
│   ├── deflated_sharpe.md
│   └── feature_ideas.md
├── src/researchloop/
│   ├── signals.py              # vetted primary-signal catalog
│   ├── features_catalog.py     # vetted meta-feature catalog
│   ├── spec.py                 # structured strategy spec + strict validation
│   ├── backtest.py             # deterministic AFML judge (purged CV)
│   ├── deflated.py             # False Strategy Theorem / deflated Sharpe
│   ├── skills.py               # load skills into context
│   ├── llm.py                  # Anthropic SDK client (+ offline mode)
│   ├── propose.py              # PROPOSE step (LLM + offline stub)
│   ├── critique.py             # CRITIQUE / self-correction (LLM + offline stub)
│   ├── ledger.py               # JSONL log + markdown report
│   ├── loop.py                 # the orchestration loop
│   └── cli.py                  # fetch-data / run-loop / replay-ledger
├── ledger/                     # experiment logs (the artifact)
└── data/                       # cached prices (not tracked)

Tech Stack

Python · Anthropic SDK (Claude) · scikit-learn · pandas · NumPy · SciPy · uv

Next Steps

  • Add combinatorial purged CV to the judge for a stronger overfitting estimate.
  • Let the model propose fractional-differentiation parameters as a feature transform (another vetted catalog entry).
  • Add a bet-sizing step that turns the meta-probability into position size.
  • Use real dollar/volume bars instead of daily time bars.
  • Add prompt-caching of the skills block to cut token cost on long runs.

Honesty note

This repository is an explicit demonstration of AI-assisted research. The code was written with LLM assistance — which is the entire subject. Everything here is designed to be understood and defended by its author: the methodology is standard, the skills are self-authored summaries (not copied from any book), and the statistical guardrails are the point of the project.

Reference

Marcos López de Prado, Advances in Financial Machine Learning, Wiley, 2018.

About

LLM-augmented quant research loop: Claude proposes strategies, a leakage-controlled AFML harness (triple-barrier, purged CV) judges them, and deflated-Sharpe keeps the search honest. After López de Prado.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages