Releases: weich97/TreLLM
v0.2.0
v0.2.0: Frozen Benchmark Protocol And Reproduction Pack
This is the first protocol-focused TradeArena release. It freezes the v0.2 benchmark spec, separates engineering/benchmark/scientific claim boundaries, and ships a no-key external reproduction pack.
Highlights
- Frozen v0.2 benchmark spec with canonical spec hashing.
- Claim boundary badge and public claim-boundary policy.
- One-command external reproduction pack with command logs, environment metadata, artifact hashes, trajectory hash, and provenance flags.
- Expanded classical baselines and failure-autopsy tooling.
- Public notes for execution calibration priorities and known limitations.
One-command reproduction
python scripts/run_external_reproduction_pack.py --output-dir outputs/reproduction/v0_2Expected no-key trajectory reproducibility hash:
sha256:bf3b1084aeec89f3bf0f99ab91b6c16a989dc8c8a29d9e93c8c72109548e442f
Canonical v0.2 benchmark spec hash:
sha256:a777cdfb962a07e658996c9366070d4b0ffb867659c2ccc45685a5c788bf6204
Official package hashes
| File | SHA-256 |
|---|---|
tradearena_benchmark-0.2.0-py3-none-any.whl |
sha256:2d21b11554100a9c52fd3b934e2919976e7e5ce4f2912aa7df0ff9110eda621e |
tradearena_benchmark-0.2.0.tar.gz |
sha256:25d0fc6a58914558e3197a17d85ed64dd754e67a09d4aa176c48f7a8544a2568 |
Known limitations
- The no-key reproduction pack is an engineering reproducibility target, not a model-skill claim.
- Provider-backed model rows remain sensitive to provider routing, prompts, rate limits, cache provenance, and model-version drift.
- The default execution simulator is a stress-test simulator, not a calibrated venue-level quote/order-book/fill replay.
- Scientific claims require repeated seeds or rolling windows, non-LLM baselines, statistical intervals, failure autopsy, and independent reproduction reports.
v0.1.2
Full Changelog: v0.1.1...v0.1.2
v0.1.1: High-Spread Execution Stress Preset
v0.1.1: High-Spread Execution Stress Preset
TradeArena v0.1.1 is a small maintenance release focused on making execution
realism easier to inspect and reproduce.
Highlights
- Added an explicit
spread_bpsparameter to the realistic order simulator. - Added a
high_spreadrow toexamples/execution_realism_sweep_demo.py. - The high-spread preset models market orders crossing half the quoted
bid-ask spread before market impact and volatility slippage. - The execution sweep now emits spread configuration fields into its JSON and
CSV artifacts. - Added tests covering spread-driven crossing cost and the high-spread demo
row.
Why It Matters
The preset separates spread cost from generic slippage. This makes it easier
to show that an agent can keep a high fill rate while still losing realized
performance to wide quoted markets.
Reproduce
python -m pip install -e ".[dev]"
python examples/execution_realism_sweep_demo.py
python scripts/run_showcase.py --reuse-existing
python -m pytest tests -qRelated Issue
- Closes #3.
v0.1.0: Auditable benchmark release for LLM trading agents
v0.1.0: Auditable Benchmark Release For LLM Trading Agents
TradeArena v0.1.0 is the first public benchmark release for evaluating LLM
trading agents as auditable decision-making systems under realistic market
constraints.
Highlights
- Quickstart showcase: run
python scripts/run_showcase.pyto generate a
local demo portal without model keys or live market-data downloads. - Captioned demo video: watch or regenerate a 3-minute walkthrough of the
showcase portal, audit report, execution realism, extension walkthrough, and
retail planning sandbox. Browser playback is available at
https://weich97.github.io/TradeArena/demo_video.html. - Replayable audit trajectories: every decision records observation,
signals, intended allocation, risk-gate changes, orders, fills/rejections,
portfolio state, memory, and reproducibility metadata. - Execution realism: built-in simulator models fees, slippage, latency,
liquidity constraints, partial fills, pending orders, and rejections. - Risk lifecycle: pre-trade gates, in-trade monitors, post-trade
attribution, suitability checks, and risk-violation logs are first-class
artifacts. - Hands-on extensions: examples cover custom analysts, custom risk modules,
custom evaluators, A-share rules, AkShare CSV reuse, retail planning, and
paper rebalance reports. - Research-grade diagnostics: tracked artifacts show representation
signatures, crisis-scene probes, feedback-alignment diagnostics, and 51-stock
intraday portfolio behavior without exposing raw provider prompt/response
caches.
Quick Start
python -m pip install -e ".[dev]"
python scripts/run_showcase.pyOpen:
outputs/examples/showcase.html
What This Release Is Not
TradeArena is not a live trading bot and does not promise profitable trading.
It is a benchmark, simulation, and audit framework for studying whether LLM
trading agents can be reproduced, inspected, risk-gated, and evaluated under
realistic constraints.
Suggested GitHub Release Title
v0.1.0: Auditable benchmark release for LLM trading agents