Skip to content

Releases: weich97/TreLLM

v0.2.0

22 May 05:15

Choose a tag to compare

v0.2.0: Frozen Benchmark Protocol And Reproduction Pack

This is the first protocol-focused TradeArena release. It freezes the v0.2 benchmark spec, separates engineering/benchmark/scientific claim boundaries, and ships a no-key external reproduction pack.

Highlights

  • Frozen v0.2 benchmark spec with canonical spec hashing.
  • Claim boundary badge and public claim-boundary policy.
  • One-command external reproduction pack with command logs, environment metadata, artifact hashes, trajectory hash, and provenance flags.
  • Expanded classical baselines and failure-autopsy tooling.
  • Public notes for execution calibration priorities and known limitations.

One-command reproduction

python scripts/run_external_reproduction_pack.py --output-dir outputs/reproduction/v0_2

Expected no-key trajectory reproducibility hash:

sha256:bf3b1084aeec89f3bf0f99ab91b6c16a989dc8c8a29d9e93c8c72109548e442f

Canonical v0.2 benchmark spec hash:

sha256:a777cdfb962a07e658996c9366070d4b0ffb867659c2ccc45685a5c788bf6204

Official package hashes

File SHA-256
tradearena_benchmark-0.2.0-py3-none-any.whl sha256:2d21b11554100a9c52fd3b934e2919976e7e5ce4f2912aa7df0ff9110eda621e
tradearena_benchmark-0.2.0.tar.gz sha256:25d0fc6a58914558e3197a17d85ed64dd754e67a09d4aa176c48f7a8544a2568

Known limitations

  • The no-key reproduction pack is an engineering reproducibility target, not a model-skill claim.
  • Provider-backed model rows remain sensitive to provider routing, prompts, rate limits, cache provenance, and model-version drift.
  • The default execution simulator is a stress-test simulator, not a calibrated venue-level quote/order-book/fill replay.
  • Scientific claims require repeated seeds or rolling windows, non-LLM baselines, statistical intervals, failure autopsy, and independent reproduction reports.

PyPI: https://pypi.org/project/tradearena-benchmark/0.2.0/

v0.1.2

17 May 12:57

Choose a tag to compare

Full Changelog: v0.1.1...v0.1.2

v0.1.1: High-Spread Execution Stress Preset

17 May 08:29

Choose a tag to compare

v0.1.1: High-Spread Execution Stress Preset

TradeArena v0.1.1 is a small maintenance release focused on making execution
realism easier to inspect and reproduce.

Highlights

  • Added an explicit spread_bps parameter to the realistic order simulator.
  • Added a high_spread row to examples/execution_realism_sweep_demo.py.
  • The high-spread preset models market orders crossing half the quoted
    bid-ask spread before market impact and volatility slippage.
  • The execution sweep now emits spread configuration fields into its JSON and
    CSV artifacts.
  • Added tests covering spread-driven crossing cost and the high-spread demo
    row.

Why It Matters

The preset separates spread cost from generic slippage. This makes it easier
to show that an agent can keep a high fill rate while still losing realized
performance to wide quoted markets.

Reproduce

python -m pip install -e ".[dev]"
python examples/execution_realism_sweep_demo.py
python scripts/run_showcase.py --reuse-existing
python -m pytest tests -q

Related Issue

  • Closes #3.

v0.1.0: Auditable benchmark release for LLM trading agents

17 May 02:29

Choose a tag to compare

v0.1.0: Auditable Benchmark Release For LLM Trading Agents

TradeArena v0.1.0 is the first public benchmark release for evaluating LLM
trading agents as auditable decision-making systems under realistic market
constraints.

Highlights

  • Quickstart showcase: run python scripts/run_showcase.py to generate a
    local demo portal without model keys or live market-data downloads.
  • Captioned demo video: watch or regenerate a 3-minute walkthrough of the
    showcase portal, audit report, execution realism, extension walkthrough, and
    retail planning sandbox. Browser playback is available at
    https://weich97.github.io/TradeArena/demo_video.html.
  • Replayable audit trajectories: every decision records observation,
    signals, intended allocation, risk-gate changes, orders, fills/rejections,
    portfolio state, memory, and reproducibility metadata.
  • Execution realism: built-in simulator models fees, slippage, latency,
    liquidity constraints, partial fills, pending orders, and rejections.
  • Risk lifecycle: pre-trade gates, in-trade monitors, post-trade
    attribution, suitability checks, and risk-violation logs are first-class
    artifacts.
  • Hands-on extensions: examples cover custom analysts, custom risk modules,
    custom evaluators, A-share rules, AkShare CSV reuse, retail planning, and
    paper rebalance reports.
  • Research-grade diagnostics: tracked artifacts show representation
    signatures, crisis-scene probes, feedback-alignment diagnostics, and 51-stock
    intraday portfolio behavior without exposing raw provider prompt/response
    caches.

Quick Start

python -m pip install -e ".[dev]"
python scripts/run_showcase.py

Open:

outputs/examples/showcase.html

What This Release Is Not

TradeArena is not a live trading bot and does not promise profitable trading.
It is a benchmark, simulation, and audit framework for studying whether LLM
trading agents can be reproduced, inspected, risk-gated, and evaluated under
realistic constraints.

Suggested GitHub Release Title

v0.1.0: Auditable benchmark release for LLM trading agents