This repository contains code and reproducibility artifacts for a controlled
LLM-assisted human-machine delegation simulation. The experiment studies
selective audit as a way to learn a residual correction to a biased base
predictor b0(x) under audit costs.
The main experiment is controlled: LLMs may be used to create realistic task contexts and base predicted quality values, but the latent reward model is synthetic and known to the experiment runner. The main experiment does not use an LLM judge.
configs/ Experiment configs
prompts/ Optional API prompt templates
scripts/ Data, experiment, plotting, and utility scripts
src/ Python package implementation
tests/ Unit tests
docs/ Reproducibility notes, test report, and result notes
artifacts/ Curated small formal figures and summaries for GitHub upload
Large generated round-level logs are intentionally not part of the curated artifact library.
python -m venv .venv
.venv/Scripts/activate # Windows PowerShell users can run: .venv\Scripts\Activate.ps1
pip install -r requirements.txtpython -m pytestCurrent local test report: docs/test_report.md.
The fallback path is deterministic and does not require an API key:
python scripts/run_all_local.py --config configs/pilot.yamlThis writes fallback tasks, a controlled dataset, CSV results, and PDF plots
under outputs/.
The preserved formal run is:
formal_strict_1000_twobranch_lambda02
Key settings:
T = 1000- domains: billing, refund, technical, compliance
- feature normalization:
x_t = phi_raw / sqrt(5) - normalized-coordinate ridge lambda:
0.2 - raw-equivalent lambda:
1 beta_0^emp = 0.25beta_h^emp = 0.35sigma = 0.03use_llm_judge = falseb0 = clip(predicted_quality_raw, 0.2, 0.9)- Delegate-UCB uses the two-branch audit rule from Algorithm 1 with an empirical hard cap
- final methods: Delegate-UCB, No-audit, Random-audit UCB
Main paper figures are copied to:
artifacts/formal_strict_1000_twobranch_lambda02/figures/
The result summary is documented in:
docs/results/formal_strict_1000_twobranch_lambda02.md
High-level conclusions are in:
docs/results/experiment_conclusions.md
If the existing saved results are present, regenerate plots only with:
python scripts/plot_results.py \
--paper \
--config configs/formal_strict_1000_normalized.yaml \
--results results/formal_strict_1000_twobranch_lambda02/results.csv \
--dataset data/formal_strict_1000_normalized_lambda02/controlled_dataset.csv \
--outdir figures/formal_strict_1000_twobranch_lambda02 \
--main-budget 200 \
--main-methods Delegate-UCB No-audit "Random-audit UCB" \
--appendix-methods Delegate-UCB No-audit "Random-audit UCB"This command reads existing CSVs only. It does not call APIs, rebuild datasets,
or rerun the simulation. See
docs/reproducibility/reproduce_formal_figures.md.
Full regeneration is optional and should be done only when explicitly needed. If cached LLM task and base-score files exist, the full experiment can be run offline. If they are missing, regeneration is API-dependent.
See:
docs/reproducibility/reproduce_full_formal_experiment.md
Create a local .env only when intentionally running API scripts. Never commit
.env.
cp .env.example .env
# edit .env and add OPENAI_API_KEY=...The Qwen/OpenAI-compatible scripts read OPENAI_API_KEY from the environment
or .env, cache outputs, support dry runs, and refuse to overwrite existing
outputs unless --overwrite is passed.
- Empirical UCB uses fixed exploration multipliers, not determinant-based theoretical confidence radii.
sigma = 0.03is used for simulated machine audit noise and human feedback noise, but not to set beta.- Random-audit UCB uses the same UCB routing indices and ridge updates as Delegate-UCB.
- Random-audit UCB selects audit online with probability
remaining_budget / remaining_roundsafter routing is computed. - Hard audit caps are empirical budgeted variants; theoretical guarantees are stated for uncapped tau-admissible audit rules.
.envand.env.*are ignored..env.exampleis safe to track.- API keys are never hard-coded.
- API scripts do not print keys or request headers.