Delegate-UCB Controlled Simulation

This repository contains code and reproducibility artifacts for a controlled LLM-assisted human-machine delegation simulation. The experiment studies selective audit as a way to learn a residual correction to a biased base predictor b0(x) under audit costs.

The main experiment is controlled: LLMs may be used to create realistic task contexts and base predicted quality values, but the latent reward model is synthetic and known to the experiment runner. The main experiment does not use an LLM judge.

Repository Structure

configs/       Experiment configs
prompts/       Optional API prompt templates
scripts/       Data, experiment, plotting, and utility scripts
src/           Python package implementation
tests/         Unit tests
docs/          Reproducibility notes, test report, and result notes
artifacts/     Curated small formal figures and summaries for GitHub upload

Large generated round-level logs are intentionally not part of the curated artifact library.

Setup

python -m venv .venv
.venv/Scripts/activate  # Windows PowerShell users can run: .venv\Scripts\Activate.ps1
pip install -r requirements.txt

Tests

python -m pytest

Current local test report: docs/test_report.md.

Local Fallback Smoke Run

The fallback path is deterministic and does not require an API key:

python scripts/run_all_local.py --config configs/pilot.yaml

This writes fallback tasks, a controlled dataset, CSV results, and PDF plots under outputs/.

Formal Experiment Artifacts

The preserved formal run is:

formal_strict_1000_twobranch_lambda02

Key settings:

T = 1000
domains: billing, refund, technical, compliance
feature normalization: x_t = phi_raw / sqrt(5)
normalized-coordinate ridge lambda: 0.2
raw-equivalent lambda: 1
beta_0^emp = 0.25
beta_h^emp = 0.35
sigma = 0.03
use_llm_judge = false
b0 = clip(predicted_quality_raw, 0.2, 0.9)
Delegate-UCB uses the two-branch audit rule from Algorithm 1 with an empirical hard cap
final methods: Delegate-UCB, No-audit, Random-audit UCB

Main paper figures are copied to:

artifacts/formal_strict_1000_twobranch_lambda02/figures/

The result summary is documented in:

docs/results/formal_strict_1000_twobranch_lambda02.md

High-level conclusions are in:

docs/results/experiment_conclusions.md

Regenerate Formal Figures Without API Calls

If the existing saved results are present, regenerate plots only with:

python scripts/plot_results.py \
  --paper \
  --config configs/formal_strict_1000_normalized.yaml \
  --results results/formal_strict_1000_twobranch_lambda02/results.csv \
  --dataset data/formal_strict_1000_normalized_lambda02/controlled_dataset.csv \
  --outdir figures/formal_strict_1000_twobranch_lambda02 \
  --main-budget 200 \
  --main-methods Delegate-UCB No-audit "Random-audit UCB" \
  --appendix-methods Delegate-UCB No-audit "Random-audit UCB"

This command reads existing CSVs only. It does not call APIs, rebuild datasets, or rerun the simulation. See docs/reproducibility/reproduce_formal_figures.md.

Optional Full Formal Regeneration

Full regeneration is optional and should be done only when explicitly needed. If cached LLM task and base-score files exist, the full experiment can be run offline. If they are missing, regeneration is API-dependent.

See:

docs/reproducibility/reproduce_full_formal_experiment.md

Optional API Generation

Create a local .env only when intentionally running API scripts. Never commit .env.

cp .env.example .env
# edit .env and add OPENAI_API_KEY=...

The Qwen/OpenAI-compatible scripts read OPENAI_API_KEY from the environment or .env, cache outputs, support dry runs, and refuse to overwrite existing outputs unless --overwrite is passed.

Implementation Notes

Empirical UCB uses fixed exploration multipliers, not determinant-based theoretical confidence radii.
sigma = 0.03 is used for simulated machine audit noise and human feedback noise, but not to set beta.
Random-audit UCB uses the same UCB routing indices and ridge updates as Delegate-UCB.
Random-audit UCB selects audit online with probability remaining_budget / remaining_rounds after routing is computed.
Hard audit caps are empirical budgeted variants; theoretical guarantees are stated for uncapped tau-admissible audit rules.

Security

.env and .env.* are ignored.
.env.example is safe to track.
API keys are never hard-coded.
API scripts do not print keys or request headers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Delegate-UCB Controlled Simulation

Repository Structure

Setup

Tests

Local Fallback Smoke Run

Formal Experiment Artifacts

Regenerate Formal Figures Without API Calls

Optional Full Formal Regeneration

Optional API Generation

Implementation Notes

Security

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
artifacts/formal_strict_1000_twobranch_lambda02		artifacts/formal_strict_1000_twobranch_lambda02
configs		configs
data/formal_strict_1000_normalized_lambda02		data/formal_strict_1000_normalized_lambda02
docs		docs
prompts		prompts
scripts		scripts
src/delegation_sim		src/delegation_sim
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Delegate-UCB Controlled Simulation

Repository Structure

Setup

Tests

Local Fallback Smoke Run

Formal Experiment Artifacts

Regenerate Formal Figures Without API Calls

Optional Full Formal Regeneration

Optional API Generation

Implementation Notes

Security

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages