Code implementation for paper SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate.
Our implementation uses AWS-Bedrock. Make an api-keys.json under repository root as:
{
"AWS_BEDROCK_API": "<paste your Bedrock Mantle token here>"
}
Source datasets are under data/:
- IMO-AnswerBench (~400 problems): download https://github.com/google-deepmind/superhuman/blob/main/imobench/answerbench_v2.csv and place it at
data/imobench/answerbench_v2.csv. - HLE (~424 text-only multiple-choice non-math problems): gated on HuggingFace. Accept terms at https://huggingface.co/datasets/cais/hle, then:
This writes
cd data && python download_hle.py
data/hle/text_mc_nomath.csv.
All debate runners accept --only-non-unanimous to skip questions whose round-0 answers already agree across the N agents.
Please contact the authors for existing run logs. To regenerate from scratch (require Bedrock API calls), follow below steps in order:
python generate_predebate.py --model gpt-oss-120b --dataset imobench --n 6 --seed-base 0
python generate_predebate.py --model gpt-oss-120b --dataset hle --n 6 --seed-base 0 --concat-reasoning
python generate_predebate.py --model deepseek-v3.1 --dataset imobench --n 6 --seed-base 0 --reasoning-effort medium
python generate_predebate.py --model deepseek-v3.1 --dataset hle --n 6 --seed-base 0 --reasoning-effort medium --concat-reasoning# gpt-oss x IMO
for algo in sid_a2a s2_mad; do
python $algo.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 --d 2 --convert-math-answers
done
python groupdebate.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 --d 2 --groups 3,3 --convert-math-answers
# gpt-oss x HLE
for algo in sid_a2a s2_mad; do
python $algo.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 --d 2 --include-reasoning
done
python groupdebate.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 --d 2 --groups 3,3 --include-reasoning
# DeepSeek x IMO
for algo in sid_a2a s2_mad; do
python $algo.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 --d 2 --peer-header math_full_strong --convert-math-answers
done
python groupdebate.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 --d 2 --groups 3,3 --peer-header math_full_strong --convert-math-answers
# DeepSeek x HLE
for algo in sid_a2a s2_mad; do
python $algo.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 --d 2 --include-reasoning --peer-header mc_full_strong
done
python groupdebate.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 --d 2 --groups 3,3 --include-reasoning --peer-header mc_full_strong# gpt-oss x IMO
python svr_mad.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 \
--P 2 --prior-signal perplexity --peer-signal perplexity \
--tiebreak layered_mv --convert-math-answers
# gpt-oss x HLE
python svr_mad.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 \
--P 2 --prior-signal perplexity --peer-signal perplexity \
--tiebreak layered_mv --include-reasoning
# DeepSeek x IMO
python svr_mad.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 \
--P 3 --prior-signal min_logprob --peer-signal min_logprob \
--peer-header math_full_strong --tiebreak layered_mv --convert-math-answers
# DeepSeek x HLE
python svr_mad.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 \
--P 3 --prior-signal min_logprob --peer-signal min_logprob \
--peer-header mc_full_strong --tiebreak layered_mv --include-reasoningpython exhaustive_d1.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 --convert-math-answers
python exhaustive_d1.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 --include-reasoning
python exhaustive_d1.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 \
--peer-header math_full_strong --convert-math-answers
python exhaustive_d1.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 \
--peer-header mc_full_strong --include-reasoning--reuse-debate-outcome <svr_mad_run_dir> reuses each receiver-peer probe SVR-MAD already fired, so only the remaining (i, j) pairs incur LLM calls.
python evaluate_runs.pyExpected output:
LLM Method IMO NComm IMO Tok IMO Acc HLE NComm HLE Tok HLE Acc
------------------------------------------------------------------------------------------------------
GPT-OSS-120B Self Consistency 0.00 45.41 36.70% 0.00 14.12 13.86%
GPT-OSS-120B GroupDebate 19.37 164.14 41.75% 18.40 111.01 20.22%
GPT-OSS-120B SID-ET (sk 60/70) 17.47 85.51 38.72% 12.81 39.57 17.23%
GPT-OSS-120B S2-MAD 22.75 131.44 42.09% 14.04 67.74 20.97%
GPT-OSS-120B SVR-MAD 5.64 82.05 42.76% 7.31 39.07 21.35%
DeepSeek-V3.1 Self Consistency 0.00 20.22 31.77% 0.00 9.77 14.13%
DeepSeek-V3.1 GroupDebate 19.81 234.56 40.13% 19.58 108.72 14.86%
DeepSeek-V3.1 SID-ET (sk 50/60) 19.97 103.04 33.78% 16.63 33.74 14.13%
DeepSeek-V3.1 S2-MAD 20.85 154.66 36.79% 18.70 71.69 16.67%
DeepSeek-V3.1 SVR-MAD 8.93 91.66 41.47% 5.28 30.33 16.67%
python plot_motivation.py # figures/motivation.{pdf,png}python plot_tail_acc.py # figures/tail-acc.{pdf,png}python ablations.py # figures/ablation.{pdf,png}

