Skip to content

weifanjiang/SVR-MAD

Repository files navigation

SVR-MAD

Code implementation for paper SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate.

Setup

Our implementation uses AWS-Bedrock. Make an api-keys.json under repository root as:

{
  "AWS_BEDROCK_API": "<paste your Bedrock Mantle token here>"
}

Datasets

Source datasets are under data/:

All debate runners accept --only-non-unanimous to skip questions whose round-0 answers already agree across the N agents.


1. Generate run JSONs

Please contact the authors for existing run logs. To regenerate from scratch (require Bedrock API calls), follow below steps in order:

Pre-debate (round 0)

python generate_predebate.py --model gpt-oss-120b  --dataset imobench --n 6 --seed-base 0
python generate_predebate.py --model gpt-oss-120b  --dataset hle      --n 6 --seed-base 0 --concat-reasoning
python generate_predebate.py --model deepseek-v3.1 --dataset imobench --n 6 --seed-base 0 --reasoning-effort medium
python generate_predebate.py --model deepseek-v3.1 --dataset hle      --n 6 --seed-base 0 --reasoning-effort medium --concat-reasoning

Baselines (SID, S2-MAD, GroupDebate)

# gpt-oss x IMO
for algo in sid_a2a s2_mad; do
    python $algo.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 --d 2 --convert-math-answers
done
python groupdebate.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 --d 2 --groups 3,3 --convert-math-answers

# gpt-oss x HLE
for algo in sid_a2a s2_mad; do
    python $algo.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 --d 2 --include-reasoning
done
python groupdebate.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 --d 2 --groups 3,3 --include-reasoning

# DeepSeek x IMO
for algo in sid_a2a s2_mad; do
    python $algo.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 --d 2 --peer-header math_full_strong --convert-math-answers
done
python groupdebate.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 --d 2 --groups 3,3 --peer-header math_full_strong --convert-math-answers

# DeepSeek x HLE
for algo in sid_a2a s2_mad; do
    python $algo.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 --d 2 --include-reasoning --peer-header mc_full_strong
done
python groupdebate.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 --d 2 --groups 3,3 --include-reasoning --peer-header mc_full_strong

SVR-MAD

# gpt-oss x IMO
python svr_mad.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 \
    --P 2 --prior-signal perplexity --peer-signal perplexity \
    --tiebreak layered_mv --convert-math-answers

# gpt-oss x HLE
python svr_mad.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6 \
    --P 2 --prior-signal perplexity --peer-signal perplexity \
    --tiebreak layered_mv --include-reasoning

# DeepSeek x IMO
python svr_mad.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 \
    --P 3 --prior-signal min_logprob --peer-signal min_logprob \
    --peer-header math_full_strong --tiebreak layered_mv --convert-math-answers

# DeepSeek x HLE
python svr_mad.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 \
    --P 3 --prior-signal min_logprob --peer-signal min_logprob \
    --peer-header mc_full_strong --tiebreak layered_mv --include-reasoning

Exhaustive d=1 substrate (needed by motivation + ablation figures)

python exhaustive_d1.py --predebate-dir runs/imobench/gpt-oss-120b/predebate_n6 --convert-math-answers
python exhaustive_d1.py --predebate-dir runs/hle/gpt-oss-120b/predebate_n6      --include-reasoning
python exhaustive_d1.py --predebate-dir runs/imobench/deepseek-v3.1/predebate_n6 \
    --peer-header math_full_strong --convert-math-answers
python exhaustive_d1.py --predebate-dir runs/hle/deepseek-v3.1/predebate_n6 \
    --peer-header mc_full_strong --include-reasoning

--reuse-debate-outcome <svr_mad_run_dir> reuses each receiver-peer probe SVR-MAD already fired, so only the remaining (i, j) pairs incur LLM calls.


2. Headline evaluation

python evaluate_runs.py

Expected output:

LLM             Method                IMO NComm   IMO Tok   IMO Acc   HLE NComm   HLE Tok   HLE Acc
------------------------------------------------------------------------------------------------------

GPT-OSS-120B    Self Consistency           0.00     45.41    36.70%        0.00     14.12    13.86%
GPT-OSS-120B    GroupDebate               19.37    164.14    41.75%       18.40    111.01    20.22%
GPT-OSS-120B    SID-ET (sk 60/70)         17.47     85.51    38.72%       12.81     39.57    17.23%
GPT-OSS-120B    S2-MAD                    22.75    131.44    42.09%       14.04     67.74    20.97%
GPT-OSS-120B    SVR-MAD                    5.64     82.05    42.76%        7.31     39.07    21.35%

DeepSeek-V3.1   Self Consistency           0.00     20.22    31.77%        0.00      9.77    14.13%
DeepSeek-V3.1   GroupDebate               19.81    234.56    40.13%       19.58    108.72    14.86%
DeepSeek-V3.1   SID-ET (sk 50/60)         19.97    103.04    33.78%       16.63     33.74    14.13%
DeepSeek-V3.1   S2-MAD                    20.85    154.66    36.79%       18.70     71.69    16.67%
DeepSeek-V3.1   SVR-MAD                    8.93     91.66    41.47%        5.28     30.33    16.67%

3. Additional analysis (figures)

python plot_motivation.py        # figures/motivation.{pdf,png}

motivation

python plot_tail_acc.py          # figures/tail-acc.{pdf,png}

tail-acc

python ablations.py  # figures/ablation.{pdf,png}

ablation

About

Code implementation for paper SVR-MAD: A Bayesian-Inspired Framework for Posterior-Guided Multi-Agent Debate.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages