AgentBeats Competition Submission: Green Agent for Multi-Agent Negotiation Assessment
This repository contains a green agent that implements the Empirical Meta-Game Analysis framework from Smithline, Mascioli, Chakraborty & Wellman (2025) for evaluating negotiation agents. The agent computes Maximum Entropy Nash Equilibrium (MENE) to rigorously assess purple agent strategies within their strategic ecosystem.
- Python 3.11 (required - the OpenSpiel binary is compiled for Python 3.11)
- uv package manager (recommended)
# Clone and setup
git clone https://github.com/gsmithline/tutorial-agent-beats-comp.git
cd tutorial-agent-beats-comp
# Install dependencies (uses Python 3.11 via .python-version)
uv sync
# Set environment variables
cp sample.env .env
# Add your API key to .env
# Run a local assessment
PYTHONPATH=scenarios/bargaining/open_spiel:$PYTHONPATH \
uv run python -m scenarios.bargaining.bargaining_green once \
--config '{"challenger_url": "https://your-purple-agent.com", "games": 10}'Note: The
PYTHONPATHmust includescenarios/bargaining/open_spielto load the pre-compiled OpenSpiel module.
# Deploy using the pre-built Docker image
gcloud run deploy bargaining-green-agent \
--image ghcr.io/gsmithline/tutorial-agent-beats-comp:latest \
--region=us-central1 \
--allow-unauthenticated \
--memory=4Gi
# Or build from source
gcloud run deploy bargaining-green-agent \
--source . \
--region=us-central1 \
--allow-unauthenticated- Deploy your green agent (Option B above)
- Navigate to agentbeats.dev
- Register your agent with the Cloud Run URL
- Run assessments against purple agents via the platform
This green agent implements the Empirical Meta-Game Analysis methodology introduced by Li & Wellman (2024) and applied to LLM bargaining evaluation in Smithline et al. (2025).
Traditional benchmarks evaluate agents in isolation against fixed opponents. But in strategic environments, an agent's performance inherently depends on the behavior of other agents. Meta-game analysis addresses this by:
- Constructing an empirical game over the space of agent strategies
- Computing Nash equilibria to identify stable population mixtures
- Evaluating agents at equilibrium to measure how well-adapted they are to strategic competition
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Purple Agent │ │ Green Agent │ │ Baseline Pool │
│ (Challenger) │───▶│ (Evaluator) │◀───│ soft, tough, │
└─────────────────┘ │ │ │ aspire, walk, │
│ 1. Build Roster │ │ nfsp, rnad │
│ 2. Simulate N² │ └─────────────────┘
│ Matchups │
│ 3. MENE Solve │
│ 4. Compute │
│ Metrics │
└────────┬─────────┘
│
▼
┌─────────────────┐
│ Evaluation │
│ Results │
│ - MENE Regret │
│ - Welfare % │
│ - Fairness % │
└─────────────────┘
Step 1: Agent Roster Construction
- Your purple agent joins a pool of baseline strategies
- Heuristic agents:
soft(accepts any offer),tough(minimal offers),aspire(concession schedule),walk(takes BATNA) - RL-derived policies:
nfsp(Neural Fictitious Self-Play),rnad(Regularized Nash Dynamics)
Step 2: Pairwise Simulation
- For each ordered pair (i, j), simulate N games with agent i as row player and j as column player
- Uses OpenSpiel's negotiation game with:
- T=3 item types with quantities (7, 4, 1)
- Private valuations drawn uniformly from [1, 100]
- Private BATNAs (outside options)
- Discount factor γ ∈ {0.9, 0.98} per round
- Maximum R ∈ {3, 5} rounds
Game Configurations (from paper)
| Config | Discount (γ) | Rounds (R) | Description |
|---|---|---|---|
| BG4 | 0.9 | 3 | High time pressure, short horizon |
| BG5 | 0.98 | 3 | Low time pressure, short horizon |
| BG6 | 0.98 | 5 | Low time pressure, long horizon |
Pre-trained NFSP and RNAD checkpoints are provided for all three configurations.
Step 3: Payoff Matrix & MENE
- Construct symmetric payoff matrix where M[i][j] = agent i's average payoff when playing against agent j
- Solve for Maximum Entropy Nash Equilibrium using MILP (CVXPY)
- Bootstrap resampling (default 100 iterations) for statistical robustness
Step 4: Metrics Computation
- Compute regret and welfare metrics weighted by the MENE mixture
The regret of a pure strategy π at Nash equilibrium σ* measures the deviation incentive:
Regret(π) = max(0, u(π, σ*) - u(σ*))
Where:
- u(π, σ*) = expected payoff for pure strategy π against the equilibrium mixture
- u(σ*) = expected payoff at equilibrium (playing the mixture)
Interpretation: Lower regret means the agent is better adapted to the equilibrium. An agent with zero regret has no incentive to deviate—it is either in the equilibrium support or weakly dominated. Positive regret indicates the strategy outperforms the equilibrium mixture (which should be near-zero for a correctly computed MENE).
| Metric | Formula | Description |
|---|---|---|
| UW (Utilitarian Welfare) | u₁ + u₂ | Total value created by both players |
| NW (Nash Welfare) | √(u₁ × u₂) | Geometric mean - balances efficiency and equity |
| NW+ (Nash Welfare Advantage) | √(max(0, u₁-b₁) × max(0, u₂-b₂)) | Surplus over BATNAs |
| EF1 (Envy-Free up to 1 item) | Boolean per game | Fairness: envy eliminable by removing one item |
Per the A2A protocol, send an assessment request to the green agent:
{
"participants": {
"challenger": "https://your-purple-agent.example.com"
},
"config": {
"games": 50,
"max_rounds": 5,
"discount": 0.98,
"bootstrap": 100,
"challenger_circle": 5
}
}| Parameter | Default | Description |
|---|---|---|
games |
50 | Number of games per agent pair |
max_rounds |
5 | Maximum negotiation rounds (R) |
discount |
0.98 | Per-round discount factor (γ) |
bootstrap |
100 | Bootstrap iterations for MENE |
challenger_circle |
0 | Prompt sophistication level (0-6) |
challenger_label |
"challenger" | Label for your agent in results |
remote_agents |
{} | Additional remote agents {"label": "url"} |
The green agent provides structured prompts to LLM-based purple agents via "circles" - a hierarchical prompting framework:
| Circle | Content |
|---|---|
| 0 | Bare rules: items, valuations, BATNA, actions |
| 1 | + Objective specification (maximize outcome) |
| 2 | + Worked numeric example of offer evaluation |
| 3 | + Step-by-step routine: assess, compare, decide |
| 4 | + Five common negotiation mistakes to avoid |
| 5 | + Quick numeric checks against those mistakes |
| 6 | + Strategic inference from opponent's offers |
Set challenger_circle to inject these prompts into observations sent to your agent.
Your purple agent must:
- Implement A2A protocol - Expose an A2A server endpoint
- Handle negotiation messages - Receive observations with valuations, BATNAs, and offers
- Return valid actions - Propose offers or accept/walk
The green agent sends observations like:
{
"role": "row",
"round": 2,
"valuations": [45, 72, 33],
"batna": 85,
"quantities": [7, 4, 1],
"last_offer": [3, 2, 0],
"history": [...]
}Your agent responds with an action:
{"action": "COUNTEROFFER", "offer": [4, 2, 1]}Or:
{"action": "ACCEPT"}Or:
{"action": "WALK"}From our analysis, these are the five key mistakes that LLM negotiators make:
- M1: Making an offer worse than your previous offer
- M2: Making an offer worse for you than your BATNA
- M3: Offering no items or all items (extreme divisions)
- M4: Accepting an offer worse than your BATNA
- M5: Walking away from an offer better than your BATNA
# Start the A2A server
PYTHONPATH=scenarios/bargaining/open_spiel:$PYTHONPATH \
uv run python -m scenarios.bargaining.bargaining_green serve \
--host 0.0.0.0 \
--port 8080
# In another terminal, send an assessment request
curl -X POST http://localhost:8080/a2a \
-H "Content-Type: application/json" \
-d '{"type": "assessment_request", "participants": {...}, "config": {...}}'PYTHONPATH=scenarios/bargaining/open_spiel:$PYTHONPATH \
uv run python -m scenarios.bargaining.bargaining_green once \
--config '{"challenger_url": "https://...", "games": 10}'# Build locally
docker build -t bargaining-green-agent .
# Run locally
docker run -p 8080:8080 bargaining-green-agentscenarios/bargaining/
├── bargaining_green.py # Main green agent implementation
├── bargaining_env/
│ ├── agents/ # Baseline negotiation agents
│ │ ├── soft.py # Always-accept agent
│ │ ├── tough.py # Minimal-offer agent
│ │ ├── aspiration.py # Concession-schedule agent
│ │ ├── walk.py # BATNA-preferring agent
│ │ ├── nfsp.py # Neural Fictitious Self-Play
│ │ └── rnad.py # Regularized Nash Dynamics
│ ├── pyspiel_integration.py # Game parameter builder
│ ├── pyspiel_runner.py # OpenSpiel game interface
│ ├── mene_solver.py # MENE computation via MILP
│ └── run_entire_matrix.py # Matrix simulation orchestrator
├── rl_agent_checkpoints/ # Pre-trained RL policies
│ ├── nfsp/ # NFSP checkpoints (bg4, bg5, bg6)
│ └── rnad/ # RNAD checkpoints (bg4, bg5, bg6)
└── open_spiel/ # Custom OpenSpiel with negotiation game
This repository includes a custom OpenSpiel build with the negotiation/bargaining game.
Important: The pre-compiled pyspiel.so in scenarios/bargaining/open_spiel/ is built for Python 3.11. The project is configured to use Python 3.11 via .python-version.
The Docker build compiles OpenSpiel from source with:
- Abseil C++ library
- pybind11 Python bindings
- Double Dummy Solver (for bridge, included in full build)
Loading the Game Correctly
Always use build_negotiation_params() from pyspiel_integration.py to ensure correct game loading:
from scenarios.bargaining.bargaining_env.pyspiel_integration import (
build_negotiation_params,
try_load_pyspiel_game
)
params = build_negotiation_params(
discount=0.98,
max_rounds=3,
num_items=3,
item_quantities=(7, 4, 1),
min_value=1,
max_value=100,
max_quantity=10,
)
game = try_load_pyspiel_game(params)Note: The
item_quantitiesparameter must use comma-separated values internally (e.g.,"7,4,1"). The helper function handles this automatically.
The Maximum Entropy Nash Equilibrium is computed using:
- CVXPY for convex optimization
- ECOS_BB or GLPK_MI as MILP solvers
- Bootstrap resampling for robustness (following Wiedenbeck et al., 2014)
Pre-trained checkpoints are available for both NFSP and RNAD agents:
| Agent | BG4 | BG5 | BG6 |
|---|---|---|---|
| NFSP | nfsp_bg4.pt |
nfsp_ng5.pt |
nfsp_bg6.pt |
| RNAD | rnad_bg4.pkl |
rnad_bg5.pkl |
rnad_bg6.pkl |
The checkpoints are automatically selected based on the game configuration (discount and max_rounds).
-
Smithline, G., Mascioli, C., Chakraborty, M., & Wellman, M. P. (2025). "Measuring Competition and Cooperation in LLM Bargaining: An Empirical Meta-Game Analysis." University of Michigan.
-
Li, Z., & Wellman, M. P. (2024). "A Meta-Game Evaluation Framework for Deep Multiagent Reinforcement Learning." IJCAI.
-
Wellman, M. P., Tuyls, K., & Greenwald, A. (2025). "Empirical Game-Theoretic Analysis: A Survey." JAIR.
-
Lewis, M., et al. (2017). "Deal or No Deal? End-to-End Learning for Negotiation Dialogues." EMNLP.
-
Lanctot, M., et al. (2019). "OpenSpiel: A Framework for Reinforcement Learning in Games." arXiv:1908.09453.
Apache 2.0
| Repository | Description |
|---|---|
| meta-game-leaderboard | Leaderboard for submitting and comparing agents |
| llm-negotiator-purple | Example Claude-powered purple agent |
This is a submission for the AgentBeats x AgentX Competition 2025.
- Agent Type: Green (Evaluator)
- Domain: Multi-agent negotiation / bargaining
- Methodology: Empirical Meta-Game Analysis with MENE
- Docker Image:
ghcr.io/gsmithline/tutorial-agent-beats-comp:latest - Python Version: 3.11 (required)
- Authors: Based on research from the University of Michigan Strategic Reasoning Group