Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary

## Summary

BrowseComp (BC-EN/BC-ZH) evaluations are significantly slower than other benchmarks. A single BC-ZH task averages **48.6 minutes** (median 37.3 min), with tail tasks reaching **265 minutes (4.4 hours)**. BC-EN with 1266 tasks takes **60-70 hours** per run.

Profiling shows **LLM inference is the dominant bottleneck** (~80%+ of total task time), followed by tool execution overhead. Both sides need optimization to achieve the target of **2x+ overall speedup**.

---

## 1. LLM Inference Optimization (Biggest Impact) 🔴

LLM inference accounts for the vast majority of evaluation time. A BC task runs ~300-400 turns, each requiring a full LLM call with growing context (up to tens of thousands of tokens). **Conservative estimate: 20-60 min per task on inference alone, often much more in practice.**

### 1.1 Prefix Caching

**Current state**: All evaluation tasks share the same system prompt (several thousand tokens of agent instructions). Currently, every request recomputes the full KV cache from scratch.

**Optimization**: Enable prefix caching (RadixTree) on the sglang server so the system prompt KV cache is computed once and reused across all concurrent requests.

**Additional benefit**: Within a single task's multi-turn conversation, each turn's context is a prefix of the next. With cache-aware routing (routing same-task requests to the same worker), KV cache from previous turns can be reused.

**Expected impact**: Significant reduction in prefill compute, especially for later turns with long context. **Estimated 30-50% inference time reduction.**

### 1.2 Chunked Prefill with Pipeline Parallelism

**Current state**: Basic sglang launch with `--tp 8`, no chunked prefill.

**Optimization**: Enable `--chunked-prefill-size 4096 --enable-dynamic-chunking` to pipeline long prefills. SGLang benchmarks show up to **3.3x prefill throughput** and **67.9% TTFT reduction** for long contexts.

**Why this matters for evaluation**: BC tasks accumulate long conversation histories (10K-60K+ tokens). Chunked prefill directly reduces the time spent on these long-context prefill operations.

### 1.3 Prefill-Decode Disaggregation (Multi-Node)

**Current state**: Single-node TP8 serving both prefill and decode.

**Optimization**: When multiple nodes are available, separate prefill-heavy and decode-heavy workloads into dedicated workers. Evaluation workloads are prefill-dominant (long contexts, relatively short generations), making this particularly beneficial.

**Expected impact**: Better GPU utilization and throughput when scaling beyond single node.

### 1.4 Current vs Optimized sglang Config

**Current** (basic):
```bash
python3 -m sglang.launch_server \
    --model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
    --trust-remote-code --enable-metrics --mem-fraction-static 0.9
```

**Optimized** (recommended):
```bash
python3 -m sglang.launch_server \
    --model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
    --trust-remote-code --enable-metrics --mem-fraction-static 0.9 \
    --chunked-prefill-size 4096 --enable-dynamic-chunking
```

Our internal infra framework already supports these advanced sglang features (prefix caching, chunked pipeline, PD disaggregation, cache-aware routing). Integrating them into the evaluation serving setup is the highest-leverage optimization.

---

## 2. Evaluation Pipeline Optimization

### 2.1 MCP Tool Server Parallel Initialization ✅ Fixed

**Location**: `libs/miroflow-tools/src/miroflow_tools/manager.py` → `get_all_tool_definitions()`

Each task initialized 3 MCP tool servers **sequentially**. Under high concurrency: avg **234s**, max **945s**.

**Fix**: `asyncio.gather()` to parallelize. Measured **13.8x speedup** (234s → 17s avg).

**Status**: ✅ Implemented in **PR #139**.

---

### 2.2 MCP Server Connection Not Reused Across Tool Calls 🔴

**Location**: `libs/miroflow-tools/src/miroflow_tools/manager.py` → `execute_tool_call()`

**Every single tool call** spawns a new MCP server subprocess, performs stdio handshake, executes, then destroys:

```python
# Called ~400 times per BC task!
async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()  # Handshake every time
        tool_result = await session.call_tool(tool_name, arguments)
# Process destroyed here
```

Note: `playwright` already does connection reuse correctly (`manager.py:247-252`).

**Proposed fix**: Keep MCP server sessions alive for the lifetime of a task.

**Expected impact**: Eliminate ~400 process spawns per task. Estimated **2-5 min saved per task**.

---

### 2.3 httpx Connection Pooling ✅ Fixed

**Location**: `libs/miroflow-tools/src/miroflow_tools/dev_mcp_servers/search_and_scrape_webpage.py`

Each search call created a new TCP connection. Now reuses shared `httpx.AsyncClient`.

**Status**: ✅ Implemented in **PR #139**.

---

### 2.4 max_turns Reduced from 400 to 300 ✅ Fixed

Algorithm team confirmed tasks not solved within 300 turns essentially never succeed in turns 300-400. New config `mirothinker_1.7_keep5_max300.yaml` saves ~25% wasted compute on tail tasks.

**Status**: ✅ Already on main.

---

### 2.5 Concurrency Overloading (No Backpressure)

`NUM_RUNS=2, MAX_CONCURRENT=60` → peak **120 processes** simultaneously. Causes E2B sandbox init to spike from 33s to **631s** under contention.

**Proposed fix**: Shared semaphore across runs, or adaptive concurrency.

---

## Won't Fix (confirmed with algorithm team)

- ~~`scrape_and_extract_info` optimization~~ — Jina + LLM extraction are both necessary, cannot be shortened without accuracy loss.
- ~~LLM retry parameters (base_wait=30s, max_retries=10)~~ — Required for reliability under high load.

---

## Benchmark-Specific Impact

| Benchmark | Tasks | Avg Task Time | Main Bottleneck |
|---|---|---|---|
| **BC-EN** | 1266 | ~43 min | LLM inference (long context) + scrape calls |
| **BC-ZH** | 289 | ~49 min | LLM inference + high turn count |
| HLE | 500 | ~15 min | E2B sandbox latency |
| GAIA | 103 | ~20 min | Mixed tools |

---

## Progress Tracker

| Priority | Optimization | Expected Impact | Status |
|---|---|---|---|
| **P0** | LLM inference: prefix caching + chunked prefill | **30-50% inference speedup** | 🔴 To do |
| **P0** | Parallel tool server init | **13.8x** init speedup (234s → 17s) | ✅ PR #139 |
| **P0** | MCP server connection reuse | Save **2-5 min/task** | 🔴 To do |
| **P1** | httpx connection pooling | Reduce TCP overhead | ✅ PR #139 |
| **P1** | max_turns 400 → 300 | ~25% less wasted compute | ✅ On main |
| **P2** | Concurrency backpressure | Reduce E2B init spike | ⬚ To do |

---

## Environment

- Agent config: `mirothinker_1.7_keep5_max300` (previously v1.5 max400)
- Typical: 30B model, 8×GPU sglang, MAX_CONCURRENT=60, NUM_RUNS=2
- Profiled on: BC-ZH/BC-EN completed evaluations (xxg and lxx checkpoints)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary #137

Summary

1. LLM Inference Optimization (Biggest Impact) 🔴

1.1 Prefix Caching

1.2 Chunked Prefill with Pipeline Parallelism

1.3 Prefill-Decode Disaggregation (Multi-Node)

1.4 Current vs Optimized sglang Config

2. Evaluation Pipeline Optimization

2.1 MCP Tool Server Parallel Initialization ✅ Fixed

2.2 MCP Server Connection Not Reused Across Tool Calls 🔴

2.3 httpx Connection Pooling ✅ Fixed

2.4 max_turns Reduced from 400 to 300 ✅ Fixed

2.5 Concurrency Overloading (No Backpressure)

Won't Fix (confirmed with algorithm team)

Benchmark-Specific Impact

Progress Tracker

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark	Tasks	Avg Task Time	Main Bottleneck
BC-EN	1266	~43 min	LLM inference (long context) + scrape calls
BC-ZH	289	~49 min	LLM inference + high turn count
HLE	500	~15 min	E2B sandbox latency
GAIA	103	~20 min	Mixed tools

Priority	Optimization	Expected Impact	Status
P0	LLM inference: prefix caching + chunked prefill	30-50% inference speedup	🔴 To do
P0	Parallel tool server init	13.8x init speedup (234s → 17s)	✅ PR #139
P0	MCP server connection reuse	Save 2-5 min/task	🔴 To do
P1	httpx connection pooling	Reduce TCP overhead	✅ PR #139
P1	max_turns 400 → 300	~25% less wasted compute	✅ On main
P2	Concurrency backpressure	Reduce E2B init spike	⬚ To do

Uh oh!

Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary #137

Description

Summary

1. LLM Inference Optimization (Biggest Impact) 🔴

1.1 Prefix Caching

1.2 Chunked Prefill with Pipeline Parallelism

1.3 Prefill-Decode Disaggregation (Multi-Node)

1.4 Current vs Optimized sglang Config

2. Evaluation Pipeline Optimization

2.1 MCP Tool Server Parallel Initialization ✅ Fixed

2.2 MCP Server Connection Not Reused Across Tool Calls 🔴

2.3 httpx Connection Pooling ✅ Fixed

2.4 max_turns Reduced from 400 to 300 ✅ Fixed

2.5 Concurrency Overloading (No Backpressure)

Won't Fix (confirmed with algorithm team)

Benchmark-Specific Impact

Progress Tracker

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions