Skip to content

Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary #137

Description

@wangbinluo

Summary

BrowseComp (BC-EN/BC-ZH) evaluations are significantly slower than other benchmarks. A single BC-ZH task averages 48.6 minutes (median 37.3 min), with tail tasks reaching 265 minutes (4.4 hours). BC-EN with 1266 tasks takes 60-70 hours per run.

Profiling shows LLM inference is the dominant bottleneck (~80%+ of total task time), followed by tool execution overhead. Both sides need optimization to achieve the target of 2x+ overall speedup.


1. LLM Inference Optimization (Biggest Impact) 🔴

LLM inference accounts for the vast majority of evaluation time. A BC task runs ~300-400 turns, each requiring a full LLM call with growing context (up to tens of thousands of tokens). Conservative estimate: 20-60 min per task on inference alone, often much more in practice.

1.1 Prefix Caching

Current state: All evaluation tasks share the same system prompt (several thousand tokens of agent instructions). Currently, every request recomputes the full KV cache from scratch.

Optimization: Enable prefix caching (RadixTree) on the sglang server so the system prompt KV cache is computed once and reused across all concurrent requests.

Additional benefit: Within a single task's multi-turn conversation, each turn's context is a prefix of the next. With cache-aware routing (routing same-task requests to the same worker), KV cache from previous turns can be reused.

Expected impact: Significant reduction in prefill compute, especially for later turns with long context. Estimated 30-50% inference time reduction.

1.2 Chunked Prefill with Pipeline Parallelism

Current state: Basic sglang launch with --tp 8, no chunked prefill.

Optimization: Enable --chunked-prefill-size 4096 --enable-dynamic-chunking to pipeline long prefills. SGLang benchmarks show up to 3.3x prefill throughput and 67.9% TTFT reduction for long contexts.

Why this matters for evaluation: BC tasks accumulate long conversation histories (10K-60K+ tokens). Chunked prefill directly reduces the time spent on these long-context prefill operations.

1.3 Prefill-Decode Disaggregation (Multi-Node)

Current state: Single-node TP8 serving both prefill and decode.

Optimization: When multiple nodes are available, separate prefill-heavy and decode-heavy workloads into dedicated workers. Evaluation workloads are prefill-dominant (long contexts, relatively short generations), making this particularly beneficial.

Expected impact: Better GPU utilization and throughput when scaling beyond single node.

1.4 Current vs Optimized sglang Config

Current (basic):

python3 -m sglang.launch_server \
    --model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
    --trust-remote-code --enable-metrics --mem-fraction-static 0.9

Optimized (recommended):

python3 -m sglang.launch_server \
    --model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
    --trust-remote-code --enable-metrics --mem-fraction-static 0.9 \
    --chunked-prefill-size 4096 --enable-dynamic-chunking

Our internal infra framework already supports these advanced sglang features (prefix caching, chunked pipeline, PD disaggregation, cache-aware routing). Integrating them into the evaluation serving setup is the highest-leverage optimization.


2. Evaluation Pipeline Optimization

2.1 MCP Tool Server Parallel Initialization ✅ Fixed

Location: libs/miroflow-tools/src/miroflow_tools/manager.pyget_all_tool_definitions()

Each task initialized 3 MCP tool servers sequentially. Under high concurrency: avg 234s, max 945s.

Fix: asyncio.gather() to parallelize. Measured 13.8x speedup (234s → 17s avg).

Status: ✅ Implemented in PR #139.


2.2 MCP Server Connection Not Reused Across Tool Calls 🔴

Location: libs/miroflow-tools/src/miroflow_tools/manager.pyexecute_tool_call()

Every single tool call spawns a new MCP server subprocess, performs stdio handshake, executes, then destroys:

# Called ~400 times per BC task!
async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()  # Handshake every time
        tool_result = await session.call_tool(tool_name, arguments)
# Process destroyed here

Note: playwright already does connection reuse correctly (manager.py:247-252).

Proposed fix: Keep MCP server sessions alive for the lifetime of a task.

Expected impact: Eliminate ~400 process spawns per task. Estimated 2-5 min saved per task.


2.3 httpx Connection Pooling ✅ Fixed

Location: libs/miroflow-tools/src/miroflow_tools/dev_mcp_servers/search_and_scrape_webpage.py

Each search call created a new TCP connection. Now reuses shared httpx.AsyncClient.

Status: ✅ Implemented in PR #139.


2.4 max_turns Reduced from 400 to 300 ✅ Fixed

Algorithm team confirmed tasks not solved within 300 turns essentially never succeed in turns 300-400. New config mirothinker_1.7_keep5_max300.yaml saves ~25% wasted compute on tail tasks.

Status: ✅ Already on main.


2.5 Concurrency Overloading (No Backpressure)

NUM_RUNS=2, MAX_CONCURRENT=60 → peak 120 processes simultaneously. Causes E2B sandbox init to spike from 33s to 631s under contention.

Proposed fix: Shared semaphore across runs, or adaptive concurrency.


Won't Fix (confirmed with algorithm team)

  • scrape_and_extract_info optimization — Jina + LLM extraction are both necessary, cannot be shortened without accuracy loss.
  • LLM retry parameters (base_wait=30s, max_retries=10) — Required for reliability under high load.

Benchmark-Specific Impact

Benchmark Tasks Avg Task Time Main Bottleneck
BC-EN 1266 ~43 min LLM inference (long context) + scrape calls
BC-ZH 289 ~49 min LLM inference + high turn count
HLE 500 ~15 min E2B sandbox latency
GAIA 103 ~20 min Mixed tools

Progress Tracker

Priority Optimization Expected Impact Status
P0 LLM inference: prefix caching + chunked prefill 30-50% inference speedup 🔴 To do
P0 Parallel tool server init 13.8x init speedup (234s → 17s) ✅ PR #139
P0 MCP server connection reuse Save 2-5 min/task 🔴 To do
P1 httpx connection pooling Reduce TCP overhead ✅ PR #139
P1 max_turns 400 → 300 ~25% less wasted compute ✅ On main
P2 Concurrency backpressure Reduce E2B init spike ⬚ To do

Environment

  • Agent config: mirothinker_1.7_keep5_max300 (previously v1.5 max400)
  • Typical: 30B model, 8×GPU sglang, MAX_CONCURRENT=60, NUM_RUNS=2
  • Profiled on: BC-ZH/BC-EN completed evaluations (xxg and lxx checkpoints)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions