Skip to content

[AgentGateway] feat: Multiple Linear Chains for Gateway Sessions#66

Open
zhanghuiyao wants to merge 5 commits into
verl-project:mainfrom
zhanghuiyao:gateway-multiple-chains
Open

[AgentGateway] feat: Multiple Linear Chains for Gateway Sessions#66
zhanghuiyao wants to merge 5 commits into
verl-project:mainfrom
zhanghuiyao:gateway-multiple-chains

Conversation

@zhanghuiyao

@zhanghuiyao zhanghuiyao commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Implements per-session multiple linear chains for GatewaySession. A gateway session no longer uses one global message_history plus one active trajectory as the runtime source of truth. Instead, it keeps multiple append-only linear chains so sub-agent/system splits, context compaction, repeated prompts, and best-of-N / retry-style flows do not overwrite unrelated trajectory branches.

In the current PR head, multiple chains are the default and only GatewaySession state model. Live state is stored in active_chains, closed state is stored in materialized_chains, and there is no GatewayActorConfig.enable_multiple_chains rollout flag. The PR also includes the M2 opt-in follow-ups that are now implemented in this branch: enable_parallel_session_generation=False for same-session backend concurrency experiments, and ignore_cch_for_prefix_hash=False for Claude Code cch= prefix-hash compatibility. M3 branch identity and idempotency remain design-only and are not implemented here.

What's in this PR

Default multiple-chain state model

  • Adds ChainState for one active linear chain with its own message history, stable prefix hashes, tool/schema compatibility state, effective chat-template kwargs, trajectory buffer, media state, logprob-completeness state, and sequence metadata.
  • Adds MaterializedChain so early-closed chains and still-active chains can be sorted together at finalize.
  • Replaces the old single-active runtime state with active_chains and materialized_chains as the authoritative session model.

Chain selection, lifecycle, and materialization

  • Selects the longest compatible active chain; equal-length sibling matches use deterministic (updated_seq, chain_id) tie-breaks.
  • Gates reuse by tools, effective chat_template_kwargs, and canonical message-prefix hash matching.
  • Keeps prepare mutation isolated to a working buffer clone.
  • Preserves commit-on-success behavior for backend failures.
  • Rejects late commits after finalize/abort without mutating session state.
  • Closes only the selected chain on length exhaustion and assigns close-time order_seq.
  • Floors remaining response budget at 0 before applying request max_tokens.
  • Maintains response-logprob all-or-none materialization: response_logprobs is either aligned with response_ids or emitted as None.

Prefix matching and codec compatibility

  • Adds stable cumulative message-prefix hashing with deterministic JSON and SHA-256 domain prefixes.
  • Uses MessageCodec.canonicalize_message_for_prefix_comparison() for both incoming request messages and committed assistant messages.
  • Preserves committed tool_calls[].id and tool_call_id in prefix-hash semantics, so rewriting committed tool-call ids splits instead of reusing an unsafe buffer.
  • Adds MessageCodec.effective_chat_template_kwargs() and routes full/incremental encoding through the same effective-kwargs merge rule.
  • Fixes processor-backed incremental prefix slicing so the sliced prefix comes from the same effective kwargs and encoder path.

M2 opt-in same-session backend concurrency

  • Adds enable_parallel_session_generation=False as an explicit opt-in.
  • Keeps default behavior serialized under generation_lock.
  • When enabled, keeps prepare/commit/decode serialized under request_lock, while backend generation runs without holding a session lock.
  • Allocates per-generation generation_id and sends backend request_id as session_id:generation_id, avoiding same-session backend request-id collisions.
  • Tracks _inflight_generations for snapshot/debug visibility.
  • If a selected chain advanced or disappeared while the backend was running, preserves the successful backend output as a new sibling chain instead of overwriting or dropping it.
  • Clears in-flight state on success, stale-sibling append, backend failure, decode failure, late finalize/abort rejection, and cancellation.

M2 opt-in cch= prefix-hash compatibility

  • Adds ignore_cch_for_prefix_hash=False.
  • Normalizes only cch=[A-Za-z0-9_-]+ string leaves for prefix-hash input when explicitly enabled.
  • Keeps raw payloads, stored history, trajectories, TQ fields, and HTTP responses unchanged.
  • Fails fast on non-bool M2 flag values at framework config, actor config, and direct session construction boundaries.

Framework / reward / TQ behavior

  • Keeps the Trajectory and TransferQueue schemas unchanged.
  • Materializes each chain as the existing Trajectory schema.
  • Keeps the framework reward policy unchanged: _score_trajectories() scores session_trajectories[-1] and broadcasts that score to all trajectories from the session.
  • Documents the M2 reward caveat: parallel completion order can make session_trajectories[-1] timing-dependent, so the parallel flag should only be used when concurrent sibling chains are meant to share the same broadcast reward.

Actor / manager wiring and observability

  • _GatewayActor.create_session() forwards prompt/response budgets plus M2 flags into GatewaySession.
  • build_gateway_manager() reads actor_rollout_ref.rollout.custom.agent_framework.enable_parallel_session_generation and ignore_cch_for_prefix_hash.
  • snapshot_state() reports num_active_chains, active_chain_ids, active chain tip hashes, and in-flight generation ids for live-session inspection.
  • User docs describe default multiple chains plus the two M2 opt-in flags.

Tests (CPU-only)

  • tests/uni_agent/gateway/test_session_multiple_chains_on_cpu.py
    • M1 coverage for linear parity, subagent return-to-main, compaction split, repeated prompt siblings, distinct-sibling exact continuation by assistant echo, tools/effective-kwargs gates, backend failure, selected-chain length close ordering, surviving-chain continuation after length close, budget clamping, multimodal chain locality, unsent media on length exhaustion, media list copying, late finalize/abort rejection, prefix-hash canonicalization, tool-call echo, committed tool-call id rewrite splitting, and all-or-none response logprobs.
    • M2 coverage for same-tip stale success becoming a sibling, concurrent continuations of different chains committing in place, same-prompt concurrent siblings with unique backend request ids, terminal-state late-commit rejection, length-close racing with backend success, backend failure cleanup, decode failure cleanup, cancellation cleanup, and opt-in cch= normalization.
  • tests/uni_agent/gateway/test_gateway_config_on_cpu.py
    • Low-level strict-bool validation for M2 flags on GatewayActorConfig.
  • tests/uni_agent/gateway/test_gateway_actor_on_cpu.py
    • Actor-level coverage for session budget and M2 flag forwarding, response-budget clamping, default serialized same-session requests, opt-in parallel same-session requests, subagent return-to-main ordering, repeated same-prompt latest-sibling continuation, active-chain snapshot state, multimodal flows, and backend-failure behavior.
  • tests/uni_agent/gateway/test_gateway_manager_on_cpu.py
    • Manager/HTTP coverage proving default multi-chain behavior and finalized order across the gateway manager path.
  • tests/uni_agent/framework/test_generate_sequences_on_cpu.py
    • Gateway-manager config defaults, M2 flag wiring, strict-bool rejection from training config, framework reward/TQ ordering, and last-finalized-trajectory reward target checks.

Representative local validation recorded for this PR head:

.venv/bin/python -m pytest tests/uni_agent/gateway/test_session_multiple_chains_on_cpu.py tests/uni_agent/gateway/test_gateway_config_on_cpu.py tests/uni_agent/framework/test_generate_sequences_on_cpu.py::test_build_gateway_manager_wires_gateway_config_defaults tests/uni_agent/framework/test_generate_sequences_on_cpu.py::test_build_gateway_manager_wires_ignore_cch_for_prefix_hash tests/uni_agent/framework/test_generate_sequences_on_cpu.py::test_build_gateway_manager_wires_enable_parallel_session_generation tests/uni_agent/framework/test_generate_sequences_on_cpu.py::test_build_gateway_manager_rejects_non_bool_m2_flags -q
# 62 passed, 1 warning

git diff --check
# passed

Design

The M1 goal of this design is to transform GatewaySession into a unique state model consisting of multiple append-only linear chains, while maintaining the training data plane and OpenAI-compatible HTTP envelope unchanged. The core principle is: only reuse the token buffer of a chain if it can be proven that the incoming history is a safe continuation of that chain; otherwise, create a new chain and retain the old chain until finalize.

The M2 goal of this design is to allow multiple backend generations within the same gateway session to run concurrently as an opt-in mode on top of M1's multiple-chain state model, while keeping chain selection, commit, decode, and accepted-outcome ordering boundaries verifiable.

  1. Prefix matching strategy, eligibility gates, and hash strategy

    M1 uses a "longest continuable chain" selection rule. For each active chain, the session first checks active_tool_schemas == tools, then checks that the request's effective chat_template_kwargs equals the chain's effective kwargs, and finally checks whether the incoming normalized message-history prefix hash equals the chain tip hash. Among all matching chains, the chain with the longest message_history wins. If multiple chains have the same length, (updated_seq, chain_id) is used as the deterministic tie-break. If no active chain matches, the request starts a new chain.

    Prefix hashes describe only whether the message history is the same prefix. Tools and effective chat-template kwargs are kept as separate compatibility gates and are not mixed into the message hash. Hash input is the output of MessageCodec.normalize_request(), canonicalized by MessageCodec.canonicalize_message_for_prefix_comparison(), serialized with deterministic JSON, and hashed with SHA-256. Prefix hashes are cumulative over message hashes. Assistant messages committed after backend generation use the same canonicalization path, so a later request that echoes the committed assistant message can reattach to the same chain. M1 does not use Python's built-in hash(), repr(), or process-randomized state.

  2. Backend rollout and incremental-token strategy

    When a request continues an existing chain, prepare clones that chain's TrajectoryBuffer plus chain-local media and logprob state. It does not mutate the active chain directly. The session computes incremental_messages = messages[len(chain.message_history):], encodes those messages with encode_incremental(), and after the length-budget check appends the incremental token ids to the working buffer with response_mask=0, because those tokens are continuation context rather than this turn's generated answer.

    The rollout backend still receives buffer.prompt_ids + buffer.response_ids as context. After the backend succeeds, the session appends backend output tokens with response_mask=1, decodes the assistant message, and commits under request_lock. New chains are allocated and appended; existing chains are replaced in place only if the selected chain is still the same live tip. Backend failure keeps commit-on-success semantics: no chain is modified and no partial trajectory is created.

  3. Boundary conditions and chain preservation

    Tool, permission, or template changes must not silently reuse an old token buffer. M1 explicitly gates reuse by tool schemas and effective chat_template_kwargs; any caller-side change that affects available tools, system policy, or prompt rendering must show up in those compatibility states. If tools or effective kwargs are incompatible, a request starts a new chain even if the message history looks like a prefix.

    In subagent/system-split flows, a subagent with a different system prompt or tool context creates an independent chain. When the caller returns to the main history, the incoming history can match the main chain by prefix hash and continue it, instead of reopening main as a third trajectory. Context compaction or history rewriting that is not a prefix continuation of any active chain also starts a new chain; the previous chains remain available for finalize.

    Repeated same-prompt, retry, and best-of-N flows can create sibling chains. If different samples produce different assistant messages, a later request that faithfully echoes one assistant message can select that sibling exactly through the prefix hash. Ambiguity only remains for canonically identical tips or callers that do not provide enough history or branch identity; M1 handles those cases deterministically with the latest-updated tie-break. M1 does not implement idempotent deduplication or explicit branch selection. Length exhaustion closes only the selected continuation chain, does not append unsent incremental tokens or media, and leaves other active chains alive. Finalize sorts early-closed and active chains together by order_seq, preserving the invariant that the last visible session interaction is returned at session_trajectories[-1].

  4. M2 compatibility opt-in for Claude Code cch= billing-hash churn

    anthropics/claude-code#40652 describes a Claude Code behavior where historical tool-result text containing cch=<value> billing hashes can be rewritten, which changes historical message bytes and can break prompt-cache prefix matching. M1 treats any historical message-content change as semantic and therefore starts a new chain.

    M2 adds an explicit opt-in compatibility flag, ignore_cch_for_prefix_hash, defaulting to False. When enabled, prefix-hash canonicalization normalizes only string leaves matching cch=[A-Za-z0-9_-]+ to a fixed placeholder before deterministic JSON and SHA-256. This does not rewrite the raw request payload, stored message_history, trajectory token buffers, TQ fields, or HTTP responses. It only means the caller accepts cch= value churn as non-semantic for chain-selection purposes; any non-cch= content change still splits into a new chain.

Scope / deferrals

  • No prefix trie: this PR intentionally does not introduce message nodes, child maps, nearest-checkpoint traversal, warm-start structural nodes, or idempotent retry dedup.
  • No public OpenAI-compatible API change: branching remains message-driven through the existing chat-completion request history.
  • No per-chain reward: reward remains one-session-one-reward; the final finalized trajectory is scored and the result is broadcast to all trajectories in the session.
  • No Trajectory / TQ schema change: multiple chains materialize as multiple existing trajectories.
  • No enable_multiple_chains feature flag: multiple chains are the default and only session state model in this branch.
  • M2 parallel mode is opt-in, not default: same-session backend concurrency is available only behind enable_parallel_session_generation=True and should be used only when timing-dependent final-target reward semantics are acceptable.
  • M3 branch identity and idempotency are not implemented: there is no BranchHints, no uni_agent_parent_chain_id, no uni_agent_parent_tip_hash, and no uni_agent_request_id behavior in this PR.
  • No OpenAI response envelope change for branch metadata: branch ids, tip hashes, split metadata, and idempotency records are not added to normal chat-completion responses.
  • No trainer-visible branch metadata: parent/split/request-id metadata is not written to Trajectory.extra_fields.

Remaining future work (M3 / framework follow-up)

  • Optional gateway-aware branch identity can be designed later for direct-actor or wrapper/control-plane callers that need exact sibling selection.
  • Optional request idempotency can be designed later around a session-scoped request id plus deterministic normalized request hash.
  • If exact branch identity requires per-chain scoring, that should be a separate framework feature rather than a hidden change to the current last-interaction broadcast reward policy.

…eway sessions

- Introduced `enable_multiple_chains` configuration option in GatewayActorConfig to toggle multiple chain support.
- Updated GatewayManager and GatewayActor to handle multiple chains.
- Enhanced GatewaySession to manage active chains, including selection, materialization, and closure of chains.
- Implemented chain state management with ChainState and MaterializedChain data classes.
- Refactored message encoding and response handling to accommodate multiple chains.
- Added methods for computing message prefix hashes and managing chain compatibility.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for managing multiple active linear trajectory chains per gateway session, allowing for branching conversations (e.g., subagents) within a single session. The changes span the framework configuration, gateway actor, and session management, backed by a comprehensive suite of unit tests. The feedback identifies a potential issue where the calculated remaining response budget could become negative if the generated response exceeds the budget, which may cause LLM backends to crash; clamping this value to a minimum of zero is recommended.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread uni_agent/gateway/session/session.py
@zhanghuiyao zhanghuiyao marked this pull request as draft June 23, 2026 02:26
@zhanghuiyao zhanghuiyao changed the title [AgentGateway] feat: add support for multiple active linear trajectory chains in gat… [AgentGateway] feat: Multiple Linear Chains for Gateway Sessions Jun 23, 2026
…ro and update MessageCodec for dynamic system prompts
- Removed enable_multiple_chains parameter from session and gateway configurations.
- Simplified GatewaySession to handle multiple chains without conditional checks.
- Updated MessageCodec to derive system prompts from processors when available.
- Consolidated trajectory handling to streamline the materialization of active chains.
- Enhanced tests for multiple chains to ensure consistent behavior and coverage.
- Removed legacy code related to single chain handling, improving clarity and maintainability.
- Introduced `enable_parallel_session_generation` and `ignore_cch_for_prefix_hash` flags in GatewayActorConfig to control session behavior.
- Updated GatewaySession to handle parallel backend generation and manage inflight generations.
- Enhanced tests to cover new functionality, including concurrent requests and validation of boolean flags.
- Modified message prefix hash computation to optionally ignore CCH markers for prefix comparisons.
- Ensured backward compatibility while implementing new features.
@zhanghuiyao zhanghuiyao marked this pull request as ready for review June 25, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant