Problem or Use Case
Currently, the OpenGauss agent framework utilizes a strictly linear sequential loop (G=1). When operating under high-friction or ambiguous environments (e.g., repository-level software debugging), standard sequential reasoning is highly susceptible to "Linear Deadlocks." If the agent encounters an unexpected tool error or initiates an early hallucination, it becomes trapped in self-reinforcing feedback loops, exhausting the maximum turn limit. This results in massive, unproductive API token expenditure and inconsistent convergence on final states.
Furthermore, to unlock modern Reinforcement Learning (RL) training paradigms, the infrastructure currently lacks native support for Group Relative advantage estimation and robust trajectory sanitization for backpropagation.
Proposed Solution
Upgrade the core infrastructure to decouple execution from linear limits:
-
Tree-Search BFS Architecture (
environments/agent_loop.py): Replace linear loops with a configurable Breadth-First Search (BFS) generation system. Allow the agent to fork into G parallel exploratory branches simultaneously.
-
Low-Overhead Sandbox Isolation (
tools/environments/docker.py): Implement a high-speed Unix tar-pipe cloning system to provision isolated branch sandboxes instantly (guest boot times < 1.0s).
-
Best-of-N Reward Alignment (
environments/gauss_base_env.py): Add standard multi-branch evaluation logic that automatically audits and selects the cleanest, most turn-efficient trajectory for commit.
-
Mathematical GRPO Loss Engine (
tools/rl_training_tool.py): Deliver a standalone GaussGRPOEngine to compute Group Relative advantages ($A_i = \frac{R_i - \mu}{\sigma + \epsilon}$) and reference-aligned clipped surrogate losses natively over distributed Ray clusters.
-
TRACE Masking Sanitizer (
agent/trace_masking.py): Isolate deterministic reward assignments directly to user/assistant tool blocks, pruning bulky terminal stdout noise before sequencing trajectories into memory buffers.
Empirical Pilot Validation
A head-to-head benchmark on the TBLite cohort demonstrated an empirical 12.74% overall turn reduction and a 70% API budget savings on complex diagnostic tasks (solving optimal paths in 18 turns vs hitting the 60-turn timeout).
Alternatives Considered
No response
Feature Type
Performance / reliability
Scope
Large (new module or significant refactor)
Contribution
Problem or Use Case
Currently, the OpenGauss agent framework utilizes a strictly linear sequential loop (
G=1). When operating under high-friction or ambiguous environments (e.g., repository-level software debugging), standard sequential reasoning is highly susceptible to "Linear Deadlocks." If the agent encounters an unexpected tool error or initiates an early hallucination, it becomes trapped in self-reinforcing feedback loops, exhausting the maximum turn limit. This results in massive, unproductive API token expenditure and inconsistent convergence on final states.Furthermore, to unlock modern Reinforcement Learning (RL) training paradigms, the infrastructure currently lacks native support for Group Relative advantage estimation and robust trajectory sanitization for backpropagation.
Proposed Solution
Upgrade the core infrastructure to decouple execution from linear limits:
environments/agent_loop.py): Replace linear loops with a configurable Breadth-First Search (BFS) generation system. Allow the agent to fork intoGparallel exploratory branches simultaneously.tools/environments/docker.py): Implement a high-speed Unix tar-pipe cloning system to provision isolated branch sandboxes instantly (guest boot times< 1.0s).environments/gauss_base_env.py): Add standard multi-branch evaluation logic that automatically audits and selects the cleanest, most turn-efficient trajectory for commit.tools/rl_training_tool.py): Deliver a standaloneGaussGRPOEngineto compute Group Relative advantages (agent/trace_masking.py): Isolate deterministic reward assignments directly to user/assistant tool blocks, pruning bulky terminal stdout noise before sequencing trajectories into memory buffers.Empirical Pilot Validation
A head-to-head benchmark on the
TBLitecohort demonstrated an empirical 12.74% overall turn reduction and a 70% API budget savings on complex diagnostic tasks (solving optimal paths in 18 turns vs hitting the 60-turn timeout).Alternatives Considered
No response
Feature Type
Performance / reliability
Scope
Large (new module or significant refactor)
Contribution