Skip to content

[Feature]: OpenGauss 2.0 — Tree-Search BFS Agent Loop, Distributed GRPO Infrastructure & TRACE Reward Masking #450

@RUFFY-369

Description

@RUFFY-369

Problem or Use Case

Currently, the OpenGauss agent framework utilizes a strictly linear sequential loop (G=1). When operating under high-friction or ambiguous environments (e.g., repository-level software debugging), standard sequential reasoning is highly susceptible to "Linear Deadlocks." If the agent encounters an unexpected tool error or initiates an early hallucination, it becomes trapped in self-reinforcing feedback loops, exhausting the maximum turn limit. This results in massive, unproductive API token expenditure and inconsistent convergence on final states.
Furthermore, to unlock modern Reinforcement Learning (RL) training paradigms, the infrastructure currently lacks native support for Group Relative advantage estimation and robust trajectory sanitization for backpropagation.

Proposed Solution

Upgrade the core infrastructure to decouple execution from linear limits:

  1. Tree-Search BFS Architecture (environments/agent_loop.py): Replace linear loops with a configurable Breadth-First Search (BFS) generation system. Allow the agent to fork into G parallel exploratory branches simultaneously.
  2. Low-Overhead Sandbox Isolation (tools/environments/docker.py): Implement a high-speed Unix tar-pipe cloning system to provision isolated branch sandboxes instantly (guest boot times < 1.0s).
  3. Best-of-N Reward Alignment (environments/gauss_base_env.py): Add standard multi-branch evaluation logic that automatically audits and selects the cleanest, most turn-efficient trajectory for commit.
  4. Mathematical GRPO Loss Engine (tools/rl_training_tool.py): Deliver a standalone GaussGRPOEngine to compute Group Relative advantages ($A_i = \frac{R_i - \mu}{\sigma + \epsilon}$) and reference-aligned clipped surrogate losses natively over distributed Ray clusters.
  5. TRACE Masking Sanitizer (agent/trace_masking.py): Isolate deterministic reward assignments directly to user/assistant tool blocks, pruning bulky terminal stdout noise before sequencing trajectories into memory buffers.

Empirical Pilot Validation

A head-to-head benchmark on the TBLite cohort demonstrated an empirical 12.74% overall turn reduction and a 70% API budget savings on complex diagnostic tasks (solving optimal paths in 18 turns vs hitting the 60-turn timeout).

Alternatives Considered

No response

Feature Type

Performance / reliability

Scope

Large (new module or significant refactor)

Contribution

  • I'd like to implement this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions