Skip to content

Add a multi-turn coding agentic training example #24

Description

@xsuler

Background

AReno has agentic examples for shopping and tic-tac-toe, but it does not yet show how to train a coding agent in a realistic multi-turn loop. Coding agents are a natural fit for agentic RL because the policy must inspect files, search code, edit patches, run tests, interpret failures, and iterate until the task is solved.

A first-class coding agentic example would give users a concrete pattern for building tool-using training tasks beyond toy domains. It should be designed carefully around the same workflow a practical coding agent uses, while staying small enough to run as an example.


Scope

Add an agentic coding example that trains a model through multi-turn software-engineering tasks.

The example should include:

  • a small task dataset with repo-local coding tasks, expected outcomes, and test commands.
  • a multi-turn agent loop that can call one tool per turn and append tool results back into the conversation.
  • useful coding tools modeled after Codex-style workflows, such as:
    • list files / inspect tree
    • read file snippets
    • search with ripgrep-style queries
    • apply unified patches
    • run a bounded shell command or test command
    • report final answer / completion status
  • a reward function that scores task success from test results and optionally patch quality signals.
  • trajectory construction that returns explicit agentic samples without relying on proxy-side prompt matching.
  • clear safeguards for command execution, timeouts, path allowlists, and output truncation.
  • documentation showing how to run the example with areno train and how to interpret reward/log output.

Design requirements

The example should prefer realistic coding-agent mechanics over a scripted oracle:

  • Tools should expose constrained capabilities, not direct access to ground-truth answers.
  • The agent should be able to recover from failed tests by reading output, editing again, and rerunning.
  • Tool outputs should be compact and deterministic enough for stable training.
  • Dataset tasks should be small, CPU-friendly, and not require network access.
  • The example should avoid destructive filesystem operations and should isolate each task workspace.

Acceptance criteria

  • A new agentic coding example exists under examples/agentic/ or another documented examples location.
  • The agent loop performs multiple model calls for one sample and records a combined trajectory.
  • The tool set includes file inspection, search, patch application, and test execution or equivalent bounded commands.
  • The reward function can identify success/failure without starting GPU/backend-heavy work.
  • CPU tests cover tool behavior, trajectory construction, and at least one successful toy coding task.
  • README/docs mention the example and show the minimal command to run it.
  • Error messages for malformed task specs, unsafe paths, invalid patches, and failing commands are clear.

Activity

  • Design the task schema and workspace isolation model.
  • Implement coding tools with strict path and timeout controls.
  • Implement the multi-turn run_agent loop and trajectory return path.
  • Add a small deterministic dataset of coding tasks.
  • Add reward logic based on test success and final submission.
  • Add CPU tests for tools, loop behavior, and reward outcomes.
  • Document how to run the example and what success looks like.

Metadata

Metadata

Assignees

Labels

area/agenticIssues or PRs related to agentic RL, agent functions, and trajectorieskind/featureCategorizes issue or PR as related to a new feature

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions