Skip to content

Hert4/trace2skill

Repository files navigation

Trace2Skill

Harness-agnostic framework for evolving agent skills from real trajectories. Reference implementation of Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills (arXiv:2603.25158).

Python License: MIT Paper Tests

Status: pre-alpha, actively developed. Core pipeline and two real harnesses (Claude Code, LangChain) are working end-to-end. Not on PyPI yet — install from source.

What it does

Given an agent with a SKILL.md file and a set of tasks with ground truth, Trace2Skill runs a 3-stage pipeline to improve the skill without touching model weights:

  1. Rollout — run the agent on N tasks in parallel, collect trajectories
  2. Analyze — N parallel analysts propose patches. Error Analyst is an agentic ReAct loop with 6 tools (inspect skill, read ground truth, try patches, diff vs ground truth, finish, drop). Quality-gate drops trajectories where the cause can't be verified.
  3. Consolidate — hierarchical merge that keeps only edits appearing ≥2 times across the patch pool, with 3 deterministic guardrails (file-exists, line-range conflict, trial-apply validation)

Output: a better SKILL.md + resources, portable across harnesses. Every patch is traceable back to its source trajectory — full audit trail.

Why this repo exists

The paper proves the method. This repo makes it plug-and-play across any agent stack. Four plugin axes:

Axis Role Shipped adapters
HarnessAdapter Run one query on an agent harness → Trajectory ClaudeCodeHarnessAdapter, LangChainHarnessAdapter, SimpleReActHarness
LLMProvider Wrap one LLM API for analyst/merger/judge AnthropicLLMProvider, OpenAICompatibleProvider (covers OpenAI / Gemini / OpenRouter / DeepSeek / Groq / Together / xAI)
SkillFormat Read/write skills on disk AnthropicSkillFormat (SKILL.md + resources/)
EvidenceAdapter Collect raw session signals for semi-online mode ClaudeCodeEvidenceAdapter (JSONL sessions), LangChainEvidenceAdapter (LangSmith runs)

API-only. No GPU required. All LLMs go through HTTPS endpoints. vLLM and Ollama are optional community-tier.

Quickstart

Instant Claude Code setup (semi-online)

If you already use the claude CLI and just want the skill to evolve from your real sessions:

pip install trace2skill
trace2skill init-claude-code          # interactive wizard

The wizard writes ~/.trace2skill/<skill>.yaml, stores your API key in ~/.trace2skill/.env (mode 600), merges a SessionEnd hook into ~/.claude/settings.json (preserving any existing hooks), and runs a dry-run validation. After that, use Claude Code normally — the skill evolves in the background once ≥5 sessions accumulate. Rollback with trace2skill rollback --config ~/.trace2skill/<skill>.yaml.

Offline batch (benchmark with ground truth)

git clone https://github.com/Hert4/trace2skill.git
cd trace2skill
pip install -e ".[dev,anthropic,langchain]"

# Set at least one provider key
export ANTHROPIC_API_KEY=sk-ant-...
# or GEMINI_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY

# Run the Claude Code example (20 tasks, real claude CLI)
cd examples/02_claude_code_basic
trace2skill evolve --config trace2skill.yaml

You'll need the Claude Code CLI installed + authenticated for example 02. Alternative: wire LangChainHarnessAdapter with any BaseChatModel — see trace2skill/harnesses/langchain.py.

Minimal Python API

import asyncio
from pathlib import Path
from trace2skill.pipeline import Trace2SkillPipeline
from trace2skill.harnesses import LangChainHarnessAdapter
from trace2skill.llm.openai_compatible import OpenAICompatibleProvider
from trace2skill.skill_formats import AnthropicSkillFormat
from trace2skill.core.models import Task, FileGroundTruth
from langchain_anthropic import ChatAnthropic

llm = OpenAICompatibleProvider(
    model="gemini-2.5-pro",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key="...",  # your Gemini key
)
harness = LangChainHarnessAdapter(llm=ChatAnthropic(model="claude-sonnet-4-6"), tools=[...])
pipeline = Trace2SkillPipeline(
    harness=harness,
    llm=llm,
    skill_format=AnthropicSkillFormat(),
    evaluator=MyEvaluator(),
)

tasks = [Task(task_id="t1", query="...", inputs={}, ground_truth=FileGroundTruth(...))]
result = asyncio.run(pipeline.evolve(
    tasks=tasks,
    skill_dir=Path("./skill-v0"),
    workspace_dir=Path("./workspace"),
    analyst_modes={"error"},  # paper's +Error condition
))

AnthropicSkillFormat().save(result.skill, Path("./skill-evolved"))
print(f"{len(result.patches)} patches proposed, {result.patches_dropped} dropped")

Pipeline modes

  • Offline batch (trace2skill evolve) — paper-faithful 3 stages, need tasks + ground-truth + evaluator.
  • Semi-online (trace2skill evolve-online) — consume real user sessions via an EvidenceAdapter, run from filtered trajectories (skip Stage 1). SessionEnd hook example in examples/03_claude_code_semi_online/.
  • Rollback (trace2skill rollback) — atomic swap previous skill version back from timestamped backups.

Architecture in one diagram

┌──────────────────────────────────────────────────────────────┐
│                  trace2skill CORE                            │
│         (harness-agnostic, zero domain logic)                │
│                                                              │
│   Stage 1 Rollout → Stage 2 Analyze → Stage 3 Merge          │
│                    + Signal Layer                            │
└───┬──────────────┬──────────────┬──────────────┬─────────────┘
    │              │              │              │
┌───▼────┐    ┌────▼────┐    ┌────▼────┐    ┌────▼──────┐
│Harness │    │   LLM   │    │  Skill  │    │ Evidence  │
│Adapter │    │Provider │    │ Format  │    │  Adapter  │
└────────┘    └─────────┘    └─────────┘    └───────────┘

Each axis is a Protocol. Core never imports any adapter or provider SDK. Users pick-and-mix: ClaudeCodeHarnessAdapter + AnthropicLLMProvider + AnthropicSkillFormat, or LangChainHarnessAdapter + OpenAICompatibleProvider (Gemini) + AnthropicSkillFormat, or any combo.

Roadmap

See plan.md for the 16-week plan. Current progress: Phases 0/1/1.5/2/3/6 done, Phase 4 infra complete (paper-level delta deferred), Phase 5 partial (LangChain shipped, Cline/OpenCode pending), Phase 7 in progress (this README is part of it).

Development

pip install -e ".[dev]"
python -m pytest tests/unit -q    # 319 tests
ruff check .
pyright --strict trace2skill

Adapter contributions are welcome — see trace2skill/harnesses/langchain.py and trace2skill/evidence_adapters/langchain.py for reference implementations (~200-280 LOC each). Aim for <300 LOC per new adapter.

Citation

If you use Trace2Skill in research, cite the paper it implements:

@article{trace2skill2026,
  title={Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills},
  journal={arXiv preprint arXiv:2603.25158},
  year={2026}
}

License

MIT — see LICENSE. Built as open source from day one.

About

Harness-agnostic framework for evolving agent skills from real trajectories. Implementation of Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills ([arXiv:2603.25158](https://arxiv.org/abs/2603.25158)).

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages