Trace2Skill

Harness-agnostic framework for evolving agent skills from real trajectories. Reference implementation of Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills (arXiv:2603.25158).

Status: pre-alpha, actively developed. Core pipeline and two real harnesses (Claude Code, LangChain) are working end-to-end. Not on PyPI yet — install from source.

What it does

Given an agent with a SKILL.md file and a set of tasks with ground truth, Trace2Skill runs a 3-stage pipeline to improve the skill without touching model weights:

Rollout — run the agent on N tasks in parallel, collect trajectories
Analyze — N parallel analysts propose patches. Error Analyst is an agentic ReAct loop with 6 tools (inspect skill, read ground truth, try patches, diff vs ground truth, finish, drop). Quality-gate drops trajectories where the cause can't be verified.
Consolidate — hierarchical merge that keeps only edits appearing ≥2 times across the patch pool, with 3 deterministic guardrails (file-exists, line-range conflict, trial-apply validation)

Output: a better SKILL.md + resources, portable across harnesses. Every patch is traceable back to its source trajectory — full audit trail.

Why this repo exists

The paper proves the method. This repo makes it plug-and-play across any agent stack. Four plugin axes:

Axis	Role	Shipped adapters
`HarnessAdapter`	Run one query on an agent harness → `Trajectory`	`ClaudeCodeHarnessAdapter`, `LangChainHarnessAdapter`, `SimpleReActHarness`
`LLMProvider`	Wrap one LLM API for analyst/merger/judge	`AnthropicLLMProvider`, `OpenAICompatibleProvider` (covers OpenAI / Gemini / OpenRouter / DeepSeek / Groq / Together / xAI)
`SkillFormat`	Read/write skills on disk	`AnthropicSkillFormat` (SKILL.md + resources/)
`EvidenceAdapter`	Collect raw session signals for semi-online mode	`ClaudeCodeEvidenceAdapter` (JSONL sessions), `LangChainEvidenceAdapter` (LangSmith runs)

API-only. No GPU required. All LLMs go through HTTPS endpoints. vLLM and Ollama are optional community-tier.

Quickstart

Instant Claude Code setup (semi-online)

If you already use the claude CLI and just want the skill to evolve from your real sessions:

pip install trace2skill
trace2skill init-claude-code          # interactive wizard

The wizard writes ~/.trace2skill/<skill>.yaml, stores your API key in ~/.trace2skill/.env (mode 600), merges a SessionEnd hook into ~/.claude/settings.json (preserving any existing hooks), and runs a dry-run validation. After that, use Claude Code normally — the skill evolves in the background once ≥5 sessions accumulate. Rollback with trace2skill rollback --config ~/.trace2skill/<skill>.yaml.

Offline batch (benchmark with ground truth)

git clone https://github.com/Hert4/trace2skill.git
cd trace2skill
pip install -e ".[dev,anthropic,langchain]"

# Set at least one provider key
export ANTHROPIC_API_KEY=sk-ant-...
# or GEMINI_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY

# Run the Claude Code example (20 tasks, real claude CLI)
cd examples/02_claude_code_basic
trace2skill evolve --config trace2skill.yaml

You'll need the Claude Code CLI installed + authenticated for example 02. Alternative: wire LangChainHarnessAdapter with any BaseChatModel — see trace2skill/harnesses/langchain.py.

Minimal Python API

import asyncio
from pathlib import Path
from trace2skill.pipeline import Trace2SkillPipeline
from trace2skill.harnesses import LangChainHarnessAdapter
from trace2skill.llm.openai_compatible import OpenAICompatibleProvider
from trace2skill.skill_formats import AnthropicSkillFormat
from trace2skill.core.models import Task, FileGroundTruth
from langchain_anthropic import ChatAnthropic

llm = OpenAICompatibleProvider(
    model="gemini-2.5-pro",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key="...",  # your Gemini key
)
harness = LangChainHarnessAdapter(llm=ChatAnthropic(model="claude-sonnet-4-6"), tools=[...])
pipeline = Trace2SkillPipeline(
    harness=harness,
    llm=llm,
    skill_format=AnthropicSkillFormat(),
    evaluator=MyEvaluator(),
)

tasks = [Task(task_id="t1", query="...", inputs={}, ground_truth=FileGroundTruth(...))]
result = asyncio.run(pipeline.evolve(
    tasks=tasks,
    skill_dir=Path("./skill-v0"),
    workspace_dir=Path("./workspace"),
    analyst_modes={"error"},  # paper's +Error condition
))

AnthropicSkillFormat().save(result.skill, Path("./skill-evolved"))
print(f"{len(result.patches)} patches proposed, {result.patches_dropped} dropped")

Pipeline modes

Offline batch (trace2skill evolve) — paper-faithful 3 stages, need tasks + ground-truth + evaluator.
Semi-online (trace2skill evolve-online) — consume real user sessions via an EvidenceAdapter, run from filtered trajectories (skip Stage 1). SessionEnd hook example in examples/03_claude_code_semi_online/.
Rollback (trace2skill rollback) — atomic swap previous skill version back from timestamped backups.

Architecture in one diagram

┌──────────────────────────────────────────────────────────────┐
│                  trace2skill CORE                            │
│         (harness-agnostic, zero domain logic)                │
│                                                              │
│   Stage 1 Rollout → Stage 2 Analyze → Stage 3 Merge          │
│                    + Signal Layer                            │
└───┬──────────────┬──────────────┬──────────────┬─────────────┘
    │              │              │              │
┌───▼────┐    ┌────▼────┐    ┌────▼────┐    ┌────▼──────┐
│Harness │    │   LLM   │    │  Skill  │    │ Evidence  │
│Adapter │    │Provider │    │ Format  │    │  Adapter  │
└────────┘    └─────────┘    └─────────┘    └───────────┘

Each axis is a Protocol. Core never imports any adapter or provider SDK. Users pick-and-mix: ClaudeCodeHarnessAdapter + AnthropicLLMProvider + AnthropicSkillFormat, or LangChainHarnessAdapter + OpenAICompatibleProvider (Gemini) + AnthropicSkillFormat, or any combo.

Roadmap

See plan.md for the 16-week plan. Current progress: Phases 0/1/1.5/2/3/6 done, Phase 4 infra complete (paper-level delta deferred), Phase 5 partial (LangChain shipped, Cline/OpenCode pending), Phase 7 in progress (this README is part of it).

Development

pip install -e ".[dev]"
python -m pytest tests/unit -q    # 319 tests
ruff check .
pyright --strict trace2skill

Adapter contributions are welcome — see trace2skill/harnesses/langchain.py and trace2skill/evidence_adapters/langchain.py for reference implementations (~200-280 LOC each). Aim for <300 LOC per new adapter.

Citation

If you use Trace2Skill in research, cite the paper it implements:

@article{trace2skill2026,
  title={Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills},
  journal={arXiv preprint arXiv:2603.25158},
  year={2026}
}

License

MIT — see LICENSE. Built as open source from day one.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
benchmarks/paper_replication		benchmarks/paper_replication
docs		docs
examples		examples
rubrics		rubrics
tests		tests
trace2skill		trace2skill
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
plan.md		plan.md
pyproject.toml		pyproject.toml
trace2skill.md		trace2skill.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trace2Skill

What it does

Why this repo exists

Quickstart

Instant Claude Code setup (semi-online)

Offline batch (benchmark with ground truth)

Minimal Python API

Pipeline modes

Architecture in one diagram

Roadmap

Development

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trace2Skill

What it does

Why this repo exists

Quickstart

Instant Claude Code setup (semi-online)

Offline batch (benchmark with ground truth)

Minimal Python API

Pipeline modes

Architecture in one diagram

Roadmap

Development

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages