Harness-agnostic framework for evolving agent skills from real trajectories. Reference implementation of Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills (arXiv:2603.25158).
Status: pre-alpha, actively developed. Core pipeline and two real harnesses (Claude Code, LangChain) are working end-to-end. Not on PyPI yet — install from source.
Given an agent with a SKILL.md file and a set of tasks with ground truth, Trace2Skill runs a 3-stage pipeline to improve the skill without touching model weights:
- Rollout — run the agent on N tasks in parallel, collect trajectories
- Analyze — N parallel analysts propose patches. Error Analyst is an agentic ReAct loop with 6 tools (inspect skill, read ground truth, try patches, diff vs ground truth, finish, drop). Quality-gate drops trajectories where the cause can't be verified.
- Consolidate — hierarchical merge that keeps only edits appearing ≥2 times across the patch pool, with 3 deterministic guardrails (file-exists, line-range conflict, trial-apply validation)
Output: a better SKILL.md + resources, portable across harnesses. Every patch is traceable back to its source trajectory — full audit trail.
The paper proves the method. This repo makes it plug-and-play across any agent stack. Four plugin axes:
| Axis | Role | Shipped adapters |
|---|---|---|
HarnessAdapter |
Run one query on an agent harness → Trajectory |
ClaudeCodeHarnessAdapter, LangChainHarnessAdapter, SimpleReActHarness |
LLMProvider |
Wrap one LLM API for analyst/merger/judge | AnthropicLLMProvider, OpenAICompatibleProvider (covers OpenAI / Gemini / OpenRouter / DeepSeek / Groq / Together / xAI) |
SkillFormat |
Read/write skills on disk | AnthropicSkillFormat (SKILL.md + resources/) |
EvidenceAdapter |
Collect raw session signals for semi-online mode | ClaudeCodeEvidenceAdapter (JSONL sessions), LangChainEvidenceAdapter (LangSmith runs) |
API-only. No GPU required. All LLMs go through HTTPS endpoints. vLLM and Ollama are optional community-tier.
If you already use the claude CLI and just want the skill to evolve from your real sessions:
pip install trace2skill
trace2skill init-claude-code # interactive wizardThe wizard writes ~/.trace2skill/<skill>.yaml, stores your API key in ~/.trace2skill/.env (mode 600), merges a SessionEnd hook into ~/.claude/settings.json (preserving any existing hooks), and runs a dry-run validation. After that, use Claude Code normally — the skill evolves in the background once ≥5 sessions accumulate. Rollback with trace2skill rollback --config ~/.trace2skill/<skill>.yaml.
git clone https://github.com/Hert4/trace2skill.git
cd trace2skill
pip install -e ".[dev,anthropic,langchain]"
# Set at least one provider key
export ANTHROPIC_API_KEY=sk-ant-...
# or GEMINI_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY
# Run the Claude Code example (20 tasks, real claude CLI)
cd examples/02_claude_code_basic
trace2skill evolve --config trace2skill.yamlYou'll need the Claude Code CLI installed + authenticated for example 02. Alternative: wire LangChainHarnessAdapter with any BaseChatModel — see trace2skill/harnesses/langchain.py.
import asyncio
from pathlib import Path
from trace2skill.pipeline import Trace2SkillPipeline
from trace2skill.harnesses import LangChainHarnessAdapter
from trace2skill.llm.openai_compatible import OpenAICompatibleProvider
from trace2skill.skill_formats import AnthropicSkillFormat
from trace2skill.core.models import Task, FileGroundTruth
from langchain_anthropic import ChatAnthropic
llm = OpenAICompatibleProvider(
model="gemini-2.5-pro",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key="...", # your Gemini key
)
harness = LangChainHarnessAdapter(llm=ChatAnthropic(model="claude-sonnet-4-6"), tools=[...])
pipeline = Trace2SkillPipeline(
harness=harness,
llm=llm,
skill_format=AnthropicSkillFormat(),
evaluator=MyEvaluator(),
)
tasks = [Task(task_id="t1", query="...", inputs={}, ground_truth=FileGroundTruth(...))]
result = asyncio.run(pipeline.evolve(
tasks=tasks,
skill_dir=Path("./skill-v0"),
workspace_dir=Path("./workspace"),
analyst_modes={"error"}, # paper's +Error condition
))
AnthropicSkillFormat().save(result.skill, Path("./skill-evolved"))
print(f"{len(result.patches)} patches proposed, {result.patches_dropped} dropped")- Offline batch (
trace2skill evolve) — paper-faithful 3 stages, need tasks + ground-truth + evaluator. - Semi-online (
trace2skill evolve-online) — consume real user sessions via anEvidenceAdapter, run from filtered trajectories (skip Stage 1). SessionEnd hook example inexamples/03_claude_code_semi_online/. - Rollback (
trace2skill rollback) — atomic swap previous skill version back from timestamped backups.
┌──────────────────────────────────────────────────────────────┐
│ trace2skill CORE │
│ (harness-agnostic, zero domain logic) │
│ │
│ Stage 1 Rollout → Stage 2 Analyze → Stage 3 Merge │
│ + Signal Layer │
└───┬──────────────┬──────────────┬──────────────┬─────────────┘
│ │ │ │
┌───▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼──────┐
│Harness │ │ LLM │ │ Skill │ │ Evidence │
│Adapter │ │Provider │ │ Format │ │ Adapter │
└────────┘ └─────────┘ └─────────┘ └───────────┘
Each axis is a Protocol. Core never imports any adapter or provider SDK. Users pick-and-mix: ClaudeCodeHarnessAdapter + AnthropicLLMProvider + AnthropicSkillFormat, or LangChainHarnessAdapter + OpenAICompatibleProvider (Gemini) + AnthropicSkillFormat, or any combo.
See plan.md for the 16-week plan. Current progress: Phases 0/1/1.5/2/3/6 done, Phase 4 infra complete (paper-level delta deferred), Phase 5 partial (LangChain shipped, Cline/OpenCode pending), Phase 7 in progress (this README is part of it).
pip install -e ".[dev]"
python -m pytest tests/unit -q # 319 tests
ruff check .
pyright --strict trace2skillAdapter contributions are welcome — see trace2skill/harnesses/langchain.py and trace2skill/evidence_adapters/langchain.py for reference implementations (~200-280 LOC each). Aim for <300 LOC per new adapter.
If you use Trace2Skill in research, cite the paper it implements:
@article{trace2skill2026,
title={Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills},
journal={arXiv preprint arXiv:2603.25158},
year={2026}
}MIT — see LICENSE. Built as open source from day one.