Evolutionary self-improvement for agent skills.
Agent Self-Evolution evolves and optimizes agent skills, tool descriptions, system prompts, and code — producing measurably better versions through reflective evolutionary search. Built on DSPy + GEPA (Genetic-Pareto Prompt Evolution), with extra safeguards on top so what ships is reliably better than the original.
No GPU training required. Everything operates via API calls — mutating text, evaluating results, and selecting the best variants. ~$2-10 per optimization run.
Works on any agent framework that emits SKILL.md markdown files. Hermes Agent skills are the original target; Claude Code skills (and any other agent's <dir>/<skill>/SKILL.md layout) are also supported via a pluggable skill-source abstraction.
Already running Hermes Agent? No env vars to set. If
~/.hermes/config.yamlexists,uv run python -m evolution.skills.evolve_skill --skill <name>picks up your provider, model, and credentials automatically. On startup the framework runs a tiny ~$0.0001 credential probe; if anything's stale you get a Rich-formatted error panel with the exact recovery command (e.g.hermes auth add anthropic) instead of a Python traceback. Jump to Run with Hermes Agent, or read docs/model_resolution.md for the full provider mapping.
flowchart LR
A[Read current<br/>skill/prompt/tool] --> B[Generate<br/>eval dataset]
B --> C[GEPA<br/>Optimizer]
C --> D[Candidate<br/>variants]
D --> E[Evaluate]
E -. Execution traces .-> C
E --> F["Constraint gates<br/>(tests, size limits,<br/>benchmarks)"]
F --> G[Best<br/>variant]
G --> H[PR against<br/>source repo]
GEPA reads execution traces to understand why things fail (not just that they failed), then proposes targeted improvements. ICLR 2026 Oral, MIT licensed.
GEPA was designed against benchmarks with hundreds of validation examples per task. Skill evolution typically has 20-60 examples, which is small enough that picking the highest-scoring candidate often picks one that won by chance — there's a real risk of shipping a "winner" that just got lucky on the eval set.
This framework adds three checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:
- Knee-point selection — instead of strictly the highest-scoring candidate, looks at every candidate close to the top score and prefers shorter ones. Filters out wins that came from a single lucky example.
- Held-out deploy check — before a candidate ships, it's compared against the baseline on examples it never saw during optimization. Several rules available, including a lenient one that's appropriate for compression-style refactors.
- Three-dimensional scoring — instead of pass/fail, the LLM judge rates each output on correctness, whether it followed the right procedure, and how concise it is. GEPA's reflection step uses these as feedback to guide the next mutation.
If you have hundreds of validation examples and a programmatic correctness metric (exact match, unit-test pass), raw GEPA is the right tool. The framework's extra layers earn their keep when validation is small and the metric is LLM-judged. See docs/framework_advantages.md for the deeper argument.
# Install
git clone https://github.com/jramos/agent-self-evolution.git
cd agent-self-evolution
uv syncuv run python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 10Whatever model + provider Hermes is using (Anthropic, OpenRouter, Nous Portal, OpenAI Codex Responses, AWS Bedrock, a local vLLM/Ollama/LM Studio, etc.) becomes the default for the optimizer, reflection, eval, and judge LMs. On Hermes setups with a single model, all four roles collapse onto it. OAuth-based setups (e.g. Nous Portal) refresh credentials via hermes model; API-key setups read from ~/.hermes/config.yaml's inline api_key or ~/.hermes/auth.json's credential pool.
For multi-model providers, override per role:
uv run python -m evolution.skills.evolve_skill \
--skill github-code-review \
--optimizer-model anthropic/claude-opus-4-5 \
--reflection-model anthropic/claude-opus-4-5 \
--eval-model anthropic/claude-haiku-4-5For closed-loop validation — run the actual Hermes binary against fixture tasks and feed its scores back into GEPA — point at your Hermes checkout:
export SKILL_SOURCES_HERMES_REPO=~/.hermes/hermes-agent
uv run python -m evolution.skills.evolve_skill \
--skill github-code-review \
--closed-loop-during-evolution evolution/validation/suites/your_suite.jsonl \
--closed-loop-hermes-repo ~/.hermes/hermes-agentThe closed-loop validator invokes hermes -z directly, so it uses the same provider config Hermes itself uses. Optimization and validation see the same model.
Set any standard provider env var and run — the framework falls back to env-var auto-detection in priority order (ANTHROPIC_API_KEY → OPENROUTER_API_KEY → OPENAI_API_KEY → others). When neither Hermes nor an env var is configured, the framework exits with an actionable message listing what was tried.
export ANTHROPIC_API_KEY=sk-ant-...
uv run python -m evolution.skills.evolve_skill \
--skill writing-skills \
--iterations 10See docs/model_resolution.md for the full provider mapping, local-server (vLLM/Ollama/LM Studio) examples, and per-role override patterns.
Skills are resolved by walking a list of SkillSource adapters in priority order:
--skill-source-dir PATH(repeatable) — generic<dir>/<name>/SKILL.mdlayout. Use for Codex, openclaw, or any custom framework.- Hermes Agent — set
SKILL_SOURCES_HERMES_REPO=/path/to/hermes-agent(or have~/.hermes/hermes-agentexist). Layout:<root>/skills/<category>/<name>/SKILL.md. - Claude Code — auto-discovered if
~/.claude/plugins/cache/exists. No env var needed. Layout:<vendor>/<plugin>/<version>/skills/<name>/SKILL.md.
Sources whose roots don't exist on disk are skipped automatically.
export SKILL_SOURCES_HERMES_REPO=~/.hermes/hermes-agent
uv run python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 10 \
--eval-source syntheticThe model defaults to whatever Hermes is configured for. See "Run with Hermes Agent" above.
# No env var needed if you have Claude Code installed
uv run python -m evolution.skills.evolve_skill \
--skill writing-skills \
--iterations 10 \
--eval-source syntheticuv run python -m evolution.skills.evolve_skill \
--skill my-skill \
--skill-source-dir ~/path/to/my-skills \
--iterations 10 \
--eval-source syntheticFor agents using MCP, Anthropic tool-use, OpenAI function calling, or any custom registry that can be exported to MCP's list_tools() JSON shape:
uv run python -m evolution.tools.evolve_tool \
--tool search_files \
--manifest /path/to/your/mcp-tools.json \
--iterations 5Reads the static MCP-shape manifest, evolves one tool's top-level description field via GEPA, and writes the result to output/tools/<tool>/<timestamp>/. --apply rewrites the source manifest in place (every non-target tool's description, inputSchema, and any _evolution_metadata block are preserved verbatim); --patch emits a unified diff to stdout instead.
At evaluation time the agent sees the full rendered manifest, so cross-tool regressions (the evolved description "stealing" selections from a confusable neighbor) surface naturally through the deploy gate.
For agents whose tools are defined as Python *_SCHEMA dicts (Hermes Agent's pattern), point --manifest at the tools directory:
uv run python -m evolution.tools.evolve_tool \
--tool read_file \
--manifest /path/to/hermes-agent/tools \
--fitness-profile balanced --iterations 5The framework parses every *_SCHEMA = {...} and *_SCHEMAS = [...] declaration via AST, handles literal-string descriptions and one-hop Name references (constants like TERMINAL_TOOL_DESCRIPTION), and refuses to apply changes to f-string-built descriptions (rewrite the tool to a literal description first). Tools that can't be parsed statically (e.g., schemas built from function calls) appear in gate_decision.json.dataset.dropped_tools so you see what's excluded.
With --apply, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent.
For skill evolution:
uv run python -m evolution.skills.evolve_skill \
--skill github-code-review \
--iterations 10 \
--eval-source sessiondbPulls real usage from Claude Code (~/.claude/history.jsonl), Copilot, and Hermes session logs.
For tool description evolution:
uv run python -m evolution.tools.evolve_tool \
--tool search_files \
--manifest /path/to/mcp-tools.json \
--eval-source sessiondbMines Hermes session JSON (~/.hermes/sessions/) for (user_task, invoked_tool) pairs, then re-judges each pair against the current manifest. Misselections — where the judge picks a different tool than the agent did with high confidence — become flipped-label training examples that exercise exactly the failure mode the evolution is trying to fix. Add --dry-run to confirm session discovery before spending judge + GEPA budget.
Only Hermes is mined for tool data — Claude Code and Copilot session logs don't carry tool_use blocks. The eval is biased toward whatever task distribution lives in your session history, so it may underrepresent the confusable-neighbor cases the synthetic eval targets directly. Run synthetic first if you need that coverage and don't have substantial Hermes history.
The LLM-as-judge scores agent outputs on three dimensions (correctness, procedure-following, conciseness). --fitness-profile selects how those dimensions are weighted in the composite:
uv run python -m evolution.skills.evolve_skill --skill X --fitness-profile <profile>| Profile | Correctness | Procedure | Conciseness | Use when |
|---|---|---|---|---|
balanced (default) |
0.5 | 0.3 | 0.2 | General-purpose evolution. Uses balanced-mode proposer (handles both directions without bias). |
compression |
0.4 | 0.2 | 0.4 | Explicitly shrinking an over-long skill. Uses compression-mode proposer. |
growth |
0.6 | 0.4 | 0.0 | The baseline is missing capabilities and needs to add them. Uses growth-mode proposer. |
The chosen profile is recorded in gate_decision.json so any deployed variant can be traced back to the weighting that produced it.
Each profile also selects a reflection-prompt proposer template. compression tells the LM to cut redundancy under a tight char budget; growth tells it to add only what the failure feedback explicitly identifies as missing; balanced (the default) is direction-agnostic — it asks the LM to fix the failures without prescribing cuts or additions, and uses a soft "stay near N characters, ±20%" budget. All three share the same anti-hallucination guardrails: every change must ground in a specific feedback phrase, and empty feedback returns the instruction unchanged.
By default, the evolved skill lands in output/<skill>/<timestamp>/evolved_skill.md and stops there. Two opt-in flags automate the next step:
# Copy evolved_skill.md over the source SKILL.md in place on a deploy decision.
# No git operations; the user's workflow stays in their hands.
uv run python -m evolution.skills.evolve_skill --skill X --apply
# Emit a unified diff to stdout instead — pipe to patch, git apply, or a review tool.
uv run python -m evolution.skills.evolve_skill --skill X --patch | git applyBoth flags are no-ops on a reject decision (with a stderr notice). --apply also skips with a warning when the source path is under Claude Code's plugin cache (read-only by design).
--max-total-cost-usd FLOAT aborts the run cleanly when cumulative LM cost exceeds the ceiling. Useful when an accidentally-cranked --iterations could push a run past your expected budget. Worst-case overshoot is one LM call past the ceiling — the cost callback fires after each call returns, and the next call aborts at start.
uv run python -m evolution.skills.evolve_skill --skill X --max-total-cost-usd 5.00On abort, output/<artifact>/<ts>/gate_decision.json carries decision="aborted", reason="cost_ceiling_exceeded", and the full cost_summary block so you see what was actually spent.
--benchmark-cmd "<shell command>" runs your command as a deploy gate after the framework's own gate passes. Nonzero exit flips the decision to reject with reason="benchmark_failed". The command receives the evolved + baseline artifact paths via env vars so it can run a pytest line, a custom benchmark, or any shell pipeline:
uv run python -m evolution.tools.evolve_tool --tool X --manifest Y \
--benchmark-cmd 'pytest -k smoke && custom_check.sh "$EVOLVED_PATH"'Env vars: EVOLVED_PATH, BASELINE_PATH, RUN_DIR, TARGET_NAME, ARTIFACT_TYPE. The hook runs under /bin/sh -c — interactive aliases are not available; invoke binaries by full name. Trust boundary: the command string is yours, do not pass strings you didn't write yourself.
The framework's deploy gate scores evolved artifacts against an LM-judge on a synthetic eval set. That's a closed loop: an LM scoring another LM's output on tasks a third LM made up. To break the loop, point a real agent at a small task suite with the baseline and evolved artifacts and see whether real agent behavior actually shifted:
uv run python -m evolution.validation.closed_loop \
--tool patch \
--hermes-repo ~/.hermes/hermes-agent \
--tasks evolution/validation/suites/patch.jsonl \
--baseline ~/.hermes/hermes-agent/tools/file_tools.py \
--evolved /tmp/evolved/file_tools.pyFor each task in the suite, the harness installs baseline then evolved into the user's hermes-agent (atomically, with a .cl_backup for crash recovery and fcntl.flock to block concurrent runs), invokes hermes -z non-interactively, parses the resulting session JSON, and scores each run against the task's expected_tools and forbidden_tools. The report shows per-task wins/losses + aggregate pass-rate change. Decision rule: pass iff evolved_pass_rate >= baseline_pass_rate AND (no per-task loss OR wins offset losses 2:1). Exit code 0 on pass, 1 on regression — drop-in for --benchmark-cmd:
--benchmark-cmd 'python -m evolution.validation.closed_loop \
--tool $TARGET_NAME \
--hermes-repo ~/.hermes/hermes-agent \
--tasks evolution/validation/suites/$TARGET_NAME.jsonl \
--baseline "$BASELINE_PATH" \
--evolved "$EVOLVED_PATH"'Cost: each task is one hermes -z run (~$0.05–$0.50). The bundled patch.jsonl is 5 tasks × 2 phases = ~$0.50–$5 per validation.
| Phase | Target | Engine | Status |
|---|---|---|---|
| Phase 1 | Skill files (SKILL.md) | DSPy + GEPA | ✅ Implemented |
| Phase 2 | Tool descriptions | DSPy + GEPA | ✅ Implemented |
| Phase 3 | System prompt sections | DSPy + GEPA | 🔲 Planned |
| Phase 4 | Tool implementation code | Darwinian Evolver | 🔲 Planned |
| Phase 5 | Continuous improvement loop | Automated pipeline | 🔲 Planned |
| Engine | What It Does | License |
|---|---|---|
| DSPy + GEPA | Reflective prompt evolution — reads execution traces, proposes targeted mutations | MIT |
| Darwinian Evolver | Code evolution with Git-based organisms | AGPL v3 (external CLI only) |
Every evolved variant must pass:
- Full test suite —
pytest tests/ -qmust pass 100% - Size limits — Skills ≤15KB, tool descriptions ≤500 chars
- Caching compatibility — No mid-conversation changes
- Semantic preservation — Must not drift from original purpose
- PR review — All changes go through human review, never direct commit
See PLAN.md for the complete architecture, evaluation data strategy, constraints, benchmarks integration, and phased timeline.
MIT — © 2026 jramos and Nous Research