🧬 Agent Self-Evolution

Evolutionary self-improvement for agent skills.

Agent Self-Evolution evolves and optimizes agent skills, tool descriptions, system prompts, and code — producing measurably better versions through reflective evolutionary search. Built on DSPy + GEPA (Genetic-Pareto Prompt Evolution), with extra safeguards on top so what ships is reliably better than the original.

No GPU training required. Everything operates via API calls — mutating text, evaluating results, and selecting the best variants. ~$2-10 per optimization run.

Works on any agent framework that emits SKILL.md markdown files. Hermes Agent skills are the original target; Claude Code skills (and any other agent's <dir>/<skill>/SKILL.md layout) are also supported via a pluggable skill-source abstraction.

Already running Hermes Agent? No env vars to set. If ~/.hermes/config.yaml exists, uv run python -m evolution.skills.evolve_skill --skill <name> picks up your provider, model, and credentials automatically. On startup the framework runs a tiny ~$0.0001 credential probe; if anything's stale you get a Rich-formatted error panel with the exact recovery command (e.g. hermes auth add anthropic) instead of a Python traceback. Jump to Run with Hermes Agent, or read docs/model_resolution.md for the full provider mapping.

How It Works

flowchart LR
    A[Read current<br/>skill/prompt/tool] --> B[Generate<br/>eval dataset]
    B --> C[GEPA<br/>Optimizer]
    C --> D[Candidate<br/>variants]
    D --> E[Evaluate]
    E -. Execution traces .-> C
    E --> F["Constraint gates<br/>(tests, size limits,<br/>benchmarks)"]
    F --> G[Best<br/>variant]
    G --> H[PR against<br/>source repo]

GEPA reads execution traces to understand why things fail (not just that they failed), then proposes targeted improvements. ICLR 2026 Oral, MIT licensed.

Why this isn't just DSPy + GEPA

GEPA was designed against benchmarks with hundreds of validation examples per task. Skill evolution typically has 20-60 examples, which is small enough that picking the highest-scoring candidate often picks one that won by chance — there's a real risk of shipping a "winner" that just got lucky on the eval set.

This framework adds three checks on top of GEPA so the candidate that ships is one that genuinely improved the skill:

Knee-point selection — instead of strictly the highest-scoring candidate, looks at every candidate close to the top score and prefers shorter ones. Filters out wins that came from a single lucky example.
Held-out deploy check — before a candidate ships, it's compared against the baseline on examples it never saw during optimization. Several rules available, including a lenient one that's appropriate for compression-style refactors.
Three-dimensional scoring — instead of pass/fail, the LLM judge rates each output on correctness, whether it followed the right procedure, and how concise it is. GEPA's reflection step uses these as feedback to guide the next mutation.

If you have hundreds of validation examples and a programmatic correctness metric (exact match, unit-test pass), raw GEPA is the right tool. The framework's extra layers earn their keep when validation is small and the metric is LLM-judged. See docs/framework_advantages.md for the deeper argument.

Quick Start

# Install
git clone https://github.com/jramos/agent-self-evolution.git
cd agent-self-evolution
uv sync

Run with Hermes Agent

uv run python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10

Whatever model + provider Hermes is using (Anthropic, OpenRouter, Nous Portal, OpenAI Codex Responses, AWS Bedrock, a local vLLM/Ollama/LM Studio, etc.) becomes the default for the optimizer, reflection, eval, and judge LMs. On Hermes setups with a single model, all four roles collapse onto it. OAuth-based setups (e.g. Nous Portal) refresh credentials via hermes model; API-key setups read from ~/.hermes/config.yaml's inline api_key or ~/.hermes/auth.json's credential pool.

For multi-model providers, override per role:

uv run python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --optimizer-model anthropic/claude-opus-4-5 \
    --reflection-model anthropic/claude-opus-4-5 \
    --eval-model anthropic/claude-haiku-4-5

For closed-loop validation — run the actual Hermes binary against fixture tasks and feed its scores back into GEPA — point at your Hermes checkout:

export SKILL_SOURCES_HERMES_REPO=~/.hermes/hermes-agent
uv run python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --closed-loop-during-evolution evolution/validation/suites/your_suite.jsonl \
    --closed-loop-hermes-repo ~/.hermes/hermes-agent

The closed-loop validator invokes hermes -z directly, so it uses the same provider config Hermes itself uses. Optimization and validation see the same model.

Run without Hermes Agent

Set any standard provider env var and run — the framework falls back to env-var auto-detection in priority order (ANTHROPIC_API_KEY → OPENROUTER_API_KEY → OPENAI_API_KEY → others). When neither Hermes nor an env var is configured, the framework exits with an actionable message listing what was tried.

export ANTHROPIC_API_KEY=sk-ant-...
uv run python -m evolution.skills.evolve_skill \
    --skill writing-skills \
    --iterations 10

See docs/model_resolution.md for the full provider mapping, local-server (vLLM/Ollama/LM Studio) examples, and per-role override patterns.

Skill discovery

Skills are resolved by walking a list of SkillSource adapters in priority order:

--skill-source-dir PATH (repeatable) — generic <dir>/<name>/SKILL.md layout. Use for Codex, openclaw, or any custom framework.
Hermes Agent — set SKILL_SOURCES_HERMES_REPO=/path/to/hermes-agent (or have ~/.hermes/hermes-agent exist). Layout: <root>/skills/<category>/<name>/SKILL.md.
Claude Code — auto-discovered if ~/.claude/plugins/cache/ exists. No env var needed. Layout: <vendor>/<plugin>/<version>/skills/<name>/SKILL.md.

Sources whose roots don't exist on disk are skipped automatically.

Evolve a Hermes skill

export SKILL_SOURCES_HERMES_REPO=~/.hermes/hermes-agent

uv run python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source synthetic

The model defaults to whatever Hermes is configured for. See "Run with Hermes Agent" above.

Evolve a Claude Code skill

# No env var needed if you have Claude Code installed
uv run python -m evolution.skills.evolve_skill \
    --skill writing-skills \
    --iterations 10 \
    --eval-source synthetic

Evolve a skill from any custom layout

uv run python -m evolution.skills.evolve_skill \
    --skill my-skill \
    --skill-source-dir ~/path/to/my-skills \
    --iterations 10 \
    --eval-source synthetic

Evolve a tool description

For agents using MCP, Anthropic tool-use, OpenAI function calling, or any custom registry that can be exported to MCP's list_tools() JSON shape:

uv run python -m evolution.tools.evolve_tool \
    --tool search_files \
    --manifest /path/to/your/mcp-tools.json \
    --iterations 5

Reads the static MCP-shape manifest, evolves one tool's top-level description field via GEPA, and writes the result to output/tools/<tool>/<timestamp>/. --apply rewrites the source manifest in place (every non-target tool's description, inputSchema, and any _evolution_metadata block are preserved verbatim); --patch emits a unified diff to stdout instead.

At evaluation time the agent sees the full rendered manifest, so cross-tool regressions (the evolved description "stealing" selections from a confusable neighbor) surface naturally through the deploy gate.

Hermes Agent tools

For agents whose tools are defined as Python *_SCHEMA dicts (Hermes Agent's pattern), point --manifest at the tools directory:

uv run python -m evolution.tools.evolve_tool \
    --tool read_file \
    --manifest /path/to/hermes-agent/tools \
    --fitness-profile balanced --iterations 5

The framework parses every *_SCHEMA = {...} and *_SCHEMAS = [...] declaration via AST, handles literal-string descriptions and one-hop Name references (constants like TERMINAL_TOOL_DESCRIPTION), and refuses to apply changes to f-string-built descriptions (rewrite the tool to a literal description first). Tools that can't be parsed statically (e.g., schemas built from function calls) appear in gate_decision.json.dataset.dropped_tools so you see what's excluded.

With --apply, the evolved description is spliced into the source file's bytes at the original position — comments, formatting, and unrelated tools are untouched. Multi-line parenthesized concatenations collapse to a single triple-quoted string at the same indent.

Mine real session history for evals

For skill evolution:

uv run python -m evolution.skills.evolve_skill \
    --skill github-code-review \
    --iterations 10 \
    --eval-source sessiondb

Pulls real usage from Claude Code (~/.claude/history.jsonl), Copilot, and Hermes session logs.

For tool description evolution:

uv run python -m evolution.tools.evolve_tool \
    --tool search_files \
    --manifest /path/to/mcp-tools.json \
    --eval-source sessiondb

Mines Hermes session JSON (~/.hermes/sessions/) for (user_task, invoked_tool) pairs, then re-judges each pair against the current manifest. Misselections — where the judge picks a different tool than the agent did with high confidence — become flipped-label training examples that exercise exactly the failure mode the evolution is trying to fix. Add --dry-run to confirm session discovery before spending judge + GEPA budget.

Only Hermes is mined for tool data — Claude Code and Copilot session logs don't carry tool_use blocks. The eval is biased toward whatever task distribution lives in your session history, so it may underrepresent the confusable-neighbor cases the synthetic eval targets directly. Run synthetic first if you need that coverage and don't have substantial Hermes history.

Tune the fitness weighting

The LLM-as-judge scores agent outputs on three dimensions (correctness, procedure-following, conciseness). --fitness-profile selects how those dimensions are weighted in the composite:

uv run python -m evolution.skills.evolve_skill --skill X --fitness-profile <profile>

Profile	Correctness	Procedure	Conciseness	Use when
`balanced` (default)	0.5	0.3	0.2	General-purpose evolution. Uses balanced-mode proposer (handles both directions without bias).
`compression`	0.4	0.2	0.4	Explicitly shrinking an over-long skill. Uses compression-mode proposer.
`growth`	0.6	0.4	0.0	The baseline is missing capabilities and needs to add them. Uses growth-mode proposer.

The chosen profile is recorded in gate_decision.json so any deployed variant can be traced back to the weighting that produced it.

Each profile also selects a reflection-prompt proposer template. compression tells the LM to cut redundancy under a tight char budget; growth tells it to add only what the failure feedback explicitly identifies as missing; balanced (the default) is direction-agnostic — it asks the LM to fix the failures without prescribing cuts or additions, and uses a soft "stay near N characters, ±20%" budget. All three share the same anti-hallucination guardrails: every change must ground in a specific feedback phrase, and empty feedback returns the instruction unchanged.

Ship the evolved skill back to source

By default, the evolved skill lands in output/<skill>/<timestamp>/evolved_skill.md and stops there. Two opt-in flags automate the next step:

# Copy evolved_skill.md over the source SKILL.md in place on a deploy decision.
# No git operations; the user's workflow stays in their hands.
uv run python -m evolution.skills.evolve_skill --skill X --apply

# Emit a unified diff to stdout instead — pipe to patch, git apply, or a review tool.
uv run python -m evolution.skills.evolve_skill --skill X --patch | git apply

Both flags are no-ops on a reject decision (with a stderr notice). --apply also skips with a warning when the source path is under Claude Code's plugin cache (read-only by design).

Safety knobs

--max-total-cost-usd FLOAT aborts the run cleanly when cumulative LM cost exceeds the ceiling. Useful when an accidentally-cranked --iterations could push a run past your expected budget. Worst-case overshoot is one LM call past the ceiling — the cost callback fires after each call returns, and the next call aborts at start.

uv run python -m evolution.skills.evolve_skill --skill X --max-total-cost-usd 5.00

On abort, output/<artifact>/<ts>/gate_decision.json carries decision="aborted", reason="cost_ceiling_exceeded", and the full cost_summary block so you see what was actually spent.

--benchmark-cmd "<shell command>" runs your command as a deploy gate after the framework's own gate passes. Nonzero exit flips the decision to reject with reason="benchmark_failed". The command receives the evolved + baseline artifact paths via env vars so it can run a pytest line, a custom benchmark, or any shell pipeline:

uv run python -m evolution.tools.evolve_tool --tool X --manifest Y \
    --benchmark-cmd 'pytest -k smoke && custom_check.sh "$EVOLVED_PATH"'

Env vars: EVOLVED_PATH, BASELINE_PATH, RUN_DIR, TARGET_NAME, ARTIFACT_TYPE. The hook runs under /bin/sh -c — interactive aliases are not available; invoke binaries by full name. Trust boundary: the command string is yours, do not pass strings you didn't write yourself.

Closed-loop validation (real agent on real tasks)

The framework's deploy gate scores evolved artifacts against an LM-judge on a synthetic eval set. That's a closed loop: an LM scoring another LM's output on tasks a third LM made up. To break the loop, point a real agent at a small task suite with the baseline and evolved artifacts and see whether real agent behavior actually shifted:

uv run python -m evolution.validation.closed_loop \
    --tool patch \
    --hermes-repo ~/.hermes/hermes-agent \
    --tasks evolution/validation/suites/patch.jsonl \
    --baseline ~/.hermes/hermes-agent/tools/file_tools.py \
    --evolved /tmp/evolved/file_tools.py

For each task in the suite, the harness installs baseline then evolved into the user's hermes-agent (atomically, with a .cl_backup for crash recovery and fcntl.flock to block concurrent runs), invokes hermes -z non-interactively, parses the resulting session JSON, and scores each run against the task's expected_tools and forbidden_tools. The report shows per-task wins/losses + aggregate pass-rate change. Decision rule: pass iff evolved_pass_rate >= baseline_pass_rate AND (no per-task loss OR wins offset losses 2:1). Exit code 0 on pass, 1 on regression — drop-in for --benchmark-cmd:

--benchmark-cmd 'python -m evolution.validation.closed_loop \
    --tool $TARGET_NAME \
    --hermes-repo ~/.hermes/hermes-agent \
    --tasks evolution/validation/suites/$TARGET_NAME.jsonl \
    --baseline "$BASELINE_PATH" \
    --evolved "$EVOLVED_PATH"'

Cost: each task is one hermes -z run (~$0.05–$0.50). The bundled patch.jsonl is 5 tasks × 2 phases = ~$0.50–$5 per validation.

What It Optimizes

Phase	Target	Engine	Status
Phase 1	Skill files (SKILL.md)	DSPy + GEPA	✅ Implemented
Phase 2	Tool descriptions	DSPy + GEPA	✅ Implemented
Phase 3	System prompt sections	DSPy + GEPA	🔲 Planned
Phase 4	Tool implementation code	Darwinian Evolver	🔲 Planned
Phase 5	Continuous improvement loop	Automated pipeline	🔲 Planned

Engines

Engine	What It Does	License
DSPy + GEPA	Reflective prompt evolution — reads execution traces, proposes targeted mutations	MIT
Darwinian Evolver	Code evolution with Git-based organisms	AGPL v3 (external CLI only)

Guardrails

Every evolved variant must pass:

Full test suite — pytest tests/ -q must pass 100%
Size limits — Skills ≤15KB, tool descriptions ≤500 chars
Caching compatibility — No mid-conversation changes
Semantic preservation — Must not drift from original purpose
PR review — All changes go through human review, never direct commit

Full Plan

See PLAN.md for the complete architecture, evaluation data strategy, constraints, benchmarks integration, and phased timeline.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
assets		assets
datasets		datasets
docs		docs
evolution		evolution
examples		examples
reports		reports
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
PLAN.md		PLAN.md
README.md		README.md
generate_report.py		generate_report.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Agent Self-Evolution

How It Works

Why this isn't just DSPy + GEPA

Quick Start

Run with Hermes Agent

Run without Hermes Agent

Skill discovery

Evolve a Hermes skill

Evolve a Claude Code skill

Evolve a skill from any custom layout

Evolve a tool description

Hermes Agent tools

Mine real session history for evals

Tune the fitness weighting

Ship the evolved skill back to source

Safety knobs

Closed-loop validation (real agent on real tasks)

What It Optimizes

Engines

Guardrails

Full Plan

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 Agent Self-Evolution

How It Works

Why this isn't just DSPy + GEPA

Quick Start

Run with Hermes Agent

Run without Hermes Agent

Skill discovery

Evolve a Hermes skill

Evolve a Claude Code skill

Evolve a skill from any custom layout

Evolve a tool description

Hermes Agent tools

Mine real session history for evals

Tune the fitness weighting

Ship the evolved skill back to source

Safety knobs

Closed-loop validation (real agent on real tasks)

What It Optimizes

Engines

Guardrails

Full Plan

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages