Capability-Augmented Generation through Reusable Graph. A research framework for accumulating, validating, and composing executable skills across LLM agents.
LLMs forget every solution the moment a session ends. CARG is a thin layer that turns each verified solution into a reusable, executable skill, organizes those skills into a dependency graph, and gives later tasks a runtime where the LLM can call the prior skills by name instead of redrafting them.
The result, on a 30-trial coding benchmark with claude-sonnet-4.5:
| Metric | Cold | Few-shot | CARG |
|---|---|---|---|
| Success rate | 86.7% | 80.0% | 80.0% |
| Billed tokens | 198 576 | 222 342 | 145 578 (-26.7% vs cold) |
| Paid tokens (after cache) | 198 576 | 222 342 | 0 |
| Recall hit rate | 0% | 0% | 26.7% |
The full numbers and how to reproduce them are in Benchmarks.
RAG retrieves text. Toolformer calls hand-written tools. Neither grows its tool belt across
sessions, and neither maintains lifecycle (probation, validation, deprecation) on what it has learned.
CARG argues that an agent's long-term value is in executable, validated, composable skills — not in chats, not in raw memories. That belief is testable, and this repository is the test bench.
┌─────────────────────────────────┐
│ Foundation LLM │
│ (OpenAI / Anthropic / mock) │
└──────────────┬──────────────────┘
│
┌────────────────────▼──────────────────┐
│ CARG Runtime │
│ recall · compose · resolve · cache │
└─┬──────┬──────┬──────────┬──────┬─────┘
│ │ │ │ │
┌─────────────▼──┐ ┌─▼──┐ ┌─▼────────┐ ┌▼────┐ ┌▼─────────┐
│ Capability │ │SDK │ │Crystallzr│ │Evol │ │ Sandbox │
│ Graph │ │recall│ miner │ │ AB │ │simulator │
│ (skill→skill) │ │commit│ │ │ + │ │+ telemetry│
└─────────┬──────┘ └─┬──┘ └────┬────┘ │roll- │ └─────┬────┘
│ │ │ │ back │ │
└──────────▼─────────▼──────▼───────▼───────┘
│
┌────────▼─────────┐
│ Kernel Registry │
│ (SQLite) │
└──────────────────┘
Six concrete pieces, each in its own subpackage:
| Module | Responsibility |
|---|---|
carg.core |
Four kernel formats: Program / Policy / ReasoningGraph / LatentAdapter. All conform to K: (S, M) → (A, M'). |
carg.memory |
KernelRegistry (SQLite persistence), KernelMemory (recall/commit façade with probation→promoted→deprecated lifecycle), CapabilityGraph (AST-derived caller → callee graph). |
carg.llm |
Provider clients (mock / OpenAI / Anthropic), plus CachedLLMClient so identical prompts only cost once. |
carg.runtime |
Composer + conflict resolver (priority / agreement / traffic-light strategies). |
carg.crystallizer |
Mines kernels from session logs (n-gram patterns → program / policy / graph / adapter). |
carg.evolution |
Mutation operators (parametric / structural / crossover) + A/B harness + automatic rollback. |
carg.sandbox |
The synthetic SaaS app, agent personas, simulator, telemetry. |
carg.benchmark |
Task suites + four solvers + 2-way and 3-way harnesses. |
pip install -e ".[dev,llm]"This exposes the carg CLI.
carg demo --users 1500This runs the bundled synthetic e-commerce app, finds defects, crystallizes kernels from the logs, applies them, re-measures, and runs one generation of the evolution engine. Everything is offline.
export ANTHROPIC_BASE_URL=... # or OPENAI_API_KEY
export ANTHROPIC_AUTH_TOKEN=...
export ANTHROPIC_MODEL=claude-sonnet-4.5
carg benchmark --suite easy --solver llm --trials 60 # easy suite
carg benchmark --suite evil --solver llm --trials 30 # adversarial spec inversions
carg compose-bench --provider anthropic # capability-graph showcase
carg bench3way --provider anthropic --suite evil # cold vs fewshot vs CARGIf no key is set, every command falls back to the deterministic mock client so the pipeline still runs end to end.
from carg.benchmark import KernelMemory, LLMSolver
from carg.benchmark.suites import default_task_stream
mem = KernelMemory(":memory:")
solver = LLMSolver() # picks provider from env
for task in default_task_stream(seed=7, length=30):
result = solver.solve(task, memory=mem)
print(task.task_id, result.success, result.recall_hit, result.cost_tokens)The same KernelMemory interface works inside any agent: call recall(skill, description) before
asking the model, call commit(skill, description, solution, success, cost_tokens) afterwards.
first successful solve
────────────────────────────────► PROBATION
│
same body verifies on a 2nd task
▼
PROMOTED ◄── recall starts handing it out
│
3 recall-misses on real tasks
▼
DEPRECATED
The probation gate exists because trusting the very first success poisons the registry on
adversarial inputs (we measured this; see bench_brutal_30.log in the archive/ branch). A kernel
must validate on at least two distinct tasks before any task is allowed to recall it.
When a skill body is committed, the AST is parsed for free-name function calls. If a called name
matches another stored kernel's function name, an edge caller → callee is added. The graph is
rebuilt on demand, used by the compose solver to inject every promoted skill as an executable helper
function in the next prompt.
sum_digital_roots ─► digital_root ─► digit_sum
lcm_list ─► lcm ─► gcd
count_primes_in_range ─► is_prime
Few-shot can prepend examples. Few-shot cannot make a previously written gcd actually
runnable in the next task's evaluation context. CARG can.
Three task suites, all in carg.benchmark.suites:
| Suite | Tasks | Skill repeats | What it measures |
|---|---|---|---|
easy |
30 | 3 per skill | Baseline accumulation curve. |
evil |
30 | 3 per skill | Adversarial spec inversions (RPN with floor div, glob with 1+ semantics, prev_permutation, etc). LLMs override the spec and revert to memorized conventions. |
compose |
18 | 3 layers (atoms → composites → compounds) | Capability-graph showcase. Layer-N tasks should call layer-(N-1) helpers instead of re-implementing them. |
Each suite has a deterministic seed and a single canonical solution per skill. Verification is by
unittest-style test cases inside the task definition.
carg bench3way --suite evil --provider anthropic --trials 30 --seed 7 --json > result.jsonThree legs run against the same task stream through a shared SQLite response cache, so identical
prompts cost once. The cost_savings_pct and delta_success fields in the JSON are the bottom-line
comparison.
The benchmark is not a claim that CARG accumulates capability the LLM lacks. On claude-sonnet-4.5
we observe:
- Cost drops by 27–67% across runs of repeated-skill streams.
- Success stays flat or moves within ±1 task on small-N suites — the model already solves these problems, the win is in dollars, not in floor.
- On
rpn_floor(RPN with//semantics) the model refuses to follow the spec and writes truncate-toward-zero instead. Memory cannot help if the model never produces a correct first solution. We treat this as a feature: the benchmark surfaces alignment ceilings.
The path to a real success-rate win is (1) embedding-based recall instead of skill-tag recall, (2)
weaker models where cold success is < 100% (e.g. claude-haiku, gpt-4o-mini), and (3) compositional
tasks that exceed any single-shot context.
src/carg/
├── core/ # K: (S, M) -> (A, M')
│ ├── base.py
│ ├── program.py # RestrictedPython-sandboxed code
│ ├── policy.py # softmax over discrete actions
│ ├── graph.py # YAML-described action graphs
│ └── adapter.py # prefix + logit-bias prompt-tune
├── memory/ # persistent skill store
│ ├── registry.py # SQLite, lifecycle, lineage
│ ├── sdk.py # KernelMemory recall/commit
│ └── capability_graph.py # AST-derived skill graph
├── llm/ # provider clients + cache
│ ├── base.py
│ ├── mock.py
│ ├── openai_client.py
│ ├── anthropic_client.py
│ └── cache.py # CachedLLMClient (SQLite)
├── runtime/ # compose, conflict resolution
├── crystallizer/ # mine kernels from session logs
├── evolution/ # mutators, A/B engine, rollback
├── sandbox/ # §10 synthetic app + simulator
├── benchmark/
│ ├── suites/ # easy / evil / compose
│ ├── solvers/ # simulated / llm / fewshot / compose
│ ├── harness.py # 2-way A/B
│ └── bench3way.py # 3-way A/B/C
└── cli.py
- Four kernel formats with executable verification.
- SQLite registry, probation lifecycle, deprecation.
- AST-derived capability graph.
- 3-way benchmark vs few-shot baseline.
- CLI:
simulate,benchmark,bench3way,compose-bench,evolve,crystallize,demo. - Embedding-based recall. The current skill-tag matching does not scale to real workloads.
- Weaker-model studies. Run on
claude-haikuandgpt-4o-miniwhere cold success is well below 100%. - Cross-model transfer. Train kernels on model A, run on model B. This is the experiment that decides whether CARG is infrastructure or novel capability.
- HumanEval+ / SWE-bench Lite. Move from synthetic suites to recognised public benchmarks.
- Production telemetry hooks. Drop-in middleware for OpenAI / Anthropic clients with a dashboard that reports kernel hit-rate and tokens saved against a customer's traffic.
@software{vladov_carg_2026,
author = {Danil Vladov},
title = {CARG: Capability-Augmented Generation through Reusable Graph},
institution = {NARE Labs},
year = {2026},
publisher = {GitHub},
howpublished = {\url{https://github.com/narelabs/carg}},
}Apache-2.0. See LICENSE.
Built with ❤️ by NARE Labs · github.com/narelabs