CARG

Capability-Augmented Generation through Reusable Graph. A research framework for accumulating, validating, and composing executable skills across LLM agents.

LLMs forget every solution the moment a session ends. CARG is a thin layer that turns each verified solution into a reusable, executable skill, organizes those skills into a dependency graph, and gives later tasks a runtime where the LLM can call the prior skills by name instead of redrafting them.

The result, on a 30-trial coding benchmark with claude-sonnet-4.5:

Metric	Cold	Few-shot	CARG
Success rate	86.7%	80.0%	80.0%
Billed tokens	198 576	222 342	145 578 (-26.7% vs cold)
Paid tokens (after cache)	198 576	222 342	0
Recall hit rate	0%	0%	26.7%

The full numbers and how to reproduce them are in Benchmarks.

Why this exists

RAG retrieves text. Toolformer calls hand-written tools. Neither grows its tool belt across sessions, and neither maintains lifecycle (probation, validation, deprecation) on what it has learned.

CARG argues that an agent's long-term value is in executable, validated, composable skills — not in chats, not in raw memories. That belief is testable, and this repository is the test bench.

Architecture

                                    ┌─────────────────────────────────┐
                                    │       Foundation LLM            │
                                    │   (OpenAI / Anthropic / mock)   │
                                    └──────────────┬──────────────────┘
                                                   │
                              ┌────────────────────▼──────────────────┐
                              │         CARG Runtime                   │
                              │   recall · compose · resolve · cache   │
                              └─┬──────┬──────┬──────────┬──────┬─────┘
                                │      │      │          │      │
                  ┌─────────────▼──┐ ┌─▼──┐ ┌─▼────────┐ ┌▼────┐ ┌▼─────────┐
                  │ Capability     │ │SDK │ │Crystallzr│ │Evol │ │ Sandbox  │
                  │   Graph        │ │recall│  miner   │ │ AB   │ │simulator │
                  │ (skill→skill)  │ │commit│         │ │  +    │ │+ telemetry│
                  └─────────┬──────┘ └─┬──┘ └────┬────┘ │roll-  │ └─────┬────┘
                            │          │         │      │ back  │       │
                            └──────────▼─────────▼──────▼───────▼───────┘
                                       │
                              ┌────────▼─────────┐
                              │ Kernel Registry  │
                              │   (SQLite)        │
                              └──────────────────┘

Six concrete pieces, each in its own subpackage:

Module	Responsibility
`carg.core`	Four kernel formats: `Program` / `Policy` / `ReasoningGraph` / `LatentAdapter`. All conform to `K: (S, M) → (A, M')`.
`carg.memory`	`KernelRegistry` (SQLite persistence), `KernelMemory` (recall/commit façade with probation→promoted→deprecated lifecycle), `CapabilityGraph` (AST-derived `caller → callee` graph).
`carg.llm`	Provider clients (mock / OpenAI / Anthropic), plus `CachedLLMClient` so identical prompts only cost once.
`carg.runtime`	Composer + conflict resolver (priority / agreement / traffic-light strategies).
`carg.crystallizer`	Mines kernels from session logs (n-gram patterns → program / policy / graph / adapter).
`carg.evolution`	Mutation operators (parametric / structural / crossover) + A/B harness + automatic rollback.
`carg.sandbox`	The synthetic SaaS app, agent personas, simulator, telemetry.
`carg.benchmark`	Task suites + four solvers + 2-way and 3-way harnesses.

Install

pip install -e ".[dev,llm]"

This exposes the carg CLI.

Quickstart

1. The §10 demo (no API key needed)

carg demo --users 1500

This runs the bundled synthetic e-commerce app, finds defects, crystallizes kernels from the logs, applies them, re-measures, and runs one generation of the evolution engine. Everything is offline.

2. Anti-amnesia benchmark (real LLM)

export ANTHROPIC_BASE_URL=...        # or OPENAI_API_KEY
export ANTHROPIC_AUTH_TOKEN=...
export ANTHROPIC_MODEL=claude-sonnet-4.5

carg benchmark --suite easy  --solver llm --trials 60     # easy suite
carg benchmark --suite evil  --solver llm --trials 30     # adversarial spec inversions
carg compose-bench  --provider anthropic                   # capability-graph showcase
carg bench3way      --provider anthropic --suite evil     # cold vs fewshot vs CARG

If no key is set, every command falls back to the deterministic mock client so the pipeline still runs end to end.

3. Use as a library

from carg.benchmark import KernelMemory, LLMSolver
from carg.benchmark.suites import default_task_stream

mem = KernelMemory(":memory:")
solver = LLMSolver()                       # picks provider from env
for task in default_task_stream(seed=7, length=30):
    result = solver.solve(task, memory=mem)
    print(task.task_id, result.success, result.recall_hit, result.cost_tokens)

The same KernelMemory interface works inside any agent: call recall(skill, description) before asking the model, call commit(skill, description, solution, success, cost_tokens) afterwards.

How it actually works

Lifecycle of a skill kernel

        first successful solve
   ────────────────────────────────►  PROBATION
                                          │
                       same body verifies on a 2nd task
                                          ▼
                                       PROMOTED  ◄── recall starts handing it out
                                          │
                       3 recall-misses on real tasks
                                          ▼
                                      DEPRECATED

The probation gate exists because trusting the very first success poisons the registry on adversarial inputs (we measured this; see bench_brutal_30.log in the archive/ branch). A kernel must validate on at least two distinct tasks before any task is allowed to recall it.

Capability graph

When a skill body is committed, the AST is parsed for free-name function calls. If a called name matches another stored kernel's function name, an edge caller → callee is added. The graph is rebuilt on demand, used by the compose solver to inject every promoted skill as an executable helper function in the next prompt.

sum_digital_roots ─► digital_root ─► digit_sum
lcm_list          ─► lcm           ─► gcd
count_primes_in_range ─► is_prime

Few-shot can prepend examples. Few-shot cannot make a previously written gcd actually runnable in the next task's evaluation context. CARG can.

Benchmarks

Three task suites, all in carg.benchmark.suites:

Suite	Tasks	Skill repeats	What it measures
`easy`	30	3 per skill	Baseline accumulation curve.
`evil`	30	3 per skill	Adversarial spec inversions (RPN with floor div, glob with 1+ semantics, prev_permutation, etc). LLMs override the spec and revert to memorized conventions.
`compose`	18	3 layers (atoms → composites → compounds)	Capability-graph showcase. Layer-N tasks should call layer-(N-1) helpers instead of re-implementing them.

Each suite has a deterministic seed and a single canonical solution per skill. Verification is by unittest-style test cases inside the task definition.

Reproducing the headline number

carg bench3way --suite evil --provider anthropic --trials 30 --seed 7 --json > result.json

Three legs run against the same task stream through a shared SQLite response cache, so identical prompts cost once. The cost_savings_pct and delta_success fields in the JSON are the bottom-line comparison.

Honest disclaimers

The benchmark is not a claim that CARG accumulates capability the LLM lacks. On claude-sonnet-4.5 we observe:

Cost drops by 27–67% across runs of repeated-skill streams.
Success stays flat or moves within ±1 task on small-N suites — the model already solves these problems, the win is in dollars, not in floor.
On rpn_floor (RPN with // semantics) the model refuses to follow the spec and writes truncate-toward-zero instead. Memory cannot help if the model never produces a correct first solution. We treat this as a feature: the benchmark surfaces alignment ceilings.

The path to a real success-rate win is (1) embedding-based recall instead of skill-tag recall, (2) weaker models where cold success is < 100% (e.g. claude-haiku, gpt-4o-mini), and (3) compositional tasks that exceed any single-shot context.

Project structure

src/carg/
├── core/                    # K: (S, M) -> (A, M')
│   ├── base.py
│   ├── program.py           # RestrictedPython-sandboxed code
│   ├── policy.py            # softmax over discrete actions
│   ├── graph.py             # YAML-described action graphs
│   └── adapter.py           # prefix + logit-bias prompt-tune
├── memory/                  # persistent skill store
│   ├── registry.py          # SQLite, lifecycle, lineage
│   ├── sdk.py               # KernelMemory recall/commit
│   └── capability_graph.py  # AST-derived skill graph
├── llm/                     # provider clients + cache
│   ├── base.py
│   ├── mock.py
│   ├── openai_client.py
│   ├── anthropic_client.py
│   └── cache.py             # CachedLLMClient (SQLite)
├── runtime/                 # compose, conflict resolution
├── crystallizer/            # mine kernels from session logs
├── evolution/               # mutators, A/B engine, rollback
├── sandbox/                 # §10 synthetic app + simulator
├── benchmark/
│   ├── suites/              # easy / evil / compose
│   ├── solvers/             # simulated / llm / fewshot / compose
│   ├── harness.py           # 2-way A/B
│   └── bench3way.py         # 3-way A/B/C
└── cli.py

Roadmap

Cite

@software{vladov_carg_2026,
  author       = {Danil Vladov},
  title        = {CARG: Capability-Augmented Generation through Reusable Graph},
  institution  = {NARE Labs},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/narelabs/carg}},
}

License

Apache-2.0. See LICENSE.

Built with ❤️ by NARE Labs · github.com/narelabs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
src/carg		src/carg
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CARG

Why this exists

Architecture

Install

Quickstart

1. The §10 demo (no API key needed)

2. Anti-amnesia benchmark (real LLM)

3. Use as a library

How it actually works

Lifecycle of a skill kernel

Capability graph

Benchmarks

Reproducing the headline number

Honest disclaimers

Project structure

Roadmap

Cite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CARG

Why this exists

Architecture

Install

Quickstart

1. The §10 demo (no API key needed)

2. Anti-amnesia benchmark (real LLM)

3. Use as a library

How it actually works

Lifecycle of a skill kernel

Capability graph

Benchmarks

Reproducing the headline number

Honest disclaimers

Project structure

Roadmap

Cite

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages