Skip to content

narelabs/CARG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CARG

License: Apache 2.0 Python 3.10+ NARE Labs

Capability-Augmented Generation through Reusable Graph. A research framework for accumulating, validating, and composing executable skills across LLM agents.

LLMs forget every solution the moment a session ends. CARG is a thin layer that turns each verified solution into a reusable, executable skill, organizes those skills into a dependency graph, and gives later tasks a runtime where the LLM can call the prior skills by name instead of redrafting them.

The result, on a 30-trial coding benchmark with claude-sonnet-4.5:

Metric Cold Few-shot CARG
Success rate 86.7% 80.0% 80.0%
Billed tokens 198 576 222 342 145 578 (-26.7% vs cold)
Paid tokens (after cache) 198 576 222 342 0
Recall hit rate 0% 0% 26.7%

The full numbers and how to reproduce them are in Benchmarks.


Why this exists

RAG retrieves text. Toolformer calls hand-written tools. Neither grows its tool belt across sessions, and neither maintains lifecycle (probation, validation, deprecation) on what it has learned.

CARG argues that an agent's long-term value is in executable, validated, composable skills — not in chats, not in raw memories. That belief is testable, and this repository is the test bench.


Architecture

                                    ┌─────────────────────────────────┐
                                    │       Foundation LLM            │
                                    │   (OpenAI / Anthropic / mock)   │
                                    └──────────────┬──────────────────┘
                                                   │
                              ┌────────────────────▼──────────────────┐
                              │         CARG Runtime                   │
                              │   recall · compose · resolve · cache   │
                              └─┬──────┬──────┬──────────┬──────┬─────┘
                                │      │      │          │      │
                  ┌─────────────▼──┐ ┌─▼──┐ ┌─▼────────┐ ┌▼────┐ ┌▼─────────┐
                  │ Capability     │ │SDK │ │Crystallzr│ │Evol │ │ Sandbox  │
                  │   Graph        │ │recall│  miner   │ │ AB   │ │simulator │
                  │ (skill→skill)  │ │commit│         │ │  +    │ │+ telemetry│
                  └─────────┬──────┘ └─┬──┘ └────┬────┘ │roll-  │ └─────┬────┘
                            │          │         │      │ back  │       │
                            └──────────▼─────────▼──────▼───────▼───────┘
                                       │
                              ┌────────▼─────────┐
                              │ Kernel Registry  │
                              │   (SQLite)        │
                              └──────────────────┘

Six concrete pieces, each in its own subpackage:

Module Responsibility
carg.core Four kernel formats: Program / Policy / ReasoningGraph / LatentAdapter. All conform to K: (S, M) → (A, M').
carg.memory KernelRegistry (SQLite persistence), KernelMemory (recall/commit façade with probation→promoted→deprecated lifecycle), CapabilityGraph (AST-derived caller → callee graph).
carg.llm Provider clients (mock / OpenAI / Anthropic), plus CachedLLMClient so identical prompts only cost once.
carg.runtime Composer + conflict resolver (priority / agreement / traffic-light strategies).
carg.crystallizer Mines kernels from session logs (n-gram patterns → program / policy / graph / adapter).
carg.evolution Mutation operators (parametric / structural / crossover) + A/B harness + automatic rollback.
carg.sandbox The synthetic SaaS app, agent personas, simulator, telemetry.
carg.benchmark Task suites + four solvers + 2-way and 3-way harnesses.

Install

pip install -e ".[dev,llm]"

This exposes the carg CLI.


Quickstart

1. The §10 demo (no API key needed)

carg demo --users 1500

This runs the bundled synthetic e-commerce app, finds defects, crystallizes kernels from the logs, applies them, re-measures, and runs one generation of the evolution engine. Everything is offline.

2. Anti-amnesia benchmark (real LLM)

export ANTHROPIC_BASE_URL=...        # or OPENAI_API_KEY
export ANTHROPIC_AUTH_TOKEN=...
export ANTHROPIC_MODEL=claude-sonnet-4.5

carg benchmark --suite easy  --solver llm --trials 60     # easy suite
carg benchmark --suite evil  --solver llm --trials 30     # adversarial spec inversions
carg compose-bench  --provider anthropic                   # capability-graph showcase
carg bench3way      --provider anthropic --suite evil     # cold vs fewshot vs CARG

If no key is set, every command falls back to the deterministic mock client so the pipeline still runs end to end.

3. Use as a library

from carg.benchmark import KernelMemory, LLMSolver
from carg.benchmark.suites import default_task_stream

mem = KernelMemory(":memory:")
solver = LLMSolver()                       # picks provider from env
for task in default_task_stream(seed=7, length=30):
    result = solver.solve(task, memory=mem)
    print(task.task_id, result.success, result.recall_hit, result.cost_tokens)

The same KernelMemory interface works inside any agent: call recall(skill, description) before asking the model, call commit(skill, description, solution, success, cost_tokens) afterwards.


How it actually works

Lifecycle of a skill kernel

        first successful solve
   ────────────────────────────────►  PROBATION
                                          │
                       same body verifies on a 2nd task
                                          ▼
                                       PROMOTED  ◄── recall starts handing it out
                                          │
                       3 recall-misses on real tasks
                                          ▼
                                      DEPRECATED

The probation gate exists because trusting the very first success poisons the registry on adversarial inputs (we measured this; see bench_brutal_30.log in the archive/ branch). A kernel must validate on at least two distinct tasks before any task is allowed to recall it.

Capability graph

When a skill body is committed, the AST is parsed for free-name function calls. If a called name matches another stored kernel's function name, an edge caller → callee is added. The graph is rebuilt on demand, used by the compose solver to inject every promoted skill as an executable helper function in the next prompt.

sum_digital_roots ─► digital_root ─► digit_sum
lcm_list          ─► lcm           ─► gcd
count_primes_in_range ─► is_prime

Few-shot can prepend examples. Few-shot cannot make a previously written gcd actually runnable in the next task's evaluation context. CARG can.


Benchmarks

Three task suites, all in carg.benchmark.suites:

Suite Tasks Skill repeats What it measures
easy 30 3 per skill Baseline accumulation curve.
evil 30 3 per skill Adversarial spec inversions (RPN with floor div, glob with 1+ semantics, prev_permutation, etc). LLMs override the spec and revert to memorized conventions.
compose 18 3 layers (atoms → composites → compounds) Capability-graph showcase. Layer-N tasks should call layer-(N-1) helpers instead of re-implementing them.

Each suite has a deterministic seed and a single canonical solution per skill. Verification is by unittest-style test cases inside the task definition.

Reproducing the headline number

carg bench3way --suite evil --provider anthropic --trials 30 --seed 7 --json > result.json

Three legs run against the same task stream through a shared SQLite response cache, so identical prompts cost once. The cost_savings_pct and delta_success fields in the JSON are the bottom-line comparison.

Honest disclaimers

The benchmark is not a claim that CARG accumulates capability the LLM lacks. On claude-sonnet-4.5 we observe:

  • Cost drops by 27–67% across runs of repeated-skill streams.
  • Success stays flat or moves within ±1 task on small-N suites — the model already solves these problems, the win is in dollars, not in floor.
  • On rpn_floor (RPN with // semantics) the model refuses to follow the spec and writes truncate-toward-zero instead. Memory cannot help if the model never produces a correct first solution. We treat this as a feature: the benchmark surfaces alignment ceilings.

The path to a real success-rate win is (1) embedding-based recall instead of skill-tag recall, (2) weaker models where cold success is < 100% (e.g. claude-haiku, gpt-4o-mini), and (3) compositional tasks that exceed any single-shot context.


Project structure

src/carg/
├── core/                    # K: (S, M) -> (A, M')
│   ├── base.py
│   ├── program.py           # RestrictedPython-sandboxed code
│   ├── policy.py            # softmax over discrete actions
│   ├── graph.py             # YAML-described action graphs
│   └── adapter.py           # prefix + logit-bias prompt-tune
├── memory/                  # persistent skill store
│   ├── registry.py          # SQLite, lifecycle, lineage
│   ├── sdk.py               # KernelMemory recall/commit
│   └── capability_graph.py  # AST-derived skill graph
├── llm/                     # provider clients + cache
│   ├── base.py
│   ├── mock.py
│   ├── openai_client.py
│   ├── anthropic_client.py
│   └── cache.py             # CachedLLMClient (SQLite)
├── runtime/                 # compose, conflict resolution
├── crystallizer/            # mine kernels from session logs
├── evolution/               # mutators, A/B engine, rollback
├── sandbox/                 # §10 synthetic app + simulator
├── benchmark/
│   ├── suites/              # easy / evil / compose
│   ├── solvers/             # simulated / llm / fewshot / compose
│   ├── harness.py           # 2-way A/B
│   └── bench3way.py         # 3-way A/B/C
└── cli.py

Roadmap

  • Four kernel formats with executable verification.
  • SQLite registry, probation lifecycle, deprecation.
  • AST-derived capability graph.
  • 3-way benchmark vs few-shot baseline.
  • CLI: simulate, benchmark, bench3way, compose-bench, evolve, crystallize, demo.
  • Embedding-based recall. The current skill-tag matching does not scale to real workloads.
  • Weaker-model studies. Run on claude-haiku and gpt-4o-mini where cold success is well below 100%.
  • Cross-model transfer. Train kernels on model A, run on model B. This is the experiment that decides whether CARG is infrastructure or novel capability.
  • HumanEval+ / SWE-bench Lite. Move from synthetic suites to recognised public benchmarks.
  • Production telemetry hooks. Drop-in middleware for OpenAI / Anthropic clients with a dashboard that reports kernel hit-rate and tokens saved against a customer's traffic.

Cite

@software{vladov_carg_2026,
  author       = {Danil Vladov},
  title        = {CARG: Capability-Augmented Generation through Reusable Graph},
  institution  = {NARE Labs},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/narelabs/carg}},
}

License

Apache-2.0. See LICENSE.


Built with ❤️ by NARE Labs · github.com/narelabs

About

Capability-Augmented Generation through Reusable Graph — a research framework for accumulating and composing executable LLM skills.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages