A research tool that makes claims, evidence, and reasoning traceable — for humans and AI agents.
Instead of scattering research across notes, bookmarks, and chat threads, Sparkle stores it as a typed graph: claims link to evidence, objections, questions, and syntheses. Every conclusion traces back to the path that produced it. Nothing gets lost.
See it in action: Does music help you code?
Most research tools are good at capture and bad at accountability. Ask yourself:
- What supports this claim? What weakens it?
- What alternative paths did I explore?
- Why did I abandon that line of thinking?
- How exactly did I arrive at this conclusion?
If you can't answer those from your notes, you have a capture tool, not a research tool. Sparkle is the graph layer that makes these questions answerable — by you, by a collaborator reviewing your work, or by an AI agent conducting research on your behalf.
Three ways to operate the graph work today, and the main one is an AI agent. In Claude Code or Claude Desktop, an agent drives the graph interactively over the MCP server, with the host's own model supplying all the thinking — no server-side model, no API key (see "Intelligent Operation" below). A human can also drive it directly from the CLI. And a fully hands-off loop drives it via sparkle run, where Claude proposes and judges while a different AI from another company attacks, debating a seed question with no human in the turn order (see "Autonomous Operation" below). That loop now runs live end-to-end (validated across three real debates on 2026-05-29) and reaches the models through the claude and codex command-line tools you already have logged in, so it too needs no API key. The evidence-gathering role searches the real web to verify its sources, and every other role is required to flag any figure it recalls from memory as unverified — so the debate no longer asserts made-up statistics as fact.
What's left:
| Phase | What | Why |
|---|---|---|
| Real citations | Structured sources, excerpts, DOI/URL, BibTeX interop | Flat strings aren't verifiable sources |
| Deeper introspection | gaps, tensions, stale, orphans as first-class commands |
Tell an agent what to work on next without reading the whole graph |
| Machine-readable CLI | --format json on every command |
Drive Sparkle from scripts without going through MCP |
See docs/roadmap.md for the full plan.
Requires Python 3.11+ and nothing else — the core is pure standard library. The main way to use Sparkle is to let an AI agent drive the graph over MCP; the CLI is there when you want to seed, inspect, or steer it by hand.
git clone https://github.com/albeorla/sparkle.git && cd sparkle
pip install -e '.[mcp]' # core + the MCP server
sparkle init # create a graph store
claude mcp add sparkle -- sparkle mcp # register the server with Claude CodeNow ask your agent to work the graph. "What's the weakest part of the debate right now?" pulls the live work frontier; the /challenge, /investigate, /synthesize, and /next slash commands run the adversarial moves. The host's own model does all the reasoning — no server-side model, no API key. Full details in Intelligent Operation.
pip install -e . # core only — drop the [mcp] extra
sparkle bootstrap # seed an example graph
sparkle home # dashboard with counts and next actions
# Explore any node by a short id prefix
sparkle tree <node_id_prefix>
sparkle show <node_id_prefix>
sparkle why <node_id_prefix>Prefer not to install? Every command also runs from the source tree — swap sparkle <cmd> for PYTHONPATH=src python3 -m sparkle <cmd>.
Explore the pre-built demo graph without touching your own:
sparkle --store demo/.sparkle/graph.json home
sparkle --store demo/.sparkle/graph.json tree e036ff896ceaSparkle ships an MCP server so an AI agent can operate the graph natively — no shelling out, no ASCII parsing. Today this works interactively on Claude Code and Claude Desktop: you run a slash-command prompt, the host's own model reads a "where is the debate weak right now" feed, reasons about it, and calls mutation tools to write objections, evidence, and rulings back into the graph. The server supplies structure, targets, and guardrails; the host model supplies all the thinking. No server-side model and no API key.
pip install 'sparkle[mcp]' # adds the MCP SDK; the core stays dependency-free
claude mcp add sparkle -- sparkle mcp # register the server with a clientThe sparkle mcp command runs the server over stdin/stdout. It is a lazy seam: every other command works without the extra installed, and a missing extra fails with a clear pip install 'sparkle[mcp]' message.
What the server exposes:
- A work frontier (
sparkle://frontier, also a tool) — a prioritized "what needs adversarial attention next" feed computed purely from the stored graph: unchallenged claims, open questions, stalled threads, thin-evidence claims, and claims ready to judge. Live, non-persisted aggregates (a supports-minus-contradicts tally plus node type/status), never the frozen stored confidence. - Mutation tools that enforce debate rules the raw CLI doesn't — for example, the judge's "rule" move refuses to ratify a claim the adversary never attacked. Because the MCP server is an agent-driven front-end, its rule move carries the same integrity floors the autonomous harness uses, so an agent driving Sparkle over MCP is held to the same bar as the hands-off loop:
- Author-distinct floor. A claim only counts as challenged if the objection comes from a different author than the claim, so a single agent cannot ratify its own claim off an objection it wrote against itself. (Honest boundary, not a bug: on the MCP path the objection's author is whatever role string the calling agent passes, because the MCP operator pattern is one agent playing every role — so this floor catches a same-author objection, but a single agent could still relabel itself to fake an adversary's name. Genuine adversary independence — a truly different model attacking the claim — is the cross-family CLI leg,
sparkle run, not the MCP path; see "Autonomous Operation" below.) - Verdict-rejection floor. The judge's "rule" move refuses to settle-and-ratify a claim its own verdict rejected: an agent calling the rule move with a non-affirming verdict (anything other than the harness's affirming
upheld) andsettle=Trueis refused before any write — the ruling can still be recorded, the claim just isn't ratified. This used to live only in the autonomous harness's dispatch, so via MCP an agent could rule "refuted" and still stamp the terminalratifiedstatus on a claim its own ruling rejected; the floor now sits in the shared seam and the MCP rule move opts into it. - The trusted human
sparkle ruleCLI path stays permissive on both (a human is trusted to challenge their own claim and to use free-text verdict words like "accept" / "needs work").
- Author-distinct floor. A claim only counts as challenged if the objection comes from a different author than the claim, so a single agent cannot ratify its own claim off an objection it wrote against itself. (Honest boundary, not a bug: on the MCP path the objection's author is whatever role string the calling agent passes, because the MCP operator pattern is one agent playing every role — so this floor catches a same-author objection, but a single agent could still relabel itself to fake an adversary's name. Genuine adversary independence — a truly different model attacking the claim — is the cross-family CLI leg,
- Complete run regions. When an agent adds a node with a fused link (an evidence "supports" edge) or harvests a synthesis (the "produced" edge), the MCP server now stamps the edge with the run id too, not just the node. So the full run — every node and every edge — shows up in the run review / diff / rollback surface, matching the autonomous harness. Previously those fused edges leaked out of the run region and were invisible to a rollback.
- Adversarial prompts as slash commands:
/challenge(cast the model as critic),/investigate(evidence-gatherer),/synthesize(judge then synthesizer),/next(read the frontier and run the highest-leverage move).
The guardrails live in the shared operations layer (ops.py), not in the MCP wrapper, so the CLI, the MCP server, and the autonomous harness all inherit the same rules through the same error boundary. The hands-off, no-human-in-the-loop version is described next.
The MCP path still needs a human to take turns with the model. The autonomous harness removes the human from the turn order entirely: one command, sparkle run "<seed question>", walks the same adversarial playbook unattended — a proposer states a claim, a critic attacks it, an evidence gatherer adds support and counter-evidence, a judge rules, and a synthesizer harvests the takeaway. This runs live end-to-end today, validated across three real debates on 2026-05-29: the full propose -> critique (the GPT critic objects) -> gather evidence -> judge rules -> synthesize sequence completes unattended, the judge correctly declines to ratify an overstated claim, and the live signal honestly tells a claim a judge has ruled on (judged) apart from one that was actually settled (ratified).
The reason this exists is an integrity weakness the interactive path can't fully close. The "was this claim challenged?" gate only checks that an objection exists; it can't tell whether the objection came from a real opponent or from the same model writing a strawman against itself. The harness makes the adversary real by pitting two different AI companies' models against each other, which it reaches through command-line tools rather than a paid API:
- Claude proposes and judges; a GPT model attacks. Each debate role is a separate AI call with its own context and a distinct author identity. Claude (Opus 4.8, reached through the
claudecommand-line tool on your Claude Max login) plays the proposer, the judge, the evidence gatherer, and the synthesizer. The critic — the role whose whole job is to attack the claim — is OpenAI's GPT-5.5, reached through thecodexcommand-line tool on your Codex/ChatGPT login. Real adversarial diversity comes from different model families (Claude vs GPT), not from running two tiers of the same Claude model, so every Claude-side role uses the same Opus 4.8 and the genuine opponent is the GPT critic. Each role's model family and model id are env-overridable (e.g.SPARKLE_CRITIC_BACKEND,SPARKLE_CRITIC_MODEL), and the codex side's reasoning effort can be tuned withSPARKLE_CODEX_REASONING_EFFORT. - The evidence role verifies against the real web; everyone else stays honest about recall. Early readiness probes turned up a structural failure: with every role denied all tools, the models cited precise named-source statistics with full confidence and no way to check them, and a fabricated figure could become the load-bearing evidence in a verdict. Two changes close this. First, the evidence gatherer — and only it — runs with real web search and fetch, so it looks up actual sources and puts the URLs it retrieved into the citations instead of reciting numbers from memory. Second, every still-tool-denied role (proposer, critic, judge, synthesizer) carries an honesty mandate: any figure, statistic, study name, or citation it recalls from memory must be tagged
(recalled, unverified), with no false precision, and the judge must discount unverifiable specifics rather than let an unchecked number decide a ruling. Verified live: re-running a solar-power question that had previously fabricated cost figures now returns an evidence node with real fetched URLs (iea.org,irena.org), a source-checked quote, and an honest caveat about what the data actually showed. The web access for the evidence role can be turned off (config.evidence_web_search=False), in which case that role degrades gracefully to the same "mark it unverified" honesty floor as the others. - It runs on your subscriptions, not an API bill. The loop never imports a model SDK or reads an API key — it shells out to the
claudeandcodexbinaries you already have logged in.claude -prides your Claude Max subscription andcodex execrides your Codex/ChatGPT subscription, so the spend comes out of those, not a per-token API bill. Each call runs in a throwaway empty directory, so the model can reason and answer but cannot read, write, or touch this repo (or anything else on the machine) while it thinks. On the Claude side the lockdown is deny-all: it disables every built-in tool (--tools "", plus an explicit deny for the one tool that survives), loads only your user-level settings so a project allow-rule can't re-open anything, and loads no MCP servers. The one exception is the evidence gatherer, which is granted exactly the web search/fetch pair and nothing else (--tools WebSearch WebFetchrestricts what is even reachable to those two, and--allowedTools WebSearch WebFetchpre-approves them so they run headlessly) — nothing dangerous becomes reachable, the role just gains real source verification. - A cross-family integrity gate. Before the judge is allowed to ratify a claim, the engine requires that at least one objection against it came from a different model family than the claim's author (Claude vs GPT) — not merely a different author name. This is stronger than the shared "different author" floor in the operations layer (which the harness also turns on): a role with a distinct name but the same model family (say, a Claude-side evidence gatherer writing a counter-point against a Claude-authored claim) does not satisfy the gate, so a model can't ratify its own claim off an attack it effectively wrote. The one locked configuration rule — checked when the run is set up — is that the critic must be a different family than the proposer.
- Rewrite, re-challenge. When a claim is revised, its old objections are copied onto the new version for lineage and display, but those copied-over objections are tagged as carried-over and no longer count toward the gate. A rewritten claim therefore has to earn a fresh, genuinely cross-family objection before the judge can ratify it — you can't rewrite around an attack and then ratify the new wording on the strength of the old one. (A self-loop
contradictsedge — a claim pointing its own objection at itself — is also banned outright at the edge-write boundary, since it can never be a real attack.)
No API key and no pip extra — just the two command-line tools, installed and logged in: claude (https://docs.claude.com/claude-code, on your Claude Max login) and codex (the Codex/ChatGPT CLI, on your Codex/ChatGPT login).
sparkle run "Does music help you code?" # one live debate
sparkle run "..." --run-id music-q1 # tag the whole run for review / rollback
sparkle run "..." --max-moves 12 # hard runaway cap on total moves
SPARKLE_CODEX_REASONING_EFFORT=low sparkle run "..." # cheaper GPT critic (see below)Cost and control knobs:
--rounds Noverrides the per-phase critique/gather iteration count (the built-in playbook defaults to 3 each). Only single-round runs are validated live so far; multi-round is available but unvalidated.--max-moves Nis a hard runaway cap on the total number of moves in a run.--run-id <id>tags the whole run as one reviewable, rollback-able region (otherwise an id is generated).SPARKLE_CODEX_REASONING_EFFORT=low|medium|high|xhighcontrols how hard the GPT critic thinks. Codex defaults toxhigh, which is token-heavy (roughly 20k tokens even for a trivial reply), so set it lower for cheaper runs. Per-role backend and model overrides also exist viaSPARKLE_<ROLE>_BACKEND/SPARKLE_<ROLE>_MODEL.
sparkle run is a lazy seam, exactly like sparkle mcp: the autonomous code is only imported when you invoke run. Because the engine and the thinker layer are pure standard-library Python, that import always succeeds — there is no extra to miss. The two failure modes are handled differently. A missing command-line tool is fatal up front: if claude or codex isn't on your PATH, run stops with a clear "install and log in to the claude and codex CLIs" message (not a pip hint). But a flaky model call mid-run — a timeout, a non-zero exit, or an empty/garbled answer from one CLI turn — is recorded as a single failed move (an error outcome) and the loop continues to the next phase, instead of crashing the whole run with a raw traceback. The Claude side is locked down so it can answer but not act: each call runs deny-all with NO built-in tools (--tools "", plus an explicit deny for the one tool that survives), forces --permission-mode default, loads only user-level settings (--setting-sources user, so a project or local allow-rule can't re-open a tool), loads no MCP servers (--strict-mcp-config), and runs in an isolated throwaway tempdir. The single exception is the evidence gatherer, which is granted exactly the web search/fetch pair (and nothing else) so it can verify sources. Every graph mutation the loop makes goes through the same ops.py functions the CLI and MCP server use, stamped with a run_id so the whole run (the claim, the objection, the decision, the ratified version, the synthesis, and all of their edges) is visible to the run-summary / diff / rollback surface. After a run finishes, the printed summary shows the run id, how many moves the agents made (a Moves: line, the honest per-move count — the debate is a single phase walk, not several adversarial rounds), the final verdict signal, and a Cross-family: line confirming which model family backed the proposer and the critic. The MCP server's run tools (summary, diff, manifest, ratify-region, rollback) can then review, audit, or undo the whole run by its id.
The cross-family setup is also auditable from the saved graph, not just trusted from the live config. Each run persists the role-to-family/model roster it was configured with under a top-level runs key in the graph store — outside every hashed node/edge payload, so no content-addressed id changes. Read it back with sparkle run-manifest <run_id> (CLI), the sparkle_run_manifest MCP tool, or the sparkle://run/{run_id}/manifest resource. Honest caveat: the manifest records the configured roster — which family each role was set up to use — not a runtime probe of which model actually answered each call. That is faithful in the shipped CLI path because the run builds its thinkers from the same config the manifest is written from. The binding per-ratification cross-family check is still the judge-time gate, recorded separately in the graph as the decision nodes and contradicts edges; the manifest is the setup roster, not that gate's result.
Honest limits to keep in mind:
- A live run is token-heavy and meant to be kicked off deliberately. The GPT critic runs at high reasoning effort by default and the whole debate is several back-and-forth AI turns, so a real run can take a while and burn real subscription tokens. Turn the critic's effort down with
SPARKLE_CODEX_REASONING_EFFORTfor cheaper runs; don't loop it blindly. - Source verification is partial, not total — the harness is not a full fact-checker. This gap used to be wide open: with every role tool-denied, the debate could assert fabricated statistics with confidence, and a made-up figure could carry a verdict. It is now partially closed. The evidence gatherer verifies against the real web and cites the URLs it actually retrieved, and the proposer, critic, judge, and synthesizer must tag anything they recall from memory as
(recalled, unverified)while the judge discounts unverifiable specifics. But those four roles still reason from memory rather than looking things up, so the harness should be read as "the evidence is source-checked and recalled claims are flagged," not "every statement in the debate is verified." - A judge can no longer ratify a claim its own verdict rejected (now enforced in code, not just the prompt). On the autonomous path the judge's verdict vocabulary is fixed (
upheld|refuted|overstated), and the engine structurally honors a settle-and-ratify request ONLY for anupheldverdict — any other verdict still records the ruling but never stamps the terminalratifiedstatus. So a judge turn that asks to settle while its verdict rejects the claim has the settle dropped. The agent-driven MCP rule move enforces the same floor (it opts the shared seam into the harness's affirming verdict word), so an agent over MCP that rules "refuted" with settle requested is refused too. (The free-text human CLI rule path is unchanged — there a verdict is plain words like "accept" / "needs work", so the coupling can't live there.) - Ratification is not gated on net-positive support — by design, the judge's reasoning is sovereign. This is a different point from the one above, and they should not be conflated. Once a genuinely cross-family objection exists and the judge issues an
upheldverdict that asks to settle, the judge MAY settle even when the supporting and contradicting evidence is tied (net=0): the judge weighs the debate, it does not count votes. This is a deliberate design choice, not a missing floor. The structural floor that DOES exist blocks the other direction — a judge cannot settle a claim its own verdict rejected. (A claim with zero support AND zero objections still cannot ratify at all, because the cross-family gate requires a real challenge first.) - The automated tests cover the engine through stubs and fakes, not live AI calls. The harness loop, the cross-family gate, the rewrite-re-challenge rule, the self-loop ban, the run-id plumbing, and the stop conditions are all proven in the stdlib test suite with a scripted stub thinker and a fake command-runner that records the exact
claude/codexarguments and returns canned output — no network, no real CLI process, no subscription spend. The handful of tests that would need a live binary are skipped. No test ever invokes the realclaudeorcodexcommands; the live debate is validated by hand (three real runs on 2026-05-29), not by the suite.
Research is a graph of typed nodes connected by typed edges:
- Nodes:
claim,evidence,question,objection,inference,decision,synthesis - Edges:
supports,contradicts,refines,derived_from,evaluates,produced,supersedes(the CLI validates against this set; the Python kernel accepts custom relation sets) - Status:
active,stalled,weakly_supported,promising,abandoned,harvested,ratified(terminal — a judge's binding ruling; only a settled ruling can write it) - IDs: SHA256 content hashes — same content always produces the same ID, tamper-evident by construction
Recurring research moves have templates so you don't have to remember node types and relations:
| Template | Creates | Relation | Use when |
|---|---|---|---|
support |
evidence | supports | Adding evidence for a claim |
objection |
objection | contradicts | Challenging a claim |
reframing |
question | refines | Asking a better question |
application |
claim | derived_from | Drawing a practical conclusion |
sparkle add-branch \
--from <claim_id> --template support \
--title "Primary source evidence" \
--citations "https://example.com/paper"All state lives in a single human-readable JSON file (.sparkle/graph.json). Nodes and edges are keyed by content hash. Identical payloads collapse to the same ID. The store is immutable in spirit — provenance remains stable as the graph grows.
| Command | What it does |
|---|---|
init |
Create an empty graph store |
bootstrap |
Seed with example nodes from the concept conversation |
home |
Dashboard with counts, recent nodes, next actions |
add-node |
Create a node with type, title, content, confidence (validated to 0.0-1.0), status, tags; optionally fuse a link in one call (--link-to <prefix> --relation <rel>) and register a name (--as <name>) |
add-edge |
Link two nodes with a relation (validated against the known relation set); endpoints accept ids, short prefixes, or alias names |
add-branch |
Templated node + edge creation for common research moves |
revise |
Supersede a node with a corrected version and re-home its inbound edges — model-honest editing that never mutates the frozen original (--no-rehome-edges to leave edges on the old version) |
import |
Build a graph fragment from a JSON document (file path or - for stdin); prints a nickname-to-id map. The LLM/batch on-ramp, so it is treated as untrusted: every imported status is routed through the terminal-status guard, so automated or model-generated input can't fabricate a pre-ratified (or pre-harvested/abandoned) claim. A trusted=True opt-in in ops.import_graph exists only for round-tripping a graph this tool itself exported |
list-nodes |
List nodes through a single unified filter (type, status, tag, query, limit); --ids-only prints bare handles for fast scanning before linking |
list-edges |
List all edges |
list-templates |
Show available branch templates |
relations |
Show each edge relation with a plain-English direction gloss |
show |
Inspect a node with all its incoming and outgoing edges |
tree |
ASCII tree of a node's immediate neighborhood |
why |
Trace inbound provenance chain |
lineage |
Walk all inbound ancestors (BFS) |
export |
Export a subgraph rooted at a node to markdown |
mcp |
Run the MCP server over stdin/stdout (requires the sparkle[mcp] extra) |
run |
Run the autonomous adversarial loop on a seed question: Claude (Opus 4.8) proposes and judges, OpenAI's GPT-5.5 attacks, debating unattended; runs live end-to-end (requires the claude and codex CLIs on PATH and logged in). --rounds overrides the critique/gather caps (default 3 each; only single-round runs are validated live), --max-moves sets the runaway backstop, --run-id tags the whole run as one reviewable/rollback-able region. SPARKLE_CODEX_REASONING_EFFORT=low|medium|high|xhigh tunes the GPT critic's depth (codex defaults to token-heavy xhigh). The summary reports a Moves: count (the honest per-move tally, not a "rounds" label) and a Cross-family: line naming the proposer's and critic's model families |
run-manifest |
Show the role-to-model-family roster a run was configured with (sparkle run-manifest <run_id>) — the cross-family setup audit. Records the configured roster (which family each role was set up to use), not a runtime probe of which model answered; faithful in the shipped CLI path because the run builds its thinkers from the same config. Reads the top-level runs key in the store; errors on an unknown run id or a debate from before manifests existed |
All commands accept --store <path> to use a non-default graph file.
After every add, Sparkle echoes a short 12-character handle on its own Handle: line (so you stop copying 64-character hashes) and a spoken-word link confirmation (Linked: evidence "X" --supports--> claim "Y") so you catch a backwards edge at a glance. Use explicit short prefixes or alias names when linking; there is deliberately no "last node" token, because whole-second timestamps have no tiebreaker and would silently attach an edge to the wrong node.
src/sparkle/
__init__.py # package marker, re-exports the reusable graph kernel
models.py # Node, Edge — frozen dataclasses with content-addressed IDs
graph.py # GraphStore — JSON-backed storage, traversal, subgraph, lineage export
ops.py # the operations contract — every action as a function returning dicts;
# debate invariants, referee/transition engine, dedup gate, alias sidecar,
# frontier, file lock. The single surface every front-end calls.
cli.py # CLI — a thin formatter; each command makes one ops.py call and prints
mcp_server.py # MCP server — a thin FastMCP adapter over ops.py (only loaded with the extra)
harness.py # autonomous engine + per-role agents + thinker contract + config;
# drives the playbook via ops.* only — pure stdlib, never imports a model SDK;
# enforces the cross-family gate (Claude vs GPT) and the rewrite-re-challenge rule
thinker.py # the live model backends — the ONLY module that shells out to a model CLI;
# ClaudeCliThinker (claude -p) + CodexCliThinker (codex exec), pure stdlib
# (subprocess/json/shutil/tempfile/os), no anthropic/openai package, no API key
presentation.py # render_tree, render_why, export_markdown — read-only rendering over a store
templates.py # BranchTemplate — opinionated inquiry workflows
bootstrap.py # seeds example graph from concept conversation
__main__.py # python -m sparkle entrypoint
tests/
test_cli.py # CLI integration tests via unittest
test_referee.py # referee/transition engine + edge tally
test_debate_loop.py # end-to-end debate loop over ops
test_runs.py # run tagging, summary, diff, rollback, and the run manifest (role->family roster) round-trip + id-stability
test_mcp_server.py # MCP adapter (skips when the mcp extra is absent): the integrity floors through the real tool closures
test_mcp_liveproof.py # opt-in live proof: a fresh sparkle mcp server over real stdio, every floor asserted end-to-end (SPARKLE_LIVE_MCP=1 + the mcp extra; skipped in the fast suite)
test_identity_rule.py # self-loop contradicts ban + distinct-adversary ratification floor
test_run_plumbing.py # run_id threaded through add-branch and rule
test_run_completeness.py # add_node run_id channel: proposal + synthesis (node + edges) land in the run region
test_harness.py # autonomous engine driven by a deterministic stub thinker (no network, no CLI)
test_thinker_cli.py # CLI thinkers via a fake command-runner: exact claude/codex argv + output parsing, no live calls
test_family_gate.py # cross-family ratification gate (Claude vs GPT) + the rewrite-re-challenge rule, via stub backends
demo/
README.md # demo overview with mermaid graph
WALKTHROUGH.md # conversational walkthrough of building a claim graph
exported-research.md # sparkle export output
.sparkle/graph.json # the demo graph (17 nodes, 19 edges)
docs/
prd.md # product requirements
roadmap.md # what's built, what's next
The ops.py seam is the structural keystone: the CLI never touches the store directly, and neither does the MCP server or the autonomous harness. All three bottom out in the same functions behind one ValueError boundary, so the debate rules — including the distinct-adversary ratification floor and the self-loop ban — are enforced in exactly one place and the front-ends stay interchangeable. Both the author-distinct floor and the verdict-rejection floor are per-caller choices made at this seam: the two agent-driven front-ends (the autonomous harness and the MCP server) opt in — ops.rule takes a require_distinct_adversary flag (so a single agent can't ratify off a self-written objection) and an opt-in affirming_verdicts set (so a non-affirming verdict can't settle a claim) — while the trusted human sparkle rule CLI path leaves both off (a human is trusted to challenge their own claim honestly and to use free-text verdict words). There is exactly one optional pip extra: sparkle[mcp] adds the MCP SDK. The autonomous run needs no pip extra at all — the engine (harness.py) and the model backends (thinker.py) are pure standard-library Python, and the live debate reaches the AI models by shelling out to the already-installed claude and codex command-line tools (no anthropic / openai package, no API key). The core, the CLI core, the MCP server, and the autonomous engine all stay pure stdlib with zero external dependencies.
Class model
classDiagram
class Node {
+node_type
+title
+content
+citations[]
+author
+created_at
+confidence
+status
+tags[]
+metadata
+to_payload()
+compute_id()
}
class Edge {
+from_id
+to_id
+relation
+note
+created_at
+metadata
+to_payload()
+compute_id()
}
class GraphStore {
+path
+node_types
+edge_relations
+init()
+read()
+add_node(node)
+add_edge(edge)
+list_nodes()
+list_edges()
+resolve_id(prefix)
+get_node(node_id)
+get_neighbor_details(node_id)
+lineage(root_id)
+subgraph(root_id)
+export_lineage(root_id)
}
class presentation {
+render_tree(store, root_id)
+render_why(store, root_id)
+export_markdown(store, root_id, output)
}
class BranchTemplate {
+name
+node_type
+relation
+default_status
+description
+prompt_prefix
+edge_note
}
GraphStore --> Node : stores
GraphStore --> Edge : stores
BranchTemplate --> Node : configures
presentation --> GraphStore : reads
Add-branch sequence
sequenceDiagram
actor User
participant CLI as sparkle.cli
participant Store as GraphStore
participant Tpl as BranchTemplate
participant JSON as graph.json
User->>CLI: add-branch --from <claim> --template support --title ...
CLI->>Store: resolve_id(from_prefix)
Store->>JSON: read nodes
JSON-->>Store: matching node id
Store-->>CLI: parent node id
CLI->>Store: get_node(parent_id)
Store->>JSON: read parent node
JSON-->>Store: parent node payload
Store-->>CLI: parent node
CLI->>Tpl: build_branch_node(parent_title, template, title, ...)
Tpl-->>CLI: branch Node + template metadata
CLI->>Store: add_node(branch_node)
Store->>JSON: write node by content hash
JSON-->>Store: persisted
Store-->>CLI: branch_id
CLI->>Store: add_edge(branch_id -> parent_id, template.relation)
Store->>JSON: write edge by content hash
JSON-->>Store: persisted
Store-->>CLI: edge_id
CLI-->>User: branch node id + branch edge id
Node status model
stateDiagram-v2
[*] --> active
active --> promising
active --> stalled
active --> weakly_supported
weakly_supported --> promising
weakly_supported --> abandoned
stalled --> active
stalled --> abandoned
promising --> harvested
promising --> active
harvested --> active
active --> abandoned
abandoned --> active
active --> ratified : settled ruling (rule --settle)
ratified --> [*]
ratified is a terminal status reached only through a settled judge's ruling, never set freehand. Because nodes are frozen, ratification is realized model-honestly: the ruling writes a decision node and a superseding claim version carrying status=ratified and a supersedes edge to the original — never an in-place mutation. harvested and abandoned are likewise terminal in spirit; a model-authored write may not set any of the three directly.
python3 -m unittest discover -s tests -vThe base suite is pure stdlib unittest with no third-party packages installed; run it from the repo root. It covers: init, bootstrap, node/edge CRUD, branch templates, show/tree/why rendering, filtered listing (type/status/tag/query/limit), list-nodes --ids-only, home dashboard, lineage, markdown export, content-addressing idempotency, custom node-type/relation graph kernels with metadata, lineage-only vs full-component export, n/a confidence rendering, dangling-edge export safety, corrupt-store handling, error handling (invalid lookups, ambiguous prefixes, unknown relations, out-of-range confidence), the operations layer (the ratified terminal status, the fused create-and-link path including the paired-or-error guard, alias registration and re-pointing on reuse, the ruling invariant that refuses to ratify an unchallenged claim, the content-fingerprint dedup gate, JSON import from stdin with invalid-JSON rejection, revise with and without re-homing inbound edges, the referee/transition engine, and run tagging/summary/diff/rollback), and the Phase 2 autonomous harness:
- Distinct-adversary floor (
test_identity_rule.py): the self-loopcontradictsban at the edge-write boundary, and therequire_distinct_adversaryratification gate — refused when the only objection's author equals the claim's author, allowed when a different author objects. - Run-id plumbing (
test_run_plumbing.py): arun_idthreaded throughadd-branchandruleshows up on the objection and the ruling, so a full autonomous loop is captured by the run-summary surface. - Autonomous engine (
test_harness.py): a deterministic stub thinker (no network, no CLI) drives the full propose -> object -> rule loop; the run summary shows the claim, objection, and decision; the judge is refused on a self-strawman and succeeds with a cross-family critic; the hard move cap and thedone/stopmove both end the loop; andHarnessConfigraises when the critic's model family equals the proposer's. A guard test confirms the engine module never imports a model SDK. The run result reportsmoves_made(the honest per-move count, never a misleadingrounds_runkey) and equals the recorded move count, and the engine persists a cross-family run manifest — pinning the honest limitation that the manifest reflects the configured role-to-family roster, not a runtime probe of which thinker answered. - CLI thinkers (
test_thinker_cli.py): an injected fake command-runner records the exactclaude/codexargument list and returns canned output, so the tests pin the argv shape (the Opus-4.8 model id, JSON output mode, the deny-all no-tools lockdown flag, the read-only sandbox, the isolated working dir, the output-file flag, and the optional reasoning-effort override) and the output parsing (claude's JSON event array, codex's last-message file) without ever spawning a real process. They also pin the one deny-all exception: the evidence gatherer (web_search=True) gets the web search/fetch pair and only that pair (each ofWebSearch/WebFetchappears twice — once to restrict availability via--tools, once to pre-approve via--allowedTools), while a default role stays deny-all with no web tools at all. Timeouts, non-zero exits, empty output, and malformed JSON all raise rather than returning a fake answer. The stub thinker is covered too. No test invokes the real CLIs. - Cross-family gate + rewrite-re-challenge (
test_family_gate.py): stub backends tagged with families prove the judge is refused unless an objection comes from a different model family than the claim's author, and that a carried-over objection on a revised claim no longer counts — so a rewritten claim must earn a fresh cross-family attack before it can be ratified. - Run completeness (
test_run_completeness.py): therun_idchannel onadd_nodestamps both the new node and the fused link edge, so a full loop's proposal and synthesis (nodes and their edges) land in the run region and are visible to the run-summary / diff / rollback surface; an un-run human write stays byte-identical (no run stamp leaks onto any edge).
The MCP adapter tests (test_mcp_server.py) skip when the sparkle[mcp] extra is absent. They prove the integrity floors fire through the real build_app() tool closures: a pre-challenge ratify is refused, a same-author (self-strawman) objection can't ratify, and a non-affirming verdict can't settle.
There is also an opt-in live end-to-end proof (test_mcp_liveproof.py) that goes one level further: it spawns a fresh sparkle mcp server as a subprocess, connects to it as a real MCP client over stdio, and drives a full debate while asserting every floor fires through the actual transport + server + ops seam — pre-challenge ratify refused, self-strawman refused, the verdict-rejection floor (a "refuted" verdict with settle requested is refused), the honest judged-vs-ratified signal, a clean cross-author "upheld" ruling that does ratify, and the evidence/harvest edges landing in the run region. It is slower (it spawns a process), so it is skipped unless SPARKLE_LIVE_MCP=1 is set and the sparkle[mcp] extra is installed; the fast suite leaves it out.
No test makes a live AI call or spawns the real claude / codex commands — the live cross-family debate runs end-to-end (validated by hand across three real debates on 2026-05-29) but is token-heavy, so it is checked manually rather than by the suite.
- The live cross-family debate runs end-to-end (validated across three real debates on 2026-05-29) but is token-heavy and meant to be kicked off deliberately.
sparkle runshells out to theclaudeandcodexcommand-line tools (you must have both installed and logged in; no API key or pip extra is needed). The GPT critic runs at high reasoning effort by default and is token-heavy on the Codex/ChatGPT side, and the debate is several AI turns, so a real run burns real subscription tokens; turn the critic's effort down withSPARKLE_CODEX_REASONING_EFFORTfor cheaper runs. The automated suite proves the engine against a deterministic stub thinker and a fake command-runner; the live path is validated by hand, not by the test suite. - Ratification is not gated on net-positive support, by design — given that a cross-family objection exists, a judge may issue an
upheldverdict and settle even on tied evidence (net=0), because the judge weighs the debate rather than counting votes. This is a deliberate design choice (the judge's reasoning is sovereign), not a missing floor. The floor that IS structural runs the other way: a judge can no longer ratify a claim its own verdict rejected, because the engine honors settle only for anupheldverdict (and a claim with no challenge at all still can't ratify, since the cross-family gate requires a real objection first). The cross-family setup is now auditable from the saved graph viasparkle run-manifest <run_id>, with the honest caveat that the manifest records the configured roster, not a runtime probe of which model answered each call. Multi-round runs (--rounds> 1) are available but not yet validated live. - A deduped re-proposal of the same seed stays attributed to its first run — if a later run re-proposes a byte-identical claim, the existing (frozen) node is returned and keeps the first run's
run_id. This is an accepted, documented gap: re-stamping it would require mutating a frozen node, which the model forbids. - The MCP path's distinct-adversary floor is author-string level, not true independence. An agent driving Sparkle over MCP gets the same floors as the autonomous loop (a self-written objection can't ratify a claim, and a non-affirming verdict can't settle one). But because the MCP operator pattern is one agent playing every role, the objection's author is just whatever role string the agent passes, so a single agent could relabel itself to fake an adversary's name. This is an accepted boundary, not a bug: genuine adversary independence — a truly different model family attacking the claim — comes from the cross-family CLI leg (
sparkle run), where the critic is a different company's model than the proposer and the judge-time gate checks model family, not just an author name. - One front-end at a time per graph — an advisory file lock guards each read-modify-write, but there is no multi-writer guarantee beyond that.
- Source verification on the autonomous path is partial, not total. The evidence-gathering role searches the real web and cites the URLs it retrieved, and every other role must flag any recalled figure as
(recalled, unverified)while the judge discounts unverifiable specifics — so a fabricated statistic can no longer silently carry a verdict. But the proposer, critic, judge, and synthesizer still reason from memory, so the harness is a source-checked-evidence debate, not a full fact-checker of every statement. - Citations are flat strings — no structured source metadata
- No first-class introspection commands — the frontier covers "what needs attention" via MCP, but
gaps/tensions/stale/orphansare not yet standalone CLI commands - No UI — terminal only
- Semantic dedup is out of scope — exact-fingerprint re-proposals auto-dedup, but near-duplicates are not detected (true fuzzy matching would force a model dependency)