Skip to content

albeorla/sparkle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✨ Sparkle

A research tool that makes claims, evidence, and reasoning traceable — for humans and AI agents.

Instead of scattering research across notes, bookmarks, and chat threads, Sparkle stores it as a typed graph: claims link to evidence, objections, questions, and syntheses. Every conclusion traces back to the path that produced it. Nothing gets lost.

See it in action: Does music help you code?

Why This Exists

Most research tools are good at capture and bad at accountability. Ask yourself:

  • What supports this claim? What weakens it?
  • What alternative paths did I explore?
  • Why did I abandon that line of thinking?
  • How exactly did I arrive at this conclusion?

If you can't answer those from your notes, you have a capture tool, not a research tool. Sparkle is the graph layer that makes these questions answerable — by you, by a collaborator reviewing your work, or by an AI agent conducting research on your behalf.

Where It's Going

Three ways to operate the graph work today, and the main one is an AI agent. In Claude Code or Claude Desktop, an agent drives the graph interactively over the MCP server, with the host's own model supplying all the thinking — no server-side model, no API key (see "Intelligent Operation" below). A human can also drive it directly from the CLI. And a fully hands-off loop drives it via sparkle run, where Claude proposes and judges while a different AI from another company attacks, debating a seed question with no human in the turn order (see "Autonomous Operation" below). That loop now runs live end-to-end (validated across three real debates on 2026-05-29) and reaches the models through the claude and codex command-line tools you already have logged in, so it too needs no API key. The evidence-gathering role searches the real web to verify its sources, and every other role is required to flag any figure it recalls from memory as unverified — so the debate no longer asserts made-up statistics as fact.

What's left:

Phase What Why
Real citations Structured sources, excerpts, DOI/URL, BibTeX interop Flat strings aren't verifiable sources
Deeper introspection gaps, tensions, stale, orphans as first-class commands Tell an agent what to work on next without reading the whole graph
Machine-readable CLI --format json on every command Drive Sparkle from scripts without going through MCP

See docs/roadmap.md for the full plan.

Quick Start

Requires Python 3.11+ and nothing else — the core is pure standard library. The main way to use Sparkle is to let an AI agent drive the graph over MCP; the CLI is there when you want to seed, inspect, or steer it by hand.

Let an AI agent drive it (the primary path)

git clone https://github.com/albeorla/sparkle.git && cd sparkle
pip install -e '.[mcp]'                  # core + the MCP server

sparkle init                             # create a graph store
claude mcp add sparkle -- sparkle mcp    # register the server with Claude Code

Now ask your agent to work the graph. "What's the weakest part of the debate right now?" pulls the live work frontier; the /challenge, /investigate, /synthesize, and /next slash commands run the adversarial moves. The host's own model does all the reasoning — no server-side model, no API key. Full details in Intelligent Operation.

Drive it yourself from the CLI

pip install -e .             # core only — drop the [mcp] extra

sparkle bootstrap           # seed an example graph
sparkle home                # dashboard with counts and next actions

# Explore any node by a short id prefix
sparkle tree <node_id_prefix>
sparkle show <node_id_prefix>
sparkle why  <node_id_prefix>

Prefer not to install? Every command also runs from the source tree — swap sparkle <cmd> for PYTHONPATH=src python3 -m sparkle <cmd>.

Explore the pre-built demo graph without touching your own:

sparkle --store demo/.sparkle/graph.json home
sparkle --store demo/.sparkle/graph.json tree e036ff896cea

Intelligent Operation (MCP)

Sparkle ships an MCP server so an AI agent can operate the graph natively — no shelling out, no ASCII parsing. Today this works interactively on Claude Code and Claude Desktop: you run a slash-command prompt, the host's own model reads a "where is the debate weak right now" feed, reasons about it, and calls mutation tools to write objections, evidence, and rulings back into the graph. The server supplies structure, targets, and guardrails; the host model supplies all the thinking. No server-side model and no API key.

pip install 'sparkle[mcp]'                 # adds the MCP SDK; the core stays dependency-free
claude mcp add sparkle -- sparkle mcp       # register the server with a client

The sparkle mcp command runs the server over stdin/stdout. It is a lazy seam: every other command works without the extra installed, and a missing extra fails with a clear pip install 'sparkle[mcp]' message.

What the server exposes:

  • A work frontier (sparkle://frontier, also a tool) — a prioritized "what needs adversarial attention next" feed computed purely from the stored graph: unchallenged claims, open questions, stalled threads, thin-evidence claims, and claims ready to judge. Live, non-persisted aggregates (a supports-minus-contradicts tally plus node type/status), never the frozen stored confidence.
  • Mutation tools that enforce debate rules the raw CLI doesn't — for example, the judge's "rule" move refuses to ratify a claim the adversary never attacked. Because the MCP server is an agent-driven front-end, its rule move carries the same integrity floors the autonomous harness uses, so an agent driving Sparkle over MCP is held to the same bar as the hands-off loop:
    • Author-distinct floor. A claim only counts as challenged if the objection comes from a different author than the claim, so a single agent cannot ratify its own claim off an objection it wrote against itself. (Honest boundary, not a bug: on the MCP path the objection's author is whatever role string the calling agent passes, because the MCP operator pattern is one agent playing every role — so this floor catches a same-author objection, but a single agent could still relabel itself to fake an adversary's name. Genuine adversary independence — a truly different model attacking the claim — is the cross-family CLI leg, sparkle run, not the MCP path; see "Autonomous Operation" below.)
    • Verdict-rejection floor. The judge's "rule" move refuses to settle-and-ratify a claim its own verdict rejected: an agent calling the rule move with a non-affirming verdict (anything other than the harness's affirming upheld) and settle=True is refused before any write — the ruling can still be recorded, the claim just isn't ratified. This used to live only in the autonomous harness's dispatch, so via MCP an agent could rule "refuted" and still stamp the terminal ratified status on a claim its own ruling rejected; the floor now sits in the shared seam and the MCP rule move opts into it.
    • The trusted human sparkle rule CLI path stays permissive on both (a human is trusted to challenge their own claim and to use free-text verdict words like "accept" / "needs work").
  • Complete run regions. When an agent adds a node with a fused link (an evidence "supports" edge) or harvests a synthesis (the "produced" edge), the MCP server now stamps the edge with the run id too, not just the node. So the full run — every node and every edge — shows up in the run review / diff / rollback surface, matching the autonomous harness. Previously those fused edges leaked out of the run region and were invisible to a rollback.
  • Adversarial prompts as slash commands: /challenge (cast the model as critic), /investigate (evidence-gatherer), /synthesize (judge then synthesizer), /next (read the frontier and run the highest-leverage move).

The guardrails live in the shared operations layer (ops.py), not in the MCP wrapper, so the CLI, the MCP server, and the autonomous harness all inherit the same rules through the same error boundary. The hands-off, no-human-in-the-loop version is described next.

Autonomous Operation (sparkle run)

The MCP path still needs a human to take turns with the model. The autonomous harness removes the human from the turn order entirely: one command, sparkle run "<seed question>", walks the same adversarial playbook unattended — a proposer states a claim, a critic attacks it, an evidence gatherer adds support and counter-evidence, a judge rules, and a synthesizer harvests the takeaway. This runs live end-to-end today, validated across three real debates on 2026-05-29: the full propose -> critique (the GPT critic objects) -> gather evidence -> judge rules -> synthesize sequence completes unattended, the judge correctly declines to ratify an overstated claim, and the live signal honestly tells a claim a judge has ruled on (judged) apart from one that was actually settled (ratified).

The reason this exists is an integrity weakness the interactive path can't fully close. The "was this claim challenged?" gate only checks that an objection exists; it can't tell whether the objection came from a real opponent or from the same model writing a strawman against itself. The harness makes the adversary real by pitting two different AI companies' models against each other, which it reaches through command-line tools rather than a paid API:

  • Claude proposes and judges; a GPT model attacks. Each debate role is a separate AI call with its own context and a distinct author identity. Claude (Opus 4.8, reached through the claude command-line tool on your Claude Max login) plays the proposer, the judge, the evidence gatherer, and the synthesizer. The critic — the role whose whole job is to attack the claim — is OpenAI's GPT-5.5, reached through the codex command-line tool on your Codex/ChatGPT login. Real adversarial diversity comes from different model families (Claude vs GPT), not from running two tiers of the same Claude model, so every Claude-side role uses the same Opus 4.8 and the genuine opponent is the GPT critic. Each role's model family and model id are env-overridable (e.g. SPARKLE_CRITIC_BACKEND, SPARKLE_CRITIC_MODEL), and the codex side's reasoning effort can be tuned with SPARKLE_CODEX_REASONING_EFFORT.
  • The evidence role verifies against the real web; everyone else stays honest about recall. Early readiness probes turned up a structural failure: with every role denied all tools, the models cited precise named-source statistics with full confidence and no way to check them, and a fabricated figure could become the load-bearing evidence in a verdict. Two changes close this. First, the evidence gatherer — and only it — runs with real web search and fetch, so it looks up actual sources and puts the URLs it retrieved into the citations instead of reciting numbers from memory. Second, every still-tool-denied role (proposer, critic, judge, synthesizer) carries an honesty mandate: any figure, statistic, study name, or citation it recalls from memory must be tagged (recalled, unverified), with no false precision, and the judge must discount unverifiable specifics rather than let an unchecked number decide a ruling. Verified live: re-running a solar-power question that had previously fabricated cost figures now returns an evidence node with real fetched URLs (iea.org, irena.org), a source-checked quote, and an honest caveat about what the data actually showed. The web access for the evidence role can be turned off (config.evidence_web_search=False), in which case that role degrades gracefully to the same "mark it unverified" honesty floor as the others.
  • It runs on your subscriptions, not an API bill. The loop never imports a model SDK or reads an API key — it shells out to the claude and codex binaries you already have logged in. claude -p rides your Claude Max subscription and codex exec rides your Codex/ChatGPT subscription, so the spend comes out of those, not a per-token API bill. Each call runs in a throwaway empty directory, so the model can reason and answer but cannot read, write, or touch this repo (or anything else on the machine) while it thinks. On the Claude side the lockdown is deny-all: it disables every built-in tool (--tools "", plus an explicit deny for the one tool that survives), loads only your user-level settings so a project allow-rule can't re-open anything, and loads no MCP servers. The one exception is the evidence gatherer, which is granted exactly the web search/fetch pair and nothing else (--tools WebSearch WebFetch restricts what is even reachable to those two, and --allowedTools WebSearch WebFetch pre-approves them so they run headlessly) — nothing dangerous becomes reachable, the role just gains real source verification.
  • A cross-family integrity gate. Before the judge is allowed to ratify a claim, the engine requires that at least one objection against it came from a different model family than the claim's author (Claude vs GPT) — not merely a different author name. This is stronger than the shared "different author" floor in the operations layer (which the harness also turns on): a role with a distinct name but the same model family (say, a Claude-side evidence gatherer writing a counter-point against a Claude-authored claim) does not satisfy the gate, so a model can't ratify its own claim off an attack it effectively wrote. The one locked configuration rule — checked when the run is set up — is that the critic must be a different family than the proposer.
  • Rewrite, re-challenge. When a claim is revised, its old objections are copied onto the new version for lineage and display, but those copied-over objections are tagged as carried-over and no longer count toward the gate. A rewritten claim therefore has to earn a fresh, genuinely cross-family objection before the judge can ratify it — you can't rewrite around an attack and then ratify the new wording on the strength of the old one. (A self-loop contradicts edge — a claim pointing its own objection at itself — is also banned outright at the edge-write boundary, since it can never be a real attack.)

No API key and no pip extra — just the two command-line tools, installed and logged in: claude (https://docs.claude.com/claude-code, on your Claude Max login) and codex (the Codex/ChatGPT CLI, on your Codex/ChatGPT login).

sparkle run "Does music help you code?"                  # one live debate
sparkle run "..." --run-id music-q1                       # tag the whole run for review / rollback
sparkle run "..." --max-moves 12                          # hard runaway cap on total moves
SPARKLE_CODEX_REASONING_EFFORT=low sparkle run "..."      # cheaper GPT critic (see below)

Cost and control knobs:

  • --rounds N overrides the per-phase critique/gather iteration count (the built-in playbook defaults to 3 each). Only single-round runs are validated live so far; multi-round is available but unvalidated.
  • --max-moves N is a hard runaway cap on the total number of moves in a run.
  • --run-id <id> tags the whole run as one reviewable, rollback-able region (otherwise an id is generated).
  • SPARKLE_CODEX_REASONING_EFFORT=low|medium|high|xhigh controls how hard the GPT critic thinks. Codex defaults to xhigh, which is token-heavy (roughly 20k tokens even for a trivial reply), so set it lower for cheaper runs. Per-role backend and model overrides also exist via SPARKLE_<ROLE>_BACKEND / SPARKLE_<ROLE>_MODEL.

sparkle run is a lazy seam, exactly like sparkle mcp: the autonomous code is only imported when you invoke run. Because the engine and the thinker layer are pure standard-library Python, that import always succeeds — there is no extra to miss. The two failure modes are handled differently. A missing command-line tool is fatal up front: if claude or codex isn't on your PATH, run stops with a clear "install and log in to the claude and codex CLIs" message (not a pip hint). But a flaky model call mid-run — a timeout, a non-zero exit, or an empty/garbled answer from one CLI turn — is recorded as a single failed move (an error outcome) and the loop continues to the next phase, instead of crashing the whole run with a raw traceback. The Claude side is locked down so it can answer but not act: each call runs deny-all with NO built-in tools (--tools "", plus an explicit deny for the one tool that survives), forces --permission-mode default, loads only user-level settings (--setting-sources user, so a project or local allow-rule can't re-open a tool), loads no MCP servers (--strict-mcp-config), and runs in an isolated throwaway tempdir. The single exception is the evidence gatherer, which is granted exactly the web search/fetch pair (and nothing else) so it can verify sources. Every graph mutation the loop makes goes through the same ops.py functions the CLI and MCP server use, stamped with a run_id so the whole run (the claim, the objection, the decision, the ratified version, the synthesis, and all of their edges) is visible to the run-summary / diff / rollback surface. After a run finishes, the printed summary shows the run id, how many moves the agents made (a Moves: line, the honest per-move count — the debate is a single phase walk, not several adversarial rounds), the final verdict signal, and a Cross-family: line confirming which model family backed the proposer and the critic. The MCP server's run tools (summary, diff, manifest, ratify-region, rollback) can then review, audit, or undo the whole run by its id.

The cross-family setup is also auditable from the saved graph, not just trusted from the live config. Each run persists the role-to-family/model roster it was configured with under a top-level runs key in the graph store — outside every hashed node/edge payload, so no content-addressed id changes. Read it back with sparkle run-manifest <run_id> (CLI), the sparkle_run_manifest MCP tool, or the sparkle://run/{run_id}/manifest resource. Honest caveat: the manifest records the configured roster — which family each role was set up to use — not a runtime probe of which model actually answered each call. That is faithful in the shipped CLI path because the run builds its thinkers from the same config the manifest is written from. The binding per-ratification cross-family check is still the judge-time gate, recorded separately in the graph as the decision nodes and contradicts edges; the manifest is the setup roster, not that gate's result.

Honest limits to keep in mind:

  • A live run is token-heavy and meant to be kicked off deliberately. The GPT critic runs at high reasoning effort by default and the whole debate is several back-and-forth AI turns, so a real run can take a while and burn real subscription tokens. Turn the critic's effort down with SPARKLE_CODEX_REASONING_EFFORT for cheaper runs; don't loop it blindly.
  • Source verification is partial, not total — the harness is not a full fact-checker. This gap used to be wide open: with every role tool-denied, the debate could assert fabricated statistics with confidence, and a made-up figure could carry a verdict. It is now partially closed. The evidence gatherer verifies against the real web and cites the URLs it actually retrieved, and the proposer, critic, judge, and synthesizer must tag anything they recall from memory as (recalled, unverified) while the judge discounts unverifiable specifics. But those four roles still reason from memory rather than looking things up, so the harness should be read as "the evidence is source-checked and recalled claims are flagged," not "every statement in the debate is verified."
  • A judge can no longer ratify a claim its own verdict rejected (now enforced in code, not just the prompt). On the autonomous path the judge's verdict vocabulary is fixed (upheld | refuted | overstated), and the engine structurally honors a settle-and-ratify request ONLY for an upheld verdict — any other verdict still records the ruling but never stamps the terminal ratified status. So a judge turn that asks to settle while its verdict rejects the claim has the settle dropped. The agent-driven MCP rule move enforces the same floor (it opts the shared seam into the harness's affirming verdict word), so an agent over MCP that rules "refuted" with settle requested is refused too. (The free-text human CLI rule path is unchanged — there a verdict is plain words like "accept" / "needs work", so the coupling can't live there.)
  • Ratification is not gated on net-positive support — by design, the judge's reasoning is sovereign. This is a different point from the one above, and they should not be conflated. Once a genuinely cross-family objection exists and the judge issues an upheld verdict that asks to settle, the judge MAY settle even when the supporting and contradicting evidence is tied (net=0): the judge weighs the debate, it does not count votes. This is a deliberate design choice, not a missing floor. The structural floor that DOES exist blocks the other direction — a judge cannot settle a claim its own verdict rejected. (A claim with zero support AND zero objections still cannot ratify at all, because the cross-family gate requires a real challenge first.)
  • The automated tests cover the engine through stubs and fakes, not live AI calls. The harness loop, the cross-family gate, the rewrite-re-challenge rule, the self-loop ban, the run-id plumbing, and the stop conditions are all proven in the stdlib test suite with a scripted stub thinker and a fake command-runner that records the exact claude / codex arguments and returns canned output — no network, no real CLI process, no subscription spend. The handful of tests that would need a live binary are skipped. No test ever invokes the real claude or codex commands; the live debate is validated by hand (three real runs on 2026-05-29), not by the suite.

How It Works

The graph model

Research is a graph of typed nodes connected by typed edges:

  • Nodes: claim, evidence, question, objection, inference, decision, synthesis
  • Edges: supports, contradicts, refines, derived_from, evaluates, produced, supersedes (the CLI validates against this set; the Python kernel accepts custom relation sets)
  • Status: active, stalled, weakly_supported, promising, abandoned, harvested, ratified (terminal — a judge's binding ruling; only a settled ruling can write it)
  • IDs: SHA256 content hashes — same content always produces the same ID, tamper-evident by construction

Branch templates

Recurring research moves have templates so you don't have to remember node types and relations:

Template Creates Relation Use when
support evidence supports Adding evidence for a claim
objection objection contradicts Challenging a claim
reframing question refines Asking a better question
application claim derived_from Drawing a practical conclusion
sparkle add-branch \
  --from <claim_id> --template support \
  --title "Primary source evidence" \
  --citations "https://example.com/paper"

Storage

All state lives in a single human-readable JSON file (.sparkle/graph.json). Nodes and edges are keyed by content hash. Identical payloads collapse to the same ID. The store is immutable in spirit — provenance remains stable as the graph grows.

CLI Reference

Command What it does
init Create an empty graph store
bootstrap Seed with example nodes from the concept conversation
home Dashboard with counts, recent nodes, next actions
add-node Create a node with type, title, content, confidence (validated to 0.0-1.0), status, tags; optionally fuse a link in one call (--link-to <prefix> --relation <rel>) and register a name (--as <name>)
add-edge Link two nodes with a relation (validated against the known relation set); endpoints accept ids, short prefixes, or alias names
add-branch Templated node + edge creation for common research moves
revise Supersede a node with a corrected version and re-home its inbound edges — model-honest editing that never mutates the frozen original (--no-rehome-edges to leave edges on the old version)
import Build a graph fragment from a JSON document (file path or - for stdin); prints a nickname-to-id map. The LLM/batch on-ramp, so it is treated as untrusted: every imported status is routed through the terminal-status guard, so automated or model-generated input can't fabricate a pre-ratified (or pre-harvested/abandoned) claim. A trusted=True opt-in in ops.import_graph exists only for round-tripping a graph this tool itself exported
list-nodes List nodes through a single unified filter (type, status, tag, query, limit); --ids-only prints bare handles for fast scanning before linking
list-edges List all edges
list-templates Show available branch templates
relations Show each edge relation with a plain-English direction gloss
show Inspect a node with all its incoming and outgoing edges
tree ASCII tree of a node's immediate neighborhood
why Trace inbound provenance chain
lineage Walk all inbound ancestors (BFS)
export Export a subgraph rooted at a node to markdown
mcp Run the MCP server over stdin/stdout (requires the sparkle[mcp] extra)
run Run the autonomous adversarial loop on a seed question: Claude (Opus 4.8) proposes and judges, OpenAI's GPT-5.5 attacks, debating unattended; runs live end-to-end (requires the claude and codex CLIs on PATH and logged in). --rounds overrides the critique/gather caps (default 3 each; only single-round runs are validated live), --max-moves sets the runaway backstop, --run-id tags the whole run as one reviewable/rollback-able region. SPARKLE_CODEX_REASONING_EFFORT=low|medium|high|xhigh tunes the GPT critic's depth (codex defaults to token-heavy xhigh). The summary reports a Moves: count (the honest per-move tally, not a "rounds" label) and a Cross-family: line naming the proposer's and critic's model families
run-manifest Show the role-to-model-family roster a run was configured with (sparkle run-manifest <run_id>) — the cross-family setup audit. Records the configured roster (which family each role was set up to use), not a runtime probe of which model answered; faithful in the shipped CLI path because the run builds its thinkers from the same config. Reads the top-level runs key in the store; errors on an unknown run id or a debate from before manifests existed

All commands accept --store <path> to use a non-default graph file.

After every add, Sparkle echoes a short 12-character handle on its own Handle: line (so you stop copying 64-character hashes) and a spoken-word link confirmation (Linked: evidence "X" --supports--> claim "Y") so you catch a backwards edge at a glance. Use explicit short prefixes or alias names when linking; there is deliberately no "last node" token, because whole-second timestamps have no tiebreaker and would silently attach an edge to the wrong node.

Architecture

Source layout

src/sparkle/
  __init__.py       # package marker, re-exports the reusable graph kernel
  models.py         # Node, Edge — frozen dataclasses with content-addressed IDs
  graph.py          # GraphStore — JSON-backed storage, traversal, subgraph, lineage export
  ops.py            # the operations contract — every action as a function returning dicts;
                    #   debate invariants, referee/transition engine, dedup gate, alias sidecar,
                    #   frontier, file lock. The single surface every front-end calls.
  cli.py            # CLI — a thin formatter; each command makes one ops.py call and prints
  mcp_server.py     # MCP server — a thin FastMCP adapter over ops.py (only loaded with the extra)
  harness.py        # autonomous engine + per-role agents + thinker contract + config;
                    #   drives the playbook via ops.* only — pure stdlib, never imports a model SDK;
                    #   enforces the cross-family gate (Claude vs GPT) and the rewrite-re-challenge rule
  thinker.py        # the live model backends — the ONLY module that shells out to a model CLI;
                    #   ClaudeCliThinker (claude -p) + CodexCliThinker (codex exec), pure stdlib
                    #   (subprocess/json/shutil/tempfile/os), no anthropic/openai package, no API key
  presentation.py   # render_tree, render_why, export_markdown — read-only rendering over a store
  templates.py      # BranchTemplate — opinionated inquiry workflows
  bootstrap.py      # seeds example graph from concept conversation
  __main__.py       # python -m sparkle entrypoint
tests/
  test_cli.py       # CLI integration tests via unittest
  test_referee.py   # referee/transition engine + edge tally
  test_debate_loop.py   # end-to-end debate loop over ops
  test_runs.py      # run tagging, summary, diff, rollback, and the run manifest (role->family roster) round-trip + id-stability
  test_mcp_server.py    # MCP adapter (skips when the mcp extra is absent): the integrity floors through the real tool closures
  test_mcp_liveproof.py # opt-in live proof: a fresh sparkle mcp server over real stdio, every floor asserted end-to-end (SPARKLE_LIVE_MCP=1 + the mcp extra; skipped in the fast suite)
  test_identity_rule.py # self-loop contradicts ban + distinct-adversary ratification floor
  test_run_plumbing.py  # run_id threaded through add-branch and rule
  test_run_completeness.py  # add_node run_id channel: proposal + synthesis (node + edges) land in the run region
  test_harness.py   # autonomous engine driven by a deterministic stub thinker (no network, no CLI)
  test_thinker_cli.py   # CLI thinkers via a fake command-runner: exact claude/codex argv + output parsing, no live calls
  test_family_gate.py   # cross-family ratification gate (Claude vs GPT) + the rewrite-re-challenge rule, via stub backends
demo/
  README.md         # demo overview with mermaid graph
  WALKTHROUGH.md    # conversational walkthrough of building a claim graph
  exported-research.md  # sparkle export output
  .sparkle/graph.json   # the demo graph (17 nodes, 19 edges)
docs/
  prd.md            # product requirements
  roadmap.md        # what's built, what's next

The ops.py seam is the structural keystone: the CLI never touches the store directly, and neither does the MCP server or the autonomous harness. All three bottom out in the same functions behind one ValueError boundary, so the debate rules — including the distinct-adversary ratification floor and the self-loop ban — are enforced in exactly one place and the front-ends stay interchangeable. Both the author-distinct floor and the verdict-rejection floor are per-caller choices made at this seam: the two agent-driven front-ends (the autonomous harness and the MCP server) opt in — ops.rule takes a require_distinct_adversary flag (so a single agent can't ratify off a self-written objection) and an opt-in affirming_verdicts set (so a non-affirming verdict can't settle a claim) — while the trusted human sparkle rule CLI path leaves both off (a human is trusted to challenge their own claim honestly and to use free-text verdict words). There is exactly one optional pip extra: sparkle[mcp] adds the MCP SDK. The autonomous run needs no pip extra at all — the engine (harness.py) and the model backends (thinker.py) are pure standard-library Python, and the live debate reaches the AI models by shelling out to the already-installed claude and codex command-line tools (no anthropic / openai package, no API key). The core, the CLI core, the MCP server, and the autonomous engine all stay pure stdlib with zero external dependencies.

Diagrams

Class model
classDiagram
    class Node {
      +node_type
      +title
      +content
      +citations[]
      +author
      +created_at
      +confidence
      +status
      +tags[]
      +metadata
      +to_payload()
      +compute_id()
    }

    class Edge {
      +from_id
      +to_id
      +relation
      +note
      +created_at
      +metadata
      +to_payload()
      +compute_id()
    }

    class GraphStore {
      +path
      +node_types
      +edge_relations
      +init()
      +read()
      +add_node(node)
      +add_edge(edge)
      +list_nodes()
      +list_edges()
      +resolve_id(prefix)
      +get_node(node_id)
      +get_neighbor_details(node_id)
      +lineage(root_id)
      +subgraph(root_id)
      +export_lineage(root_id)
    }

    class presentation {
      +render_tree(store, root_id)
      +render_why(store, root_id)
      +export_markdown(store, root_id, output)
    }

    class BranchTemplate {
      +name
      +node_type
      +relation
      +default_status
      +description
      +prompt_prefix
      +edge_note
    }

    GraphStore --> Node : stores
    GraphStore --> Edge : stores
    BranchTemplate --> Node : configures
    presentation --> GraphStore : reads
Loading
Add-branch sequence
sequenceDiagram
    actor User
    participant CLI as sparkle.cli
    participant Store as GraphStore
    participant Tpl as BranchTemplate
    participant JSON as graph.json

    User->>CLI: add-branch --from <claim> --template support --title ...
    CLI->>Store: resolve_id(from_prefix)
    Store->>JSON: read nodes
    JSON-->>Store: matching node id
    Store-->>CLI: parent node id
    CLI->>Store: get_node(parent_id)
    Store->>JSON: read parent node
    JSON-->>Store: parent node payload
    Store-->>CLI: parent node
    CLI->>Tpl: build_branch_node(parent_title, template, title, ...)
    Tpl-->>CLI: branch Node + template metadata
    CLI->>Store: add_node(branch_node)
    Store->>JSON: write node by content hash
    JSON-->>Store: persisted
    Store-->>CLI: branch_id
    CLI->>Store: add_edge(branch_id -> parent_id, template.relation)
    Store->>JSON: write edge by content hash
    JSON-->>Store: persisted
    Store-->>CLI: edge_id
    CLI-->>User: branch node id + branch edge id
Loading
Node status model
stateDiagram-v2
    [*] --> active
    active --> promising
    active --> stalled
    active --> weakly_supported
    weakly_supported --> promising
    weakly_supported --> abandoned
    stalled --> active
    stalled --> abandoned
    promising --> harvested
    promising --> active
    harvested --> active
    active --> abandoned
    abandoned --> active
    active --> ratified : settled ruling (rule --settle)
    ratified --> [*]
Loading

ratified is a terminal status reached only through a settled judge's ruling, never set freehand. Because nodes are frozen, ratification is realized model-honestly: the ruling writes a decision node and a superseding claim version carrying status=ratified and a supersedes edge to the original — never an in-place mutation. harvested and abandoned are likewise terminal in spirit; a model-authored write may not set any of the three directly.

Testing

python3 -m unittest discover -s tests -v

The base suite is pure stdlib unittest with no third-party packages installed; run it from the repo root. It covers: init, bootstrap, node/edge CRUD, branch templates, show/tree/why rendering, filtered listing (type/status/tag/query/limit), list-nodes --ids-only, home dashboard, lineage, markdown export, content-addressing idempotency, custom node-type/relation graph kernels with metadata, lineage-only vs full-component export, n/a confidence rendering, dangling-edge export safety, corrupt-store handling, error handling (invalid lookups, ambiguous prefixes, unknown relations, out-of-range confidence), the operations layer (the ratified terminal status, the fused create-and-link path including the paired-or-error guard, alias registration and re-pointing on reuse, the ruling invariant that refuses to ratify an unchallenged claim, the content-fingerprint dedup gate, JSON import from stdin with invalid-JSON rejection, revise with and without re-homing inbound edges, the referee/transition engine, and run tagging/summary/diff/rollback), and the Phase 2 autonomous harness:

  • Distinct-adversary floor (test_identity_rule.py): the self-loop contradicts ban at the edge-write boundary, and the require_distinct_adversary ratification gate — refused when the only objection's author equals the claim's author, allowed when a different author objects.
  • Run-id plumbing (test_run_plumbing.py): a run_id threaded through add-branch and rule shows up on the objection and the ruling, so a full autonomous loop is captured by the run-summary surface.
  • Autonomous engine (test_harness.py): a deterministic stub thinker (no network, no CLI) drives the full propose -> object -> rule loop; the run summary shows the claim, objection, and decision; the judge is refused on a self-strawman and succeeds with a cross-family critic; the hard move cap and the done/stop move both end the loop; and HarnessConfig raises when the critic's model family equals the proposer's. A guard test confirms the engine module never imports a model SDK. The run result reports moves_made (the honest per-move count, never a misleading rounds_run key) and equals the recorded move count, and the engine persists a cross-family run manifest — pinning the honest limitation that the manifest reflects the configured role-to-family roster, not a runtime probe of which thinker answered.
  • CLI thinkers (test_thinker_cli.py): an injected fake command-runner records the exact claude / codex argument list and returns canned output, so the tests pin the argv shape (the Opus-4.8 model id, JSON output mode, the deny-all no-tools lockdown flag, the read-only sandbox, the isolated working dir, the output-file flag, and the optional reasoning-effort override) and the output parsing (claude's JSON event array, codex's last-message file) without ever spawning a real process. They also pin the one deny-all exception: the evidence gatherer (web_search=True) gets the web search/fetch pair and only that pair (each of WebSearch/WebFetch appears twice — once to restrict availability via --tools, once to pre-approve via --allowedTools), while a default role stays deny-all with no web tools at all. Timeouts, non-zero exits, empty output, and malformed JSON all raise rather than returning a fake answer. The stub thinker is covered too. No test invokes the real CLIs.
  • Cross-family gate + rewrite-re-challenge (test_family_gate.py): stub backends tagged with families prove the judge is refused unless an objection comes from a different model family than the claim's author, and that a carried-over objection on a revised claim no longer counts — so a rewritten claim must earn a fresh cross-family attack before it can be ratified.
  • Run completeness (test_run_completeness.py): the run_id channel on add_node stamps both the new node and the fused link edge, so a full loop's proposal and synthesis (nodes and their edges) land in the run region and are visible to the run-summary / diff / rollback surface; an un-run human write stays byte-identical (no run stamp leaks onto any edge).

The MCP adapter tests (test_mcp_server.py) skip when the sparkle[mcp] extra is absent. They prove the integrity floors fire through the real build_app() tool closures: a pre-challenge ratify is refused, a same-author (self-strawman) objection can't ratify, and a non-affirming verdict can't settle.

There is also an opt-in live end-to-end proof (test_mcp_liveproof.py) that goes one level further: it spawns a fresh sparkle mcp server as a subprocess, connects to it as a real MCP client over stdio, and drives a full debate while asserting every floor fires through the actual transport + server + ops seam — pre-challenge ratify refused, self-strawman refused, the verdict-rejection floor (a "refuted" verdict with settle requested is refused), the honest judged-vs-ratified signal, a clean cross-author "upheld" ruling that does ratify, and the evidence/harvest edges landing in the run region. It is slower (it spawns a process), so it is skipped unless SPARKLE_LIVE_MCP=1 is set and the sparkle[mcp] extra is installed; the fast suite leaves it out.

No test makes a live AI call or spawns the real claude / codex commands — the live cross-family debate runs end-to-end (validated by hand across three real debates on 2026-05-29) but is token-heavy, so it is checked manually rather than by the suite.

Current Limits

  • The live cross-family debate runs end-to-end (validated across three real debates on 2026-05-29) but is token-heavy and meant to be kicked off deliberately. sparkle run shells out to the claude and codex command-line tools (you must have both installed and logged in; no API key or pip extra is needed). The GPT critic runs at high reasoning effort by default and is token-heavy on the Codex/ChatGPT side, and the debate is several AI turns, so a real run burns real subscription tokens; turn the critic's effort down with SPARKLE_CODEX_REASONING_EFFORT for cheaper runs. The automated suite proves the engine against a deterministic stub thinker and a fake command-runner; the live path is validated by hand, not by the test suite.
  • Ratification is not gated on net-positive support, by design — given that a cross-family objection exists, a judge may issue an upheld verdict and settle even on tied evidence (net=0), because the judge weighs the debate rather than counting votes. This is a deliberate design choice (the judge's reasoning is sovereign), not a missing floor. The floor that IS structural runs the other way: a judge can no longer ratify a claim its own verdict rejected, because the engine honors settle only for an upheld verdict (and a claim with no challenge at all still can't ratify, since the cross-family gate requires a real objection first). The cross-family setup is now auditable from the saved graph via sparkle run-manifest <run_id>, with the honest caveat that the manifest records the configured roster, not a runtime probe of which model answered each call. Multi-round runs (--rounds > 1) are available but not yet validated live.
  • A deduped re-proposal of the same seed stays attributed to its first run — if a later run re-proposes a byte-identical claim, the existing (frozen) node is returned and keeps the first run's run_id. This is an accepted, documented gap: re-stamping it would require mutating a frozen node, which the model forbids.
  • The MCP path's distinct-adversary floor is author-string level, not true independence. An agent driving Sparkle over MCP gets the same floors as the autonomous loop (a self-written objection can't ratify a claim, and a non-affirming verdict can't settle one). But because the MCP operator pattern is one agent playing every role, the objection's author is just whatever role string the agent passes, so a single agent could relabel itself to fake an adversary's name. This is an accepted boundary, not a bug: genuine adversary independence — a truly different model family attacking the claim — comes from the cross-family CLI leg (sparkle run), where the critic is a different company's model than the proposer and the judge-time gate checks model family, not just an author name.
  • One front-end at a time per graph — an advisory file lock guards each read-modify-write, but there is no multi-writer guarantee beyond that.
  • Source verification on the autonomous path is partial, not total. The evidence-gathering role searches the real web and cites the URLs it retrieved, and every other role must flag any recalled figure as (recalled, unverified) while the judge discounts unverifiable specifics — so a fabricated statistic can no longer silently carry a verdict. But the proposer, critic, judge, and synthesizer still reason from memory, so the harness is a source-checked-evidence debate, not a full fact-checker of every statement.
  • Citations are flat strings — no structured source metadata
  • No first-class introspection commands — the frontier covers "what needs attention" via MCP, but gaps/tensions/stale/orphans are not yet standalone CLI commands
  • No UI — terminal only
  • Semantic dedup is out of scope — exact-fingerprint re-proposals auto-dedup, but near-duplicates are not detected (true fuzzy matching would force a model dependency)

About

Local-first solo research tool that stores claims, evidence, objections, questions, and syntheses in a Merkle-style graph on content-addressed storage.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors