Skip to content

monahand1023/corpus

Repository files navigation

corpus-rag

PyPI Python CI License: MIT

Your personal archive — notes, PDFs, docs — queryable in plain English, running entirely on your machine.

A personal knowledge system shouldn't require a vector database service, a SaaS subscription, or trusting your files to someone else's cloud. corpus is one Python process, one SQLite file, one MCP server. Add a corpus.toml, point it at your data, run corpus-ingest, and Claude Code can search years of notes in under 300ms.

How it works

flowchart LR
    subgraph Ingest
      Src["Notes · PDF · HTML · text"] --> Ch["Chunker"]
      Ch --> Emb["Local embeddings"]
      Emb --> DB[("SQLite<br/>vectors + BM25 FTS")]
    end
    subgraph Query
      Q["Plain-English question"] --> Hy["Hybrid search<br/>semantic + BM25 · auto-fused"]
      DB --> Hy
      Hy --> RR["Cross-encoder re-rank<br/>(local BGE)"]
      RR --> Exp["Multi-hop expand_context"]
    end
    Exp --> MCP["7-tool MCP server"]
    MCP --> CC["Claude Code"]
Loading

Point it at any directory of markdown / PDF / HTML / text files and get:

  • Semantic + BM25 hybrid search with auto-tuned fusion weights
  • Source-diversity-aware retrieval (no single doc floods top-K)
  • Multi-hop reference chasing via expand_context
  • Optional cross-encoder re-ranker (local, BGE)
  • Optional per-document Claude-Haiku summaries
  • Seven MCP tools wired into Claude Code over stdio

Stack: Python 3.12 • Voyage or Gemini embeddings • SQLite + sqlite-vec • FastMCP. No AWS, no Docker, no Terraform.

PyPI CI License


Quick start

# 1. Install
pip install corpus-rag                # base
pip install 'corpus-rag[all]'         # + reranker, summarizer, pdf, html, gemini

# 2. Interactive setup wizard — generates corpus.toml + .env
corpus-init

# 3. Paste your VOYAGE_API_KEY (free tier covers ~200M tokens) into .env
#    Sign up at https://dash.voyageai.com/  — or pick Gemini in the wizard
#    to use Google AI Studio's free tier instead.

# 4. Run the first ingest
corpus-ingest --source notes -v

# 5. Try it from the CLI
corpus-query "the question you wish you could ask your archive"

# 6. Wire it to Claude Code or Claude Desktop — see "MCP server" below

corpus-init walks you through 5 prompts (data path, format, embedder provider, etc.) and writes a working corpus.toml. No need to hand-edit anything to get started.

Configuration

Everything that varies between deployments lives in corpus.toml. The wizard generates a starter file; edit by hand from there.

[corpus]
db_path = "./corpus.db"

[embedder]
provider = "voyage"           # or "gemini"
model = "voyage-3-large"
dim = 1024                    # must match the model's output dim

[retriever]
top_k = 5
max_per_source_type = 3       # diversity cap
hybrid = true                 # vector + BM25 via RRF

[[sources]]
name = "notes"                # free-form; used as source_type everywhere
type = "markdown"             # which built-in connector to use
path = "~/Documents/notes"
glob = "**/*.md"

[[references]]
# Optional. When set, `expand_context` chases these patterns across docs and
# the BM25 weight auto-tunes higher when the user's query contains a match.
pattern = '\b[A-Z]{2,}-\d+\b'
source_type = "tickets"
description = "Jira-style ticket keys"

Schema hazard: changing embedder.dim after data has been ingested would silently corrupt retrieval. corpus validates the dim against the existing schema at startup and refuses to proceed on mismatch.

MCP server

Wire corpus into Claude Code or Claude Desktop — both use stdio and the same config format. Pass the absolute path to your corpus.toml via --config — the client spawns the MCP server from an arbitrary CWD, so a relative path won't reliably find your config.

Claude Code — add to ~/.claude.json:

{
  "mcpServers": {
    "corpus": {
      "type": "stdio",
      "command": "corpus-mcp",
      "args": ["--config", "/absolute/path/to/your/corpus.toml"],
      "env": {}
    }
  }
}

After pip install corpus-rag, corpus-mcp is on your PATH. The client spawns it on demand.

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):

{
  "mcpServers": {
    "corpus": {
      "command": "uv",
      "args": [
        "--directory", "/absolute/path/to/your/corpus",
        "run", "corpus-mcp"
      ]
    }
  }
}

Use an absolute path to uv if it's not on the client's PATH. See docs/mcp_integration.md for more detail, including multiple-corpus setups.

Seven tools exposed:

Tool Purpose
search_knowledge Hybrid BM25+vector search with dedupe + diversity
expand_context Chase references from a chunk — siblings, cited docs, parent
get_doc Pull every chunk of a specific document
timeline Search results reordered chronologically
recent_activity Chunks updated in the last N days
get_summary Cached Claude-Haiku summary (after running corpus-summarize)
corpus_stats Health check — total chunks + per-source counts

The investigation pattern is the high-leverage flow: Claude calls search_knowledge to find entry points, then expand_context on the top result to pull in adjacent material (other chunks of the same doc, referenced doc IDs, parent links), then synthesizes from the full picture.

CLI reference

corpus-init                              # interactive setup wizard
corpus-list                              # show configured sources + chunk counts
corpus-ingest --source notes -v          # ingest one source
corpus-ingest --all                      # ingest everything in corpus.toml
corpus-query "your question" -k 10       # ad-hoc search
corpus-query "question" --source notes   # source-filtered
corpus-query "question" --rerank         # local BGE reranker (opt-in)
corpus-eval --queries my_queries.py      # recall@k against your queries
corpus-benchmark --runs 20               # latency profile
corpus-benchmark --compare voyage gemini # embed-latency A/B
corpus-summarize --source notes --dry-run    # estimate Haiku spend
corpus-summarize --source notes              # run it
corpus-reset --source notes              # drop one source's chunks
corpus-reset --all                       # delete the whole DB
corpus-mcp                               # stdio MCP server (Claude spawns it)

Built-in connectors

type Default glob Extra needed Notes
markdown **/*.md YAML frontmatter parsed (title, id, url, dates)
text **/*.txt Plain text; title from filename stem
pdf **/*.pdf pip install 'corpus-rag[pdf]' Uses pypdf. Scanned PDFs need OCR first.
html **/*.{html,htm} pip install 'corpus-rag[html]' Uses trafilatura for boilerplate-stripped main-content extraction

Adding a new source type

For Slack exports, JSON dumps, an internal API archive, EPUB books — write your own connector. See docs/adding_a_source.md for the walkthrough with a worked JSON-files example.

Eval

corpus-eval runs hand-written known-answer queries against the live corpus and reports recall@K. It's a regression signal — run it after changing chunking, switching embedders, or tweaking retrieval.

Write your queries in any Python file that defines EVAL_QUERIES, then pass --queries path/to/your_queries.py:

# my_queries.py
from dataclasses import dataclass, field

@dataclass(frozen=True)
class EvalQuery:
    query: str
    expected_keys: list[str] = field(default_factory=list)
    source_filter: list[str] | None = None
    note: str = ""

EVAL_QUERIES = [
    EvalQuery(
        query="how does the payment flow work?",
        expected_keys=["payment-design-doc"],
        note="paraphrased to stress semantic retrieval",
    ),
    # add more...
]
corpus-eval --queries my_queries.py --top-k 5     # baseline
corpus-eval --queries my_queries.py --rerank      # with the BGE reranker
corpus-eval --queries my_queries.py --no-hybrid   # vector-only baseline

Tips: paraphrase away from doc titles to stress semantic retrieval; list multiple expected_keys when several docs are valid answers; add a few negative queries (empty expected_keys) to confirm the corpus correctly fails on absent topics.

Benchmarking

corpus-benchmark measures per-stage retrieval latency (embed / vector / FTS / fusion / dedupe) with p50/p95/p99 + throughput.

corpus-benchmark --runs 20                      # latency profile
corpus-benchmark --queries my_queries.py        # use your own query set
corpus-benchmark --compare voyage gemini        # embed-latency A/B
corpus-benchmark --json out.json

Typical profile on an M-series Mac, few-thousand-chunk corpus: embed dominates at 100–300ms (provider API round-trip), while vector_search / fts_search / fusion / dedupe are collectively under ~5ms. The optimization lever is "fewer or concurrent embed calls," not "faster SQLite." If vector_search exceeds ~50ms you've outgrown brute-force vec0 (~100K chunks) and want HNSW indexing.

--compare measures embedder-API latency only — it does not compare retrieval quality, because two providers' vectors aren't comparable against one DB. For quality, ingest each provider into its own corpus and run corpus-eval against each.

Documentation

Doc What it covers
docs/configuration.md Every corpus.toml setting + env var, including the Voyage-vs-Gemini embedder choice
docs/mcp_integration.md Claude Code + Claude Desktop wiring, all 7 tools, the investigation pattern
docs/adding_a_source.md Walkthrough for writing a custom connector
docs/troubleshooting.md Common problems and the actual fixes

Architecture overview, benchmarking, and eval methodology are covered inline in this README (sections above).

Develop locally

Want to hack on the framework, write a new connector, or run the tests? Clone and use uv:

git clone https://github.com/monahand1023/corpus.git
cd corpus
uv sync                              # creates .venv with all deps
uv run pytest tests/ -q              # run the suite
uv run ruff check src/ tests/        # lint
uv run corpus-init                   # the CLI scripts are also available via `uv run`

The repo includes examples/sample_corpus/ (synthetic markdown notes) and examples/corpus.toml.example (wired to point at it) for try-before-you-config experiments.

License

MIT — see LICENSE.

About

Ask natural-language questions over your own notes, PDFs, and docs. Hybrid semantic+keyword search, auto-tuned fusion, multi-hop reference expansion, and a 7-tool MCP server for Claude Code. Runs entirely locally on SQLite — no Docker, no SaaS. `pip install corpus-rag`

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages