Skip to content

mayankkohaley/highlighter

Repository files navigation

highlighter

Glow big or go home.

Status: experiment. Interfaces, scope, and behavior will change without notice.

A researcher reading a dense academic paper — or a student working through a textbook — reaches for a neon ink pen and marks the passages that answer the question they're holding in their head. They don't paraphrase on the fly; they highlight the verbatim text and come back to synthesize once the full picture is on the page.

highlighter is the digital counterpart. It reads a document with a query in mind, pulls out verbatim excerpts as it goes, and only then synthesizes an answer grounded in those highlights. Every claim in the final answer points back to a quoted span in the source — no paraphrasing, no invention.

We're trading latency for higher accuracy, for now. The sweet spot is deep questions over single documents where the answer has to be assembled from many passages rather than retrieved as one. Markdown input only.

Usage

Ask a question, get cited excerpts:

uv run python -m highlighter <markdown-file> -q "your question"

Add --synthesize to also get a short grounded answer that cites the excerpts by number.

Arg / flag Required Default Notes
<markdown-file> yes Path to a markdown document.
-q, --question yes The question to ask.
--chunk-size no 2000 Tokens per chunk.
--chunk-overlap no 200 Token overlap between consecutive chunks.
--synthesize no off Run the final LLM synthesis step.

Requires an API key for the configured provider (default: anthropic:claude-haiku-4-5-20251001, so ANTHROPIC_API_KEY must be set).

Evals

Two tracks measure extraction quality. Both report precision / recall / F1 via substring match against fixture-provided expected_excerpts, and both support a baseline + regression gate.

  • Chunk-level — pins one chunk per case and measures the extractor in isolation, bypassing query expansion and chunk selection.
  • Pipeline — runs the full pipeline end-to-end against a whole document, so generated sub-questions and rubric are part of the score.
# Chunk-level
uv run python -m evals                    # all cases, one run each
uv run python -m evals --case <name>      # single case by name
uv run python -m evals --runs 3           # repeat each N times
uv run python -m evals --debug            # also print prompt + raw LLM output
uv run python -m evals --write-baseline   # record current mean scores
uv run python -m evals --check-baseline   # gate: exit non-zero on regression

# Pipeline
uv run python -m evals.pipeline                 # all cases, one run each
uv run python -m evals.pipeline --case <name>
uv run python -m evals.pipeline --runs 3
uv run python -m evals.pipeline --debug         # also print generated sub-Qs + rubric
uv run python -m evals.pipeline --write-baseline
uv run python -m evals.pipeline --check-baseline

Baselines live in evals/baseline.json (chunk-level, tolerance 0.02) and evals/baseline-pipeline.json (pipeline, tolerance 0.05 — wider because generation variance stacks on top of extraction).

Testing

uv run pytest                             # full suite
uv run pytest tests/test_consolidate.py   # single file
uv run pytest -k synthesize               # filter by keyword

LLM-driven tests stub the pydantic-ai agent with a FunctionModel via agent.override(...), so the suite runs offline — no API key required. Internal logic (verification, citation mapping, consolidation) is never stubbed.

Development

uv sync
uv run ruff check .

About

Reads your document like a careful researcher — highlighting verbatim excerpts and answering questions with citations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages