Glow big or go home.
Status: experiment. Interfaces, scope, and behavior will change without notice.
A researcher reading a dense academic paper — or a student working through a textbook — reaches for a neon ink pen and marks the passages that answer the question they're holding in their head. They don't paraphrase on the fly; they highlight the verbatim text and come back to synthesize once the full picture is on the page.
highlighter is the digital counterpart. It reads a document with a query
in mind, pulls out verbatim excerpts as it goes, and only then synthesizes
an answer grounded in those highlights. Every claim in the final answer
points back to a quoted span in the source — no paraphrasing, no
invention.
We're trading latency for higher accuracy, for now. The sweet spot is deep questions over single documents where the answer has to be assembled from many passages rather than retrieved as one. Markdown input only.
Ask a question, get cited excerpts:
uv run python -m highlighter <markdown-file> -q "your question"
Add --synthesize to also get a short grounded answer that cites the
excerpts by number.
| Arg / flag | Required | Default | Notes |
|---|---|---|---|
<markdown-file> |
yes | — | Path to a markdown document. |
-q, --question |
yes | — | The question to ask. |
--chunk-size |
no | 2000 |
Tokens per chunk. |
--chunk-overlap |
no | 200 |
Token overlap between consecutive chunks. |
--synthesize |
no | off | Run the final LLM synthesis step. |
Requires an API key for the configured provider (default:
anthropic:claude-haiku-4-5-20251001, so ANTHROPIC_API_KEY must be
set).
Two tracks measure extraction quality. Both report precision / recall / F1
via substring match against fixture-provided expected_excerpts, and both
support a baseline + regression gate.
- Chunk-level — pins one chunk per case and measures the extractor in isolation, bypassing query expansion and chunk selection.
- Pipeline — runs the full pipeline end-to-end against a whole document, so generated sub-questions and rubric are part of the score.
# Chunk-level
uv run python -m evals # all cases, one run each
uv run python -m evals --case <name> # single case by name
uv run python -m evals --runs 3 # repeat each N times
uv run python -m evals --debug # also print prompt + raw LLM output
uv run python -m evals --write-baseline # record current mean scores
uv run python -m evals --check-baseline # gate: exit non-zero on regression
# Pipeline
uv run python -m evals.pipeline # all cases, one run each
uv run python -m evals.pipeline --case <name>
uv run python -m evals.pipeline --runs 3
uv run python -m evals.pipeline --debug # also print generated sub-Qs + rubric
uv run python -m evals.pipeline --write-baseline
uv run python -m evals.pipeline --check-baseline
Baselines live in evals/baseline.json (chunk-level, tolerance 0.02)
and evals/baseline-pipeline.json (pipeline, tolerance 0.05 — wider
because generation variance stacks on top of extraction).
uv run pytest # full suite
uv run pytest tests/test_consolidate.py # single file
uv run pytest -k synthesize # filter by keyword
LLM-driven tests stub the pydantic-ai agent with a FunctionModel via
agent.override(...), so the suite runs offline — no API key required.
Internal logic (verification, citation mapping, consolidation) is never
stubbed.
uv sync
uv run ruff check .