Best evaluated offline using the step-by-step Reviewer Demo Workflow or the comprehensive Portfolio & Reviewer Guide.
DeepReader is a local-first AI document intelligence and RAG workbench for turning technical documents into inspectable retrieval evidence. It demonstrates the components a reviewer expects in a serious RAG system: document ingestion, deterministic record IDs, source-preserving retrieval, summaries, processing jobs, citations, and evidence inspection.
It is intentionally not a chatbot wrapper. The dashboard exposes records, scores, retrieval methods, summaries, job steps, citations, and evidence packets so the retrieval pipeline can be inspected end to end.
Demo workbench: ingest local documents, inspect stable record IDs/ground-truth source text, track summary processing job steps, and search records using score chips and location metadata.
Extractive QA with citations/evidence: answers remain tied to cited records and inspectable evidence packets (separating used vs. available evidence).
- Text, EPUB, and PDF ingestion through a FastAPI backend.
- SQLite persistence for local, reproducible demos.
- Deterministic document records with stable IDs and source hashes.
- Source-preserving BM25 retrieval over original document text.
- Local vector-style retrieval and simple fusion for comparison.
- Deterministic local summaries with checkpointing.
- Optional standalone Paragraph Summary Service with deterministic mock summaries, explicitly enabled Gemini validation, and asynchronous batch scheduling.
- Processing jobs and job steps for summary generation.
- Summary-aware search with visible retrieval methods and component scores.
- Deterministic extractive QA with citations, evidence packets, and retrieval settings.
- React/Vite/TypeScript dashboard built for inspection rather than chat.
- Docker Compose setup for a no-secrets local demo.
- Backend tests and frontend build in GitHub Actions CI.
No API keys are required for the default workflow. The local summariser, mock paragraph provider, and QA flow are deterministic; the optional Gemini paragraph provider is disabled unless both provider selection and provider-call opt-in are set.
DeepReader is a portfolio/demo project built to be inspected rather than trusted. The fastest way to evaluate it is to clone it, run the local demo, and check that every pipeline stage is traceable.
- Ingest + records: upload
examples/simple_manual.txt. Confirm stable IDs,order_index, section titles, and unchanged source text. - Search provenance: run
what causes low flow?. Confirm each result shows retrieval method, aggregate score, component scores, record ID/stable ID, and source location (section/page/chapter). Missing fields fall back toNot reported; a zero-results query renders a styled empty state. - QA evidence provenance: ask
What causes low flow?. Confirm the Evidence provenance panel separates used in answer vs available only, shows each packet's retrieval method, scores, record ID, and location. - Job lifecycle: generate summaries and open the job. Confirm completed/failed/skipped counts,
error_code, attempts, and the retry button for failed or cancelled-unfinished steps.
- Skipped-step accounting and
error_codeexposure (v0.6-cancel-retry-hardeningtag). - Retry of failed and skipped/
job_cancelledsteps (content/data skips likeempty_summaryexcluded). - Remote-cancel partial artifact import (completed records imported; unfinished steps marked skipped/
job_cancelled). - Concurrent local cancellation guard (rollback-based finalization-overwrite prevention).
- Upload filename/extension safety, search, QA evidence packets, and answer persistence.
Run make test for the full backend suite.
- The optional paragraph-summary-service defaults to a deterministic
mockprovider; the Gemini path is validation-only and disabled unless explicitly opted in. - Local summaries are deterministic-extractive, not LLM summaries.
- Local vector-style retrieval is a lexical approximation, not embeddings/semantic search.
- QA is extractive and deterministic, not answer generation from a model.
- Provider (Gemini) and OpenStax validation remain deferred unless explicitly approved; the default demo runs fully offline.
- Step-by-step reviewer script with "what this demonstrates" commentary: docs/DEMO_WORKFLOW.md.
- Proven-results-only validation history: docs/validation-log.md.
- Module/data-flow overview: docs/ARCHITECTURE.md.
- Planning notes and deferred scaling: docs/project-log/feature-notes.md.
- Current tag:
v0.6-cancel-retry-hardening(lifecycle hardening: skipped steps, cancel, retry, remote-cancel partial artifact import). - Current v0.7 polish (
search-demo-polishdirection): QA evidence provenance surfacing (T1) and search result provenance/component-score display (T2), both complete and documented.
backend/src/deepreader/api: FastAPI routes and response schemas.backend/src/deepreader/ingest: text, EPUB, and PDF parsing.backend/src/deepreader/storage: SQLAlchemy models and repositories.backend/src/deepreader/summarise: local summariser, remote service client, artifacts, and summary job runner.backend/src/deepreader/retrieval: BM25, local vector-style retrieval, and fusion.backend/src/deepreader/answer: extractive QA, evidence packets, and citations.frontend/src: dashboard panels for uploads, documents, records, jobs, search, and QA.
More detail lives in docs/ARCHITECTURE.md.
Install and run the backend:
cd backend
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -e ".[dev]"
cd ..
make backend-devIn a second terminal, install and run the frontend:
cd frontend
pnpm install --frozen-lockfile
pnpm devOpen http://127.0.0.1:5173. The dashboard defaults to the backend at http://127.0.0.1:8000. To override it, copy frontend/.env.example to frontend/.env and set VITE_API_BASE_URL.
From the repository root:
docker compose up --buildThen open http://127.0.0.1:5173.
Docker Compose runs:
- backend on
http://127.0.0.1:8000 - paragraph-summary-service on
http://127.0.0.1:8001 - frontend on
http://127.0.0.1:5173 - SQLite in a named local volume, mounted at
/app/datain the backend container
No secrets or external services are required.
The backend uses its deterministic local summariser by default even though the paragraph service is running. To exercise the mock remote path, set both DEEPREADER_SUMMARY_BACKEND=remote and DEEPREADER_ALLOW_REMOTE_SUMMARY_SERVICE=true before starting Compose.
Use docs/DEMO_WORKFLOW.md for a step-by-step reviewer script. The short version:
- Start backend and frontend locally, or run Docker Compose.
- Upload
examples/simple_manual.txtor a PDF. - Select the document and inspect records, stable IDs, and source text.
- Search for
what causes low flow?. - Generate summaries and inspect the processing job.
- Search summaries.
- Ask a QA question.
- Inspect citations and evidence packets.
- Run tests and the frontend build.
Reviewer checklist:
- Uploads accept
.txt,.epub, and.pdf, and reject unsafe filenames/extensions. - Source records remain visible and unchanged.
- Stable IDs make records traceable across retrieval, summaries, citations, and jobs.
- Search results show scores, retrieval methods, metadata, summaries, and source text.
- QA answers expose citations and evidence rather than hidden generated claims.
- Tests and frontend build pass locally.
Dashboard uploads use the real API:
POST /documents/ingest/textfor.txtPOST /documents/ingest/epubfor.epubPOST /documents/ingest/pdffor.pdf
The backend enforces local filename safety checks and extension allowlists. PDF uploads stream to a temporary file while hashing and do not have an application-level size cap. Duplicate ingest currently creates another document row, while deterministic record stable IDs are reused for identical content. That behavior is intentional for now and tested.
Generating summaries for a document creates a record_summary job and one summarise_record step per record. The backend endpoint is synchronous: local extraction runs inline, while the opt-in remote path submits work to paragraph-summary-service, polls it to completion, imports the artifact once, and then returns the persisted job.
Remote backend jobs persist the paragraph-service job ID and the latest remote record counts, status, and compact stats. The paragraph service exposes read-only GET /jobs, GET /jobs/{job_id}, and GET /jobs/{job_id}/artifact diagnostics; none includes source records or credentials.
Checkpointing is based on record_id, summariser_name, and source_hash. Rerunning summary generation skips unchanged records that already have a matching summary. If a record source hash changes, a new current summary is created and prior source text remains untouched.
The local summariser is local_extractive_v1: it normalises whitespace, selects deterministic text, truncates predictably, and stores summary/source hashes. The optional Paragraph Summary Service defaults to the deterministic mock provider and returns JSON artifacts. The v0.6 Gemini provider is an explicit, capped validation path; see docs/GEMINI_PROVIDER_VALIDATION.md.
Search supports source text, summaries, local vector-style retrieval, and simple fusion. Response fields are inspection-first:
document_idrecord_idstable_idretrieval_methodsource_textsummarymetadatascorecomponent_scores
The QA endpoint is deterministic and extractive. It returns an answer plus citations, all evidence packets, used evidence, unused evidence, and retrieval settings. It is not a chatbot and does not call an LLM.
POST /documents/ingest/textPOST /documents/ingest/epubPOST /documents/ingest/pdfGET /documentsGET /documents/{document_id}GET /documents/{document_id}/recordsPOST /documents/{document_id}/summaries/runGET /documents/{document_id}/summariesGET /jobsGET /jobs/{job_id}GET /jobs/{job_id}/stepsPOST /jobs/{job_id}/retry-failedPOST /searchPOST /qa/askGET /answersGET /answers/{answer_id}
make test
make backend-dev
make frontend-dev
make frontend-buildmake frontend-dev and make frontend-build use pnpm by default. Override with NPM=npm if needed.
Backend:
make testFrontend:
cd frontend
pnpm install --frozen-lockfile
pnpm buildDocker config:
docker compose configGitHub Actions runs backend and paragraph-service tests plus the frontend build without secrets or real provider calls.
Backend defaults live in .env.example:
DEEPREADER_DATABASE_URL=sqlite:///./data/deepreader.sqlite3DEEPREADER_CORS_ORIGINS=http://127.0.0.1:5173,http://localhost:5173
The default CORS origins are local-only. Uploaded file content and secrets are not logged by design. A small redaction utility exists for future provider-backed configuration, but no provider keys are needed today.
- SQLite is the only configured persistence layer.
- Text, EPUB, and PDF are supported; real OCR is not implemented for scanned PDFs.
- The backend summary request remains synchronous in both modes; the paragraph service schedules its internal batches asynchronously while the backend polls.
- Paragraph-service jobs are in-memory and non-durable; Gemini mode is validation-only and disabled by default.
- The local summariser is deterministic and extractive, not an LLM summary.
- The local vector-style retriever is not embeddings and should not be treated as semantic search.
- Fusion is intentionally simple.
- QA is extractive and deterministic, not answer generation from a model.
- No auth, multi-user permissions, hosted deployment, PostgreSQL, Celery, Redis, or production observability stack.
- Add a short demo video.
- Validate optional provider-backed summaries conservatively before expanding the workflow.
- Add real embeddings and hybrid retrieval in a later milestone.
- Add richer job retry/checkpoint inspection.
- Add exportable evidence packets for reviewer handoff.
- Consider production deployment concerns only after the local portfolio workflow is stable.
This project is licensed under the MIT License. See LICENSE.

