Skip to content

donglinb/llm-paper-reader

Repository files navigation

📄 AI Paper Reader

v0.2.0 · MIT License · Python 3.10+

AI-powered academic paper reader with structured summarization, persistent paper library, and RAG-based Q&A.

Author: Donglin Bai & Claude Code · Email: baidonglin295332@gmail.com · WeChat: bdl332

Contents

What it does

  • Ingests papers from local PDF paths, arXiv URLs, or arXiv IDs (including legacy IDs)
  • Parses PDFs with marker, mineru, or pymupdf
  • Optionally enriches metadata via GROBID
  • Structures the paper into a normalized representation
  • Generates high-level overviews with LangGraph-driven summarization
  • Supports interactive chat and tool-based agent Q&A
  • Exports to Markdown, JSON, or BibTeX
  • Stores papers in a local library and supports comparison/export workflows
  • Uses profile-based LLM configuration via llm_config.yaml (LiteLLM-compatible)
  • Supports Chinese-language papers and Chinese output (简体中文)
  • Pipeline hooks allow injecting custom logic before/after any processing step

Installation

Requirements

  • Python 3.10+
  • pip
  • Optional: Docker (for local GROBID)
cd paper_reader
pip install -e .

Alternative:

pip install -r requirements.txt

Quick start

1) Configure non-LLM settings

cp .env.example .env

.env controls parser/GROBID/storage behavior (not model profiles).

2) Configure LLM profiles

llm_config.yaml in the repo root is ready to use and can be customized with your own profiles.

3) Run

# Parse only (no LLM summarization)
paper-reader parse 2308.13418

# Parse + summarize
paper-reader read 2308.13418

# Start interactive RAG Q&A
paper-reader chat 2308.13418

# Start tool-based agent Q&A with optional long-session memory
paper-reader agent 2308.13418

# Launch Gradio UI
paper-reader serve

4) Common workflows

# Extract parsed sections without LLM summarization
paper-reader parse 2308.13418 --output parsed.md

# Save markdown summary to file
paper-reader read 2308.13418 --output summary.md

# Force re-processing from scratch
paper-reader read 2308.13418 --force

# Export as JSON or BibTeX
paper-reader export 2308.13418 --format json --output paper.json
paper-reader export 2308.13418 --format bibtex --output paper.bib

# Batch process multiple papers
paper-reader batch 2308.13418 2401.12345 --format markdown --output-dir summaries

# Quiet mode — only warnings and errors
paper-reader read 2308.13418 --quiet

5) Example output

Running paper-reader read 2308.13418 --output summary.md produces:

summary.md (click to expand)
# Nougat: Neural Optical Understanding for Academic Documents

**Authors:** Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic

## One-Line Summary
Nougat is an end-to-end visual Transformer that converts scientific PDF pages directly into structured markup, preserving text, tables, and mathematical expressions without relying on external OCR.

## Motivation
Most scientific knowledge is stored as PDFs, a format that discards semantic structure and is particularly inadequate for mathematical expressions and tables. Existing PDF processing tools and OCR pipelines fail to reliably recover this structure, limiting machine accessibility, searchability, and reuse of scientific content. A robust document-to-markup solution is therefore critical for large-scale scientific knowledge extraction.

## Key Observations
PDFs retain visual layout but lose semantic meaning; mathematical expressions are especially poorly handled by classical OCR and PDF parsers. Prior pipelines that stitch together OCR, layout analysis, and formula recognition are brittle and error-prone. Recent Transformer-based visual document understanding models suggest that text recognition and structural understanding can be learned jointly from images alone.

## Core Idea
The paper reframes scientific document conversion as a visual document understanding problem rather than a traditional OCR task. Nougat uses an end-to-end encoder–decoder Transformer that takes only rasterized page images as input and directly generates a structured markup representation, implicitly learning text, layout, and math recognition in a single model.

## Methods
Nougat follows an encoder–decoder Transformer architecture inspired by Donut. Page images rendered at 96 DPI are resized and padded to 896×672 and encoded using a Swin Transformer (base) visual encoder with pre-trained weights. A large autoregressive Transformer decoder based on mBART generates markup tokens with cross-attention to visual embeddings, using a scientific-domain tokenizer and a maximum sequence length of 4096 tokens. The base model has ~350M parameters, with a smaller 250M-parameter variant. Training uses AdamW over 3 epochs with an effective batch size of 192 and a decaying learning rate. Extensive image augmentations (noise, blur, erosion, distortion, compression) and text-level token replacement are applied to improve robustness and prevent repetition collapse. Inference uses greedy decoding with heuristic repetition detection and early stopping.

## Main Results
Evaluation on an arXiv-based test set uses normalized edit distance, BLEU, METEOR, and precision/recall/F1 across plain text, math, tables, and overall content. Embedded PDF text achieves edit distance 0.255 and F1 79.2; GROBID performs worse overall (edit distance 0.312, F1 73.0), particularly on tables and math. A LaTeX-OCR baseline shows extremely poor math performance (BLEU 0.3, F1 9.7) despite strong aggregate scores when combined with other signals. Nougat small (250M) and base (350M) achieve the best overall performance, with edit distance around 0.07 and F1 ≈93 on all content, strong gains in math (F1 ≈77) and plain text (F1 ≈95.7). The smaller model matches the base model’s accuracy. Anti-repetition training and inference heuristics reduce failed page conversions on out-of-domain documents by 32%, with repetition occurring in about 1.5% of test pages.

## Strengths
End-to-end design eliminates dependence on OCR engines or embedded PDF text. Strong performance on mathematical expressions and tables, where prior systems struggle. Large-scale dataset creation from 1.7M arXiv articles enables robust training. Competitive accuracy even with a smaller 250M-parameter model. Released models and code support reproducibility and future research.

## Limitations
Inference is significantly slower than classical systems (e.g., ~19.5s per batch of 6 pages on an NVIDIA A10G versus GROBID’s ~10.6 pages/s). The model can still collapse into repetitive loops, especially out of domain. Training data is overwhelmingly English, with poor handling of non-Latin scripts. Page-wise independent processing causes cross-page inconsistencies in section numbering and bibliographies. Dataset ground truth contains artifacts from LaTeXML preprocessing and page alignment heuristics.

## Key Figures & Tables
- Qualitative example showing a dense mathematical PDF page converted to LaTeX and re-rendered accurately
- Table comparing Nougat (small/base) against PDF text, GROBID, and LaTeX-OCR across edit distance, BLEU, METEOR, and F1 by modality

## Related Work Context
Nougat builds on advances in OCR, mathematical expression recognition, and visual document understanding. It extends Transformer-based encoder–decoder approaches such as Donut by targeting scientific documents with dense math and structure. Unlike tools such as GROBID or pdf2htmlEX, Nougat directly recovers semantic representations of equations and tables. It complements LayoutLM-style models by focusing on full document-to-markup generation rather than token-level understanding.

## Future Work
Reducing repetition collapse remains the primary open challenge. Improving document-level consistency across pages, expanding multilingual and non-Latin script support, and accelerating inference are important directions. More robust and cleaner ground truth generation, as well as better evaluation metrics for mathematically equivalent expressions, are also highlighted as future research opportunities.

CLI commands

All commands support --verbose / -v for debug logging, --quiet / -q for warnings-only output, and --force / -F to re-process from scratch.

Accepted input formats

<source> in the commands below accepts any of:

Format Example
arXiv ID 2308.13418
arXiv URL https://arxiv.org/abs/2308.13418
Legacy arXiv ID hep-th/9905111
Local PDF path ./papers/my_paper.pdf

Core analysis

# Parse only (no LLM summarization)
paper-reader parse <source> \
  [-b <profile>] [-p marker|mineru|pymupdf] \
  [--vision] [--no-grobid] [--use-llm-parsing] \
  [-l auto|en|zh] [--output-language auto|en|zh] \
  [-o parsed.md] [-v] [-q]

# Parse + summarize
paper-reader read <source> \
  [-b <profile>] [-p marker|mineru|pymupdf] \
  [--vision] [--no-grobid] [--use-llm-parsing] \
  [-l auto|en|zh] [--output-language auto|en|zh] \
  [-o summary.md] [--force] [-v] [-q]

# Parse + summarize + interactive RAG Q&A
paper-reader chat <source> \
  [-b <profile>] [-p marker|mineru|pymupdf] \
  [--vision] [--no-grobid] [--use-llm-parsing] \
  [-l auto|en|zh] [--output-language auto|en|zh] \
  [--force] [-v] [-q]

# Parse + summarize + tool-based agent Q&A
paper-reader agent <source> \
  [-b <profile>] [-p marker|mineru|pymupdf] \
  [--vision] [--no-grobid] [--use-llm-parsing] \
  [-l auto|en|zh] [--output-language auto|en|zh] \
  [--no-memory] [--consolidation-threshold 20] \
  [--force] [-v] [-q]

Web UI

paper-reader serve [--host 127.0.0.1] [--port 7860] [-b <profile>] [-v] [-q]

The web UI provides three actions:

  • Analyze Paper — runs the full pipeline (parse + summarize) in one step
  • Parse Only — runs ingestion, parsing, metadata extraction, and structuring without LLM summarization
  • Summarize — generates the LLM overview from an already-parsed paper

The Chat tab supports switching between the RAG Chat and Agent (tool-based) Q&A pipelines.

Batch / export / library / compare

# Batch analysis and export
paper-reader batch <source1> <source2> ... \
  [-d summaries] [-f markdown|json|bibtex] \
  [-b <profile>] [-p marker|mineru|pymupdf] \
  [-l auto|en|zh] [--output-language auto|en|zh] \
  [--no-grobid] [--force] [-v] [-q]

# Export one paper directly
paper-reader export <source> \
  [-o paper.md] [-f markdown|json|bibtex] \
  [-b <profile>] [-p marker|mineru|pymupdf] \
  [-l auto|en|zh] [--output-language auto|en|zh] \
  [--no-grobid] [--force] [-v] [-q]

# Library management
paper-reader library list
paper-reader library search <query>
paper-reader library remove <paper_id>

# Compare papers already in the library (requires ≥ 2 IDs)
paper-reader compare <id1> <id2> [<id3> ...] \
  [-o comparison.md] [-b <profile>] \
  [--output-language auto|en|zh] [-v] [-q]
Flag reference (short → long)
Short Long Used by
-b --backend all commands
-p --parser parse, read, chat, agent, batch, export
-o --output parse, read, export, compare, config migrate
-f --format batch, export
-d --output-dir batch
-l --language parse, read, chat, agent, batch, export
--output-language parse, read, chat, agent, batch, export, compare
-F --force read, chat, agent, batch, export
-v --verbose all commands
-q --quiet all commands

Config migration (legacy env vars)

paper-reader config migrate [--output llm_config.yaml] [--force]

Use this when migrating from older PAPER_READER_* LLM env-var setups.

Chinese language support

Paper Reader supports Chinese-language papers and Chinese output out of the box.

Reading a Chinese paper

# Auto-detect language and produce Chinese output
paper-reader read chinese_paper.pdf

# Force Chinese output even for an English paper
paper-reader read 2308.13418 --output-language zh

# Ask questions in Chinese
paper-reader chat chinese_paper.pdf --output-language zh

Language flags

Flag Commands Description
--language, -l parse, read, chat, agent, batch, export Paper language (auto/en/zh). Default: auto (detected from content).
--output-language parse, read, chat, agent, batch, export, compare Output language (auto/en/zh). Default: auto (same as detected input).

The Gradio web UI includes an Output Language dropdown in the sidebar.

How it works

  • Language detection — Uses a CJK character ratio heuristic (with optional langdetect fallback) to classify papers as Chinese or English.
  • Chinese-aware chunking — Smaller chunk sizes (600 chars) and Chinese sentence-boundary separators (。!?;,) for better retrieval.
  • Chinese heading detection — Recognises Chinese section headings (第一章, 一、, 摘要, 引言, etc.) as a fallback when Markdown headings are absent.
  • Localised prompts — All LLM prompts have Chinese variants in paper_reader/prompts/zh.py.
  • Localised output — Overview headers, comparison headers, and CLI progress messages switch to Chinese when appropriate.

Chinese-optimised LLM profiles

Three pre-configured profiles for Chinese content are included in default_llm_config.yaml:

deepseek:
  chat_model: "deepseek/deepseek-chat"
  embedding_model: "BAAI/bge-m3"
  temperature: 0.3

qwen:
  chat_model: "dashscope/qwen-plus"
  embedding_model: "dashscope/text-embedding-v3"
  vision_model: "dashscope/qwen-vl-max"
  temperature: 0.3

local-chinese:
  chat_model: "ollama/qwen2.5:7b"
  embedding_model: "BAAI/bge-m3"
  api_base: "http://localhost:11434"
  temperature: 0.3

Use them with --backend deepseek, --backend qwen, or --backend local-chinese.

Optional Chinese dependencies

pip install langdetect jieba langchain-huggingface sentence-transformers
Package Purpose
langdetect Fallback language detection for ambiguous text
jieba Chinese word segmentation (for future BM25 hybrid retrieval)
langchain-huggingface Local HuggingFace embedding models (bge-m3)
sentence-transformers Backend for langchain-huggingface

All Chinese features work without these packages — they are only needed for local embedding models and enhanced detection.

Configuration

LLM profiles (llm_config.yaml)

Model/provider config is profile-based. Select profile at runtime with --backend <profile>.

Example profile config:

default_profile: openai

profiles:
  local:
    chat_model: "openai/gpt-4.1"
    embedding_model: "openai/text-embedding-3-small"
    vision_model: "openai/gpt-4.1"
    api_base: "http://localhost:8000/v1"
    api_key: "not-needed"
    max_retries: 3
    timeout: 120

  openai:
    chat_model: "openai/gpt-4o"
    embedding_model: "openai/text-embedding-3-small"
    vision_model: "openai/gpt-4o"
    temperature: 0.2

  anthropic:
    chat_model: "anthropic/claude-sonnet-4-20250514"
    embedding_model: "openai/text-embedding-3-small"
    vision_model: "anthropic/claude-sonnet-4-20250514"
    temperature: 0.2

  ollama:
    chat_model: "ollama/llama3.1"
    embedding_model: "ollama/nomic-embed-text"
    api_base: "http://localhost:11434"

Config discovery order:

  1. Explicit path passed to loader
  2. PAPER_READER_LLM_CONFIG env var
  3. ./llm_config.yaml (current working directory)
  4. ~/.config/paper_reader/llm_config.yaml (user-level)
  5. Legacy PAPER_READER_* env vars (emits deprecation warning)
  6. Bundled default_llm_config.yaml shipped with the package

Adding a new backend (zero code changes):

# 1. Add a profile to llm_config.yaml — e.g. groq:
#    groq:
#      chat_model: "groq/llama-3.1-70b-versatile"
#      embedding_model: "openai/text-embedding-3-small"
# 2. Set the provider's API key
export GROQ_API_KEY="gsk_..."
# 3. Use it
paper-reader read 2308.13418 --backend groq

Supported backends

Any model supported by LiteLLM works out of the box:

Provider Model prefix Example
OpenAI openai/ openai/gpt-4o
Anthropic anthropic/ anthropic/claude-sonnet-4-20250514
Ollama ollama/ ollama/llama3.1
Groq groq/ groq/llama-3.1-70b-versatile
Together AI together_ai/ together_ai/mistralai/Mixtral-8x7B
Local (OpenAI-compat) openai/ openai/gpt-4.1 (with api_base)

Non-LLM settings (.env)

Variable Description Default
PAPER_READER_LLM_PROFILE Default LLM profile name local
PAPER_READER_PARSER_BACKEND Parser backend (marker, mineru, pymupdf) marker
PAPER_READER_MARKER_USE_LLM Enable Marker LLM-augmented parsing false
PAPER_READER_GROBID_URL GROBID endpoint http://localhost:8070
PAPER_READER_ENABLE_VISION_PARSING Enable page-image vision parsing false
PAPER_READER_CHROMA_PERSIST_DIR ChromaDB storage directory ~/.paper_reader/chroma
PAPER_READER_CACHE_DIR Download/cache directory ~/.paper_reader/cache

Caching & library

All commands that process a paper (read, chat, agent, export, batch) and the web UI automatically save results to the paper library at ~/.paper_reader/:

Artifact Location Purpose
Library index ~/.paper_reader/library.json Metadata + PaperOverview per paper
Structured papers ~/.paper_reader/papers/{id}.json Full StructuredPaper for reuse
Vector index ~/.paper_reader/chroma/ ChromaDB embeddings for Q&A

Subsequent commands for the same source skip parsing and summarization. When the source is a local PDF file, caching is based on the file's content hash, so renaming or moving a file still hits the cache. Pass --force / -F to bypass the cache and re-process from scratch (the fresh results replace the cached ones).

Logging

Flag / Env var Level Effect
(default) INFO Normal progress messages
--verbose / -v DEBUG Everything including debug traces
--quiet / -q WARNING Only warnings and errors
LITELLM_LOG (env var) Controls LiteLLM's own logging; defaults to ERROR to suppress chatter. Set LITELLM_LOG=DEBUG to see full LiteLLM traces.

Optional components

PDF parsers

Marker (default) MinerU PyMuPDF4LLM
Best for General-purpose, best accuracy Scanned documents Quick text extraction
Install Core (marker-pdf) Optional (pip install "mineru[all]") Core (pymupdf4llm)
GPU Recommended Supported Not needed
Layout detection Deep-learning models Deep-learning models Rule-based
Table extraction Yes Yes Yes
Math / equations Yes Yes Limited
Figure extraction Yes Yes No
LLM-augmented mode Yes (--use-llm-parsing) No No
Fallback Falls back to PyMuPDF4LLM on failure Falls back to PyMuPDF4LLM on failure Automatic fallback for all parsers

Select a parser with --parser marker|mineru|pymupdf. Marker is the default.

GROBID

docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.2

MinerU parser extras

If you plan to use the mineru parser backend, install optional dependencies:

pip install "mineru[all]"

Troubleshooting

Missing API key / auth errors

  • Symptom: errors mentioning unauthorized access, invalid key, or provider auth failure.
  • Fix:
    • For OpenAI profiles, set OPENAI_API_KEY.
    • For Anthropic profiles, set ANTHROPIC_API_KEY.
    • For local profiles, make sure your LLM server is running and api_base is reachable.
  • Verify: run paper-reader read 2308.13418 --backend <profile> --no-grobid and confirm auth errors are gone.

Wrong profile selected

  • Symptom: model/provider mismatch, unexpected endpoint, or profile not found.
  • Fix:
    • Check default_profile in llm_config.yaml.
    • Override per command with --backend <profile>.
    • Ensure the profile name exists under profiles:.
  • Verify: run with an explicit profile and confirm logs/output match the expected provider behavior.

llm_config.yaml not picked up

  • Symptom: app falls back to defaults or legacy env-var behavior.
  • Fix:
    • Run from the project root (where llm_config.yaml exists), or
    • Set PAPER_READER_LLM_CONFIG=/absolute/path/to/llm_config.yaml.
  • Verify: ls -la llm_config.yaml (or echo "$PAPER_READER_LLM_CONFIG") points to the intended file.

GROBID unavailable

  • Symptom: metadata extraction skipped/unavailable.
  • Fix:
    • Start GROBID locally:
      docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.2
    • Or disable it explicitly with --no-grobid.
  • Verify: curl -sS http://localhost:8070/api/isalive returns a healthy response.

Parser dependency issues

  • Symptom: parser backend import/runtime errors.
  • Fix:
    • Prefer marker (default) if optional parser deps are missing.
    • For MinerU, install extras: pip install "mineru[all]".
    • Switch parser explicitly: --parser marker|mineru|pymupdf.
  • Verify: run paper-reader read 2308.13418 --parser <backend> --no-grobid without parser import/runtime errors.

Local/hosted backend not reachable

  • Symptom: connection refused or timeout errors.
  • Fix:
    • Verify the endpoint in your selected profile (api_base, embedding_api_base).
    • Confirm the server is running and reachable from your shell.
    • For local models, test with a smaller request first (paper-reader read <id>).
  • Verify: curl to the relevant endpoint returns JSON instead of connection/timeout errors.

Debug checklist

  1. Inspect active env vars

    env | grep -E '^PAPER_READER_|^OPENAI_API_KEY|^ANTHROPIC_API_KEY'

    Expected: relevant variables print (or empty output if intentionally unset).

  2. Confirm config file resolution inputs

    pwd
    ls -la llm_config.yaml
    echo "$PAPER_READER_LLM_CONFIG"

    Expected: llm_config.yaml exists in current working directory, or PAPER_READER_LLM_CONFIG points to a valid file.

  3. Check GROBID health (if enabled)

    curl -sS http://localhost:8070/api/isalive

    Expected: a healthy, non-error response.

  4. Check local/proxy model endpoints (if used)

    curl -sS http://localhost:8000/v1/models
    curl -sS http://localhost:8080/v1/models
    curl -sS http://localhost:11434/api/tags

    Expected: JSON payloads. Connection/timeout errors mean the service is not reachable.

  5. Run a minimal end-to-end test

    paper-reader read 2308.13418 --backend local --no-grobid

    Expected: progress stages complete and overview output is printed or saved.

Development

pip install -r requirements-optional.txt
pytest tests/ -v                  # unit/default test suite
pytest tests/ -v -m integration   # tests requiring a running LLM backend

Extending Paper Reader

Parsers, exporters, and Q&A backends use decorator-based registries. Add a new component without modifying existing code:

# Custom parser
from paper_reader.parsing import register_parser

@register_parser("my-parser")
class MyParser:
    name = "my-parser"
    def parse(self, pdf_path, settings, profile): ...
    def is_available(self): return True

# Custom exporter
from paper_reader.export import register_exporter

@register_exporter("html")
class HtmlExporter:
    format_name = "html"
    file_suffix = ".html"
    def export(self, paper, overview, output_path, **kw): ...

# Pipeline hooks
from paper_reader.pipeline.hooks import pipeline_hooks

@pipeline_hooks.after("summarize")
def notify_on_summary(result, **kwargs):
    print(f"Summary complete: {result.paper.title}")

High-level architecture

Input (PDF / arXiv URL / arXiv ID)
    │
    ▼
Ingestion ──→ Parsing (registry: marker / mineru / pymupdf) ──→ Structuring
    │              │                                                  │
    │         GROBID metadata (optional) ─────────────────────────────┘
    │                                                                 │
    ▼                                                                 ▼
ArxivMetadata                                                  StructuredPaper
                                                                      │
                              PipelineRunner (event-yielding)         │
                              ┌───────────────────────────────────────┤
                              ▼                                       ▼
                         Cache                                Summarizer (LangGraph)
                  (Library + Disk)                            ├── Map: section summaries
                              │                               ├── Reduce: synthesis
                              │                               └── Evaluate: quality check
                              │                                       │
                              │                                       ▼
                              │                                 PaperOverview ──→ Cache
                              │                                       │
                     ┌────────┼───────────────────┬───────────────────┤
                     ▼        │                   ▼                   ▼
                 Indexing      │          Export (registry:       Library
               (ChromaDB)      │        MD / JSON / BibTeX)     (JSON DB)
                     │        │
                     ▼        │
            Chat / Agent Q&A ◄─┘  (Q&A backend registry; reuses cache)
            ├── Retrieve + Grade
            ├── Generate answer
            ├── Hallucination check
            └── Answer quality check

            Pipeline hooks: before/after any step (extensible)

Project structure

paper_reader/
├── pyproject.toml              # Package metadata & entry point
├── requirements.txt            # Core dependencies
├── requirements-optional.txt   # Dev / test dependencies
├── llm_config.yaml             # LLM provider profiles (user-editable)
├── .env.example                # Non-LLM settings template
├── paper_reader/
│   ├── __init__.py             # Public API surface & re-exports
│   ├── core/                   # Layer 0 — foundation types & utilities
│   │   ├── __init__.py
│   │   ├── exceptions.py       # PaperReaderError hierarchy
│   │   ├── protocols.py        # Structural Protocol definitions
│   │   ├── models.py           # Pydantic data models
│   │   ├── config.py           # pydantic-settings (parser, storage, etc.)
│   │   ├── language.py         # Language detection & output language resolution
│   │   ├── utils.py            # Pure utility functions (zero intra-project imports)
│   │   └── library.py          # Persistent paper library (JSON)
│   ├── llm/                    # Layer 1 — LLM infrastructure
│   │   ├── __init__.py
│   │   ├── config.py           # LLM profile model + YAML loader
│   │   ├── backend.py          # Thin LiteLLM facade
│   │   └── default_llm_config.yaml  # Bundled fallback config
│   ├── processing/             # Layer 2 — document processing pipeline
│   │   ├── __init__.py
│   │   ├── ingestion.py        # PDF download & arXiv resolution
│   │   ├── parsing.py          # Parser registry + PDF → Markdown backends
│   │   ├── metadata.py         # GROBID academic metadata extraction
│   │   ├── structuring.py      # Section classification & merging
│   │   ├── summarizer.py       # LangGraph map-reduce summarization
│   │   └── indexing.py         # Section-aware chunking → ChromaDB
│   ├── qa/                     # Layer 3 — Q&A & interactive features
│   │   ├── __init__.py         # Q&A backend registry re-exports
│   │   ├── _registry.py        # Q&A backend registry (rag / agent)
│   │   ├── rag.py              # Adaptive RAG Q&A pipeline
│   │   ├── agent.py            # Tool-based agent Q&A
│   │   ├── memory.py           # Persistent session memory & consolidation
│   │   ├── context.py          # System prompt builder for Q&A
│   │   └── comparison.py       # Multi-paper comparative analysis
│   ├── ui/                     # Layer 5 — presentation
│   │   ├── __init__.py
│   │   ├── app.py              # Gradio web UI (layout + event wiring)
│   │   └── handlers.py         # Gradio callback handlers
│   ├── cli/                    # Typer CLI (one file per command)
│   │   ├── __init__.py         # App object + sub-app registration
│   │   ├── __main__.py         # python -m paper_reader.cli support
│   │   ├── _common.py          # Logging, filename helpers, i18n lookup
│   │   ├── parse_cmd.py        # `parse` command
│   │   ├── read_cmd.py         # `read` command
│   │   ├── chat_cmd.py         # `chat` command
│   │   ├── agent_cmd.py        # `agent` command
│   │   ├── batch_cmd.py        # `batch` command
│   │   ├── serve_cmd.py        # `serve` command
│   │   ├── export_cmd.py       # `export` command
│   │   ├── compare_cmd.py      # `compare` command
│   │   ├── library_cmd.py      # `library` sub-commands
│   │   └── config_cmd.py       # `config` sub-commands
│   ├── pipeline/               # Pipeline orchestration
│   │   ├── __init__.py         # Re-exports public names
│   │   ├── steps.py            # step_ingest .. step_summarize
│   │   ├── cache.py            # CachedResult, load_cached, persist_to_library
│   │   ├── helpers.py          # make_settings, resolve_profile, build_metadata_md
│   │   ├── runner.py           # PipelineRunner (event-yielding generator)
│   │   ├── events.py           # PipelineEvent dataclass
│   │   └── hooks.py            # PipelineHooks registry
│   ├── export/                 # Export formats
│   │   ├── __init__.py         # export_paper() dispatch + exporter registry
│   │   ├── markdown.py         # MarkdownExporter
│   │   ├── json_export.py      # JsonExporter
│   │   └── bibtex.py           # BibtexExporter
│   ├── i18n/                   # User-facing strings (CLI + UI)
│   │   ├── __init__.py         # get_text(key, lang, **kwargs)
│   │   ├── en.py               # English strings
│   │   └── zh.py               # Chinese strings
│   └── prompts/                # LLM-facing prompt templates
│       ├── __init__.py         # Prompt dispatcher (language-aware)
│       ├── en.py               # English prompts
│       └── zh.py               # Chinese prompts
├── tests/                      # Mirrors source layout
│   ├── conftest.py
│   ├── core/                   # Tests for paper_reader.core
│   ├── llm/                    # Tests for paper_reader.llm
│   ├── processing/             # Tests for paper_reader.processing
│   ├── qa/                     # Tests for paper_reader.qa
│   ├── ui/                     # Tests for paper_reader.ui
│   ├── pipeline/               # Tests for paper_reader.pipeline
│   ├── export/                 # Tests for paper_reader.export
│   └── cli/                    # Tests for paper_reader.cli
└── notebooks/
    └── paper_reader.ipynb      # Jupyter notebook interface

About

LLM-powered paper reader, parse and summarize academic paper from PDF.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors