Automated pipeline for generating academic-quality survey reports from structured PDF paper collections. Powered by MinerU (PDF parsing) and Gemini (content generation).
PDF Papers MinerU LLM (x4 stages) Final Report
(organized in ──────────> Markdown ──────────────────> output/<topic>/report_final.md
folders) per paper Outline
Section Briefs
Section Writing
Assembly
Review & Translate
| Stage | What happens | Output |
|---|---|---|
| 1. PDF Parsing | MinerU converts each PDF to markdown. Already-parsed PDFs are skipped. | *.md alongside each PDF |
| 2. Outline | LLM reads all abstracts + your README.md to produce a structured outline with [AuthorYear] citation keys. |
output/<topic>/outline.md |
| 3. Section Briefs | LLM generates a tailored writing brief for each section based on the outline and your README.md using Structured Output (JSON). |
output/<topic>/prompts/sections/*.txt |
| 4. Section Writing | For each section: full paper text for its own papers, abstracts-only for other sections' papers. Each section uses its auto-generated brief. | output/<topic>/sections/*.md |
| 5. Assembly | LLM assembles all sections into one coherent report with title, abstract, transitions, and references list. | output/<topic>/report_draft.md |
| 6. Review & Translate | LLM performs factual/citation/grammar checks, then translates to FINAL_LANGUAGE if not English. |
output/<topic>/report_final.md |
All intermediate stages use English. Only the final review stage translates to your configured language.
# 1. Install uv (if not installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Create virtual environment with Python 3.12
uv venv .venv --python 3.12
source .venv/bin/activate
# 3. Install dependencies
uv pip install -e "."
# 4. Configure environment
cp .env_example .env
# Edit .env: set GOOGLE_API_KEY and FINAL_LANGUAGE
# 5. Run
./run.sh topics/RAGrun.sh will auto-sanitize PDF filenames (replacing spaces/hyphens with underscores), auto-create the venv, and install dependencies if you skip steps 2-3.
./run.sh <topic_dir> [options]
Options:
--output-dir DIR Override output directory
--model MODEL Override Gemini model (default: gemini-2.5-pro)
--thinking-budget N Override thinking budget in tokens (default: 8192)
--reset Discard saved progress, start from scratch
A topic is a folder under topics/ containing PDFs organized into sections:
topics/
YourTopic/
README.md # Your structural guidance (required)
Section A/
paper1.pdf
paper2.pdf
Section B/
Subsection B1/
paper3.pdf
Subsection B2/
paper4.pdf
Section C/
paper5.pdf
- Folder names = section headings in the generated report
- Nesting = heading hierarchy (subfolder becomes a subsection)
- PDF placement determines which papers belong to which section
- The same PDF can appear in multiple sections if needed
This is the only file you write per topic. It tells the pipeline your intended structure and narrative focus. Example:
I. Introduction (origins and motivation)
II. Core Architecture & Bottlenecks
III. Evaluation (place benchmarks before advanced methods so readers have the measuring stick first)
IV. Advanced Methods: Comparative Analysis
V. Challenges, Security & Future DirectionsYou don't need to match folder names exactly. The LLM uses your README as narrative guidance and maps it to the actual folder structure. You can add notes about emphasis, ordering, what to compare, etc.
The pipeline saves progress after every stage to output/<topic>/state.json via atomic writes to prevent corruption.
- Re-run the same command to resume from the last checkpoint
- PDF parsing is incremental: only new/unparsed PDFs are processed
- LLM failures retry automatically with exponential backoff (3 attempts)
- PDF parsing failures retry with exponential backoff (2 attempts)
- Use
--resetto force a complete re-run
All options can be set in .env (see .env_example):
| Variable | Default | Description |
|---|---|---|
GOOGLE_API_KEY |
(required) | Gemini API key |
FINAL_LANGUAGE |
English |
Language for the final report. Intermediate stages are always English. |
MODEL_NAME |
gemini-2.5-pro |
Gemini model to use |
THINKING_BUDGET |
8192 |
Token budget for Gemini extended thinking |
OUTPUT_DIR |
output |
Output base directory |
PROMPTS_DIR |
prompts |
Reusable prompts directory |
CLI arguments override .env values.
Report-By-Agent/
├── run.sh # Entry point (auto-sanitizes PDF names)
├── .env_example # Configuration template
├── pyproject.toml # Python dependencies (mineru>=3.0.0)
├── scripts/ # Utility scripts
│ └── sanitize_pdfs.sh # Cleans up PDF filenames
├── tests/ # Unit tests
│ └── test_pdf_parser.py # Tests for robust PDF reference truncation
├── prompts/ # Reusable prompts (shared across all topics)
│ ├── outline.txt # Outline generation
│ ├── gen_section_briefs.txt # Meta-prompt for auto-generating section briefs
│ ├── section_default.txt # Section writing instructions
│ ├── assemble.txt # Report assembly
│ └── review.txt # Review + translation
├── src/ # Pipeline source code
│ ├── main.py # Orchestrator
│ ├── config.py # Configuration loading
│ ├── state.py # Checkpoint/resume state (Atomic save)
│ ├── llm_client.py # Gemini API wrapper with retry & Structured Output
│ ├── pdf_parser.py # MinerU integration & robust reference truncation
│ ├── structure.py # Folder tree → section tree
│ ├── outline_generator.py # Stage 2
│ ├── prompt_generator.py # Stage 3 (Uses JSON response_schema)
│ ├── section_generator.py # Stage 4
│ ├── assembler.py # Stage 5
│ └── reviewer.py # Stage 6
├── topics/ # Input: one folder per topic
│ └── RAG/
│ ├── README.md
│ ├── Introduction/
│ ├── Foundational RAG Architecture/
│ │ ├── Pre-Retrieval/
│ │ ├── Ranking & Hybrid Search/
│ │ └── Long Context/
│ ├── Evaluation Benchmarks & Metrics/
│ ├── Advanced Methodologies/
│ │ ├── Agentic RAG/
│ │ ├── Graph RAG/
│ │ └── Multimodel RAG/
│ ├── System Implementations & Domain Applications/
│ └── Challenges, Security & Future Work/
└── output/ # Generated output
└── RAG/
├── state.json
├── outline.md
├── prompts/sections/ # Auto-generated section briefs
├── sections/ # Individual section drafts
├── report_draft.md
└── report_final.md # The final report
The five prompts in prompts/ are generic and topic-agnostic. You can edit them to adjust:
- Writing style or depth (
section_default.txt) - Outline format (
outline.txt) - Assembly strategy (
assemble.txt) - Review criteria (
review.txt) - Brief generation logic (
gen_section_briefs.txt)
These changes apply to all topics. Per-topic customization is done exclusively through topics/<name>/README.md.