chunkr is a Rust CLI for turning an ebook library into search- and retrieval-ready text artifacts. It extracts text and metadata from Calibre-managed books, normalizes and chunks that text, generates embeddings through an HTTP provider, and inserts the resulting records into downstream systems such as Qdrant and Quickwit.
The repository is organized around a practical ingestion pipeline:
extract: walk a Calibre library and emit plain text plus sidecar metadata.chunk: clean extracted text and convert it into JSONL chunk records.insert: embed chunk text and write the results to vector and search indexes.dups,dup-stats,dedup: inspect and manage duplicate books in a Calibre library.
Chunkr is not just a text splitter. The crate is structured as an end-to-end ingestion tool for personal or research-scale libraries where Calibre is the source of truth and Qdrant/Quickwit are downstream consumers.
Core capabilities:
- Extract
.epuband.pdfcontent from a Calibre library tree. - Preserve book-level metadata in JSON sidecars during extraction.
- Normalize text before chunking, including Unicode cleanup and whitespace collapsing.
- Produce JSONL chunk records with stable metadata fields and per-chunk offsets.
- Generate embeddings through a configurable HTTP embedding provider.
- Insert the same chunk corpus into both a vector store and a search index.
- Scan a library for duplicate files and summarize duplicate storage overhead.
- Score duplicate Calibre records by metadata completeness before removing lower-quality copies.
The binary exposes the following subcommands:
chunkr extract
chunkr chunk
chunkr insert
chunkr dups
chunkr dup-stats --input dups.json
chunkr dedup --input dups.json --dry-runGlobal flags:
-c, --config <PATH>: load a TOML config file. Defaults toconfig.toml.-h, --help: print help.-V, --version: print the version.
chunkr extract walks paths.calibre_root, filters by extract.extensions, and writes extracted text under paths.extract_root.
Current format support:
- EPUB extraction uses
pandoc. - PDF extraction uses a
docling-based pipeline, with optional text-first classification and OCR fallback logic.
Extraction behavior includes:
- Layout templating for output and metadata paths through
extract.output_layoutandextract.metadata_layout. - Optional skipping of already extracted books with
extract.skip_existing. - Optional metadata sidecar generation with
extract.write_metadata. - EPUB chapter splitting for large books.
- PDF quality classification to distinguish text PDFs, low-quality text PDFs, and scan-like PDFs.
- Oversize PDF skipping when configured.
Typical output under extract_root looks like:
extract_root/
epub/
some_book.txt
some_book.json
pdf/
another_book.txt
another_book.json
The extraction metadata sidecars include fields such as source path, format, title, authors, language, publication date, identifiers, Calibre id, and extraction timestamp.
chunkr chunk reads .txt files from paths.extract_root, normalizes them, splits them into paragraphs, merges or breaks paragraphs according to configured size thresholds, and writes JSONL chunk files under paths.chunk_root.
Chunking behavior includes:
- Unicode normalization.
- Whitespace collapsing.
- Optional header/table-of-contents stripping.
- Minimum and maximum paragraph sizing.
- Target and hard maximum chunk lengths.
- Overlap between adjacent chunks.
- Metadata projection from file-level extraction metadata into each chunk record.
Each emitted JSONL line contains:
id: a generated UUID.text: chunk text.metadata: source and book metadata plus chunk-local fields such aschunk_index,char_start, andchar_end.
Typical output under chunk_root looks like:
chunk_root/
epub/
some_book.jsonl
pdf/
another_book.jsonl
chunkr insert reads JSONL chunks from paths.chunk_root, batches them, requests embeddings from the configured provider, then writes results to:
- Qdrant for vector search.
- Quickwit for text search / indexing.
Insertion behavior includes:
- Parallel file ingestion.
- Batched embedding requests.
- Retry and backoff controls.
- Optional in-memory embedding cache.
- Optional Qdrant collection creation.
- Optional Quickwit final commit after all files are processed.
The current config shape suggests the intended deployment model is local services, for example:
- Qdrant at
http://127.0.0.1:6333 - Quickwit at
http://127.0.0.1:7280 - Ollama-compatible embeddings at
http://127.0.0.1:11434
Chunkr also contains a separate duplicate-management workflow for Calibre libraries.
chunkr dups:
- Walks a library tree.
- Filters by extension and minimum size.
- Hashes candidate files in parallel.
- Emits duplicate groups in text or JSON format.
chunkr dup-stats:
- Reads the JSON report from
chunkr dups. - Summarizes group counts, file counts, and extra bytes consumed by duplicates.
chunkr dedup:
- Reads duplicate groups from
chunkr dups. - Extracts Calibre book ids from duplicate paths.
- Fetches Calibre metadata from a local library or content server target.
- Scores candidates by metadata completeness.
- Keeps the best-scoring record and removes lower-scoring duplicates.
- Supports
--dry-runand should generally be run that way first.
Rust requirements:
- Rust edition
2024 - Cargo
External tools and services depend on which commands you use.
For extraction:
pandocfor EPUB extraction.pdffonts,pdftotext, andpdfinfofor PDF inspection/text-first heuristics.- A
doclingPython environment and entrypoint script for PDF conversion. - OCR tooling if your
doclingsetup depends on it.
For insertion:
- A running embeddings endpoint compatible with the configured provider.
- A reachable Qdrant instance.
- A reachable Quickwit instance.
For deduplication:
- Access to a Calibre library path and/or Calibre content server.
- A working Calibre CLI environment if book removal is performed live.
Build the project:
cargo buildInspect commands:
cargo run -- --help
cargo run -- dups --help
cargo run -- dup-stats --help
cargo run -- dedup --helpRun the main pipeline:
cargo run -- extract
cargo run -- chunk
cargo run -- insertRun duplicate analysis:
cargo run -- dups --out dups.json
cargo run -- dup-stats --input dups.json
cargo run -- dedup --input dups.json --dry-runConfiguration is TOML-based and loaded from config.toml by default. The shipped root config is a real working example and is the best starting point when adapting the project to a new machine.
Top-level sections:
[logging][paths][extract][extract.epub][extract.pdf][chunk][chunk.metadata][insert][insert.qdrant][insert.quickwit][insert.embeddings][calibre][calibre.content_server][calibre.scoring][dups][dup_stats][dedup]
[paths] defines where the pipeline reads from and writes to:
calibre_root: source library tree to scan for books.extract_root: destination for extracted plain text and extraction metadata.chunk_root: destination for chunk JSONL files.state_dir: state/work directory for supporting workflows.examples_cfr_dir: example text corpus used by the ignored pipeline test.
[extract] controls which formats are processed and how outputs are named.
Important fields:
extensions: file extensions to include.skip_existing: avoid re-extracting existing outputs.write_metadata: emit.jsonsidecars next to extracted text.output_layout: path template for extracted text.metadata_layout: path template for metadata sidecars.
Layout placeholders currently used by the code:
{format}{title_slug}
[extract.epub] configures the pandoc path and large-book splitting behavior.
Important fields:
backendpandoc_bintoc_depthchapter_splitmax_chapter_bytesmax_file_bytesjoin_partskeep_parts
[extract.pdf] is the most operationally dense part of the config. It controls:
- the
doclingexecutable and script paths, - PDF text-quality probing,
- OCR fallback decisions,
- table extraction modes,
- batching and timeout behavior,
- size limits and split-pass extraction for large PDFs.
If README readers are onboarding to the project, this section is where environment drift is most likely to break the pipeline.
[chunk] controls text cleanup and chunk sizing.
Important fields:
normalize_unicodecollapse_whitespacestrip_headersmin_paragraph_charsmax_paragraph_charstarget_chunk_charsmax_chunk_charschunk_overlap_charsemit_jsonl
[chunk.metadata] decides which file-level metadata fields get copied into each chunk record:
- source path
- Calibre id
- title
- authors
- published date
- language
[insert] controls ingestion throughput and retry behavior:
batch_sizeretry_maxretry_backoff_msmax_parallel_files
[insert.qdrant] defines the vector collection target, including url, collection, distance, vector_size, and optional api_key.
[insert.quickwit] defines the text index target, including url, index_id, and commit behavior.
[insert.embeddings] defines the embedding provider contract:
- provider type
- base URL
- model name
- request timeout
- concurrency limits
- input truncation limits
- request batch sizing
- cache size
[dups] controls duplicate scanning defaults:
- output format
- allowed extensions
- symlink handling
- thread count
- minimum file size
- sidecar inclusion
- hash algorithm
[dup_stats] controls the default output mode for duplicate summaries.
[dedup] controls destructive dedup defaults:
min_sizedry_run
[calibre] and related sections define how duplicate cleanup talks to Calibre and how it scores record quality.
Top-level layout:
.
├── Cargo.toml
├── Cargo.lock
├── README.md
├── LICENSE
├── config.toml
├── test.toml
├── justfile
├── docs/
├── examples/
├── src/
├── tests/
└── tmp/
Rust crate source:
main.rs: CLI entrypoint, argument parsing, config loading, and subcommand dispatch.lib.rs: public module declarations for the crate.config.rs: TOML config model and defaults for duplicate-related sections.extract.rs: Calibre library walk, EPUB/PDF extraction, metadata sidecars, PDF quality handling, and output path layout logic.chunk.rs: text normalization, paragraph splitting, overlap handling, and JSONL chunk emission.insert.rs: embedding requests, batching, retries, cache, Qdrant insertion, and Quickwit insertion/commit flow.dups.rs: duplicate file scan, hashing, grouping, and report emission.dup_stats.rs: duplicate-report summarization and human/machine output.dedup.rs: duplicate-removal workflow driven by Calibre metadata quality scoring.calibre_metadata.rs: metadata normalization and scoring helpers used by dedup.logging.rs: tracing subscriber setup and colored log prefixes for concurrent insert operations.util.rs: small shared helpers such as slugification, layout rendering, and extension replacement.
pipeline.rs: ignored integration-style test for chunking and insertion. It stages example text files, resets Qdrant/Quickwit targets, runs chunk and insert in process, then verifies the indexed results.
This test is useful, but it is not a unit-test-only environment: it assumes live external services.
Example corpora used for testing and experimentation:
examples/cfr/: a large set of plain text CFR samples used by the ignored pipeline test.examples/pdf/: representative PDF fixtures covering clean text, low-quality text, and scan-like cases.
Project notes and reference material:
docs/chunkr/: implementation summaries and planning notes for this project.docs/reference/: broader project references, release notes, tooling notes, migration notes, and AI workflow documentation.
These docs are helpful for maintainers, but the crate does not appear to generate end-user docs from them directly.
Scratch material and local operational notes such as Docker Compose snippets and experimental extraction scripts. This directory looks like local working support rather than stable product surface.
The repository includes a justfile for common tasks.
Useful targets:
just buildjust fmtjust fmt-checkjust validatejust clippyjust testjust docjust ci
Examples:
just build
just fmt
just ciBasic test command:
cargo testImportant caveat:
tests/pipeline.rsis marked#[ignore].- It expects external services such as Qdrant, Quickwit, and an embeddings endpoint.
- It uses
test.tomlplus runtime overrides for temporary directories and target names.
So the automated test story is currently split between local crate tests and a service-backed integration test.
config.tomlcontains machine-specific absolute paths. Treat it as an example to adapt, not a portable default.- The PDF extraction stack has the highest environment complexity.
insertassumes downstream schemas and endpoints are already compatible with the emitted chunk records.dedupcan remove books from Calibre. Dry-run first and verify the scoring behavior against your library before running live.
As the code stands today, Chunkr is best understood as a focused ingestion utility for Calibre-centered libraries rather than a general document ETL framework. It already has a useful split between extraction, chunk preparation, indexing, duplicate inspection, and duplicate cleanup, and the repository layout reflects that separation clearly.