video.mp4
Document Ingestion & AI Search Pipeline
© 2026 Jinan Kordab. All rights reserved. DeepPipe and its source code are copyrighted works of Jinan Kordab. The engine is distributed as a licensed npm package (MIT). See Licensing.
DeepPipe is a self-contained system for ingesting documents, performing full-text search, and holding grounded, cited conversations about their contents. Upload a PDF, Office file, web page, or email; DeepPipe extracts the text with its own pure-TypeScript parsers, indexes it into a local SQLite full-text index, and answers both keyword searches and natural-language questions — every answer traceable back to the source passage.
This repository is the application layer: a zero-framework web client
(public/) and a dependency-free Node HTTP server (server/). All of the heavy
lifting — extraction, indexing, search, and retrieval — is performed by the
DeepPipe engine, which is published independently on npm.
The core engine was authored as a standalone, reusable library and is published to the npm registry. You can build your own applications on top of it without this repository:
@kordabjinan/deeppipe— the pure-TypeScript ingestion, search, and RAG engine.
npm install @kordabjinan/deeppipeimport { openPipeline } from '@kordabjinan/deeppipe';
const pipe = openPipeline({ location: './data/deeppipe.db' }).value;
await pipe.ingestFile('./contract.pdf');
const results = pipe.search('payment terms', { snippets: true });
const context = pipe.chatContext('What are the payment terms?');This app depends on that exact published package — see package.json —
so running it here is also a live integration test of the engine.
| Typical RAG stack | DeepPipe | |
|---|---|---|
| Document parsing | External services/libs (Tika, pdf.js, mammoth) |
In-house, pure-TypeScript parsers for PDF, Office (OOXML + legacy CFB), HTML, MIME/email, ZIP |
| Retrieval | Vector DB + embedding model | SQLite FTS5 + BM25 lexical ranking |
| Infrastructure | Vector DB, message queue, Python services | One Node process, one SQLite file |
| Dependencies | Heavy, often native, frequently breaking | Four small pure-JS deps |
| Determinism | Embeddings drift across model versions | Same input → same index → same results |
| Cost | Embedding API calls per document | Zero retrieval cost; LLM optional |
| Privacy | Documents often leave the machine | Everything stays local except the optional LLM call |
Highlights
- No external parsing dependencies. The PDF filter alone implements object scanning, stream decoding (FlateDecode/LZW/ASCII85/ASCIIHex with predictors), decryption (RC4, AESV2, AESV3), page-tree walking, content-stream operator interpretation, and ToUnicode CMaps — all from scratch.
- Never-throw
Result<T, E>contract. A single malformed document cannot crash a batch; failures are values, not exceptions. - Injection-safe search. Every user term is lowered into a quoted FTS5 string literal, so search syntax can never be smuggled into the engine.
- Self-contained & portable. No database server, no queue, no cloud dependency other than the optional LLM endpoint.
DeepPipe deliberately uses lexical retrieval (SQLite FTS5 with BM25 ranking) and does not use FAISS, vector embeddings, or any vector database. This is a considered engineering decision, not an oversight:
- No model dependency or drift. Embedding-based retrieval ties your index to a specific embedding model. Upgrading or changing the model means re-embedding the entire corpus, and results subtly change. BM25 over stemmed terms is deterministic and reproducible — the same document always indexes the same way.
- No extra infrastructure. FAISS / vector search needs an embedding service and a vector store (Pinecone, Weaviate, Chroma, pgvector, Qdrant, …). DeepPipe needs one SQLite file. It installs and runs anywhere Node runs.
- No per-document cost or latency. There are no embedding API calls at ingest time and no vector math at query time — retrieval is sub-millisecond on modest corpora.
- Transparency and privacy. Lexical matches are explainable (you can see exactly which terms matched and where), and documents never have to be sent to an embedding API.
The trade-off we accept: pure lexical retrieval does not capture semantic similarity between words that share no surface form (e.g. "car" vs. "automobile"). For DeepPipe's target — searching and chatting over your own document library, where the vocabulary of the query and the documents largely overlaps — BM25 is fast, cheap, private, and accurate. Semantic / vector retrieval can be layered on later if a use case demands it, but it is intentionally not a default dependency.
DeepPipe runs as a single process that serves both the REST API and the web
client, delegating all engine work to @kordabjinan/deeppipe.
┌──────────────────────────────────────────────────────────────────────────┐
│ Browser (public/) │
│ index.html · styles.css · app.js (vanilla ES module SPA) │
│ Views: Search · Upload · Chat · Library │
└───────────────┬───────────────────────────────────────────────┬──────────┘
│ REST (JSON) │ SSE (chat)
▼ ▼
┌──────────────────────────────────────────────────────────────────────────┐
│ HTTP server (server/server.mjs) │
│ Static files · /api/* routes · upload cap · .env loader · LLM proxy │
└───────────────┬───────────────────────────────────────────────────────────┘
│ in-process calls (Result<T,E>)
▼
╔════════════════════════════════════════════════════════════════════════════╗
║ @kordabjinan/deeppipe (the engine — installed from npm) ║
║ ║
║ ┌────────────────────────────────────────────────────────────────────┐ ║
║ │ Pipeline orchestrator (façade) │ ║
║ │ ingestBytes · search · chatContext · listDocuments · removeDoc… │ ║
║ └───┬──────────────────────┬───────────────────────────┬─────────────┘ ║
║ │ extract │ index / query │ retrieve ║
║ ▼ ▼ ▼ ║
║ ┌─────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ ║
║ │ Extraction │ │ Search layer │ │ Chat / RAG │ ║
║ │ filters │ │ writer · parser · │ │ context-builder │ ║
║ │ storage │ │ planner · executor │ │ (passage selection) │ ║
║ │ text·runtime│ │ index-store (SQLite)│ └──────────────────────┘ ║
║ └─────────────┘ └──────────┬───────────┘ ║
║ ▼ ║
║ ┌──────────────────────┐ ║
║ │ data/deeppipe.db │ ║
║ │ SQLite + FTS5 (WAL) │ ║
║ └──────────────────────┘ ║
╚════════════════════════════════════════════════════════════════════════════╝
| Layer | Responsibility |
|---|---|
Extraction (filters, storage, runtime) |
Detects format by magic bytes; parses PDF / OOXML / CFB / HTML / MIME / ZIP in pure TypeScript; recovers text + metadata. |
Text processing (text) |
Charset decoding (BOM + iconv-lite), language detection (franc), Unicode word-breaking and Porter stemming that matches the FTS5 tokenizer. |
Search (search) |
SHA-256 dedup, paragraph-aware chunking, recursive-descent query parser → FTS5 MATCH plan → BM25-ranked, snippet-highlighted results. |
Chat / RAG (chat) |
Retrieves the most relevant passages and assembles a grounded, cited context for any OpenAI-compatible LLM. |
Upload bytes
│
▼
┌─────────────────┐ magic bytes → structure → extension
│ Format detect │
└────────┬────────┘
▼
┌─────────────────┐ PDF · OOXML · CFB · HTML · MIME · ZIP · text
│ Format filter │ (pure-TS parser dispatch)
└────────┬────────┘
▼
┌─────────────────┐ bytes → Unicode (BOM/charset) → language detect
│ Text normalize │
└────────┬────────┘
▼
┌─────────────────┐ SHA-256 hash → dedup? → paragraph-aware chunking
│ Index writer │
└────────┬────────┘
▼
┌─────────────────┐ tokenize + Porter stem → FTS5 rows (BM25-ready)
│ SQLite + FTS5 │ persisted to data/deeppipe.db (WAL)
└─────────────────┘
Query string ─▶ parse (AST: terms, "phrases", prefix*, field:value, AND/OR/NOT)
─▶ plan (quoted FTS5 MATCH — injection-safe)
─▶ FTS5 BM25 search ─▶ hydrate rows ─▶ snippet highlight ─▶ paginate
Question ─▶ retrieve top passages (FTS5 / BM25)
─▶ build grounded, cited context
─▶ OpenAI-compatible LLM (SSE stream)
─▶ answer with source citations
.
├── public/ # Web client (index.html, app.js, styles.css)
├── server/ # Node HTTP server (REST API + static files)
├── scripts/ # smoke / pdf / chat test scripts
├── quiet-dotenv.mjs # silences a transitive dotenv startup banner
├── package.json # depends on @kordabjinan/deeppipe
└── .env.example # chat (LLM) configuration template
This pulls the engine (@kordabjinan/deeppipe) from npm:
cd GITHUB
npm installSearch and ingestion work without any configuration. Only the chat feature needs an LLM:
Copy-Item .env.example .env
# then edit .env to set your OpenAI-compatible endpoint, key, and modelThe fastest way to confirm the published engine works end-to-end — it ingests several formats and runs queries against an in-memory index:
npm run smokeExpect a series of ok lines and a zero exit code.
npm run startOpen http://localhost:4173 — upload a document, search it, and (if configured)
chat with it. The index is persisted to data/deeppipe.db, so uploads survive
restarts.
Every engine call flows through the published npm package, e.g. in
server/server.mjs:
import { openPipeline, extractiveAnswer } from '@kordabjinan/deeppipe';When a new version of the engine is published, bump the dependency in
package.json and npm install again to test it here.
- Copyright © 2026 Jinan Kordab. DeepPipe — including this application and
the
@kordabjinan/deeppipeengine — is the original work of Jinan Kordab. - The engine package is distributed under the MIT License.\r\n- Every source file carries the copyright header "DeepPipe — Document Ingestion & AI Search Pipeline - Copyright © Jinan Kordab 2026."
Authored by Jinan Kordab.