DeepPipe

video.mp4

DeepPipe

Document Ingestion & AI Search Pipeline

© 2026 Jinan Kordab. All rights reserved. DeepPipe and its source code are copyrighted works of Jinan Kordab. The engine is distributed as a licensed npm package (MIT). See Licensing.

Overview

DeepPipe is a self-contained system for ingesting documents, performing full-text search, and holding grounded, cited conversations about their contents. Upload a PDF, Office file, web page, or email; DeepPipe extracts the text with its own pure-TypeScript parsers, indexes it into a local SQLite full-text index, and answers both keyword searches and natural-language questions — every answer traceable back to the source passage.

This repository is the application layer: a zero-framework web client (public/) and a dependency-free Node HTTP server (server/). All of the heavy lifting — extraction, indexing, search, and retrieval — is performed by the DeepPipe engine, which is published independently on npm.

The engine is available on npm

The core engine was authored as a standalone, reusable library and is published to the npm registry. You can build your own applications on top of it without this repository:

@kordabjinan/deeppipe — the pure-TypeScript ingestion, search, and RAG engine.

npm install @kordabjinan/deeppipe

import { openPipeline } from '@kordabjinan/deeppipe';

const pipe = openPipeline({ location: './data/deeppipe.db' }).value;
await pipe.ingestFile('./contract.pdf');

const results = pipe.search('payment terms', { snippets: true });
const context = pipe.chatContext('What are the payment terms?');

This app depends on that exact published package — see package.json — so running it here is also a live integration test of the engine.

What makes DeepPipe different

	Typical RAG stack	DeepPipe
Document parsing	External services/libs (Tika, `pdf.js`, `mammoth`)	In-house, pure-TypeScript parsers for PDF, Office (OOXML + legacy CFB), HTML, MIME/email, ZIP
Retrieval	Vector DB + embedding model	SQLite FTS5 + BM25 lexical ranking
Infrastructure	Vector DB, message queue, Python services	One Node process, one SQLite file
Dependencies	Heavy, often native, frequently breaking	Four small pure-JS deps
Determinism	Embeddings drift across model versions	Same input → same index → same results
Cost	Embedding API calls per document	Zero retrieval cost; LLM optional
Privacy	Documents often leave the machine	Everything stays local except the optional LLM call

Highlights

No external parsing dependencies. The PDF filter alone implements object scanning, stream decoding (FlateDecode/LZW/ASCII85/ASCIIHex with predictors), decryption (RC4, AESV2, AESV3), page-tree walking, content-stream operator interpretation, and ToUnicode CMaps — all from scratch.
Never-throw Result<T, E> contract. A single malformed document cannot crash a batch; failures are values, not exceptions.
Injection-safe search. Every user term is lowered into a quoted FTS5 string literal, so search syntax can never be smuggled into the engine.
Self-contained & portable. No database server, no queue, no cloud dependency other than the optional LLM endpoint.

Why no FAISS or vector embeddings

DeepPipe deliberately uses lexical retrieval (SQLite FTS5 with BM25 ranking) and does not use FAISS, vector embeddings, or any vector database. This is a considered engineering decision, not an oversight:

No model dependency or drift. Embedding-based retrieval ties your index to a specific embedding model. Upgrading or changing the model means re-embedding the entire corpus, and results subtly change. BM25 over stemmed terms is deterministic and reproducible — the same document always indexes the same way.
No extra infrastructure. FAISS / vector search needs an embedding service and a vector store (Pinecone, Weaviate, Chroma, pgvector, Qdrant, …). DeepPipe needs one SQLite file. It installs and runs anywhere Node runs.
No per-document cost or latency. There are no embedding API calls at ingest time and no vector math at query time — retrieval is sub-millisecond on modest corpora.
Transparency and privacy. Lexical matches are explainable (you can see exactly which terms matched and where), and documents never have to be sent to an embedding API.

The trade-off we accept: pure lexical retrieval does not capture semantic similarity between words that share no surface form (e.g. "car" vs. "automobile"). For DeepPipe's target — searching and chatting over your own document library, where the vocabulary of the query and the documents largely overlaps — BM25 is fast, cheap, private, and accurate. Semantic / vector retrieval can be layered on later if a use case demands it, but it is intentionally not a default dependency.

Architecture

DeepPipe runs as a single process that serves both the REST API and the web client, delegating all engine work to @kordabjinan/deeppipe.

┌──────────────────────────────────────────────────────────────────────────┐
│                          Browser (public/)                                 │
│  index.html  ·  styles.css  ·  app.js   (vanilla ES module SPA)            │
│  Views: Search · Upload · Chat · Library                                   │
└───────────────┬───────────────────────────────────────────────┬──────────┘
                │ REST (JSON)                                     │ SSE (chat)
                ▼                                                 ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                      HTTP server (server/server.mjs)                       │
│  Static files · /api/* routes · upload cap · .env loader · LLM proxy       │
└───────────────┬───────────────────────────────────────────────────────────┘
                │ in-process calls (Result<T,E>)
                ▼
╔════════════════════════════════════════════════════════════════════════════╗
║              @kordabjinan/deeppipe   (the engine — installed from npm)        ║
║                                                                              ║
║  ┌────────────────────────────────────────────────────────────────────┐     ║
║  │               Pipeline orchestrator (façade)                        │     ║
║  │  ingestBytes · search · chatContext · listDocuments · removeDoc…    │     ║
║  └───┬──────────────────────┬───────────────────────────┬─────────────┘     ║
║      │ extract              │ index / query             │ retrieve          ║
║      ▼                      ▼                           ▼                    ║
║  ┌─────────────┐   ┌──────────────────────┐   ┌──────────────────────┐      ║
║  │ Extraction  │   │ Search layer         │   │ Chat / RAG           │      ║
║  │ filters     │   │  writer · parser ·   │   │  context-builder     │      ║
║  │ storage     │   │  planner · executor  │   │  (passage selection) │      ║
║  │ text·runtime│   │  index-store (SQLite)│   └──────────────────────┘      ║
║  └─────────────┘   └──────────┬───────────┘                                 ║
║                               ▼                                             ║
║                     ┌──────────────────────┐                               ║
║                     │ data/deeppipe.db     │                               ║
║                     │ SQLite + FTS5 (WAL)  │                               ║
║                     └──────────────────────┘                               ║
╚════════════════════════════════════════════════════════════════════════════╝

Engine layers

Layer	Responsibility
Extraction (`filters`, `storage`, `runtime`)	Detects format by magic bytes; parses PDF / OOXML / CFB / HTML / MIME / ZIP in pure TypeScript; recovers text + metadata.
Text processing (`text`)	Charset decoding (BOM + `iconv-lite`), language detection (`franc`), Unicode word-breaking and Porter stemming that matches the FTS5 tokenizer.
Search (`search`)	SHA-256 dedup, paragraph-aware chunking, recursive-descent query parser → FTS5 `MATCH` plan → BM25-ranked, snippet-highlighted results.
Chat / RAG (`chat`)	Retrieves the most relevant passages and assembles a grounded, cited context for any OpenAI-compatible LLM.

Processing flow

Ingestion (upload → indexed)

 Upload bytes
     │
     ▼
 ┌─────────────────┐   magic bytes → structure → extension
 │ Format detect   │
 └────────┬────────┘
          ▼
 ┌─────────────────┐   PDF · OOXML · CFB · HTML · MIME · ZIP · text
 │ Format filter   │   (pure-TS parser dispatch)
 └────────┬────────┘
          ▼
 ┌─────────────────┐   bytes → Unicode (BOM/charset)  →  language detect
 │ Text normalize  │
 └────────┬────────┘
          ▼
 ┌─────────────────┐   SHA-256 hash → dedup?  →  paragraph-aware chunking
 │ Index writer    │
 └────────┬────────┘
          ▼
 ┌─────────────────┐   tokenize + Porter stem  →  FTS5 rows (BM25-ready)
 │ SQLite + FTS5   │   persisted to data/deeppipe.db (WAL)
 └─────────────────┘

Query (search → ranked hits)

 Query string ─▶ parse (AST: terms, "phrases", prefix*, field:value, AND/OR/NOT)
              ─▶ plan  (quoted FTS5 MATCH — injection-safe)
              ─▶ FTS5 BM25 search ─▶ hydrate rows ─▶ snippet highlight ─▶ paginate

Chat (question → grounded answer)

 Question ─▶ retrieve top passages (FTS5 / BM25)
          ─▶ build grounded, cited context
          ─▶ OpenAI-compatible LLM (SSE stream)
          ─▶ answer with source citations

Repository layout

.
├── public/            # Web client (index.html, app.js, styles.css)
├── server/            # Node HTTP server (REST API + static files)
├── scripts/           # smoke / pdf / chat test scripts
├── quiet-dotenv.mjs   # silences a transitive dotenv startup banner
├── package.json       # depends on @kordabjinan/deeppipe
└── .env.example       # chat (LLM) configuration template

Getting started

1. Install dependencies

This pulls the engine (@kordabjinan/deeppipe) from npm:

cd GITHUB
npm install

2. (Optional) Configure chat

Search and ingestion work without any configuration. Only the chat feature needs an LLM:

Copy-Item .env.example .env
# then edit .env to set your OpenAI-compatible endpoint, key, and model

3. Run the headless smoke test

The fastest way to confirm the published engine works end-to-end — it ingests several formats and runs queries against an in-memory index:

npm run smoke

Expect a series of ok lines and a zero exit code.

4. Run the web app

npm run start

Open http://localhost:4173 — upload a document, search it, and (if configured) chat with it. The index is persisted to data/deeppipe.db, so uploads survive restarts.

How this app uses the engine

Every engine call flows through the published npm package, e.g. in server/server.mjs:

import { openPipeline, extractiveAnswer } from '@kordabjinan/deeppipe';

When a new version of the engine is published, bump the dependency in package.json and npm install again to test it here.

📜 Licensing

Authored by Jinan Kordab.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
public		public
scripts		scripts
server		server
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
GETTING-STARTED.md		GETTING-STARTED.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
quiet-dotenv.mjs		quiet-dotenv.mjs
video.mp4		video.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepPipe

Overview

The engine is available on npm

What makes DeepPipe different

Why no FAISS or vector embeddings

Architecture

Engine layers

Processing flow

Ingestion (upload → indexed)

Query (search → ranked hits)

Chat (question → grounded answer)

Repository layout

Getting started

1. Install dependencies

2. (Optional) Configure chat

3. Run the headless smoke test

4. Run the web app

How this app uses the engine

📜 Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepPipe

Overview

The engine is available on npm

What makes DeepPipe different

Why no FAISS or vector embeddings

Architecture

Engine layers

Processing flow

Ingestion (upload → indexed)

Query (search → ranked hits)

Chat (question → grounded answer)

Repository layout

Getting started

1. Install dependencies

2. (Optional) Configure chat

3. Run the headless smoke test

4. Run the web app

How this app uses the engine

📜 Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages