Skip to content

jinan-kordab/DeepPipe

Repository files navigation

video.mp4

DeepPipe

Document Ingestion & AI Search Pipeline

© 2026 Jinan Kordab. All rights reserved. DeepPipe and its source code are copyrighted works of Jinan Kordab. The engine is distributed as a licensed npm package (MIT). See Licensing.

npm Node TypeScript License

Overview

DeepPipe is a self-contained system for ingesting documents, performing full-text search, and holding grounded, cited conversations about their contents. Upload a PDF, Office file, web page, or email; DeepPipe extracts the text with its own pure-TypeScript parsers, indexes it into a local SQLite full-text index, and answers both keyword searches and natural-language questions — every answer traceable back to the source passage.

This repository is the application layer: a zero-framework web client (public/) and a dependency-free Node HTTP server (server/). All of the heavy lifting — extraction, indexing, search, and retrieval — is performed by the DeepPipe engine, which is published independently on npm.

The engine is available on npm

The core engine was authored as a standalone, reusable library and is published to the npm registry. You can build your own applications on top of it without this repository:

@kordabjinan/deeppipe — the pure-TypeScript ingestion, search, and RAG engine.

npm install @kordabjinan/deeppipe
import { openPipeline } from '@kordabjinan/deeppipe';

const pipe = openPipeline({ location: './data/deeppipe.db' }).value;
await pipe.ingestFile('./contract.pdf');

const results = pipe.search('payment terms', { snippets: true });
const context = pipe.chatContext('What are the payment terms?');

This app depends on that exact published package — see package.json — so running it here is also a live integration test of the engine.


What makes DeepPipe different

Typical RAG stack DeepPipe
Document parsing External services/libs (Tika, pdf.js, mammoth) In-house, pure-TypeScript parsers for PDF, Office (OOXML + legacy CFB), HTML, MIME/email, ZIP
Retrieval Vector DB + embedding model SQLite FTS5 + BM25 lexical ranking
Infrastructure Vector DB, message queue, Python services One Node process, one SQLite file
Dependencies Heavy, often native, frequently breaking Four small pure-JS deps
Determinism Embeddings drift across model versions Same input → same index → same results
Cost Embedding API calls per document Zero retrieval cost; LLM optional
Privacy Documents often leave the machine Everything stays local except the optional LLM call

Highlights

  • No external parsing dependencies. The PDF filter alone implements object scanning, stream decoding (FlateDecode/LZW/ASCII85/ASCIIHex with predictors), decryption (RC4, AESV2, AESV3), page-tree walking, content-stream operator interpretation, and ToUnicode CMaps — all from scratch.
  • Never-throw Result<T, E> contract. A single malformed document cannot crash a batch; failures are values, not exceptions.
  • Injection-safe search. Every user term is lowered into a quoted FTS5 string literal, so search syntax can never be smuggled into the engine.
  • Self-contained & portable. No database server, no queue, no cloud dependency other than the optional LLM endpoint.

Why no FAISS or vector embeddings

DeepPipe deliberately uses lexical retrieval (SQLite FTS5 with BM25 ranking) and does not use FAISS, vector embeddings, or any vector database. This is a considered engineering decision, not an oversight:

  • No model dependency or drift. Embedding-based retrieval ties your index to a specific embedding model. Upgrading or changing the model means re-embedding the entire corpus, and results subtly change. BM25 over stemmed terms is deterministic and reproducible — the same document always indexes the same way.
  • No extra infrastructure. FAISS / vector search needs an embedding service and a vector store (Pinecone, Weaviate, Chroma, pgvector, Qdrant, …). DeepPipe needs one SQLite file. It installs and runs anywhere Node runs.
  • No per-document cost or latency. There are no embedding API calls at ingest time and no vector math at query time — retrieval is sub-millisecond on modest corpora.
  • Transparency and privacy. Lexical matches are explainable (you can see exactly which terms matched and where), and documents never have to be sent to an embedding API.

The trade-off we accept: pure lexical retrieval does not capture semantic similarity between words that share no surface form (e.g. "car" vs. "automobile"). For DeepPipe's target — searching and chatting over your own document library, where the vocabulary of the query and the documents largely overlaps — BM25 is fast, cheap, private, and accurate. Semantic / vector retrieval can be layered on later if a use case demands it, but it is intentionally not a default dependency.


Architecture

DeepPipe runs as a single process that serves both the REST API and the web client, delegating all engine work to @kordabjinan/deeppipe.

┌──────────────────────────────────────────────────────────────────────────┐
│                          Browser (public/)                                 │
│  index.html  ·  styles.css  ·  app.js   (vanilla ES module SPA)            │
│  Views: Search · Upload · Chat · Library                                   │
└───────────────┬───────────────────────────────────────────────┬──────────┘
                │ REST (JSON)                                     │ SSE (chat)
                ▼                                                 ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                      HTTP server (server/server.mjs)                       │
│  Static files · /api/* routes · upload cap · .env loader · LLM proxy       │
└───────────────┬───────────────────────────────────────────────────────────┘
                │ in-process calls (Result<T,E>)
                ▼
╔════════════════════════════════════════════════════════════════════════════╗
║              @kordabjinan/deeppipe   (the engine — installed from npm)        ║
║                                                                              ║
║  ┌────────────────────────────────────────────────────────────────────┐     ║
║  │               Pipeline orchestrator (façade)                        │     ║
║  │  ingestBytes · search · chatContext · listDocuments · removeDoc…    │     ║
║  └───┬──────────────────────┬───────────────────────────┬─────────────┘     ║
║      │ extract              │ index / query             │ retrieve          ║
║      ▼                      ▼                           ▼                    ║
║  ┌─────────────┐   ┌──────────────────────┐   ┌──────────────────────┐      ║
║  │ Extraction  │   │ Search layer         │   │ Chat / RAG           │      ║
║  │ filters     │   │  writer · parser ·   │   │  context-builder     │      ║
║  │ storage     │   │  planner · executor  │   │  (passage selection) │      ║
║  │ text·runtime│   │  index-store (SQLite)│   └──────────────────────┘      ║
║  └─────────────┘   └──────────┬───────────┘                                 ║
║                               ▼                                             ║
║                     ┌──────────────────────┐                               ║
║                     │ data/deeppipe.db     │                               ║
║                     │ SQLite + FTS5 (WAL)  │                               ║
║                     └──────────────────────┘                               ║
╚════════════════════════════════════════════════════════════════════════════╝

Engine layers

Layer Responsibility
Extraction (filters, storage, runtime) Detects format by magic bytes; parses PDF / OOXML / CFB / HTML / MIME / ZIP in pure TypeScript; recovers text + metadata.
Text processing (text) Charset decoding (BOM + iconv-lite), language detection (franc), Unicode word-breaking and Porter stemming that matches the FTS5 tokenizer.
Search (search) SHA-256 dedup, paragraph-aware chunking, recursive-descent query parser → FTS5 MATCH plan → BM25-ranked, snippet-highlighted results.
Chat / RAG (chat) Retrieves the most relevant passages and assembles a grounded, cited context for any OpenAI-compatible LLM.

Processing flow

Ingestion (upload → indexed)

 Upload bytes
     │
     ▼
 ┌─────────────────┐   magic bytes → structure → extension
 │ Format detect   │
 └────────┬────────┘
          ▼
 ┌─────────────────┐   PDF · OOXML · CFB · HTML · MIME · ZIP · text
 │ Format filter   │   (pure-TS parser dispatch)
 └────────┬────────┘
          ▼
 ┌─────────────────┐   bytes → Unicode (BOM/charset)  →  language detect
 │ Text normalize  │
 └────────┬────────┘
          ▼
 ┌─────────────────┐   SHA-256 hash → dedup?  →  paragraph-aware chunking
 │ Index writer    │
 └────────┬────────┘
          ▼
 ┌─────────────────┐   tokenize + Porter stem  →  FTS5 rows (BM25-ready)
 │ SQLite + FTS5   │   persisted to data/deeppipe.db (WAL)
 └─────────────────┘

Query (search → ranked hits)

 Query string ─▶ parse (AST: terms, "phrases", prefix*, field:value, AND/OR/NOT)
              ─▶ plan  (quoted FTS5 MATCH — injection-safe)
              ─▶ FTS5 BM25 search ─▶ hydrate rows ─▶ snippet highlight ─▶ paginate

Chat (question → grounded answer)

 Question ─▶ retrieve top passages (FTS5 / BM25)
          ─▶ build grounded, cited context
          ─▶ OpenAI-compatible LLM (SSE stream)
          ─▶ answer with source citations

Repository layout

.
├── public/            # Web client (index.html, app.js, styles.css)
├── server/            # Node HTTP server (REST API + static files)
├── scripts/           # smoke / pdf / chat test scripts
├── quiet-dotenv.mjs   # silences a transitive dotenv startup banner
├── package.json       # depends on @kordabjinan/deeppipe
└── .env.example       # chat (LLM) configuration template

Getting started

1. Install dependencies

This pulls the engine (@kordabjinan/deeppipe) from npm:

cd GITHUB
npm install

2. (Optional) Configure chat

Search and ingestion work without any configuration. Only the chat feature needs an LLM:

Copy-Item .env.example .env
# then edit .env to set your OpenAI-compatible endpoint, key, and model

3. Run the headless smoke test

The fastest way to confirm the published engine works end-to-end — it ingests several formats and runs queries against an in-memory index:

npm run smoke

Expect a series of ok lines and a zero exit code.

4. Run the web app

npm run start

Open http://localhost:4173 — upload a document, search it, and (if configured) chat with it. The index is persisted to data/deeppipe.db, so uploads survive restarts.


How this app uses the engine

Every engine call flows through the published npm package, e.g. in server/server.mjs:

import { openPipeline, extractiveAnswer } from '@kordabjinan/deeppipe';

When a new version of the engine is published, bump the dependency in package.json and npm install again to test it here.


📜 Licensing

  • Copyright © 2026 Jinan Kordab. DeepPipe — including this application and the @kordabjinan/deeppipe engine — is the original work of Jinan Kordab.
  • The engine package is distributed under the MIT License.\r\n- Every source file carries the copyright header "DeepPipe — Document Ingestion & AI Search Pipeline - Copyright © Jinan Kordab 2026."

Authored by Jinan Kordab.

Releases

No releases published

Packages

 
 
 

Contributors