Skip to content

2bdulra7manRea/RAG-DocFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI-RAG Backend β€” Retrieval-Augmented Generation (RAG)

A compact RAG backend that combines Large Language Models (LLMs) with a Qdrant vector database to provide grounded, evidence-backed answers to user queries.

This README explains the high-level architecture, data flow, retrieval and prompting patterns, practical limitations, and extension points for engineers new to LLM systems.


πŸš€ High-level system diagram (text)

User β†’ API / Controllers (controller/question.controller.js) β†’ Inference Service (services/inference.service.js) β†’

  • Compute query embedding (via providers/openAIProvider.js or configured embedding provider)
  • Vector DB search (via services/qdrant.service.js)
  • Rerank & select context β†’ Build prompt β†’ Call LLM provider β†’ Format & return response β†’ Persist chat (database/schema/chatHistory.js)

Note: Document ingestion / indexing is handled by services/document.service.js (chunking, embedding, index write).


πŸ” Data flow (step-by-step)

  1. Ingest documents
    • Parse, normalize, chunk text, and compute embeddings for each chunk.
    • Store chunk text + metadata + embedding in the vector DB (and optionally persist the original doc).
  2. Handle query
    • Preprocess query (normalize/clean), compute query embedding.
    • Fetch top-K vectors from vector DB using cosine similarity.
  3. Post-retrieval
    • Optionally rerank (cross-encoder) or filter candidates using metadata.
    • Select chunks greedily until the prompt token budget is reached.
    • Assemble a prompt: system instructions + selected chunks + user question.
  4. LLM inference
    • Call LLM provider for answer generation.
    • Parse output, attach citations ([doc:chunk_id]), validate format.
  5. Persist & reply
    • Save conversation + retrieval provenance and return structured response to client.

🎯 Prompting strategy

  • Use a strong system prompt to define role, tone, and guardrails (e.g., "Respond using ONLY the provided sources").
  • Provide context chunks clearly delimited and labeled with provenance.
  • Enforce reply format: short answer, explicit confidence indicator, and an ordered list of source citations.
  • Use conservative instructions: tell the model to respond "I don't know" when the answer cannot be supported by provided context.
  • Tune the prompt template for your model's behavior and your domain (examples and counter-examples help).

Example template:

System: You are an assistant that answers using ONLY the context below. If the answer cannot be supported, reply: "I don't know".
Context:
---CHUNK_START [doc:chunk_id]---
<chunk_text>
---CHUNK_END---
Question: <user_question>
Answer (brief, cite sources like [doc:chunk_id]):

πŸ” Vector storage & retrieval logic

  • Embeddings
    • Keep embedding model consistent for indexing + queries. Persist model/version in vector metadata for reproducibility.
  • Chunking
    • Chunk size and overlap depend on domain (e.g., 500–1,000 tokens with 20–30% overlap is common).
    • Store chunk text, source id, chunk id, timestamps, language, and embedding.
  • Search
    • Use nearest-neighbor search (cosine or dot product depending on embedding model).
    • Apply metadata filters (e.g., source, date, language) to restrict search space.
    • Retrieve top-K (K tuned to recall vs. latency tradeoffs), then rerank if needed.
  • Reranking & selection
    • Use a cross-encoder reranker or heuristic scoring to refine candidate order.
    • Accept only candidates above a similarity/rerank threshold or select top candidates until token budget is reached.

πŸ” RAG Pipeline (concise)

  1. Query β†’ embedding β†’ vector DB (top-K)
  2. Rerank / filter β†’ token-limited context selection
  3. Build guarded prompt β†’ LLM call
  4. Parse output, attach citations, validate, return & persist

⚠️ Limitations & risks

  • Hallucination
    • LLMs can invent facts; mitigation: require citation, instruct to say "I don't know", add post-generation verification.
  • Stale / incorrect data
    • Index can become outdated β€” schedule reindexing and version data.
  • Latency
    • Embedding computation + vector search + LLM call increases latency; mitigate with caching, batching, async prefetching, or smaller local models.
  • Cost
    • API usage (embeddings + LLM tokens) has monetary cost. Use batching and caching to reduce repeated calls.
  • Privacy & security
    • Strip or redact PII before indexing; apply retention policies and access controls.

πŸ”§ How to extend or customize the pipeline

  • Swap LLM or embedding provider: add an adapter in providers/ and plug it into the inference service.
  • Replace vector DB: implement a services/<db>.service.js that follows the same interface as qdrant.service.js (index, search, upsert, delete).
  • Add hybrid retrieval: combine sparse (BM25) and dense (embedding) retrieval to improve recall.
  • Add verification: a post-generation fact-checker or external trusted data fetcher.
  • Improve reranking: add a cross-encoder or supervised ranker trained on your domain.
  • Monitoring & evaluation: add synthetic QA datasets to measure hallucination rate, precision/recall, and latency.

🧭 Operational considerations

  • Monitoring: track latency (embedding, search, LLM), token usage, error rates, and hallucination incidents.
  • Scaling: shard indexes, use read replicas for vector DB, make ingestion asynchronous, and horizontally scale inference workers.
  • Security: enforce auth, encrypt storage, and audit access to sensitive documents.

πŸ—‚ Repo map (where to look)

  • Controllers: controller/ (e.g., question.controller.js, document.controller.js)
  • Services: services/ (inference.service.js, document.service.js, qdrant.service.js)
  • Providers: providers/ (openAIProvider.js) β€” add adapters here
  • DB schema: database/schema/ (chatHistory.js, document.js, chunk.js)
  • Helpers & configs: helpers/, configs/
  • Middleware & upload: middleware/, upload/

βœ… Quick start

  1. Install deps
npm install
  1. Set environment variables (example)
  • OPENAI_API_KEY
  • QDRANT_URL / QDRANT_API_KEY
  • PORT
  1. Start server
npm start
  1. Ingest documents
  • Use document ingestion flow (see services/document.service.js) to chunk & index documents into Qdrant.

Testing Guide

Test Structure

tests/
β”œβ”€β”€ unit/                 # Unit tests for individual functions/classes
β”‚   β”œβ”€β”€ document.controller.test.js
β”‚   β”œβ”€β”€ question.controller.test.js
β”‚   β”œβ”€β”€ qdrant.service.test.js
β”‚   └── document.service.test.js
β”œβ”€β”€ api/                  # API integration tests
β”‚   └── integration.test.js
β”œβ”€β”€ mocks/               # Mock factories and utilities
β”‚   └── mockFactories.js
└── setup.js             # Jest setup configuration

Run all tests

npm test

Run tests in watch mode (re-run on file changes)

npm run test:watch

Run only unit tests

npm run test:unit

Run only API tests

npm run test:api

Run with coverage report

npm test -- --coverage

Test Coverage

The test suite covers:

Controllers

  • DocumentController: File upload handling, event emission
  • QuestionController: Question answering, error handling

Services

  • QdrantVectorDatabaseService: Vector database operations (create, insert, search)
  • DocumentService: PDF processing, chunking, embedding

API Endpoints

  • POST /new/document - Document upload
  • POST /question - Question answering

Test Examples

Unit Test Example

test('should emit file-uploaded event with correct data', () => {
  uploadDocument(req, res);
  
  expect(fileEvent.emit).toHaveBeenCalledWith('file-uploaded', {
    filename: 'test-document.pdf',
    // ... other properties
  });
});

API Test Example

test('should upload document successfully', async () => {
  const response = await request(app)
    .post('/new/document')
    .attach('document', Buffer.from('test content'), 'test.pdf');

  expect(response.status).toBe(200);
  expect(response.body).toHaveProperty('message', 'uploaded file');
});

About

The system receives documents from users and stores them in the database as Document entities. It maintains a History entity to track each processing step, including whether the document has been successfully embedded by the LLM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors