AI-RAG Backend — Retrieval-Augmented Generation (RAG)

A compact RAG backend that combines Large Language Models (LLMs) with a Qdrant vector database to provide grounded, evidence-backed answers to user queries.

This README explains the high-level architecture, data flow, retrieval and prompting patterns, practical limitations, and extension points for engineers new to LLM systems.

🚀 High-level system diagram (text)

User → API / Controllers (controller/question.controller.js) → Inference Service (services/inference.service.js) →

Compute query embedding (via providers/openAIProvider.js or configured embedding provider)
Vector DB search (via services/qdrant.service.js)
Rerank & select context → Build prompt → Call LLM provider → Format & return response → Persist chat (database/schema/chatHistory.js)

Note: Document ingestion / indexing is handled by services/document.service.js (chunking, embedding, index write).

🔁 Data flow (step-by-step)

Ingest documents
- Parse, normalize, chunk text, and compute embeddings for each chunk.
- Store chunk text + metadata + embedding in the vector DB (and optionally persist the original doc).
Handle query
- Preprocess query (normalize/clean), compute query embedding.
- Fetch top-K vectors from vector DB using cosine similarity.
Post-retrieval
- Optionally rerank (cross-encoder) or filter candidates using metadata.
- Select chunks greedily until the prompt token budget is reached.
- Assemble a prompt: system instructions + selected chunks + user question.
LLM inference
- Call LLM provider for answer generation.
- Parse output, attach citations ([doc:chunk_id]), validate format.
Persist & reply
- Save conversation + retrieval provenance and return structured response to client.

🎯 Prompting strategy

Use a strong system prompt to define role, tone, and guardrails (e.g., "Respond using ONLY the provided sources").
Provide context chunks clearly delimited and labeled with provenance.
Enforce reply format: short answer, explicit confidence indicator, and an ordered list of source citations.
Use conservative instructions: tell the model to respond "I don't know" when the answer cannot be supported by provided context.
Tune the prompt template for your model's behavior and your domain (examples and counter-examples help).

Example template:

System: You are an assistant that answers using ONLY the context below. If the answer cannot be supported, reply: "I don't know".
Context:
---CHUNK_START [doc:chunk_id]---
<chunk_text>
---CHUNK_END---
Question: <user_question>
Answer (brief, cite sources like [doc:chunk_id]):

🔍 Vector storage & retrieval logic

Embeddings
- Keep embedding model consistent for indexing + queries. Persist model/version in vector metadata for reproducibility.
Chunking
- Chunk size and overlap depend on domain (e.g., 500–1,000 tokens with 20–30% overlap is common).
- Store chunk text, source id, chunk id, timestamps, language, and embedding.
Search
- Use nearest-neighbor search (cosine or dot product depending on embedding model).
- Apply metadata filters (e.g., source, date, language) to restrict search space.
- Retrieve top-K (K tuned to recall vs. latency tradeoffs), then rerank if needed.
Reranking & selection
- Use a cross-encoder reranker or heuristic scoring to refine candidate order.
- Accept only candidates above a similarity/rerank threshold or select top candidates until token budget is reached.

🔁 RAG Pipeline (concise)

Query → embedding → vector DB (top-K)
Rerank / filter → token-limited context selection
Build guarded prompt → LLM call
Parse output, attach citations, validate, return & persist

⚠️ Limitations & risks

Hallucination
- LLMs can invent facts; mitigation: require citation, instruct to say "I don't know", add post-generation verification.
Stale / incorrect data
- Index can become outdated — schedule reindexing and version data.
Latency
- Embedding computation + vector search + LLM call increases latency; mitigate with caching, batching, async prefetching, or smaller local models.
Cost
- API usage (embeddings + LLM tokens) has monetary cost. Use batching and caching to reduce repeated calls.
Privacy & security
- Strip or redact PII before indexing; apply retention policies and access controls.

🔧 How to extend or customize the pipeline

Swap LLM or embedding provider: add an adapter in providers/ and plug it into the inference service.
Replace vector DB: implement a services/<db>.service.js that follows the same interface as qdrant.service.js (index, search, upsert, delete).
Add hybrid retrieval: combine sparse (BM25) and dense (embedding) retrieval to improve recall.
Add verification: a post-generation fact-checker or external trusted data fetcher.
Improve reranking: add a cross-encoder or supervised ranker trained on your domain.
Monitoring & evaluation: add synthetic QA datasets to measure hallucination rate, precision/recall, and latency.

🧭 Operational considerations

Monitoring: track latency (embedding, search, LLM), token usage, error rates, and hallucination incidents.
Scaling: shard indexes, use read replicas for vector DB, make ingestion asynchronous, and horizontally scale inference workers.
Security: enforce auth, encrypt storage, and audit access to sensitive documents.

🗂 Repo map (where to look)

Controllers: controller/ (e.g., question.controller.js, document.controller.js)
Services: services/ (inference.service.js, document.service.js, qdrant.service.js)
Providers: providers/ (openAIProvider.js) — add adapters here
DB schema: database/schema/ (chatHistory.js, document.js, chunk.js)
Helpers & configs: helpers/, configs/
Middleware & upload: middleware/, upload/

✅ Quick start

Install deps

npm install

Set environment variables (example)

OPENAI_API_KEY
QDRANT_URL / QDRANT_API_KEY
PORT

Start server

npm start

Ingest documents

Use document ingestion flow (see services/document.service.js) to chunk & index documents into Qdrant.

Testing Guide

Test Structure

tests/
├── unit/                 # Unit tests for individual functions/classes
│   ├── document.controller.test.js
│   ├── question.controller.test.js
│   ├── qdrant.service.test.js
│   └── document.service.test.js
├── api/                  # API integration tests
│   └── integration.test.js
├── mocks/               # Mock factories and utilities
│   └── mockFactories.js
└── setup.js             # Jest setup configuration

Run all tests

npm test

Run tests in watch mode (re-run on file changes)

npm run test:watch

Run only unit tests

npm run test:unit

Run only API tests

npm run test:api

Run with coverage report

npm test -- --coverage

Test Coverage

The test suite covers:

Controllers

DocumentController: File upload handling, event emission
QuestionController: Question answering, error handling

Services

QdrantVectorDatabaseService: Vector database operations (create, insert, search)
DocumentService: PDF processing, chunking, embedding

API Endpoints

POST /new/document - Document upload
POST /question - Question answering

Test Examples

Unit Test Example

test('should emit file-uploaded event with correct data', () => {
  uploadDocument(req, res);
  
  expect(fileEvent.emit).toHaveBeenCalledWith('file-uploaded', {
    filename: 'test-document.pdf',
    // ... other properties
  });
});

API Test Example

test('should upload document successfully', async () => {
  const response = await request(app)
    .post('/new/document')
    .attach('document', Buffer.from('test content'), 'test.pdf');

  expect(response.status).toBe(200);
  expect(response.body).toHaveProperty('message', 'uploaded file');
});

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
controller		controller
database		database
helpers		helpers
providers		providers
services		services
tests		tests
jest.config.js		jest.config.js
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-RAG Backend — Retrieval-Augmented Generation (RAG)

🚀 High-level system diagram (text)

🔁 Data flow (step-by-step)

🎯 Prompting strategy

🔍 Vector storage & retrieval logic

🔁 RAG Pipeline (concise)

⚠️ Limitations & risks

🔧 How to extend or customize the pipeline

🧭 Operational considerations

🗂 Repo map (where to look)

✅ Quick start

Testing Guide

Test Structure

Run all tests

Run tests in watch mode (re-run on file changes)

Run only unit tests

Run only API tests

Run with coverage report

Test Coverage

Controllers

Services

API Endpoints

Test Examples

Unit Test Example

API Test Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-RAG Backend — Retrieval-Augmented Generation (RAG)

🚀 High-level system diagram (text)

🔁 Data flow (step-by-step)

🎯 Prompting strategy

🔍 Vector storage & retrieval logic

🔁 RAG Pipeline (concise)

⚠️ Limitations & risks

🔧 How to extend or customize the pipeline

🧭 Operational considerations

🗂 Repo map (where to look)

✅ Quick start

Testing Guide

Test Structure

Run all tests

Run tests in watch mode (re-run on file changes)

Run only unit tests

Run only API tests

Run with coverage report

Test Coverage

Controllers

Services

API Endpoints

Test Examples

Unit Test Example

API Test Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages