A compact RAG backend that combines Large Language Models (LLMs) with a Qdrant vector database to provide grounded, evidence-backed answers to user queries.
This README explains the high-level architecture, data flow, retrieval and prompting patterns, practical limitations, and extension points for engineers new to LLM systems.
User β API / Controllers (controller/question.controller.js) β Inference Service (services/inference.service.js) β
- Compute query embedding (via
providers/openAIProvider.jsor configured embedding provider) - Vector DB search (via
services/qdrant.service.js) - Rerank & select context β Build prompt β Call LLM provider β Format & return response β Persist chat (
database/schema/chatHistory.js)
Note: Document ingestion / indexing is handled by services/document.service.js (chunking, embedding, index write).
- Ingest documents
- Parse, normalize, chunk text, and compute embeddings for each chunk.
- Store chunk text + metadata + embedding in the vector DB (and optionally persist the original doc).
- Handle query
- Preprocess query (normalize/clean), compute query embedding.
- Fetch top-K vectors from vector DB using cosine similarity.
- Post-retrieval
- Optionally rerank (cross-encoder) or filter candidates using metadata.
- Select chunks greedily until the prompt token budget is reached.
- Assemble a prompt: system instructions + selected chunks + user question.
- LLM inference
- Call LLM provider for answer generation.
- Parse output, attach citations (
[doc:chunk_id]), validate format.
- Persist & reply
- Save conversation + retrieval provenance and return structured response to client.
- Use a strong system prompt to define role, tone, and guardrails (e.g., "Respond using ONLY the provided sources").
- Provide context chunks clearly delimited and labeled with provenance.
- Enforce reply format: short answer, explicit confidence indicator, and an ordered list of source citations.
- Use conservative instructions: tell the model to respond "I don't know" when the answer cannot be supported by provided context.
- Tune the prompt template for your model's behavior and your domain (examples and counter-examples help).
Example template:
System: You are an assistant that answers using ONLY the context below. If the answer cannot be supported, reply: "I don't know".
Context:
---CHUNK_START [doc:chunk_id]---
<chunk_text>
---CHUNK_END---
Question: <user_question>
Answer (brief, cite sources like [doc:chunk_id]):
- Embeddings
- Keep embedding model consistent for indexing + queries. Persist model/version in vector metadata for reproducibility.
- Chunking
- Chunk size and overlap depend on domain (e.g., 500β1,000 tokens with 20β30% overlap is common).
- Store chunk text, source id, chunk id, timestamps, language, and embedding.
- Search
- Use nearest-neighbor search (cosine or dot product depending on embedding model).
- Apply metadata filters (e.g., source, date, language) to restrict search space.
- Retrieve top-K (K tuned to recall vs. latency tradeoffs), then rerank if needed.
- Reranking & selection
- Use a cross-encoder reranker or heuristic scoring to refine candidate order.
- Accept only candidates above a similarity/rerank threshold or select top candidates until token budget is reached.
- Query β embedding β vector DB (top-K)
- Rerank / filter β token-limited context selection
- Build guarded prompt β LLM call
- Parse output, attach citations, validate, return & persist
- Hallucination
- LLMs can invent facts; mitigation: require citation, instruct to say "I don't know", add post-generation verification.
- Stale / incorrect data
- Index can become outdated β schedule reindexing and version data.
- Latency
- Embedding computation + vector search + LLM call increases latency; mitigate with caching, batching, async prefetching, or smaller local models.
- Cost
- API usage (embeddings + LLM tokens) has monetary cost. Use batching and caching to reduce repeated calls.
- Privacy & security
- Strip or redact PII before indexing; apply retention policies and access controls.
- Swap LLM or embedding provider: add an adapter in
providers/and plug it into the inference service. - Replace vector DB: implement a
services/<db>.service.jsthat follows the same interface asqdrant.service.js(index, search, upsert, delete). - Add hybrid retrieval: combine sparse (BM25) and dense (embedding) retrieval to improve recall.
- Add verification: a post-generation fact-checker or external trusted data fetcher.
- Improve reranking: add a cross-encoder or supervised ranker trained on your domain.
- Monitoring & evaluation: add synthetic QA datasets to measure hallucination rate, precision/recall, and latency.
- Monitoring: track latency (embedding, search, LLM), token usage, error rates, and hallucination incidents.
- Scaling: shard indexes, use read replicas for vector DB, make ingestion asynchronous, and horizontally scale inference workers.
- Security: enforce auth, encrypt storage, and audit access to sensitive documents.
- Controllers:
controller/(e.g.,question.controller.js,document.controller.js) - Services:
services/(inference.service.js,document.service.js,qdrant.service.js) - Providers:
providers/(openAIProvider.js) β add adapters here - DB schema:
database/schema/(chatHistory.js,document.js,chunk.js) - Helpers & configs:
helpers/,configs/ - Middleware & upload:
middleware/,upload/
- Install deps
npm install- Set environment variables (example)
OPENAI_API_KEYQDRANT_URL/QDRANT_API_KEYPORT
- Start server
npm start- Ingest documents
- Use document ingestion flow (see
services/document.service.js) to chunk & index documents into Qdrant.
tests/
βββ unit/ # Unit tests for individual functions/classes
β βββ document.controller.test.js
β βββ question.controller.test.js
β βββ qdrant.service.test.js
β βββ document.service.test.js
βββ api/ # API integration tests
β βββ integration.test.js
βββ mocks/ # Mock factories and utilities
β βββ mockFactories.js
βββ setup.js # Jest setup configuration
npm testnpm run test:watchnpm run test:unitnpm run test:apinpm test -- --coverageThe test suite covers:
- DocumentController: File upload handling, event emission
- QuestionController: Question answering, error handling
- QdrantVectorDatabaseService: Vector database operations (create, insert, search)
- DocumentService: PDF processing, chunking, embedding
POST /new/document- Document uploadPOST /question- Question answering
test('should emit file-uploaded event with correct data', () => {
uploadDocument(req, res);
expect(fileEvent.emit).toHaveBeenCalledWith('file-uploaded', {
filename: 'test-document.pdf',
// ... other properties
});
});test('should upload document successfully', async () => {
const response = await request(app)
.post('/new/document')
.attach('document', Buffer.from('test content'), 'test.pdf');
expect(response.status).toBe(200);
expect(response.body).toHaveProperty('message', 'uploaded file');
});