Skip to content

perf: speed up vector chat creation pipeline#199

Open
icancodefyi wants to merge 2 commits into
avishek0769:mainfrom
icancodefyi:feat/speed-up-vector-chat-creation
Open

perf: speed up vector chat creation pipeline#199
icancodefyi wants to merge 2 commits into
avishek0769:mainfrom
icancodefyi:feat/speed-up-vector-chat-creation

Conversation

@icancodefyi

Copy link
Copy Markdown
Contributor

Closes #11

Summary

Optimizes the vector ingestion pipeline to reduce time from chat creation to READY.

Changes

  • Two-phase pipeline: Phase 1 scrapes all pages concurrently (configurable, default 10), Phase 2 embeds + indexes in batches
  • Parallel embedding: Promise.all fans out embedding API calls instead of sequential for-loop
  • Batched Qdrant upserts: Accumulates points across pages, flushes every 500 (down from ~300 individual upserts to ~6)
  • Batched DB writes: Uses prisma.documentPage.createMany() instead of per-page .create()
  • Configurable: All batch sizes and concurrency exposed as env vars (CRAWL_SCRAPE_CONCURRENCY, EMBEDDING_BATCH_SIZE, QDRANT_BATCH_SIZE) with sensible defaults
  • Progress reporting: Split into SCRAPING (0–50%) and INDEXING (50–100%) for accurate user-facing feedback
  • Vectorless path: CRAWL_VECTORLESS_BATCH_SIZE default bumped 5→10
  • Domain concurrency: CRAWL_MAX_CONCURRENCY_PER_DOMAIN default bumped 2→3

Optimize ingestion pipeline for vector (and vectorless) chat creation:

- Add configurable scrape concurrency (CRAWL_SCRAPE_CONCURRENCY, default 10)
- Add configurable embedding batch size (EMBEDDING_BATCH_SIZE, default 500)
- Add configurable Qdrant batch size (QDRANT_BATCH_SIZE, default 500)
- Parallelize embedding API calls within a page using Promise.all
- Batch Qdrant upserts across pages instead of one per page
- Use prisma.documentPage.createMany() instead of per-page creates
- Split progress into SCRAPING (0-50%) and INDEXING (50-100%) stages
- Bump CRAWL_MAX_CONCURRENCY_PER_DOMAIN default from 2 to 3
- Bump CRAWL_VECTORLESS_BATCH_SIZE default from 5 to 10
- Document new env vars in .env.example
Copilot AI review requested due to automatic review settings June 11, 2026 18:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR tunes and refactors the ingestion pipeline used for chat knowledge creation by increasing crawl/worker concurrency and restructuring vector ingestion into separate scraping and indexing phases.

Changes:

  • Increased default crawl concurrency and worker batch sizes via new/updated env-driven config.
  • Refactored processVector into a 2-phase pipeline (scrape/split → embed/index with batching).
  • Updated progress statuses to more granular phases (SCRAPING, INDEXING) during ingestion.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File Description
backend/utils/ragUtilities.js Increases default per-domain crawl concurrency.
backend/chatWorker.js Adds new ingestion tuning knobs; refactors vector ingestion into phased batching; updates progress statuses.
backend/.env.example Documents new ingestion tuning environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/chatWorker.js
Comment on lines +160 to +169
async function flushBatch() {
if (pendingPoints.length === 0) return;
const points = pendingPoints.splice(0);
const dbRecords = pendingDbRecords.splice(0);

await qdrant.upsert(collectionName, { wait: true, points });
await prisma.documentPage.createMany({ data: dbRecords }).catch((err) => {
console.error("Failed to update indexed pages:", err.message);
});
}
Comment thread backend/chatWorker.js
Comment on lines +200 to +202
if (pendingPoints.length >= qdrantBatchSize) {
await flushBatch();
}
Comment thread backend/chatWorker.js
Comment on lines +165 to +168
await qdrant.upsert(collectionName, { wait: true, points });
await prisma.documentPage.createMany({ data: dbRecords }).catch((err) => {
console.error("Failed to update indexed pages:", err.message);
});
Comment thread backend/chatWorker.js
Comment on lines +174 to +184
if (chunks.length > 0) {
const batchPromises = [];
for (let i = 0; i < chunks.length; i += embeddingBatchSize) {
const chunkBatch = chunks.slice(i, i + embeddingBatchSize);
batchPromises.push(generateVectorEmbeddings(chunkBatch));
}
const batchResults = await Promise.all(batchPromises);
const allEmbeddings = batchResults.flatMap((r) =>
Array.isArray(r) ? r : [r],
);

Comment thread backend/chatWorker.js
let scrapedCount = 0;

await Promise.all(allLinks.map((link) => limiter.schedule(async () => {
const scrapedPages = await Promise.all(allLinks.map((link) => limiter.schedule(async () => {
Comment thread backend/chatWorker.js
Comment on lines 139 to +144
})));

const validPages = scrapedPages.filter(Boolean);
if (validPages.length === 0) {
throw new Error("No pages were successfully scraped.");
}
Comment thread backend/chatWorker.js

await updateChatProgress(chatId, {
status: "PROCESSING",
status: "SCRAPING",
Comment thread backend/chatWorker.js
Comment on lines +150 to +155
await updateChatProgress(chatId, {
status: "INDEXING",
current: 0,
total: totalIndexPages,
progress: 50,
});
Comment thread backend/chatWorker.js
try {
const { maxPagesPerJob, vectorlessBatchSize } = getWorkerConfig();
await updateChatProgress(chatId, { status: "PROCESSING", progress: 0 });
await updateChatProgress(chatId, { status: "SCRAPING", progress: 0 });
Comment thread backend/chatWorker.js

await updateChatProgress(chatId, {
status: "PROCESSING",
status: "SCRAPING",
@avishek0769 avishek0769 added Hard This is issue is hard to solve SSoC26 Social Summer of Code - 2026 labels Jun 12, 2026
@avishek0769

Copy link
Copy Markdown
Owner

@icancodefyi Resolve the merge conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hard This is issue is hard to solve SSoC26 Social Summer of Code - 2026

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed Up Vector Chat Creation Pipeline

3 participants