perf: speed up vector chat creation pipeline by icancodefyi · Pull Request #199 · avishek0769/DocChat

icancodefyi · 2026-06-11T18:43:35Z

Closes #11

Summary

Optimizes the vector ingestion pipeline to reduce time from chat creation to READY.

Changes

Two-phase pipeline: Phase 1 scrapes all pages concurrently (configurable, default 10), Phase 2 embeds + indexes in batches
Parallel embedding: Promise.all fans out embedding API calls instead of sequential for-loop
Batched Qdrant upserts: Accumulates points across pages, flushes every 500 (down from ~300 individual upserts to ~6)
Batched DB writes: Uses prisma.documentPage.createMany() instead of per-page .create()
Configurable: All batch sizes and concurrency exposed as env vars (CRAWL_SCRAPE_CONCURRENCY, EMBEDDING_BATCH_SIZE, QDRANT_BATCH_SIZE) with sensible defaults
Progress reporting: Split into SCRAPING (0–50%) and INDEXING (50–100%) for accurate user-facing feedback
Vectorless path: CRAWL_VECTORLESS_BATCH_SIZE default bumped 5→10
Domain concurrency: CRAWL_MAX_CONCURRENCY_PER_DOMAIN default bumped 2→3

Optimize ingestion pipeline for vector (and vectorless) chat creation: - Add configurable scrape concurrency (CRAWL_SCRAPE_CONCURRENCY, default 10) - Add configurable embedding batch size (EMBEDDING_BATCH_SIZE, default 500) - Add configurable Qdrant batch size (QDRANT_BATCH_SIZE, default 500) - Parallelize embedding API calls within a page using Promise.all - Batch Qdrant upserts across pages instead of one per page - Use prisma.documentPage.createMany() instead of per-page creates - Split progress into SCRAPING (0-50%) and INDEXING (50-100%) stages - Bump CRAWL_MAX_CONCURRENCY_PER_DOMAIN default from 2 to 3 - Bump CRAWL_VECTORLESS_BATCH_SIZE default from 5 to 10 - Document new env vars in .env.example

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR tunes and refactors the ingestion pipeline used for chat knowledge creation by increasing crawl/worker concurrency and restructuring vector ingestion into separate scraping and indexing phases.

Changes:

Increased default crawl concurrency and worker batch sizes via new/updated env-driven config.
Refactored processVector into a 2-phase pipeline (scrape/split → embed/index with batching).
Updated progress statuses to more granular phases (SCRAPING, INDEXING) during ingestion.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File	Description
backend/utils/ragUtilities.js	Increases default per-domain crawl concurrency.
backend/chatWorker.js	Adds new ingestion tuning knobs; refactors vector ingestion into phased batching; updates progress statuses.
backend/.env.example	Documents new ingestion tuning environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        async function flushBatch() {
+            if (pendingPoints.length === 0) return;
+            const points = pendingPoints.splice(0);
+            const dbRecords = pendingDbRecords.splice(0);
+
+            await qdrant.upsert(collectionName, { wait: true, points });
+            await prisma.documentPage.createMany({ data: dbRecords }).catch((err) => {
+                console.error("Failed to update indexed pages:", err.message);
+            });
+        }


+                if (pendingPoints.length >= qdrantBatchSize) {
+                    await flushBatch();
+                }


+            await qdrant.upsert(collectionName, { wait: true, points });
+            await prisma.documentPage.createMany({ data: dbRecords }).catch((err) => {
+                console.error("Failed to update indexed pages:", err.message);
+            });


+            if (chunks.length > 0) {
+                const batchPromises = [];
+                for (let i = 0; i < chunks.length; i += embeddingBatchSize) {
+                    const chunkBatch = chunks.slice(i, i + embeddingBatchSize);
+                    batchPromises.push(generateVectorEmbeddings(chunkBatch));
+                }
+                const batchResults = await Promise.all(batchPromises);
+                const allEmbeddings = batchResults.flatMap((r) =>
+                    Array.isArray(r) ? r : [r],
+                );
+


+        let scrapedCount = 0;

-        await Promise.all(allLinks.map((link) => limiter.schedule(async () => {
+        const scrapedPages = await Promise.all(allLinks.map((link) => limiter.schedule(async () => {


        })));
+
+        const validPages = scrapedPages.filter(Boolean);
+        if (validPages.length === 0) {
+            throw new Error("No pages were successfully scraped.");
+        }



        await updateChatProgress(chatId, {
-            status: "PROCESSING",
+            status: "SCRAPING",


+        await updateChatProgress(chatId, {
+            status: "INDEXING",
+            current: 0,
+            total: totalIndexPages,
+            progress: 50,
+        });


    try {
        const { maxPagesPerJob, vectorlessBatchSize } = getWorkerConfig();
-        await updateChatProgress(chatId, { status: "PROCESSING", progress: 0 });
+        await updateChatProgress(chatId, { status: "SCRAPING", progress: 0 });



            await updateChatProgress(chatId, {
-                status: "PROCESSING",
+                status: "SCRAPING",


avishek0769 · 2026-06-12T13:23:48Z

@icancodefyi Resolve the merge conflicts

Copilot AI review requested due to automatic review settings June 11, 2026 18:43

Copilot AI reviewed Jun 11, 2026

View reviewed changes

merge upstream/main, resolve conflicts in chatWorker.js

5c48fc5

avishek0769 approved these changes Jun 12, 2026

View reviewed changes

avishek0769 added Hard This is issue is hard to solve SSoC26 Social Summer of Code - 2026 labels Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speed up vector chat creation pipeline#199

perf: speed up vector chat creation pipeline#199
icancodefyi wants to merge 2 commits into
avishek0769:mainfrom
icancodefyi:feat/speed-up-vector-chat-creation

icancodefyi commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

avishek0769 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

icancodefyi commented Jun 11, 2026

Summary

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

avishek0769 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants