Improve file indexing by ethanrous · Pull Request #107 · ethanrous/weblens

ethanrous · 2026-06-14T22:06:29Z

No description provided.

Copilot

Pull request overview

This PR updates the backend “scan” pipeline into a more explicit “indexing” flow, expands/clarifies embedding behavior (text vs. visual), and adds UI/dev tooling to trigger and inspect indexing/embedding state.

Changes:

Refactor jobs from scan to index semantics (new IndexFile job, updated directory indexing, updated websocket events and job registration priorities).
Improve embedding pipeline and search relevance (new hit scoring/normalization, updated extract/embed behavior, improved OCR filtering).
Update Nuxt settings/dev UI to expose indexing/embedding actions and add debug lookups.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
weblens-vue/weblens-nuxt/pages/settings/dev.vue	Reworks dev settings page layout; adds debug controls for file/media lookup and preview.
weblens-vue/weblens-nuxt/pages/settings.vue	Adds scrolling to settings content area.
weblens-vue/weblens-nuxt/components/organism/UploadProgress.vue	Adjusts positioning/borders for upload progress panel.
weblens-vue/weblens-nuxt/components/molecule/ContextMenuActions.vue	Renames folder scan action to “Re-Index Folder” and marks as danger action.
weblens-vue/weblens-nuxt/components/atom/WeblensOptions.vue	Visual tweaks; adds chevron icon; adds default option selection behavior.
weblens-vue/weblens-nuxt/components/atom/WeblensInput.vue	Reorders props for readability/consistency.
services/jobs/upload.go	Refactors upload cleanup; triggers indexing jobs for uploaded top-levels; adjusts error handling in writer loop.
services/jobs/jobs.go	Updates registered jobs and priorities to new indexing model.
services/jobs/index_file.go	Adds new `IndexFile` job implementation for per-file indexing.
services/jobs/file_parser.go	Renames directory scan to `IndexDirectory`; changes discovery/index queueing; updates completion events; adds reindex cleanup helpers.
services/jobs/embed.go	Refactors extract/embed to gate by feature flags; adds visual-embedding write path; skips text extraction for photo types.
services/jobs/embed_test.go	Adds test to ensure photo types skip text extraction.
services/embed/embed_service.go	Adds helper to determine whether a file has any embeddings (text or image).
routers/api/v1/websocket/websocket.go	Updates scan message handling to use `IndexMeta` / `IndexFileTask` for files.
routers/api/v1/file/rest_folders.go	Uses indexing meta in ScanDir; adds logic to trigger indexing when media/embeddings are missing.
routers/api/v1/file/rest_files_search.go	Uses feature flags to gate semantic search; normalizes hit scoring; filters junk text hits for photos.
modules/websocket/event.go	Renames scan-complete event constant while preserving wire value.
models/task/task.go	Redefines priority tiers (adds background and x-high).
models/task/task_test.go	Updates priority usage in tests (partially).
models/media/media_type.go	Adds `IsEmbeddable` to media type registry and expands embeddable types.
models/media/media_test.go	Adds tests for `TextEmbedEligible`.
models/media/media_model.go	Adds batch media deletion helper.
models/media/embed_extensions.go	Moves embed eligibility logic to media-type registry; adds `TextEmbedEligible`.
models/job/tasks_meta.go	Renames scan meta to `IndexMeta`; updates job naming for file indexing.
models/job/jobs.go	Renames file scan task identifier to `index_file`.
models/embedding/store.go	Adds multi-source deletion; generalizes delete-by-kind; renames idempotency counter API.
models/embedding/score.go	Adds cross-kind hit scoring/normalization and filtering.
models/embedding/score_test.go	Adds tests for new hit scoring logic.
models/embedding/embedding.go	Adds `KindAll` for deletion operations.
embed/test_extract.py	Adds tests around OCR confidence and gibberish filtering.
embed/extract.py	Introduces OCR confidence filtering and legibility heuristics; reuses OCR helper in PDF/image paths.

Comments suppressed due to low confidence (1)

services/jobs/file_parser.go:311

Non-displayable but embeddable files (txt/md/docx/etc.) return early here, so they never get an ExtractAndEmbedTask dispatched and won’t be content-indexed. Consider dispatching ExtractAndEmbedTask for text-eligible types before returning when !mt.Displayable.

	mt := media_model.ParseExtension(mf.GetPortablePath().Ext())
	if !mt.Displayable {
		return nil
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- upload: drop duplicate ctxservice import, consolidate to one alias - upload: only report removal failure when cleanup actually fails, return context error on success instead of wrapping nil - upload: log cleanup errors instead of tsk.Fail() (panics and mutates final exit status during read-only cleanup) - rest_folders: index the listed dir, not parent (walked to nil by the breadcrumb loop), fixing a nil-deref and wrong-folder scan - task_test: register background-job at PriorityBackground so the "defaults schedule ahead of background" test asserts correctly - tasks_meta: IndexMeta.MetaString uses JobName() so directory metas serialize ScanDirectoryTask instead of hard-coded IndexFileTask - dev.vue: fix getFile() catch message to say "file info"

Copilot

Pull request overview

Copilot reviewed 33 out of 33 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

services/jobs/file_parser.go:311

queueFileIndexIfNeeded returns early for non-displayable files, which prevents ExtractAndEmbedTask from being dispatched for text/doc/code types. This breaks semantic indexing for the newly-marked embeddable but non-displayable extensions (e.g. txt, pdf, docx), and contradicts the comment below about indexing non-displayables.

	mt := media_model.ParseExtension(mf.GetPortablePath().Ext())
	if !mt.Displayable {
		return nil
	}

- file_parser: remove dead commented-out call in LeafMap callback - file_parser: clearExistingIndex now also clears file-chunk embeddings (keyed by fileID) so a force re-index rebuilds the full semantic index - media_type: mark image/heif IsEmbeddable to match HEIC - embed/test_extract: skip OCR tests cleanly via pytest.importorskip when pytesseract/PIL are unavailable - folder scan: add forceReindex query param wired to IndexMeta.ForceReIndex; the dev "Reindex All Files" button now triggers a real force re-index - dev.vue: clear stale debugMedia/debugReturn on each debug submit

Address the critical and high-severity issues from code review of the file-indexing rework: - Documents/code (non-displayable types) were never text-embedded: the embed dispatch sat behind the displayability gate. Dispatch text/OCR embedding for non-displayable embed-eligible files before the gate. - Directly-uploaded and single-file-scanned files never got embedded. Wire embed dispatch into every entry point via a shared helper: IndexFile dispatches it for displayable media (after media+cache exist, so it can't race), and the upload/ScanDir/websocket paths dispatch it for non-displayable docs. - ScoreHits dropped every hit when many strong matches clustered tightly (none stood 2σ above the high mean). Keep hits that are a statistical standout OR strongly above the kind's floor. - gif/webp/raw (NEF/ARW/CR2) regressed out of image embedding: mark them IsEmbeddable and add the missing lowercase raw extensions. - Folder listings re-dispatched a whole-directory scan on every GET when embedding was disabled/unavailable. Gate the embed-existence rescan on embed being active, and stop hard-failing the listing on a count error. - Single-file force-reindex left orphaned embeddings and stale cache; clear them before rebuilding, matching the directory path.

Copilot

Pull request overview

Copilot reviewed 37 out of 43 changed files in this pull request and generated 5 comments.

Files not reviewed (2)

api/api_folder.go: Generated file
docs/docs.go: Generated file

Copilot

Pull request overview

Copilot reviewed 44 out of 50 changed files in this pull request and generated 4 comments.

Files not reviewed (2)

api/api_folder.go: Generated file
docs/docs.go: Generated file

ethanrous added 2 commits June 12, 2026 15:57

Improve file indexing

27c93f1

feat: rescale embedding scores across kinds and filter OCR noise

f30bde1

Copilot AI review requested due to automatic review settings June 14, 2026 22:06

Copilot started reviewing on behalf of ethanrous June 14, 2026 22:06 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

ethanrous added 2 commits June 14, 2026 18:35

WIP

b179c50

Copilot AI review requested due to automatic review settings June 14, 2026 22:43

Copilot started reviewing on behalf of ethanrous June 14, 2026 22:43 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

ethanrous added 2 commits June 14, 2026 19:20

Copilot AI review requested due to automatic review settings June 15, 2026 00:43

Copilot started reviewing on behalf of ethanrous June 15, 2026 00:44 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread routers/api/v1/file/rest_files_search.go

Comment thread services/jobs/file_parser.go

Comment thread services/jobs/jobs.go Outdated

Comment thread models/task/task.go Outdated

Comment thread routers/api/v1/file/rest_folders.go Outdated

ethanrous added 3 commits June 15, 2026 10:34

update architecture.md

1f3c58f

WIP

c60fe7d

WIP

d1fd4c1

Copilot AI review requested due to automatic review settings June 19, 2026 01:30

Copilot started reviewing on behalf of ethanrous June 19, 2026 01:30 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread weblens-vue/weblens-nuxt/pages/settings/dev.vue

Comment thread services/jobs/upload.go

Comment thread routers/api/v1/websocket/websocket.go

Comment thread routers/api/v1/file/rest_folders.go Outdated

WIP

da7db36

ethanrous merged commit 600ab28 into main Jun 19, 2026
9 of 10 checks passed

ethanrous deleted the improve-file-indexing branch June 19, 2026 02:32

Conversation

ethanrous commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants