Improve file indexing#107
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the backend “scan” pipeline into a more explicit “indexing” flow, expands/clarifies embedding behavior (text vs. visual), and adds UI/dev tooling to trigger and inspect indexing/embedding state.
Changes:
- Refactor jobs from scan to index semantics (new
IndexFilejob, updated directory indexing, updated websocket events and job registration priorities). - Improve embedding pipeline and search relevance (new hit scoring/normalization, updated extract/embed behavior, improved OCR filtering).
- Update Nuxt settings/dev UI to expose indexing/embedding actions and add debug lookups.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| weblens-vue/weblens-nuxt/pages/settings/dev.vue | Reworks dev settings page layout; adds debug controls for file/media lookup and preview. |
| weblens-vue/weblens-nuxt/pages/settings.vue | Adds scrolling to settings content area. |
| weblens-vue/weblens-nuxt/components/organism/UploadProgress.vue | Adjusts positioning/borders for upload progress panel. |
| weblens-vue/weblens-nuxt/components/molecule/ContextMenuActions.vue | Renames folder scan action to “Re-Index Folder” and marks as danger action. |
| weblens-vue/weblens-nuxt/components/atom/WeblensOptions.vue | Visual tweaks; adds chevron icon; adds default option selection behavior. |
| weblens-vue/weblens-nuxt/components/atom/WeblensInput.vue | Reorders props for readability/consistency. |
| services/jobs/upload.go | Refactors upload cleanup; triggers indexing jobs for uploaded top-levels; adjusts error handling in writer loop. |
| services/jobs/jobs.go | Updates registered jobs and priorities to new indexing model. |
| services/jobs/index_file.go | Adds new IndexFile job implementation for per-file indexing. |
| services/jobs/file_parser.go | Renames directory scan to IndexDirectory; changes discovery/index queueing; updates completion events; adds reindex cleanup helpers. |
| services/jobs/embed.go | Refactors extract/embed to gate by feature flags; adds visual-embedding write path; skips text extraction for photo types. |
| services/jobs/embed_test.go | Adds test to ensure photo types skip text extraction. |
| services/embed/embed_service.go | Adds helper to determine whether a file has any embeddings (text or image). |
| routers/api/v1/websocket/websocket.go | Updates scan message handling to use IndexMeta / IndexFileTask for files. |
| routers/api/v1/file/rest_folders.go | Uses indexing meta in ScanDir; adds logic to trigger indexing when media/embeddings are missing. |
| routers/api/v1/file/rest_files_search.go | Uses feature flags to gate semantic search; normalizes hit scoring; filters junk text hits for photos. |
| modules/websocket/event.go | Renames scan-complete event constant while preserving wire value. |
| models/task/task.go | Redefines priority tiers (adds background and x-high). |
| models/task/task_test.go | Updates priority usage in tests (partially). |
| models/media/media_type.go | Adds IsEmbeddable to media type registry and expands embeddable types. |
| models/media/media_test.go | Adds tests for TextEmbedEligible. |
| models/media/media_model.go | Adds batch media deletion helper. |
| models/media/embed_extensions.go | Moves embed eligibility logic to media-type registry; adds TextEmbedEligible. |
| models/job/tasks_meta.go | Renames scan meta to IndexMeta; updates job naming for file indexing. |
| models/job/jobs.go | Renames file scan task identifier to index_file. |
| models/embedding/store.go | Adds multi-source deletion; generalizes delete-by-kind; renames idempotency counter API. |
| models/embedding/score.go | Adds cross-kind hit scoring/normalization and filtering. |
| models/embedding/score_test.go | Adds tests for new hit scoring logic. |
| models/embedding/embedding.go | Adds KindAll for deletion operations. |
| embed/test_extract.py | Adds tests around OCR confidence and gibberish filtering. |
| embed/extract.py | Introduces OCR confidence filtering and legibility heuristics; reuses OCR helper in PDF/image paths. |
Comments suppressed due to low confidence (1)
services/jobs/file_parser.go:311
- Non-displayable but embeddable files (txt/md/docx/etc.) return early here, so they never get an ExtractAndEmbedTask dispatched and won’t be content-indexed. Consider dispatching ExtractAndEmbedTask for text-eligible types before returning when !mt.Displayable.
mt := media_model.ParseExtension(mf.GetPortablePath().Ext())
if !mt.Displayable {
return nil
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- upload: drop duplicate ctxservice import, consolidate to one alias - upload: only report removal failure when cleanup actually fails, return context error on success instead of wrapping nil - upload: log cleanup errors instead of tsk.Fail() (panics and mutates final exit status during read-only cleanup) - rest_folders: index the listed dir, not parent (walked to nil by the breadcrumb loop), fixing a nil-deref and wrong-folder scan - task_test: register background-job at PriorityBackground so the "defaults schedule ahead of background" test asserts correctly - tasks_meta: IndexMeta.MetaString uses JobName() so directory metas serialize ScanDirectoryTask instead of hard-coded IndexFileTask - dev.vue: fix getFile() catch message to say "file info"
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 33 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
services/jobs/file_parser.go:311
queueFileIndexIfNeededreturns early for non-displayable files, which preventsExtractAndEmbedTaskfrom being dispatched for text/doc/code types. This breaks semantic indexing for the newly-marked embeddable but non-displayable extensions (e.g. txt, pdf, docx), and contradicts the comment below about indexing non-displayables.
mt := media_model.ParseExtension(mf.GetPortablePath().Ext())
if !mt.Displayable {
return nil
}
- file_parser: remove dead commented-out call in LeafMap callback - file_parser: clearExistingIndex now also clears file-chunk embeddings (keyed by fileID) so a force re-index rebuilds the full semantic index - media_type: mark image/heif IsEmbeddable to match HEIC - embed/test_extract: skip OCR tests cleanly via pytest.importorskip when pytesseract/PIL are unavailable - folder scan: add forceReindex query param wired to IndexMeta.ForceReIndex; the dev "Reindex All Files" button now triggers a real force re-index - dev.vue: clear stale debugMedia/debugReturn on each debug submit
Address the critical and high-severity issues from code review of the file-indexing rework: - Documents/code (non-displayable types) were never text-embedded: the embed dispatch sat behind the displayability gate. Dispatch text/OCR embedding for non-displayable embed-eligible files before the gate. - Directly-uploaded and single-file-scanned files never got embedded. Wire embed dispatch into every entry point via a shared helper: IndexFile dispatches it for displayable media (after media+cache exist, so it can't race), and the upload/ScanDir/websocket paths dispatch it for non-displayable docs. - ScoreHits dropped every hit when many strong matches clustered tightly (none stood 2σ above the high mean). Keep hits that are a statistical standout OR strongly above the kind's floor. - gif/webp/raw (NEF/ARW/CR2) regressed out of image embedding: mark them IsEmbeddable and add the missing lowercase raw extensions. - Folder listings re-dispatched a whole-directory scan on every GET when embedding was disabled/unavailable. Gate the embed-existence rescan on embed being active, and stop hard-failing the listing on a count error. - Single-file force-reindex left orphaned embeddings and stale cache; clear them before rebuilding, matching the directory path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.