Skip to content

Improve file indexing#107

Merged
ethanrous merged 10 commits into
mainfrom
improve-file-indexing
Jun 19, 2026
Merged

Improve file indexing#107
ethanrous merged 10 commits into
mainfrom
improve-file-indexing

Conversation

@ethanrous

Copy link
Copy Markdown
Owner

No description provided.

Copilot AI review requested due to automatic review settings June 14, 2026 22:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the backend “scan” pipeline into a more explicit “indexing” flow, expands/clarifies embedding behavior (text vs. visual), and adds UI/dev tooling to trigger and inspect indexing/embedding state.

Changes:

  • Refactor jobs from scan to index semantics (new IndexFile job, updated directory indexing, updated websocket events and job registration priorities).
  • Improve embedding pipeline and search relevance (new hit scoring/normalization, updated extract/embed behavior, improved OCR filtering).
  • Update Nuxt settings/dev UI to expose indexing/embedding actions and add debug lookups.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
weblens-vue/weblens-nuxt/pages/settings/dev.vue Reworks dev settings page layout; adds debug controls for file/media lookup and preview.
weblens-vue/weblens-nuxt/pages/settings.vue Adds scrolling to settings content area.
weblens-vue/weblens-nuxt/components/organism/UploadProgress.vue Adjusts positioning/borders for upload progress panel.
weblens-vue/weblens-nuxt/components/molecule/ContextMenuActions.vue Renames folder scan action to “Re-Index Folder” and marks as danger action.
weblens-vue/weblens-nuxt/components/atom/WeblensOptions.vue Visual tweaks; adds chevron icon; adds default option selection behavior.
weblens-vue/weblens-nuxt/components/atom/WeblensInput.vue Reorders props for readability/consistency.
services/jobs/upload.go Refactors upload cleanup; triggers indexing jobs for uploaded top-levels; adjusts error handling in writer loop.
services/jobs/jobs.go Updates registered jobs and priorities to new indexing model.
services/jobs/index_file.go Adds new IndexFile job implementation for per-file indexing.
services/jobs/file_parser.go Renames directory scan to IndexDirectory; changes discovery/index queueing; updates completion events; adds reindex cleanup helpers.
services/jobs/embed.go Refactors extract/embed to gate by feature flags; adds visual-embedding write path; skips text extraction for photo types.
services/jobs/embed_test.go Adds test to ensure photo types skip text extraction.
services/embed/embed_service.go Adds helper to determine whether a file has any embeddings (text or image).
routers/api/v1/websocket/websocket.go Updates scan message handling to use IndexMeta / IndexFileTask for files.
routers/api/v1/file/rest_folders.go Uses indexing meta in ScanDir; adds logic to trigger indexing when media/embeddings are missing.
routers/api/v1/file/rest_files_search.go Uses feature flags to gate semantic search; normalizes hit scoring; filters junk text hits for photos.
modules/websocket/event.go Renames scan-complete event constant while preserving wire value.
models/task/task.go Redefines priority tiers (adds background and x-high).
models/task/task_test.go Updates priority usage in tests (partially).
models/media/media_type.go Adds IsEmbeddable to media type registry and expands embeddable types.
models/media/media_test.go Adds tests for TextEmbedEligible.
models/media/media_model.go Adds batch media deletion helper.
models/media/embed_extensions.go Moves embed eligibility logic to media-type registry; adds TextEmbedEligible.
models/job/tasks_meta.go Renames scan meta to IndexMeta; updates job naming for file indexing.
models/job/jobs.go Renames file scan task identifier to index_file.
models/embedding/store.go Adds multi-source deletion; generalizes delete-by-kind; renames idempotency counter API.
models/embedding/score.go Adds cross-kind hit scoring/normalization and filtering.
models/embedding/score_test.go Adds tests for new hit scoring logic.
models/embedding/embedding.go Adds KindAll for deletion operations.
embed/test_extract.py Adds tests around OCR confidence and gibberish filtering.
embed/extract.py Introduces OCR confidence filtering and legibility heuristics; reuses OCR helper in PDF/image paths.
Comments suppressed due to low confidence (1)

services/jobs/file_parser.go:311

  • Non-displayable but embeddable files (txt/md/docx/etc.) return early here, so they never get an ExtractAndEmbedTask dispatched and won’t be content-indexed. Consider dispatching ExtractAndEmbedTask for text-eligible types before returning when !mt.Displayable.
	mt := media_model.ParseExtension(mf.GetPortablePath().Ext())
	if !mt.Displayable {
		return nil
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/jobs/upload.go Outdated
Comment thread services/jobs/upload.go
Comment thread services/jobs/upload.go
Comment thread routers/api/v1/file/rest_folders.go Outdated
Comment thread weblens-vue/weblens-nuxt/pages/settings/dev.vue Outdated
Comment thread models/task/task_test.go Outdated
Comment thread models/job/tasks_meta.go
- upload: drop duplicate ctxservice import, consolidate to one alias
- upload: only report removal failure when cleanup actually fails,
  return context error on success instead of wrapping nil
- upload: log cleanup errors instead of tsk.Fail() (panics and
  mutates final exit status during read-only cleanup)
- rest_folders: index the listed dir, not parent (walked to nil by
  the breadcrumb loop), fixing a nil-deref and wrong-folder scan
- task_test: register background-job at PriorityBackground so the
  "defaults schedule ahead of background" test asserts correctly
- tasks_meta: IndexMeta.MetaString uses JobName() so directory metas
  serialize ScanDirectoryTask instead of hard-coded IndexFileTask
- dev.vue: fix getFile() catch message to say "file info"
Copilot AI review requested due to automatic review settings June 14, 2026 22:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 33 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

services/jobs/file_parser.go:311

  • queueFileIndexIfNeeded returns early for non-displayable files, which prevents ExtractAndEmbedTask from being dispatched for text/doc/code types. This breaks semantic indexing for the newly-marked embeddable but non-displayable extensions (e.g. txt, pdf, docx), and contradicts the comment below about indexing non-displayables.
	mt := media_model.ParseExtension(mf.GetPortablePath().Ext())
	if !mt.Displayable {
		return nil
	}

Comment thread services/jobs/file_parser.go
Comment thread services/jobs/file_parser.go Outdated
Comment thread models/media/media_type.go Outdated
Comment thread embed/test_extract.py
Comment thread weblens-vue/weblens-nuxt/pages/settings/dev.vue
Comment thread weblens-vue/weblens-nuxt/pages/settings/dev.vue
- file_parser: remove dead commented-out call in LeafMap callback
- file_parser: clearExistingIndex now also clears file-chunk embeddings
  (keyed by fileID) so a force re-index rebuilds the full semantic index
- media_type: mark image/heif IsEmbeddable to match HEIC
- embed/test_extract: skip OCR tests cleanly via pytest.importorskip when
  pytesseract/PIL are unavailable
- folder scan: add forceReindex query param wired to IndexMeta.ForceReIndex;
  the dev "Reindex All Files" button now triggers a real force re-index
- dev.vue: clear stale debugMedia/debugReturn on each debug submit
Address the critical and high-severity issues from code review of the
file-indexing rework:

- Documents/code (non-displayable types) were never text-embedded: the
  embed dispatch sat behind the displayability gate. Dispatch text/OCR
  embedding for non-displayable embed-eligible files before the gate.
- Directly-uploaded and single-file-scanned files never got embedded.
  Wire embed dispatch into every entry point via a shared helper:
  IndexFile dispatches it for displayable media (after media+cache
  exist, so it can't race), and the upload/ScanDir/websocket paths
  dispatch it for non-displayable docs.
- ScoreHits dropped every hit when many strong matches clustered
  tightly (none stood 2σ above the high mean). Keep hits that are a
  statistical standout OR strongly above the kind's floor.
- gif/webp/raw (NEF/ARW/CR2) regressed out of image embedding: mark
  them IsEmbeddable and add the missing lowercase raw extensions.
- Folder listings re-dispatched a whole-directory scan on every GET
  when embedding was disabled/unavailable. Gate the embed-existence
  rescan on embed being active, and stop hard-failing the listing on a
  count error.
- Single-file force-reindex left orphaned embeddings and stale cache;
  clear them before rebuilding, matching the directory path.
Copilot AI review requested due to automatic review settings June 15, 2026 00:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 43 changed files in this pull request and generated 5 comments.

Files not reviewed (2)
  • api/api_folder.go: Generated file
  • docs/docs.go: Generated file

Comment thread routers/api/v1/file/rest_files_search.go
Comment thread services/jobs/file_parser.go
Comment thread services/jobs/jobs.go Outdated
Comment thread models/task/task.go Outdated
Comment thread routers/api/v1/file/rest_folders.go Outdated
Copilot AI review requested due to automatic review settings June 19, 2026 01:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 50 changed files in this pull request and generated 4 comments.

Files not reviewed (2)
  • api/api_folder.go: Generated file
  • docs/docs.go: Generated file

Comment thread weblens-vue/weblens-nuxt/pages/settings/dev.vue
Comment thread services/jobs/upload.go
Comment thread routers/api/v1/websocket/websocket.go
Comment thread routers/api/v1/file/rest_folders.go Outdated
@ethanrous ethanrous merged commit 600ab28 into main Jun 19, 2026
9 of 10 checks passed
@ethanrous ethanrous deleted the improve-file-indexing branch June 19, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants