Skip to content

perf(scraper): speed up gamelist scrape progress and writes#868

Merged
wizzomafizzo merged 3 commits into
mainfrom
feat/throttle-scraper-progress-refreshes
May 30, 2026
Merged

perf(scraper): speed up gamelist scrape progress and writes#868
wizzomafizzo merged 3 commits into
mainfrom
feat/throttle-scraper-progress-refreshes

Conversation

@wizzomafizzo
Copy link
Copy Markdown
Member

@wizzomafizzo wizzomafizzo commented May 30, 2026

Summary

  • optimize scraped-count/status refreshes so scrape progress does not repeatedly run expensive count queries
  • speed up mediadb scrape writes with bulk title metadata/sentinel paths plus SQL trace tooling and benchmarks
  • fix ES gamelist companion .slug matching so title-level entries mark all matching media in one completed run
  • throttle companion batch progress notifications to avoid websocket/log spam during large scrapes

Validation

  • go test ./pkg/database/scraper/gamelistxml
  • go test ./pkg/api/methods ./pkg/database/mediadb ./pkg/database/scraper/gamelistxml

Summary by CodeRabbit

  • New Features

    • Batch application of scrape results and improved companion-entry handling for more efficient and accurate scraping.
  • Performance

    • Time-based caching of scraped-media counts with short refresh behavior to reduce DB load.
    • Query/path/tag lookup optimizations and optional SQL tracing for performance analysis.
  • Refactor

    • Scraper persistence and parsing reworked for transaction-scoped caching and bulk operations.
  • Tests

    • Expanded unit tests, benchmarks, and diagnostic tests covering batch flows, progress handling, and sentinel/tag edge cases.
  • Chores

    • Build-task tag handling adjusted for conditional extra tags.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f333dd43-cfcb-48bb-8160-ceeb38937c76

📥 Commits

Reviewing files that changed from the base of the PR and between d356197 and d423010.

📒 Files selected for processing (4)
  • pkg/api/methods/media_scrape.go
  • pkg/database/mediadb/sql_trace_bench_test.go
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/database/mediadb/sql_trace_bench_test.go
  • pkg/api/methods/media_scrape.go
  • pkg/database/scraper/gamelistxml/scraper_test.go

📝 Walkthrough

Walkthrough

This PR adds scraped-count caching in the API, batch scrape-write contracts and a bulk implementation in MediaDB, refactors sentinel-based scraped-media queries, introduces SQL tracing and benchmarks, and rewrites the gamelist scraper to pre-parse files and perform index-driven companion matching with batched writes and updated tests.

Changes

Scrape Write Batching and Query Optimization

Layer / File(s) Summary
Batch Scrape Write Contracts
pkg/database/database.go
Adds ScrapeWriteTarget and ScrapeResultBatchApplier to enable batch application of multiple scrape targets.
Sentinel-Based Media Counting
pkg/database/mediadb/sql_scraper.go, pkg/database/mediadb/sql_scraper_test.go
Refactors GetScrapedMediaCount/GetTotalScrapedMediaCount/GetScrapedMediaIDs to resolve sentinel tag DBIDs first and count via helpers; adds missing-sentinel tests.
Batch Write Context and Infrastructure
pkg/database/mediadb/sql_scraper.go
Adds scrapeWriteTxContext caches, context-based tag/property resolution, bulk upserts for title tags/properties/sentinels, and ApplyScrapeResults bulk flow with validation and stats.
FindMediaTitlesWithoutSentinel Refactoring
pkg/database/mediadb/sql_scraper.go, pkg/database/mediadb/sql_scraper_test.go
Uses sentinel TagDBID IN(...) filtering with fallback when no sentinels exist; updates tests for missing types and value-specific matching.
API Scrape Count Caching
pkg/api/methods/media_scrape.go, pkg/api/methods/media_scrape_test.go
Adds a 5s in-memory TotalScraped cache keyed by ScraperID, with fresh/cached read helpers and safe updates; call sites choose exact vs cached queries for init/progress/done. Tests cover cache refresh and progress/done semantics.
SQL Trace Collection and Diagnostics
pkg/database/mediadb/sql_trace_runtime.go, pkg/database/mediadb/sql_trace_runtime_stub.go, pkg/database/mediadb/sql_plan_test.go
Adds build-tagged runtime SQL trace collector and a stub for non-trace builds; diagnostic test logs EXPLAIN plans, sqlite_stat1, and PRAGMAs for hot scraper queries.
MediaDB Driver and Shutdown
pkg/database/mediadb/mediadb.go
Switches sqlite driver selection to sqliteDriverName() and calls logSQLTraceSummary() on Close to flush traces.
Benchmarks and Trace Bench
pkg/database/mediadb/sql_scraper_bench_test.go, pkg/database/mediadb/sql_trace_bench_test.go
Adds benchmarks for ApplyScrapeResults (companion batch) and a trace-enabled benchmark that aggregates per-normalized-statement timings.
Comprehensive Batch Scrape Write Tests
pkg/database/mediadb/sql_scraper_test.go
Replaces single-target tests with a suite for ApplyScrapeResults covering multi-target writes, idempotency, exclusive/additive semantics, rollback, deduplication, later-target-wins behavior, and nil-write rejection.

GameList Scraper Index-Driven Companion Optimization

Layer / File(s) Summary
Parsed GameList Loading
pkg/database/scraper/gamelistxml/scraper.go
Adds parsed-gamelist containers and loadParsedGamelistSystem; LoadRecords delegates to loadRecordsFromParsed and preserves per-file provenance for records.
Media Indexing and Regular Write Stats
pkg/database/scraper/gamelistxml/scraper.go
Adds MediaByFilename indexing (lowercased basename), introduces scrapeWriteStats, and converts regular writes to use ScrapeWriteTarget with stats recording.
Companion Entry Loading and Index-Driven Matching
pkg/database/scraper/gamelistxml/scraper.go
Loads and classifies companion parents/children from parsed data and replaces DB-query child matching with index-driven matchCompanionChildMedia (slug→titles→media, resolved-path, filename key).
Companion Batch Write Integration
pkg/database/scraper/gamelistxml/scraper.go
Batches companion ScrapeWriteTarget application via applyCompanionWriteTargets using ScrapeResultBatchApplier with per-target fallback; implements per-title deduplication/conflict detection and comprehensive companion stats.
GameList Scraper Tests
pkg/database/scraper/gamelistxml/scraper_test.go
Adds assertCompanionCounts, batchMockMediaDB, expanded index builders, a parsed-vs-regular records test, and refactors companion tests to use loadRecordIndexes; adds coverage for batching, deduplication, matching modes, and already-scraped skipping.

Build Configuration

Layer / File(s) Summary
MiSTer Build Tags
scripts/tasks/mister.yml
Updates build-arm EXTRA_TAGS to always include embed_arcadedb and optionally append additional tags comma-separated when provided.

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly Related PRs

"🐰 I hopped through code and cached a score,
Batches stitched writes and trimmed queries galore.
Parsed lists prepped roots, indices kept tight,
Companions matched swiftly, writes batched just right.
A rabbit cheers the merge — hop, test, and take flight!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main optimization focus of the changeset: improving scraper performance for progress updates and write operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/throttle-scraper-progress-refreshes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
pkg/database/scraper/gamelistxml/scraper.go (1)

314-323: ⚡ Quick win

Move the new type/const declarations into the top declaration block.

Lines 314-323 and Lines 517-520 add new types/constants after function bodies, which makes the file layout inconsistent with the repo rule and harder to scan.

As per coding guidelines, "Define Go types and consts near the top of the file, before functions and methods".

Also applies to: 517-520

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/database/scraper/gamelistxml/scraper.go` around lines 314 - 323, The new
types parsedGamelistFile and parsedGamelistSystem (and the other new type/const
group added later) are declared after function bodies; move those type/const
declarations up into the file's top declaration block where other types and
consts are defined (i.e., before any functions/methods) so they follow the repo
guideline of placing Go types/consts near the top and keep the file layout
consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/database/mediadb/sql_trace_bench_test.go`:
- Around line 38-41: Replace the use of sync.Mutex for the mu field with
syncutil.Mutex (i.e., change "mu sync.Mutex" to "mu syncutil.Mutex") and update
imports to reference the syncutil package instead of the stdlib sync; keep the
rest of the struct (byStmt, stmtByHandle) unchanged so existing code using
mu.Lock/Unlock still compiles against syncutil.Mutex.

In `@pkg/database/scraper/gamelistxml/scraper_test.go`:
- Around line 2562-2572: The helper drainBufferedUpdates currently reads from ch
without checking the receive-ok flag, which causes an infinite append of
zero-value scraper.ScrapeUpdate when the channel is closed; update the loop in
drainBufferedUpdates to use the two-value receive (e.g., update, ok := <-ch) and
return the collected updates immediately if ok is false, otherwise append the
received update, preserving the existing select/default behavior to still return
when no buffered items exist.

In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 1858-1864: When handling .slug files in the slug-detection block,
normalize the derived slug (currently computed as slug :=
strings.TrimSuffix(filename, filepath.Ext(filename))) to the same case as the
map keys before lookup; replace lookups against indexes.AllTitlesBySlug and
indexes.TitlesBySlug with a lowercased (or otherwise normalized) slug variable
(e.g., slugKey := strings.ToLower(slug)) so filenames like "MySlug.slug"
correctly match and avoid spuriously adding entries to MissingTitleSlugs.

---

Nitpick comments:
In `@pkg/database/scraper/gamelistxml/scraper.go`:
- Around line 314-323: The new types parsedGamelistFile and parsedGamelistSystem
(and the other new type/const group added later) are declared after function
bodies; move those type/const declarations up into the file's top declaration
block where other types and consts are defined (i.e., before any
functions/methods) so they follow the repo guideline of placing Go types/consts
near the top and keep the file layout consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7bd55f6-0c85-4dba-8442-c455ce66b552

📥 Commits

Reviewing files that changed from the base of the PR and between 61ca5e1 and d356197.

📒 Files selected for processing (14)
  • pkg/api/methods/media_scrape.go
  • pkg/api/methods/media_scrape_test.go
  • pkg/database/database.go
  • pkg/database/mediadb/mediadb.go
  • pkg/database/mediadb/sql_plan_test.go
  • pkg/database/mediadb/sql_scraper.go
  • pkg/database/mediadb/sql_scraper_bench_test.go
  • pkg/database/mediadb/sql_scraper_test.go
  • pkg/database/mediadb/sql_trace_bench_test.go
  • pkg/database/mediadb/sql_trace_runtime.go
  • pkg/database/mediadb/sql_trace_runtime_stub.go
  • pkg/database/scraper/gamelistxml/scraper.go
  • pkg/database/scraper/gamelistxml/scraper_test.go
  • scripts/tasks/mister.yml

Comment thread pkg/database/mediadb/sql_trace_bench_test.go Outdated
Comment thread pkg/database/scraper/gamelistxml/scraper_test.go
Comment thread pkg/database/scraper/gamelistxml/scraper.go
@sentry
Copy link
Copy Markdown

sentry Bot commented May 30, 2026

Codecov Report

❌ Patch coverage is 75.26882% with 276 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/database/mediadb/sql_scraper.go 72.80% 107 Missing and 91 partials ⚠️
pkg/database/scraper/gamelistxml/scraper.go 82.91% 39 Missing and 16 partials ⚠️
pkg/api/methods/media_scrape.go 62.29% 19 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

@wizzomafizzo wizzomafizzo merged commit d07efe9 into main May 30, 2026
12 checks passed
@wizzomafizzo wizzomafizzo deleted the feat/throttle-scraper-progress-refreshes branch May 30, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant