Skip to content

spike: evaluate SQLite + sql.js as an alternative data layer #69

@DevSecNinja

Description

@DevSecNinja

Context

This is a spike/investigation issue to evaluate replacing the current YAML/JSON file tree with a single SQLite database served via GitHub Pages and read in the browser using sql.js (SQLite compiled to WebAssembly).

Raised from architecture discussion — see the alternative proposal in #(feed.json issue).

Proposed architecture

Role Current With SQLite
Served to browser public/data/*.json via Pages public/data/wazzup.db via Pages
Persisted between pipeline runs wazzup-state.zip on Releases wazzup.db on Releases (same file, dual-purpose)

Pipeline flow

  1. Download wazzup.db from the news-state Release asset
  2. Insert new briefings, run retention DELETE — no filesystem walk needed
  3. Write public/data/wazzup.db
  4. Pages deploy serves it to the browser
  5. Re-upload .db to Releases for the next run

What goes away

  • All YAML/JSON output files and the publisher file-tree logic
  • manifest.yaml, latest.json, enforce_retention, prune_empty_directories
  • validate_data.py tree walk (replaced by SQL schema checks)

What's added

  • sqlite3 stdlib usage in publisher.py (zero new runtime dependencies)
  • sql.js WASM (~1 MB) bundled or CDN-loaded in the PWA
  • Schema migration strategy for the .db file

Known trade-offs to evaluate during the spike

Impact
sql.js WASM size ~1 MB download before the PWA can render — measure real impact on Time to Interactive
Whole-DB download Browser cannot range-request a subset; full DB downloaded on every cold load. Measure DB size at 35-day retention.
No partial SW caching The service worker must treat the DB as an opaque blob; individual briefing caching is not possible
Binary diff opacity .db changes are not human-readable in CI logs; debugging requires tooling
Pipeline rewrite scope publisher.py and all related tests require full rewrite
Richer frontend queries SQL enables filtering by date, source, score, keyword without backend — assess if this is actually needed

Spike goals

  • Prototype publisher.py writing to SQLite instead of YAML/JSON
  • Measure .db file size at 35-day retention with realistic data
  • Measure PWA Time to Interactive with sql.js vs current JSON approach
  • Assess service worker strategy for a binary DB blob
  • Document a go/no-go recommendation with data

Decision criteria

Recommend adoption if:

  • DB size at max retention stays under ~2 MB
  • Time to Interactive is within 500 ms of the current JSON approach
  • The PWA gains query capabilities that are otherwise impractical to add

Otherwise, close as "won't do" and link to the feed.json issue as the preferred path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions