Skip to content

Specialize SQLGraph.from_other for SQLite-to-SQLite (closes #285)#286

Open
JoOkuma wants to merge 4 commits into
royerlab:mainfrom
JoOkuma:fix/sql-graph-from-other-285
Open

Specialize SQLGraph.from_other for SQLite-to-SQLite (closes #285)#286
JoOkuma wants to merge 4 commits into
royerlab:mainfrom
JoOkuma:fix/sql-graph-from-other-285

Conversation

@JoOkuma
Copy link
Copy Markdown
Member

@JoOkuma JoOkuma commented May 1, 2026

Closes #285.

BaseGraph.from_other ends with other.overlaps(), which on a SQL-rooted GraphView issues a single IN (...) containing every selected node id and overflows SQLite's variable limit.

This PR adds a SQL-level fast path on SQLGraph.from_other: when both ends are on-disk SQLite, the source's schema and rows are dumped straight into the destination via ATTACH DATABASE + replayed CREATE TABLE/INDEX DDL + INSERT INTO ... SELECT, then the destination is opened from the populated file. Filtered selections (GraphView) are joined against a temp table to sidestep the bound-parameter limit. Other configurations fall back to the generic path.

Separately, SQLGraph.overlaps(node_ids) is now chunked via the existing _chunked_sa_read helper so direct callers don't hit the same limit.

Tests cover the fast path, the issue reproducer (filter → subgraph → from_other with > 999 nodes + overlaps), full-copy id preservation, and fallbacks for :memory: and non-SQL sources.

JoOkuma added 3 commits May 1, 2026 09:59
)

The generic BaseGraph.from_other materializes the source graph in
Python and then calls other.overlaps(), which on a SQLGraph-backed
GraphView issues a single 'IN (...)' clause containing every selected
node id. With non-trivial selections this hits SQLite's
SQLITE_MAX_VARIABLE_NUMBER and raises 'too many SQL variables'.

Per @JoOkuma's suggestion in royerlab#285, add a specialized fast path that
bypasses the BaseGraph copy entirely when both sides are SQLite:

* Open the destination SQLGraph so it materializes its schema.
* Mirror any extra columns and metadata.
* Dispose the dst engine, ATTACH the dst database to the source's
  connection, and dump each table via INSERT INTO ... SELECT FROM.
* For GraphView sources, materialize the selected node ids in a temp
  table (chunked inserts) and JOIN against it instead of using a
  giant IN (...) clause.
* Re-open the dst engine and refresh derived caches.

For correctness in other code paths (summary, NN solver, direct
callers), also chunk SQLGraph.overlaps(node_ids) so it never
overflows the bound-parameter limit.

New tests cover:
* The ATTACH fast path is taken (BaseGraph.from_other not called).
* The exact issue royerlab#285 scenario: filter -> subgraph -> from_other
  with a selection > 999 nodes plus overlaps.
* Node ids are preserved on full copies.
* Fallback to the generic path for ':memory:' destinations and for
  non-SQL sources.
* SQLGraph.overlaps with > 999 node ids returns the right pairs.
Instead of instantiating the destination upfront and ALTER-ing it into
shape, replay the source's own CREATE TABLE/INDEX DDL from sqlite_master
against the attached destination, copy rows (including the Metadata
table verbatim), then open the destination via cls(**kwargs). The
constructor's normal reflection path rebuilds the in-memory state, so
the manual engine dispose/reopen, cache invalidation, and
_update_max_id_per_time call are no longer needed.

Also tighten the eligibility gate in from_other: the destination file
must start empty (DDL replay would clash with an existing schema), and
overwrite=True is excluded (the dst constructor would drop the tables
we just populated).
Replace the hand-rolled chunk loop with the existing _chunked_sa_read
helper. Only the source side is constrained per chunk via IN(...); the
target side is filtered in Polars afterwards, which avoids the
quadratic bound-parameter blow-up of having two IN clauses per chunk
and lets us use the full _sql_chunk_size() budget.

Also drop the seen-set deduplication, which was a behaviour change vs.
the pre-fix code (the original query.all() preserved duplicates if any
existed in the Overlap table).
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 1, 2026

Codecov Report

❌ Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.72%. Comparing base (6227efc) to head (74a8ca7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/tracksdata/graph/_sql_graph.py 90.16% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
+ Coverage   87.70%   87.72%   +0.02%     
==========================================
  Files          57       57              
  Lines        4879     4936      +57     
  Branches      858      868      +10     
==========================================
+ Hits         4279     4330      +51     
- Misses        378      381       +3     
- Partials      222      225       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLGraph.from_other fails with OperationalError

2 participants