Specialize SQLGraph.from_other for SQLite-to-SQLite (closes #285)#286
Open
JoOkuma wants to merge 4 commits into
Open
Specialize SQLGraph.from_other for SQLite-to-SQLite (closes #285)#286JoOkuma wants to merge 4 commits into
JoOkuma wants to merge 4 commits into
Conversation
) The generic BaseGraph.from_other materializes the source graph in Python and then calls other.overlaps(), which on a SQLGraph-backed GraphView issues a single 'IN (...)' clause containing every selected node id. With non-trivial selections this hits SQLite's SQLITE_MAX_VARIABLE_NUMBER and raises 'too many SQL variables'. Per @JoOkuma's suggestion in royerlab#285, add a specialized fast path that bypasses the BaseGraph copy entirely when both sides are SQLite: * Open the destination SQLGraph so it materializes its schema. * Mirror any extra columns and metadata. * Dispose the dst engine, ATTACH the dst database to the source's connection, and dump each table via INSERT INTO ... SELECT FROM. * For GraphView sources, materialize the selected node ids in a temp table (chunked inserts) and JOIN against it instead of using a giant IN (...) clause. * Re-open the dst engine and refresh derived caches. For correctness in other code paths (summary, NN solver, direct callers), also chunk SQLGraph.overlaps(node_ids) so it never overflows the bound-parameter limit. New tests cover: * The ATTACH fast path is taken (BaseGraph.from_other not called). * The exact issue royerlab#285 scenario: filter -> subgraph -> from_other with a selection > 999 nodes plus overlaps. * Node ids are preserved on full copies. * Fallback to the generic path for ':memory:' destinations and for non-SQL sources. * SQLGraph.overlaps with > 999 node ids returns the right pairs.
Instead of instantiating the destination upfront and ALTER-ing it into shape, replay the source's own CREATE TABLE/INDEX DDL from sqlite_master against the attached destination, copy rows (including the Metadata table verbatim), then open the destination via cls(**kwargs). The constructor's normal reflection path rebuilds the in-memory state, so the manual engine dispose/reopen, cache invalidation, and _update_max_id_per_time call are no longer needed. Also tighten the eligibility gate in from_other: the destination file must start empty (DDL replay would clash with an existing schema), and overwrite=True is excluded (the dst constructor would drop the tables we just populated).
Replace the hand-rolled chunk loop with the existing _chunked_sa_read helper. Only the source side is constrained per chunk via IN(...); the target side is filtered in Polars afterwards, which avoids the quadratic bound-parameter blow-up of having two IN clauses per chunk and lets us use the full _sql_chunk_size() budget. Also drop the seen-set deduplication, which was a behaviour change vs. the pre-fix code (the original query.all() preserved duplicates if any existed in the Overlap table).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #286 +/- ##
==========================================
+ Coverage 87.70% 87.72% +0.02%
==========================================
Files 57 57
Lines 4879 4936 +57
Branches 858 868 +10
==========================================
+ Hits 4279 4330 +51
- Misses 378 381 +3
- Partials 222 225 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #285.
BaseGraph.from_otherends withother.overlaps(), which on a SQL-rootedGraphViewissues a singleIN (...)containing every selected node id and overflows SQLite's variable limit.This PR adds a SQL-level fast path on
SQLGraph.from_other: when both ends are on-disk SQLite, the source's schema and rows are dumped straight into the destination viaATTACH DATABASE+ replayedCREATE TABLE/INDEXDDL +INSERT INTO ... SELECT, then the destination is opened from the populated file. Filtered selections (GraphView) are joined against a temp table to sidestep the bound-parameter limit. Other configurations fall back to the generic path.Separately,
SQLGraph.overlaps(node_ids)is now chunked via the existing_chunked_sa_readhelper so direct callers don't hit the same limit.Tests cover the fast path, the issue reproducer (filter → subgraph → from_other with > 999 nodes + overlaps), full-copy id preservation, and fallbacks for
:memory:and non-SQL sources.