Add TopN heap and eager sort materialization for memory reduction#22
Closed
philcunliffe wants to merge 5 commits into
Closed
Add TopN heap and eager sort materialization for memory reduction#22philcunliffe wants to merge 5 commits into
philcunliffe wants to merge 5 commits into
Conversation
Add multi-level caching and reduce per-row overhead: - parseSql: LRU cache (64 entries) avoids re-tokenizing/parsing same SQL strings - planSql: WeakMap cache on parsed ASTs avoids re-planning identical queries - asyncRow: attach _data field for zero-copy collection - collect: sync fast-path skips Promise.all when all rows have pre-materialized _data - executeProject: pre-compute static column names, fast-path for simple identifier projections with direct cell passthrough and _data propagation - executeSql: skip table normalization when no array tables are present - compareForTerm: use module-level Set instead of per-call array allocation - memorySource: hoist column computation outside scan loop, use Set for validation
- Add _data to AsyncRow type definition - Cast to DerivedColumn/IdentifierNode where type narrowing is needed - Type _data as Record<string, SqlPrimitive> - Fix JSDoc placement for compareForTerm
Adapt optimizations to the new QueryResults return type: - executeSql: keep table normalization skip, use new inline plan+execute - executeProject: move pre-computation outside rows(), keep identifier fast-path and static column names inside the rows() generator - Add _data to AsyncRow type definition - Fix JSDoc placement and type casts for tsc
Drop the parseSql/planSql memoization caches added in 881a031. Also rename the pre-materialized row payload from `_data` to `resolved` for clarity, and delete stale scratch files (query-parquet.mjs, repro-525.mjs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two optimizations to reduce memory usage for ORDER BY queries: 1. TopN plan node: fuses Sort + Limit into a bounded binary heap. ORDER BY x LIMIT N now uses O(N) memory instead of O(total rows). The planner detects Limit(Sort(...)) and Limit(Project(Sort(...))) patterns and rewrites them to TopN. 2. Eager materialization in executeSort: resolves all cell values during sort buffering, replacing AsyncRow closures (which capture large decompressed parquet data) with plain value-returning functions. For tables with 10KB+ text columns, this reduces the per-row buffer cost from ~10KB (closure) to ~100B (plain value).
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two optimizations to reduce memory usage for ORDER BY queries over large datasets:
1. TopN plan node — fuses
Sort+Limitinto a bounded binary heapORDER BY x LIMIT Nnow uses O(N) memory instead of O(total rows)Limit(Sort(...))andLimit(Project(Sort(...)))patternsSELECT * FROM (860K rows) ORDER BY score LIMIT 100: buffers 100 rows instead of 860K2. Eager row materialization in
executeSortmaterializeRow()AsyncRowclosures (capturing decompressed parquet data ~10KB/row) with plain values (~100B/row)ORDER BYwithoutLIMITover 860K rows with large text columns: ~80MB instead of ~8GBTest plan