Add TopN heap and eager sort materialization for memory reduction by philcunliffe · Pull Request #22 · hyparam/squirreling

philcunliffe · 2026-04-12T20:27:00Z

Summary

Two optimizations to reduce memory usage for ORDER BY queries over large datasets:

1. TopN plan node — fuses Sort + Limit into a bounded binary heap

ORDER BY x LIMIT N now uses O(N) memory instead of O(total rows)
Planner detects Limit(Sort(...)) and Limit(Project(Sort(...))) patterns
For SELECT * FROM (860K rows) ORDER BY score LIMIT 100: buffers 100 rows instead of 860K

2. Eager row materialization in executeSort

Resolves all cell values during sort buffering via materializeRow()
Replaces AsyncRow closures (capturing decompressed parquet data ~10KB/row) with plain values (~100B/row)
For ORDER BY without LIMIT over 860K rows with large text columns: ~80MB instead of ~8GB

Test plan

All 1322 existing tests pass (with updated expectations for plan shape and cell access counts)
TopN produces identical results to Sort+Limit for all sort directions and data types
Verify with hyperparam2 memory benchmark on 5-shard dataset

Add multi-level caching and reduce per-row overhead: - parseSql: LRU cache (64 entries) avoids re-tokenizing/parsing same SQL strings - planSql: WeakMap cache on parsed ASTs avoids re-planning identical queries - asyncRow: attach _data field for zero-copy collection - collect: sync fast-path skips Promise.all when all rows have pre-materialized _data - executeProject: pre-compute static column names, fast-path for simple identifier projections with direct cell passthrough and _data propagation - executeSql: skip table normalization when no array tables are present - compareForTerm: use module-level Set instead of per-call array allocation - memorySource: hoist column computation outside scan loop, use Set for validation

- Add _data to AsyncRow type definition - Cast to DerivedColumn/IdentifierNode where type narrowing is needed - Type _data as Record<string, SqlPrimitive> - Fix JSDoc placement for compareForTerm

Adapt optimizations to the new QueryResults return type: - executeSql: keep table normalization skip, use new inline plan+execute - executeProject: move pre-computation outside rows(), keep identifier fast-path and static column names inside the rows() generator - Add _data to AsyncRow type definition - Fix JSDoc placement and type casts for tsc

Drop the parseSql/planSql memoization caches added in 881a031. Also rename the pre-materialized row payload from `_data` to `resolved` for clarity, and delete stale scratch files (query-parquet.mjs, repro-525.mjs). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two optimizations to reduce memory usage for ORDER BY queries: 1. TopN plan node: fuses Sort + Limit into a bounded binary heap. ORDER BY x LIMIT N now uses O(N) memory instead of O(total rows). The planner detects Limit(Sort(...)) and Limit(Project(Sort(...))) patterns and rewrites them to TopN. 2. Eager materialization in executeSort: resolves all cell values during sort buffering, replacing AsyncRow closures (which capture large decompressed parquet data) with plain value-returning functions. For tables with 10KB+ text columns, this reduces the per-row buffer cost from ~10KB (closure) to ~100B (plain value).

philcunliffe · 2026-04-12T20:51:45Z

Superseded by #23 (eager materialization) and #24 (TopN heap) as separate PRs.

philcunliffe and others added 5 commits April 9, 2026 16:45

Fix typecheck errors

ac13746

- Add _data to AsyncRow type definition - Cast to DerivedColumn/IdentifierNode where type narrowing is needed - Type _data as Record<string, SqlPrimitive> - Fix JSDoc placement for compareForTerm

philcunliffe closed this Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TopN heap and eager sort materialization for memory reduction#22

Add TopN heap and eager sort materialization for memory reduction#22
philcunliffe wants to merge 5 commits into
masterfrom
perf/topn-and-eager-sort

philcunliffe commented Apr 12, 2026

Uh oh!

philcunliffe commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented Apr 12, 2026

Summary

Test plan

Uh oh!

philcunliffe commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant