Skip to content

feat(cubejs): smart-gen — YAML default, shorter names, smarter skips, no auto pre-aggs#53

Merged
acmeguy merged 1 commit into
mainfrom
feat/cubejs-smart-gen-improvements
May 10, 2026
Merged

feat(cubejs): smart-gen — YAML default, shorter names, smarter skips, no auto pre-aggs#53
acmeguy merged 1 commit into
mainfrom
feat/cubejs-smart-gen-improvements

Conversation

@acmeguy
Copy link
Copy Markdown

@acmeguy acmeguy commented May 10, 2026

Summary

Several improvements to smart-gen so it produces noticeably cleaner, more useful Cube models — fewer rollups, shorter field names, more accurate field-skip rules.

Output format

  • Default to YAML (.yml). Falls back to JS only when a cube has a FILTER_PARAMS arrow callback we cannot translate. Identity arrows (v) => v are auto-transpiled to Python lambdas (lambda v: v) so YAML works for the nested-lookup-key path too.
  • Filename resolution: explicit file_name wins → reuse existing model's extension on re-run → otherwise .yml.

Cube naming from filters

  • Filtering by e.g. event = 'Stockout Ended' with no explicit name now derives stockout_ended.yml. Re-running same filter set updates the same file.

Field naming — shortest-unique resolver

Each field carries an ordered candidate list from leaf → fully-qualified. A new resolver picks the shortest non-clashing candidate; basic columns hold their leaf names and longer-candidate fields advance around them. FILTER_PARAMS refs auto-rebind when a lookup-key dim is renamed.

Before After
props_color color
commerce_products_id id
commerce_products_type type
basic lat + location.lat lat, location_lat

Value counting fix for Nested(...) parallel arrays

Profiler's arrayColumnSql switched from uniq() (counts whole-array values) to uniqArray() + arrayFilter (counts elements). value_rows now requires at least one meaningful element — string non-empty, number non-null/non-zero, other non-null — so an Array(Nullable(Bool)) of all NULL no longer reads as 100% populated just because the parallel array has length 1. Numeric arrays also emit minArray/maxArray so the all-zero skip rule works for them.

Auto-skip cascade

cubeBuilder.processColumns now skips:

  • STRING/UUID with uniqueValues===1 and lc_values[0] empty / whitespace / '0' / 0
  • NUMBER with min === max === 0
  • BOOLEAN/Int8 with min === max OR uniqueValues === 1 (catches the customer_facing Int8 always-zero case)

Same cascade mirrored into the nested-AJ children path.

No auto pre-aggregations

buildRawCube and the nested-AJ path no longer emit daily_rollup / monthly_rollup. Heuristic rollups bloat CubeStore and surprise users with hidden refresh schedules. User-added pre-aggs in existing models are preserved through merge.

Pre-existing test fixes (was 4 baseline failures)

  • package.json test script: --experimental-test-module-mocks so mock.module() works on Node 22.12 (provisionFraiOS suite).
  • Legacy ARRAY JOIN path: surface user-supplied alias as a dimension on the flattened cube (collision-safe).
  • buildWhereClause test: assert current "no allowlist ⇒ all tables internal" semantics.
  • profileTable emitter test: assert current step names (init / initial_profile / profiling).

Test plan

  • node --experimental-test-module-mocks --test 'src/**/__tests__/*.test.js'511 / 511 passing
  • Smoke test in dev: profile a table with nested Nested(...) columns, generate; confirm shorter field names + .yml output
  • Smoke test: filter by event = 'X'; confirm derived cube/file name
  • Smoke test: nested-array lookup-key cube; confirm lambda v: v in YAML and queries still resolve
  • Verify existing .js smart-gen models keep their .js extension on re-run

Companion PR

Frontend mirror of the skip rules in step-2 auto-select: smartdataHQ/client-v2#feat/smart-gen-field-selection

🤖 Generated with Claude Code

… no auto pre-aggs

Cube generation now produces noticeably cleaner models with fewer surprises.

Output format
- Default to YAML (.yml). Falls back to JS only when a cube contains a
  FILTER_PARAMS arrow callback we cannot translate. Identity arrows
  ((v) => v) are auto-transpiled to Python lambda (lambda v: v) so YAML
  works for the nested-lookup-key path too.
- Filename: explicit file_name wins; otherwise reuse the existing
  model's extension when re-running; otherwise .yml.

Cube naming from filters
- When generating with a flat filter like event = 'Stockout Ended' and
  no explicit cube/file name, derive the cube + file name from the
  filter values (stockout_ended.yml). Re-running the same filter set
  updates the same file.

Field naming (shortest-unique resolver)
- Each field carries an ordered candidate list from leaf to fully-
  qualified. A new resolver picks the shortest candidate that does not
  collide. Examples:
    props.color (map key)        → color (was props_color)
    commerce.products.id         → id    (was commerce_products_id)
    commerce.products.entry_type → type  (was commerce_products_type)
- Nested-AJ flattened cubes route around already-claimed names too.
- FILTER_PARAMS refs auto-rebind when a lookup-key dim is renamed.

Value counting fixes for nested / parallel-array Nested(...) columns
- profiler.arrayColumnSql now uses uniqArray + arrayFilter so distinct-
  count reflects element-level cardinality, not the count of distinct
  whole arrays. value_rows requires at least one meaningful element
  (string: non-null + non-empty; number: non-null + non-zero; other:
  non-null) so an Array(Nullable(Bool)) of all-NULL no longer reads as
  "100% populated" just because it inherits the parallel array length.
- Numeric arrays also emit minArray/maxArray for the all-zero skip rule.

Auto-skip cascade in cubeBuilder.processColumns
- STRING/UUID with uniqueValues===1 and lc_values being '', whitespace,
  '0' or 0 → skip
- NUMBER with min===max===0 → skip
- BOOLEAN/Int8 with min===max OR uniqueValues===1 → skip
  (catches the customer_facing Int8 'always 0' case)
- Mirrored into the nested-AJ children path.

No auto pre-aggregations
- buildRawCube and the nested-AJ path no longer emit daily_rollup /
  monthly_rollup. Heuristic rollups bloat CubeStore with unused
  materializations and surprise users with hidden refresh schedules.
  Users add pre-aggs explicitly when they understand query patterns.
  User-added pre-aggs in existing models are preserved through merge.

Pre-existing test cleanup
- package.json test: add --experimental-test-module-mocks so
  mock.module() works on Node 22.12 (provisionFraiOS suite).
- Legacy ARRAY JOIN path: surface the user-supplied alias as a
  dimension on the flattened cube (collision-safe).
- buildWhereClause test: assert current 'no allowlist ⇒ all tables
  internal' semantics.
- profileTable emitter test: assert current step names ('init',
  'initial_profile', 'profiling').

Tests: 511 / 511 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@acmeguy acmeguy force-pushed the feat/cubejs-smart-gen-improvements branch from e381466 to 601e32b Compare May 10, 2026 13:17
@acmeguy acmeguy merged commit 3a8784d into main May 10, 2026
3 checks passed
@acmeguy acmeguy deleted the feat/cubejs-smart-gen-improvements branch May 10, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants