Skip to content

fix(snowflake): recover variant schema when a BLOCK sample draws no rows#2850

Open
jswir wants to merge 2 commits into
malloydata:mainfrom
jswir:fix/snowflake-variant-block-sample-empty
Open

fix(snowflake): recover variant schema when a BLOCK sample draws no rows#2850
jswir wants to merge 2 commits into
malloydata:mainfrom
jswir:fix/snowflake-variant-block-sample-empty

Conversation

@jswir
Copy link
Copy Markdown
Contributor

@jswir jswir commented May 30, 2026

Problem

Snowflake VARIANT/ARRAY/OBJECT columns have no scalar type in metadata, so the connection samples data to infer one. For a base table above the full-scan byte threshold, the tablesample-only path samples with:

... FROM <table> TABLESAMPLE BLOCK (1) LIMIT <n>

TABLESAMPLE BLOCK (p) is block (micro-partition) sampling: each micro-partition is included independently with probability p%. On a table with relatively few micro-partitions, BLOCK (1) frequently selects zero partitions and returns no rows. The code treated that empty draw as final and fell through to opaqueVariantType(), typing the whole column as sql native variant. Because BLOCK is unseeded, this is non-deterministic — the same table resolves to a concrete type on one schema fetch and to variant on the next.

Downstream, any count(col), join on col = ..., etc. then fails to compile:

Unsupported SQL native type 'variant' not allowed in expression

…so a model compiles and runs fine in one environment and fails to compile in another (e.g. on publish) with no change to the model.

Measured on a real table (9.77M rows, uniformly INTEGER VARIANT id column)

  • SYSTEM$CLUSTERING_INFORMATION152 micro-partitions (~65,536 rows each — Snowflake's per-partition cap).
  • TABLESAMPLE BLOCK (1) returns either 0 rows or whole partitions (65,536, 131,072, …) — never a small non-zero count, so LIMIT 1000 is irrelevant to the failure.
  • Empty draws observed in 10 / 60 trials (~17%), matching the theoretical 0.99^152 ≈ 22%.
  • A plain read yields clean (col, INTEGER) evidence → would infer number. The column is perfectly typeable; only the sampling miss degrades it.

It's feast-or-famine: any non-empty draw already over-delivers evidence; the only failure mode is the empty draw.

Fix

Escalate the BLOCK percentage (1 → 10 → 100) until a draw returns rows, instead of accepting the first empty result. The logic is extracted into a pure, unit-tested sampleVariantBlocks helper, mirroring the existing pickSampleStrategy.

Two properties make this safe and correct.

1. The escalation structurally can't run away — an empty draw is a measurement, not just a miss. With M micro-partitions, P(empty) = (1 − p/100)^M, so each empty draw bounds the table:

an empty draw at proves the table is no bigger than
BLOCK(1) ~460 partitions (~7 GB)
BLOCK(10) ~45 partitions (~0.7 GB)
BLOCK(100) 0 rows — it's genuinely empty

So you only escalate on a table a prior failure has already proven is small: by the time BLOCK(100) could run, an empty BLOCK(10) has bounded the table to a few tens of partitions. A petabyte table (millions of partitions) can't produce the empty BLOCK(1) that starts the chain. Each rung is still bounded by the existing LIMIT, and because one selected partition already over-fills that limit, the extra rungs cost at most ~one partition's worth of scan. The safety has to come from here, not from LIMITLIMIT doesn't bound the scan on a large partitioned table, which is the whole premise of this path.

Terminating at 100 (not 50) matters. This path runs on tables just over the byte threshold, often a handful of partitions, where BLOCK(50) still draws empty an appreciable fraction of the time (0.5^M — e.g. ~12% at 3 partitions). Stopping at 50 only makes the bug rarer; at 100 the only empty draw is a genuinely empty table.

2. An empty draw and an error/timeout are different signals. tryBatch returns [] on a genuinely empty draw and undefined on any exception (including a 15s timeout). Only [] escalates; undefined stops. A timeout carries none of the "table is small" evidence, and higher rungs scan more — so escalating after a timeout would just buy more timeouts before the same variant.

Contract: a column degrades to variant only from a genuinely empty table (empty at BLOCK(100)) or a hard failure — never from an unlucky draw.

Testing

New unit tests in snowflake_sample_strategy.spec.ts (no DB):

  • first non-empty draw stops early (no escalation);
  • escalates past an empty draw until one returns rows;
  • stops on an errored (undefined) draw without escalating;
  • escalates past an empty draw but stops at a later error;
  • returns undefined only when every percentage comes back empty;
  • honors a custom percentage ladder, in order.

npx jest packages/malloy-db-snowflake/src/snowflake_sample_strategy.spec.ts → 25 passed.

Why escalation, rather than SAMPLE ROW / SEED

  • SAMPLE ROW (p) is per-row Bernoulli sampling — it never draws empty at meaningful sizes, but it rolls the dice per row, which forces a full-table scan. tablesample-only exists precisely to avoid that scan on a large table; switching to ROW sampling here would reintroduce the exact cost this path was built to dodge. BLOCK reads only the selected micro-partitions, and escalation preserves that cheap-I/O property for the common case.
  • SEED (n) makes BLOCK deterministic but adds no evidence — a seeded empty draw is still empty, so the column would still degrade to variant, just stably. Determinism without escalation only makes the wrong answer reproducible.

A deeper improvement would distinguish "sampled, saw nothing" from "sampled, saw conflicting types" in the inference layer so a miss never silently becomes variant; this PR fixes the empty-draw cause without changing that contract.

@mtoy-googly-moogly
Copy link
Copy Markdown
Collaborator

Thanks for this — the diagnosis is excellent: the empty draw traced to BLOCK's per-partition semantics, measured against the real partition count, and the non-determinism (unseeded BLOCK) are all dead right.

A few things to turn it into a fix we'd land:

1. Terminate the ladder at 100 ([1, 10, 100]). Stopping at 50 only makes the bug rarer — this path runs on tables just over the byte threshold, often a handful of partitions, where BLOCK(50) still draws empty ~12% of the time.

100 is safe to add because an empty draw is a measurement, not just a miss. P(empty) = (1 − p/100)^M, so each failure bounds the table:

an empty draw at proves the table is no bigger than
BLOCK(1) ~460 partitions (~7 GB)
BLOCK(10) ~45 partitions (~0.7 GB)
BLOCK(100) 0 rows — it's genuinely empty

So you only escalate on a table a prior failure has already proven is small; by the time BLOCK(100) could run, an empty BLOCK(10) has bounded it to a few tens of partitions. A petabyte table (millions of partitions) can't produce the empty BLOCK(1) that starts the chain. The escalation structurally can't run away — which matters precisely because LIMIT doesn't bound the scan on a large partitioned table (the premise of this whole path), so the safety has to come from here, not from LIMIT.

2. Escalate only on a genuinely empty draw, not on error/timeout. tryBatch returns undefined on any exception including a timeout, currently treated the same as []. A timeout carries none of the "table is small" evidence, and higher rungs scan more — so escalating after a 15s timeout just buys two more timeouts before the same variant. Split [] (→ escalate) from undefined (→ stop).

3. Lock the contract: a column resolves to variant only from a genuinely-empty table (BLOCK(100) empty) or a hard failure — never an unlucky draw. Worth a sentence in the comment.

4. The SAMPLE ROW / SEED alternatives in Notes: make the concrete case for why one's better — against BLOCK existing specifically to avoid the full scan row-sampling incurs — or drop them.

5. Bring the SampleStrategy doc-comment up to date. There's no CONTEXT.md/README in this package, so that comment is the doc for how sampling works — and it still says tablesample-only will "degrade to variant [rather] than issue a runaway query," which is the behavior you're replacing. Give it a pass so it's correct, complete, and free of filler — a tight, accurate description of the escalation a maintainer would write.

6. Trim the comments to final-PR length. The paragraph-length rationale on sampleVariantBlocks is great for this thread, but in committed code it's more than the next reader needs. Keep the load-bearing why — BLOCK decides per-partition, so a low % can draw empty; escalate because an empty draw itself proves the table is small enough to sample harder — and move the full derivation (the probabilities, the worst-case sizing) into the commit message / PR description, where it lives better. Same for any other comment that's grown past a few lines.

The helper structure and tests are right; they just need the []-vs-undefined split and the new terminal rung.

TABLESAMPLE BLOCK (p) decides per micro-partition, so a low percentage on
a table with relatively few partitions can select zero partitions and
return no rows. The tablesample-only schema path treated that empty draw
as final, degrading every VARIANT/ARRAY/OBJECT column to opaque `sql
native`. Because BLOCK is unseeded this is non-deterministic: the same
table can resolve to a concrete type on one schema fetch and to variant on
the next, which then makes count()/joins on that column fail to compile
("Unsupported SQL native type 'variant' not allowed in expression").

Escalate the BLOCK percentage (1 -> 10 -> 50) until a draw returns
evidence. Each draw is still bounded by the existing LIMIT, and one
selected partition already over-fills it, so the extra attempts cost at
most about one partition's worth of scan. Logic is extracted into a pure,
unit-tested sampleVariantBlocks helper alongside pickSampleStrategy.

Signed-off-by: James Swirhun <james@credibledata.com>
@jswir jswir force-pushed the fix/snowflake-variant-block-sample-empty branch from dd12af7 to d760039 Compare May 31, 2026 00:17
- Terminate the escalation ladder at BLOCK(100) (was 50): an empty draw
  bounds the table size, so the only empty draw at 100% is a genuinely
  empty table — 50 just made the bug rarer.
- Escalate only on a genuinely empty draw ([]); stop on error/timeout
  (undefined). A timeout carries no 'table is small' evidence and higher
  percentages only scan more, so escalating just buys more timeouts.
- State the degrade-to-variant contract in the doc-comment.
- Refresh the SampleStrategy doc-comment (was 'degrade to variant rather
  than runaway') and trim the inline rationale to the load-bearing why.

Signed-off-by: James Swirhun <james@credibledata.com>
@jswir jswir force-pushed the fix/snowflake-variant-block-sample-empty branch from d760039 to ddead2a Compare May 31, 2026 00:19
@jswir jswir marked this pull request as ready for review May 31, 2026 02:08
@jswir
Copy link
Copy Markdown
Contributor Author

jswir commented May 31, 2026

thank you for the review - makes sense! Worked through these with claude and addressed all six in the latest commit:

  1. Ladder terminates at 100 ([1, 10, 100]) instead of 50

  2. [] vs undefined split. Confirmed the mechanism in source first: _execute resolves rows ?? [] on success (snowflake_executor.ts:207, covering Snowflake's "completed with no rows" case) and only tryBatch's catch produces undefined, so an empty-but-successful draw can't be confused with a failure. So [] → escalate, undefined → stop. A timeout carries no "table is small" evidence and higher rungs only scan more, so escalating after one just buys more timeouts. New tests pin both: stops on an errored (undefined) draw without escalating, and escalates past an empty draw but stops at a later error.

  3. Added "a column degrades to variant only from a genuinely empty table (empty even at BLOCK (100)) or a hard failure — never an unlucky draw." in comments to clarify the contract.

  4. SAMPLE ROW / SEED — updated to be more specific about why to use block => more efficient than ROW by focusing random sampling on partitions.

  5. SampleStrategy doc-comment updated to describe the escalation and the contract.

  6. Comments trimmed

Also rebased onto current main and fixed a missing DCO sign-off. Tests: 25 passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants