fix(snowflake): recover variant schema when a BLOCK sample draws no rows#2850
fix(snowflake): recover variant schema when a BLOCK sample draws no rows#2850jswir wants to merge 2 commits into
Conversation
|
Thanks for this — the diagnosis is excellent: the empty draw traced to A few things to turn it into a fix we'd land: 1. Terminate the ladder at 100 ( 100 is safe to add because an empty draw is a measurement, not just a miss.
So you only escalate on a table a prior failure has already proven is small; by the time 2. Escalate only on a genuinely empty draw, not on error/timeout. 3. Lock the contract: a column resolves to 4. The 5. Bring the 6. Trim the comments to final-PR length. The paragraph-length rationale on The helper structure and tests are right; they just need the |
TABLESAMPLE BLOCK (p) decides per micro-partition, so a low percentage on
a table with relatively few partitions can select zero partitions and
return no rows. The tablesample-only schema path treated that empty draw
as final, degrading every VARIANT/ARRAY/OBJECT column to opaque `sql
native`. Because BLOCK is unseeded this is non-deterministic: the same
table can resolve to a concrete type on one schema fetch and to variant on
the next, which then makes count()/joins on that column fail to compile
("Unsupported SQL native type 'variant' not allowed in expression").
Escalate the BLOCK percentage (1 -> 10 -> 50) until a draw returns
evidence. Each draw is still bounded by the existing LIMIT, and one
selected partition already over-fills it, so the extra attempts cost at
most about one partition's worth of scan. Logic is extracted into a pure,
unit-tested sampleVariantBlocks helper alongside pickSampleStrategy.
Signed-off-by: James Swirhun <james@credibledata.com>
dd12af7 to
d760039
Compare
- Terminate the escalation ladder at BLOCK(100) (was 50): an empty draw bounds the table size, so the only empty draw at 100% is a genuinely empty table — 50 just made the bug rarer. - Escalate only on a genuinely empty draw ([]); stop on error/timeout (undefined). A timeout carries no 'table is small' evidence and higher percentages only scan more, so escalating just buys more timeouts. - State the degrade-to-variant contract in the doc-comment. - Refresh the SampleStrategy doc-comment (was 'degrade to variant rather than runaway') and trim the inline rationale to the load-bearing why. Signed-off-by: James Swirhun <james@credibledata.com>
d760039 to
ddead2a
Compare
|
thank you for the review - makes sense! Worked through these with claude and addressed all six in the latest commit:
Also rebased onto current main and fixed a missing DCO sign-off. Tests: 25 passing. |
Problem
Snowflake
VARIANT/ARRAY/OBJECTcolumns have no scalar type in metadata, so the connection samples data to infer one. For a base table above the full-scan byte threshold, thetablesample-onlypath samples with:TABLESAMPLE BLOCK (p)is block (micro-partition) sampling: each micro-partition is included independently with probabilityp%. On a table with relatively few micro-partitions,BLOCK (1)frequently selects zero partitions and returns no rows. The code treated that empty draw as final and fell through toopaqueVariantType(), typing the whole column assql nativevariant. BecauseBLOCKis unseeded, this is non-deterministic — the same table resolves to a concrete type on one schema fetch and tovarianton the next.Downstream, any
count(col), joinon col = ..., etc. then fails to compile:…so a model compiles and runs fine in one environment and fails to compile in another (e.g. on publish) with no change to the model.
Measured on a real table (9.77M rows, uniformly INTEGER
VARIANTid column)SYSTEM$CLUSTERING_INFORMATION→ 152 micro-partitions (~65,536 rows each — Snowflake's per-partition cap).TABLESAMPLE BLOCK (1)returns either 0 rows or whole partitions (65,536, 131,072, …) — never a small non-zero count, soLIMIT 1000is irrelevant to the failure.0.99^152 ≈ 22%.(col, INTEGER)evidence → would infernumber. The column is perfectly typeable; only the sampling miss degrades it.It's feast-or-famine: any non-empty draw already over-delivers evidence; the only failure mode is the empty draw.
Fix
Escalate the
BLOCKpercentage (1 → 10 → 100) until a draw returns rows, instead of accepting the first empty result. The logic is extracted into a pure, unit-testedsampleVariantBlockshelper, mirroring the existingpickSampleStrategy.Two properties make this safe and correct.
1. The escalation structurally can't run away — an empty draw is a measurement, not just a miss. With
Mmicro-partitions,P(empty) = (1 − p/100)^M, so each empty draw bounds the table:BLOCK(1)BLOCK(10)BLOCK(100)So you only escalate on a table a prior failure has already proven is small: by the time
BLOCK(100)could run, an emptyBLOCK(10)has bounded the table to a few tens of partitions. A petabyte table (millions of partitions) can't produce the emptyBLOCK(1)that starts the chain. Each rung is still bounded by the existingLIMIT, and because one selected partition already over-fills that limit, the extra rungs cost at most ~one partition's worth of scan. The safety has to come from here, not fromLIMIT—LIMITdoesn't bound the scan on a large partitioned table, which is the whole premise of this path.Terminating at 100 (not 50) matters. This path runs on tables just over the byte threshold, often a handful of partitions, where
BLOCK(50)still draws empty an appreciable fraction of the time (0.5^M— e.g. ~12% at 3 partitions). Stopping at 50 only makes the bug rarer; at 100 the only empty draw is a genuinely empty table.2. An empty draw and an error/timeout are different signals.
tryBatchreturns[]on a genuinely empty draw andundefinedon any exception (including a 15s timeout). Only[]escalates;undefinedstops. A timeout carries none of the "table is small" evidence, and higher rungs scan more — so escalating after a timeout would just buy more timeouts before the samevariant.Contract: a column degrades to
variantonly from a genuinely empty table (empty atBLOCK(100)) or a hard failure — never from an unlucky draw.Testing
New unit tests in
snowflake_sample_strategy.spec.ts(no DB):undefined) draw without escalating;undefinedonly when every percentage comes back empty;npx jest packages/malloy-db-snowflake/src/snowflake_sample_strategy.spec.ts→ 25 passed.Why escalation, rather than
SAMPLE ROW/SEEDSAMPLE ROW (p)is per-row Bernoulli sampling — it never draws empty at meaningful sizes, but it rolls the dice per row, which forces a full-table scan.tablesample-onlyexists precisely to avoid that scan on a large table; switching to ROW sampling here would reintroduce the exact cost this path was built to dodge. BLOCK reads only the selected micro-partitions, and escalation preserves that cheap-I/O property for the common case.SEED (n)makes BLOCK deterministic but adds no evidence — a seeded empty draw is still empty, so the column would still degrade tovariant, just stably. Determinism without escalation only makes the wrong answer reproducible.A deeper improvement would distinguish "sampled, saw nothing" from "sampled, saw conflicting types" in the inference layer so a miss never silently becomes
variant; this PR fixes the empty-draw cause without changing that contract.