fix(valuediff): normalise column case before adapter.quote() for Snowflake (DRC-3464)#1388
fix(valuediff): normalise column case before adapter.quote() for Snowflake (DRC-3464)#1388iamcxa wants to merge 3 commits into
Conversation
…flake
On Snowflake with default quoting config, physical columns are stored uppercase
(e.g. CUSTOMER_ID). user-supplied check params follow the dbt manifest
convention (lowercase, e.g. customer_id). Passing them directly to
adapter.quote() produces '"customer_id"' — a case-sensitive identifier that
Snowflake cannot resolve, raising:
SQL compilation error: invalid identifier '"customer_id"'
Fix: add ValueDiffMixin._build_column_case_lookup() that calls get_columns()
and builds a {lower(physical_name): physical_name} lookup. Before any
adapter.quote() call in ValueDiffTask._query_value_diff and
ValueDiffDetailTask._query_value_diff, each user-supplied identifier is mapped
through the lookup so adapter.quote() receives the physical name ("CUSTOMER_ID")
and produces a valid quoted identifier.
Pass-through for identifiers not found in the catalog preserves existing
behaviour for unknown columns rather than masking errors. Mirrors the pattern
profile_diff already uses via get_columns().
Regression tests: test_value_diff_snowflake_uppercase_columns and
test_value_diff_detail_snowflake_uppercase_columns patch get_columns() to return
uppercase names (simulating Snowflake) and assert the generated SQL contains
'"CUSTOMER_ID"' and NOT '"customer_id"'.
Closes DRC-3464.
Signed-off-by: Kent <kentchen@reccehq.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…y_key (DRC-3464 cycle 1) When primary_key is a list, execute() now builds the column-case lookup and normalises keys BEFORE calling _verify_primary_key, so the composite path's adapter.quote() calls receive physical catalog names (e.g. CUSTOMER_ID on Snowflake) instead of user-supplied lowercase identifiers. Also adds regression tests test_value_diff_snowflake_composite_primary_key and test_value_diff_detail_snowflake_composite_primary_key that patch dbt_adapter.adapter.execute (the low-level adapter used by _verify_primary_key) to capture and assert the generated SQL uses uppercase quoted identifiers. PRE-FIX: both tests FAIL (sql contains lowercase quoted identifiers) POST-FIX: both tests PASS (sql contains uppercase quoted identifiers) Signed-off-by: Kent <iamcxa@gmail.com>
There was a problem hiding this comment.
Pull request overview
Fixes Snowflake value_diff failures caused by quoting lowercase identifiers (from dbt manifest conventions) against physically uppercase columns, by normalizing user-supplied column identifiers to catalog/physical case before adapter.quote() is applied.
Changes:
- Added a catalog-driven
{lower(physical): physical}lookup helper and identifier normalization inValueDiffTaskandValueDiffDetailTask. - Normalized
primary_keybefore_verify_primary_key()to fix the composite primary-key quoting path. - Added Snowflake-focused regression tests that patch
get_columns()to return uppercase columns and assert generated SQL uses uppercase quoted identifiers.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
recce/tasks/valuediff.py |
Adds catalog case lookup + normalizes PK/columns before quoting to avoid Snowflake invalid identifier errors. |
tests/tasks/test_valuediff.py |
Adds unit tests for the lookup/normalization helpers and Snowflake regression tests capturing executed SQL. |
Comments suppressed due to low confidence (1)
recce/tasks/valuediff.py:431
- As in
ValueDiffTask._query_value_diff,case_lookupis built unconditionally (twoget_columns()macro executions) even whencolumnsis None and the function already callsget_columns()again to compute common columns. Since value_diff can run frequently andget_columns()may hit INFORMATION_SCHEMA, consider avoiding redundant catalog calls by building the lookup only when normalising user-supplied identifiers, or reusing the already-fetched column lists to construct the lookup.
# Normalise user-supplied identifiers to physical catalog names before quoting.
# On Snowflake (default quoting config), physical columns are stored uppercase;
# adapter.quote("customer_id") → '"customer_id"' which Snowflake cannot resolve.
case_lookup = self._build_column_case_lookup(dbt_adapter, model)
if columns is None or len(columns) == 0:
base_columns = [column.column for column in dbt_adapter.get_columns(model, base=True)]
curr_columns = [column.column for column in dbt_adapter.get_columns(model, base=False)]
columns = [column for column in base_columns if column in curr_columns]
else:
columns = [self._normalise_identifier(c, case_lookup) for c in columns]
| # Build a {lower(physical_name): physical_name} lookup from the catalog so that | ||
| # user-supplied lowercase identifiers (e.g. "customer_id") are mapped to the | ||
| # physical name the database actually stores (e.g. "CUSTOMER_ID" on Snowflake). | ||
| # This must happen before get_columns() is called for the columns-list path so | ||
| # that the lookup covers both branches uniformly. | ||
| case_lookup = self._build_column_case_lookup(dbt_adapter, model) | ||
|
|
||
| if columns is None or len(columns) == 0: | ||
| base_columns = [column.column for column in dbt_adapter.get_columns(model, base=True)] | ||
| curr_columns = [column.column for column in dbt_adapter.get_columns(model, base=False)] | ||
| columns = [column for column in base_columns if column in curr_columns] | ||
| else: |
There was a problem hiding this comment.
Fixed in bc1fcb0: added optional case_lookup parameter to both _query_value_diff() methods; execute() now builds the lookup once and passes it down, reducing get_columns() calls from 4× to 2× per value_diff run.
| # _query_value_diff rebuilds case_lookup internally; passing the already- | ||
| # normalised primary_key here is a defensive duplicate that keeps both | ||
| # paths consistent and avoids a second get_columns() round-trip on the | ||
| # happy path (the internal rebuild still runs for the columns list). |
There was a problem hiding this comment.
Fixed in bc1fcb0: the misleading comment at line 346 has been removed. The comment claimed the internal rebuild was still running; now that case_lookup is passed through from execute(), the comment is no longer needed and was deleted.
Codecov Report❌ Patch coverage is
... and 5 files with indirect coverage changes 🚀 New features to boost your workflow:
|
…umns() (DRC-3464 PR #1388 review) Signed-off-by: Kent <iamcxa@gmail.com>
Summary
value_diffwas failing on Snowflake withSQL compilation error: invalid identifier '"customer_id"'. The check params carry lowercase column names (from the dbt manifest convention), but Snowflake folds unquoted DDL identifiers to uppercase (CUSTOMER_ID). Callingadapter.quote("customer_id")produced the case-sensitive'"customer_id"'identifier, which Snowflake cannot resolve against its physical uppercase column.Root cause
Two failure sites in
recce/tasks/valuediff.py, both invoked fromexecute():_verify_primary_keycomposite path (whenprimary_keyis a list) — builds inline Jinja that callsadapter.quote(col)over the raw user-supplied list._query_value_diff— callsadapter.quote(field)/adapter.quote(column_to_compare)on the surrogate-key and column-comparison clauses.The scalar primary_key path of
_verify_primary_keyusesadapter.dispatch('test_unique', 'dbt')which renders{{ column_name }}verbatim (unquoted) and so passes Snowflake case-insensitively. Same pattern inValueDiffDetailTask. Other check tasks (profile_diff,histogram_diff,top_k_diff,row_count_diff,query_diff,schema_diff) are unaffected —profile_diffaccidentally avoids the bug by sourcing column names fromget_columns()(uppercase on Snowflake) rather than from check params.Fix
Add a
ValueDiffMixin._build_column_case_lookup()helper that callsdbt_adapter.get_columns()(base + current relations) and returns a{lower(physical_name): physical_name}lookup._normalise_identifier()maps a user-supplied lowercase identifier to its physical catalog case, falling back to the original string when the column is unknown (preserves current behaviour for unknown columns instead of masking errors).In
ValueDiffTask.execute()andValueDiffDetailTask.execute(), the lookup is built FIRST and theprimary_key(andcolumnslist when present) is normalised BEFORE_verify_primary_key()is called — so both_verify_primary_key's composite path AND_query_value_diff's adapter.quote() calls receive the physical catalog case. Mirrors theget_columns()pattern already used byprofile_diff(recce/tasks/profile.py:222).Tests
recce/tests/tasks/test_valuediff.py:TestBuildColumnCaseLookup— 4 unit tests for the new helper (case-insensitive lookup, base+current relation merge, unknown-identifier passthrough, exception fallback).test_value_diff_snowflake_uppercase_columns+test_value_diff_detail_snowflake_uppercase_columns— scalar primary_key on Snowflake: mockget_columns()returns uppercase columns; capture executed SQL; assert"CUSTOMER_ID"present,"customer_id"absent.test_value_diff_snowflake_composite_primary_key+test_value_diff_detail_snowflake_composite_primary_key— composite primary_key on Snowflake; assert the SQL executed by_verify_primary_key's composite path contains uppercase quoted identifiers.Pre-fix rigor (rebuild verified independently):
valuediff.pyto before either fix commit → all 4 Snowflake regression tests FAIL with verbatimAssertionError: '"CUSTOMER_ID"' in 'with a_query as (select "customer_id", ...)'.Full tasks suite (
pytest tests/tasks/ -q): 116 passed, 1 pre-existing portalocker warning (unrelated).Audit trail
Investigated through DataRecce's internal
debug-flywheelworkflow as case 019. Stages: reproduce → discover → hypothesize → verify → fix-and-harden (cycle 1 after codex review caught the composite path gap on the first fix attempt).Linear: DRC-3464