Skip to content

fix(valuediff): normalise column case before adapter.quote() for Snowflake (DRC-3464)#1388

Open
iamcxa wants to merge 3 commits into
mainfrom
fix/drc-3464-value-diff-snowflake-quoting
Open

fix(valuediff): normalise column case before adapter.quote() for Snowflake (DRC-3464)#1388
iamcxa wants to merge 3 commits into
mainfrom
fix/drc-3464-value-diff-snowflake-quoting

Conversation

@iamcxa
Copy link
Copy Markdown
Contributor

@iamcxa iamcxa commented May 21, 2026

Summary

value_diff was failing on Snowflake with SQL compilation error: invalid identifier '"customer_id"'. The check params carry lowercase column names (from the dbt manifest convention), but Snowflake folds unquoted DDL identifiers to uppercase (CUSTOMER_ID). Calling adapter.quote("customer_id") produced the case-sensitive '"customer_id"' identifier, which Snowflake cannot resolve against its physical uppercase column.

Root cause

Two failure sites in recce/tasks/valuediff.py, both invoked from execute():

  1. _verify_primary_key composite path (when primary_key is a list) — builds inline Jinja that calls adapter.quote(col) over the raw user-supplied list.
  2. _query_value_diff — calls adapter.quote(field) / adapter.quote(column_to_compare) on the surrogate-key and column-comparison clauses.

The scalar primary_key path of _verify_primary_key uses adapter.dispatch('test_unique', 'dbt') which renders {{ column_name }} verbatim (unquoted) and so passes Snowflake case-insensitively. Same pattern in ValueDiffDetailTask. Other check tasks (profile_diff, histogram_diff, top_k_diff, row_count_diff, query_diff, schema_diff) are unaffected — profile_diff accidentally avoids the bug by sourcing column names from get_columns() (uppercase on Snowflake) rather than from check params.

Fix

Add a ValueDiffMixin._build_column_case_lookup() helper that calls dbt_adapter.get_columns() (base + current relations) and returns a {lower(physical_name): physical_name} lookup. _normalise_identifier() maps a user-supplied lowercase identifier to its physical catalog case, falling back to the original string when the column is unknown (preserves current behaviour for unknown columns instead of masking errors).

In ValueDiffTask.execute() and ValueDiffDetailTask.execute(), the lookup is built FIRST and the primary_key (and columns list when present) is normalised BEFORE _verify_primary_key() is called — so both _verify_primary_key's composite path AND _query_value_diff's adapter.quote() calls receive the physical catalog case. Mirrors the get_columns() pattern already used by profile_diff (recce/tasks/profile.py:222).

Tests

recce/tests/tasks/test_valuediff.py:

  • TestBuildColumnCaseLookup — 4 unit tests for the new helper (case-insensitive lookup, base+current relation merge, unknown-identifier passthrough, exception fallback).
  • test_value_diff_snowflake_uppercase_columns + test_value_diff_detail_snowflake_uppercase_columns — scalar primary_key on Snowflake: mock get_columns() returns uppercase columns; capture executed SQL; assert "CUSTOMER_ID" present, "customer_id" absent.
  • test_value_diff_snowflake_composite_primary_key + test_value_diff_detail_snowflake_composite_primary_key — composite primary_key on Snowflake; assert the SQL executed by _verify_primary_key's composite path contains uppercase quoted identifiers.

Pre-fix rigor (rebuild verified independently):

  • Revert valuediff.py to before either fix commit → all 4 Snowflake regression tests FAIL with verbatim AssertionError: '"CUSTOMER_ID"' in 'with a_query as (select "customer_id", ...)'.
  • Restore the fix → all 21 valuediff tests PASS in 4.64s.

Full tasks suite (pytest tests/tasks/ -q): 116 passed, 1 pre-existing portalocker warning (unrelated).

Audit trail

Investigated through DataRecce's internal debug-flywheel workflow as case 019. Stages: reproduce → discover → hypothesize → verify → fix-and-harden (cycle 1 after codex review caught the composite path gap on the first fix attempt).

Linear: DRC-3464

iamcxa added 2 commits May 20, 2026 14:44
…flake

On Snowflake with default quoting config, physical columns are stored uppercase
(e.g. CUSTOMER_ID).  user-supplied check params follow the dbt manifest
convention (lowercase, e.g. customer_id).  Passing them directly to
adapter.quote() produces '"customer_id"' — a case-sensitive identifier that
Snowflake cannot resolve, raising:
  SQL compilation error: invalid identifier '"customer_id"'

Fix: add ValueDiffMixin._build_column_case_lookup() that calls get_columns()
and builds a {lower(physical_name): physical_name} lookup.  Before any
adapter.quote() call in ValueDiffTask._query_value_diff and
ValueDiffDetailTask._query_value_diff, each user-supplied identifier is mapped
through the lookup so adapter.quote() receives the physical name ("CUSTOMER_ID")
and produces a valid quoted identifier.

Pass-through for identifiers not found in the catalog preserves existing
behaviour for unknown columns rather than masking errors.  Mirrors the pattern
profile_diff already uses via get_columns().

Regression tests: test_value_diff_snowflake_uppercase_columns and
test_value_diff_detail_snowflake_uppercase_columns patch get_columns() to return
uppercase names (simulating Snowflake) and assert the generated SQL contains
'"CUSTOMER_ID"' and NOT '"customer_id"'.

Closes DRC-3464.

Signed-off-by: Kent <kentchen@reccehq.com>
Signed-off-by: Kent <iamcxa@gmail.com>
…y_key (DRC-3464 cycle 1)

When primary_key is a list, execute() now builds the column-case lookup and
normalises keys BEFORE calling _verify_primary_key, so the composite path's
adapter.quote() calls receive physical catalog names (e.g. CUSTOMER_ID on
Snowflake) instead of user-supplied lowercase identifiers.

Also adds regression tests test_value_diff_snowflake_composite_primary_key and
test_value_diff_detail_snowflake_composite_primary_key that patch
dbt_adapter.adapter.execute (the low-level adapter used by _verify_primary_key)
to capture and assert the generated SQL uses uppercase quoted identifiers.

PRE-FIX: both tests FAIL (sql contains lowercase quoted identifiers)
POST-FIX: both tests PASS (sql contains uppercase quoted identifiers)
Signed-off-by: Kent <iamcxa@gmail.com>
@iamcxa iamcxa self-assigned this May 21, 2026
@iamcxa iamcxa requested a review from Copilot May 21, 2026 09:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Snowflake value_diff failures caused by quoting lowercase identifiers (from dbt manifest conventions) against physically uppercase columns, by normalizing user-supplied column identifiers to catalog/physical case before adapter.quote() is applied.

Changes:

  • Added a catalog-driven {lower(physical): physical} lookup helper and identifier normalization in ValueDiffTask and ValueDiffDetailTask.
  • Normalized primary_key before _verify_primary_key() to fix the composite primary-key quoting path.
  • Added Snowflake-focused regression tests that patch get_columns() to return uppercase columns and assert generated SQL uses uppercase quoted identifiers.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
recce/tasks/valuediff.py Adds catalog case lookup + normalizes PK/columns before quoting to avoid Snowflake invalid identifier errors.
tests/tasks/test_valuediff.py Adds unit tests for the lookup/normalization helpers and Snowflake regression tests capturing executed SQL.
Comments suppressed due to low confidence (1)

recce/tasks/valuediff.py:431

  • As in ValueDiffTask._query_value_diff, case_lookup is built unconditionally (two get_columns() macro executions) even when columns is None and the function already calls get_columns() again to compute common columns. Since value_diff can run frequently and get_columns() may hit INFORMATION_SCHEMA, consider avoiding redundant catalog calls by building the lookup only when normalising user-supplied identifiers, or reusing the already-fetched column lists to construct the lookup.
        # Normalise user-supplied identifiers to physical catalog names before quoting.
        # On Snowflake (default quoting config), physical columns are stored uppercase;
        # adapter.quote("customer_id") → '"customer_id"' which Snowflake cannot resolve.
        case_lookup = self._build_column_case_lookup(dbt_adapter, model)

        if columns is None or len(columns) == 0:
            base_columns = [column.column for column in dbt_adapter.get_columns(model, base=True)]
            curr_columns = [column.column for column in dbt_adapter.get_columns(model, base=False)]
            columns = [column for column in base_columns if column in curr_columns]
        else:
            columns = [self._normalise_identifier(c, case_lookup) for c in columns]

Comment thread recce/tasks/valuediff.py Outdated
Comment on lines +140 to +151
# Build a {lower(physical_name): physical_name} lookup from the catalog so that
# user-supplied lowercase identifiers (e.g. "customer_id") are mapped to the
# physical name the database actually stores (e.g. "CUSTOMER_ID" on Snowflake).
# This must happen before get_columns() is called for the columns-list path so
# that the lookup covers both branches uniformly.
case_lookup = self._build_column_case_lookup(dbt_adapter, model)

if columns is None or len(columns) == 0:
base_columns = [column.column for column in dbt_adapter.get_columns(model, base=True)]
curr_columns = [column.column for column in dbt_adapter.get_columns(model, base=False)]
columns = [column for column in base_columns if column in curr_columns]
else:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bc1fcb0: added optional case_lookup parameter to both _query_value_diff() methods; execute() now builds the lookup once and passes it down, reducing get_columns() calls from 4× to 2× per value_diff run.

Comment thread recce/tasks/valuediff.py Outdated
Comment on lines +346 to +349
# _query_value_diff rebuilds case_lookup internally; passing the already-
# normalised primary_key here is a defensive duplicate that keeps both
# paths consistent and avoids a second get_columns() round-trip on the
# happy path (the internal rebuild still runs for the columns list).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bc1fcb0: the misleading comment at line 346 has been removed. The comment claimed the internal rebuild was still running; now that case_lookup is passed through from execute(), the comment is no longer needed and was deleted.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 98.51852% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
recce/tasks/valuediff.py 93.93% 2 Missing ⚠️
Files with missing lines Coverage Δ
tests/tasks/test_valuediff.py 99.64% <100.00%> (+0.19%) ⬆️
recce/tasks/valuediff.py 81.93% <93.93%> (+1.73%) ⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…umns() (DRC-3464 PR #1388 review)

Signed-off-by: Kent <iamcxa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants