Skip to content

workload-replay: anonymize subsource child keys, broaden verify#36797

Closed
jasonhernandez wants to merge 1 commit into
workload-anonymize-testsfrom
workload-anonymize-subsource-keys
Closed

workload-replay: anonymize subsource child keys, broaden verify#36797
jasonhernandez wants to merge 1 commit into
workload-anonymize-testsfrom
workload-anonymize-subsource-keys

Conversation

@jasonhernandez
Copy link
Copy Markdown
Contributor

Fourth in the stack — base workload-anonymize-tests (#36749). Found by actually running the tool against a real Materialize Cloud capture.

The bug

Captures with Postgres/MySQL CDC sources store each source's subsources (children) in a dict keyed by the child's fully-qualified database.schema.name. That key was built during the mapping pass from child["database"]/child["schema"], which are only remapped later (pass 2) — so the key kept the original database and schema names, even though the child's own fields and every other position were correctly anonymized.

Real example from the capture, anonymized output before this fix:

children:
  qa_canary_environment.public_loadgen.child_25:   # <-- original db + schema leaked in the key
    name: child_25
    database: db_5      # correctly anonymized
    schema: schema_18   # correctly anonymized

After: db_5.schema_18.child_25.

8 distinct original database/schema names leaked this way in a 5-minute prod capture.

Why verify didn't catch it

verify_anonymized only scanned create_sql/sql values, never structural dict keys — so it reported all-clear. This PR broadens it to also scan dict keys, with two guards to avoid false positives (both learned the hard way while writing it):

  1. Reserved format keys (RESERVED_FORMAT_KEYS): the workload file's own field names (sql, cluster, transaction_id, tables, …). Without this, a user column named transaction_id (which maps to column_N) makes verify flag every query record's own transaction_id field. Their values are still checked — only the structural key name is exempt.
  2. Scalar values are not identifier-scanned. A kept literal — e.g. a column default 'secret note' under --no-literals — can contain a word matching a renamed column (note) without being a leak. SQL text is still scanned (as before).

Validation

  • Re-ran against the real prod capture: non-keyword identifier survivors dropped 8 → 0 (the one remaining match is the legitimate transaction_id format key, not data). --verify passes and writes; before the fix it correctly refused to write.
  • 4 new regression tests (19 total): end-to-end child-key anonymization, verify catching a leaked child key, and the two false-positive guards.
  • bin/fmt, ruff clean.

🤖 Generated with Claude Code

Found while anonymizing a real capture with Postgres/MySQL CDC sources: a
source's children (subsources) are stored in a dict keyed by the child's
fully-qualified `database.schema.name`. That key was built in the mapping
pass from child["database"]/child["schema"], which are only remapped later,
so the key retained the original database and schema names — leaking them
even though every other position was anonymized. Build the key from the
mapped names instead.

The verify pass missed this because it only scanned create_sql/sql values,
not structural dict keys. Extend it to also check identifier survival in dict
keys, while:
- skipping the workload format's own reserved keys (RESERVED_FORMAT_KEYS), so
  a user object sharing a name with a format field (e.g. a column named
  `transaction_id`) does not produce a false positive on the query record's
  own field name; and
- NOT scanning arbitrary scalar values, because a kept literal (e.g. a column
  default 'secret note' under --no-literals) can contain a word matching a
  renamed identifier without being a leak. SQL text is still scanned.

Add regression tests: end-to-end child-key anonymization, verify catching a
leaked child key, and the two false-positive guards above.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jasonhernandez
Copy link
Copy Markdown
Contributor Author

Superseded by #36803, which squashes this stack into a single PR against main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant