workload-replay: anonymize subsource child keys, broaden verify by jasonhernandez · Pull Request #36797 · MaterializeInc/materialize

jasonhernandez · 2026-05-29T19:27:49Z

Fourth in the stack — base workload-anonymize-tests (#36749). Found by actually running the tool against a real Materialize Cloud capture.

The bug

Captures with Postgres/MySQL CDC sources store each source's subsources (children) in a dict keyed by the child's fully-qualified database.schema.name. That key was built during the mapping pass from child["database"]/child["schema"], which are only remapped later (pass 2) — so the key kept the original database and schema names, even though the child's own fields and every other position were correctly anonymized.

Real example from the capture, anonymized output before this fix:

children:
  qa_canary_environment.public_loadgen.child_25:   # <-- original db + schema leaked in the key
    name: child_25
    database: db_5      # correctly anonymized
    schema: schema_18   # correctly anonymized

After: db_5.schema_18.child_25.

8 distinct original database/schema names leaked this way in a 5-minute prod capture.

Why verify didn't catch it

verify_anonymized only scanned create_sql/sql values, never structural dict keys — so it reported all-clear. This PR broadens it to also scan dict keys, with two guards to avoid false positives (both learned the hard way while writing it):

Reserved format keys (RESERVED_FORMAT_KEYS): the workload file's own field names (sql, cluster, transaction_id, tables, …). Without this, a user column named transaction_id (which maps to column_N) makes verify flag every query record's own transaction_id field. Their values are still checked — only the structural key name is exempt.
Scalar values are not identifier-scanned. A kept literal — e.g. a column default 'secret note' under --no-literals — can contain a word matching a renamed column (note) without being a leak. SQL text is still scanned (as before).

Validation

Re-ran against the real prod capture: non-keyword identifier survivors dropped 8 → 0 (the one remaining match is the legitimate transaction_id format key, not data). --verify passes and writes; before the fix it correctly refused to write.
4 new regression tests (19 total): end-to-end child-key anonymization, verify catching a leaked child key, and the two false-positive guards.
bin/fmt, ruff clean.

🤖 Generated with Claude Code

Found while anonymizing a real capture with Postgres/MySQL CDC sources: a source's children (subsources) are stored in a dict keyed by the child's fully-qualified `database.schema.name`. That key was built in the mapping pass from child["database"]/child["schema"], which are only remapped later, so the key retained the original database and schema names — leaking them even though every other position was anonymized. Build the key from the mapped names instead. The verify pass missed this because it only scanned create_sql/sql values, not structural dict keys. Extend it to also check identifier survival in dict keys, while: - skipping the workload format's own reserved keys (RESERVED_FORMAT_KEYS), so a user object sharing a name with a format field (e.g. a column named `transaction_id`) does not produce a false positive on the query record's own field name; and - NOT scanning arbitrary scalar values, because a kept literal (e.g. a column default 'secret note' under --no-literals) can contain a word matching a renamed identifier without being a leak. SQL text is still scanned. Add regression tests: end-to-end child-key anonymization, verify catching a leaked child key, and the two false-positive guards above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jasonhernandez · 2026-05-29T22:28:29Z

Superseded by #36803, which squashes this stack into a single PR against main.

This was referenced May 29, 2026

workload-replay: require the parser by default for query literals #36799

Closed

workload-replay: harden and AST-anonymize captured workloads #36803

Draft

jasonhernandez closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload-replay: anonymize subsource child keys, broaden verify#36797

workload-replay: anonymize subsource child keys, broaden verify#36797
jasonhernandez wants to merge 1 commit into
workload-anonymize-testsfrom
workload-anonymize-subsource-keys

jasonhernandez commented May 29, 2026

Uh oh!

jasonhernandez commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonhernandez commented May 29, 2026

The bug

Why verify didn't catch it

Validation

Uh oh!

jasonhernandez commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant