workload-replay: anonymize subsource child keys, broaden verify#36797
Closed
jasonhernandez wants to merge 1 commit into
Closed
workload-replay: anonymize subsource child keys, broaden verify#36797jasonhernandez wants to merge 1 commit into
jasonhernandez wants to merge 1 commit into
Conversation
Found while anonymizing a real capture with Postgres/MySQL CDC sources: a source's children (subsources) are stored in a dict keyed by the child's fully-qualified `database.schema.name`. That key was built in the mapping pass from child["database"]/child["schema"], which are only remapped later, so the key retained the original database and schema names — leaking them even though every other position was anonymized. Build the key from the mapped names instead. The verify pass missed this because it only scanned create_sql/sql values, not structural dict keys. Extend it to also check identifier survival in dict keys, while: - skipping the workload format's own reserved keys (RESERVED_FORMAT_KEYS), so a user object sharing a name with a format field (e.g. a column named `transaction_id`) does not produce a false positive on the query record's own field name; and - NOT scanning arbitrary scalar values, because a kept literal (e.g. a column default 'secret note' under --no-literals) can contain a word matching a renamed identifier without being a leak. SQL text is still scanned. Add regression tests: end-to-end child-key anonymization, verify catching a leaked child key, and the two false-positive guards above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 29, 2026
Contributor
Author
|
Superseded by #36803, which squashes this stack into a single PR against main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fourth in the stack — base
workload-anonymize-tests(#36749). Found by actually running the tool against a real Materialize Cloud capture.The bug
Captures with Postgres/MySQL CDC sources store each source's subsources (
children) in a dict keyed by the child's fully-qualifieddatabase.schema.name. That key was built during the mapping pass fromchild["database"]/child["schema"], which are only remapped later (pass 2) — so the key kept the original database and schema names, even though the child's own fields and every other position were correctly anonymized.Real example from the capture, anonymized output before this fix:
After:
db_5.schema_18.child_25.8 distinct original database/schema names leaked this way in a 5-minute prod capture.
Why verify didn't catch it
verify_anonymizedonly scannedcreate_sql/sqlvalues, never structural dict keys — so it reported all-clear. This PR broadens it to also scan dict keys, with two guards to avoid false positives (both learned the hard way while writing it):RESERVED_FORMAT_KEYS): the workload file's own field names (sql,cluster,transaction_id,tables, …). Without this, a user column namedtransaction_id(which maps tocolumn_N) makes verify flag every query record's owntransaction_idfield. Their values are still checked — only the structural key name is exempt.'secret note'under--no-literals— can contain a word matching a renamed column (note) without being a leak. SQL text is still scanned (as before).Validation
transaction_idformat key, not data).--verifypasses and writes; before the fix it correctly refused to write.bin/fmt,ruffclean.🤖 Generated with Claude Code