workload-replay: harden and AST-anonymize captured workloads#36803
Draft
jasonhernandez wants to merge 1 commit into
Draft
workload-replay: harden and AST-anonymize captured workloads#36803jasonhernandez wants to merge 1 commit into
jasonhernandez wants to merge 1 commit into
Conversation
`bin/mz-workload-anonymize` scrubs identifiers and literals from captured
workloads so they can be shared. It relied on text-regex substitution, which
leaked sensitive data and corrupted SQL. This reworks it to rewrite SQL on
Materialize's own parsed AST, and hardens the surrounding tool.
New crate `src/sql-anonymize` (`mz-sql-anonymize`): reads
{mapping, rename_identifiers, redact_literals, statements} and rewrites each
statement on the AST via VisitMut —
- renames identifiers as whole tokens, reaching object/cluster/type references
(the `Raw` AstInfo associated types, whose generic visitors are no-ops) by
overriding visit_item_name_mut and friends. No substring corruption, no
in-string rewrites, no word-boundary or case guesswork (Mz identifiers are
case-sensitive, so exact matching is correct);
- redacts query literals — strings, numbers, hex strings, intervals — to
'<REDACTED>';
- preserves config literals (CREATE CLUSTER / CLUSTER REPLICA / ALTER CLUSTER
and SET / RESET / SET TRANSACTION / ALTER SYSTEM), e.g. sizes and timeouts,
which replay needs and which are not sensitive.
The anonymizer (Python) sends all cluster/DDL/query SQL through the helper and:
- still scrubs DDL create_sql literals with a blanket regex, because option
strings (broker addresses, hosts) are typed AST fields neither the visitor
nor the engine's redacted Display treats as literals;
- applies the structural identifier mapping (column types, child schema/db,
query routing fields) directly, since those are not SQL;
- rebuilds source-child dict keys from the mapped database/schema names, which
previously leaked the originals;
- requires the parser by default (--require-parser); falls back to the regex,
with a warning, only when the binary is unavailable or a statement does not
parse.
A verify pass re-scans the output for surviving original identifiers (in any
string, including structural keys) and non-placeholder literals, exempting
reserved format keys and preserved config statements, and refuses to write if
anything leaks. Output now requires an explicit -o/--in-place rather than
silently overwriting the input.
Validated against a production capture: 0 of 623 anonymized queries fail to
re-parse (a regex prototype corrupted 65 of 565), 0 identifier leaks, query
numbers redacted, cluster/SET config preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Squash of the
workload-anonymize-*stack (#36745, #36746, #36749, #36797, #36799, #36801) into one reviewable change againstmain. Those drafts are superseded by this PR.bin/mz-workload-anonymizescrubs identifiers and literals from captured workloads so they can be shared. It relied on text-regex substitution, which both leaked sensitive data and corrupted SQL. This reworks the SQL rewriting onto Materialize's own parsed AST and hardens the surrounding tool.New crate
src/sql-anonymize(mz-sql-anonymize)Reads
{mapping, rename_identifiers, redact_literals, statements}and rewrites each statement on the AST viaVisitMut:RawAstInfoassociated types, whose generic visitors are no-ops — by overridingvisit_item_name_mut& friends. No substring corruption, no in-string rewrites, no word-boundary/case guesswork (Mz identifiers are case-sensitive, so exact match is correct).'<REDACTED>'.CREATE CLUSTER/CLUSTER REPLICA/ALTER CLUSTER,SET/RESET/SET TRANSACTION/ALTER SYSTEM) — sizes, timeouts — that replay needs and aren't sensitive.Anonymizer (Python)
Sends all cluster/DDL/query SQL through the helper, and:
create_sqlliterals with a blanket regex, because option strings (broker addresses, hosts) are typed AST fields neither the visitor nor the engine's redacted Display treats as literals;--require-parser); falls back to regex, with a warning, only when the binary is unavailable or a statement doesn't parse.Safety net
A verify pass re-scans output for surviving original identifiers (in any string, including structural keys) and non-placeholder literals — exempting reserved format keys and preserved config statements — and refuses to write if anything leaks. Output now requires an explicit
-o/--in-placeinstead of silently overwriting the input.Validation (production capture)
transaction_idformat key)Tests
8 Rust unit tests (token-level rename, qualified refs, no substring/in-string corruption, query-literal redaction incl. numbers, cluster/SET preservation, parse-failure → null) + 22 Python tests (end-to-end anonymization, leak regressions, subsource-key fix, verify guards, require-parser gate, regex fallback). The DDL-option-string limitation (broker addresses handled by the regex, not the AST) is documented in the README and code.
🤖 Generated with Claude Code