workload-replay: harden and AST-anonymize captured workloads by jasonhernandez · Pull Request #36803 · MaterializeInc/materialize

jasonhernandez · 2026-05-29T22:28:14Z

Squash of the workload-anonymize-* stack (#36745, #36746, #36749, #36797, #36799, #36801) into one reviewable change against main. Those drafts are superseded by this PR.

bin/mz-workload-anonymize scrubs identifiers and literals from captured workloads so they can be shared. It relied on text-regex substitution, which both leaked sensitive data and corrupted SQL. This reworks the SQL rewriting onto Materialize's own parsed AST and hardens the surrounding tool.

New crate `src/sql-anonymize` (`mz-sql-anonymize`)

Reads {mapping, rename_identifiers, redact_literals, statements} and rewrites each statement on the AST via VisitMut:

Renames identifiers as whole tokens, reaching object/cluster/type references — the Raw AstInfo associated types, whose generic visitors are no-ops — by overriding visit_item_name_mut & friends. No substring corruption, no in-string rewrites, no word-boundary/case guesswork (Mz identifiers are case-sensitive, so exact match is correct).
Redacts query literals (strings, numbers, hex, intervals) → '<REDACTED>'.
Preserves config literals (CREATE CLUSTER/CLUSTER REPLICA/ALTER CLUSTER, SET/RESET/SET TRANSACTION/ALTER SYSTEM) — sizes, timeouts — that replay needs and aren't sensitive.

Anonymizer (Python)

Sends all cluster/DDL/query SQL through the helper, and:

still scrubs DDL create_sql literals with a blanket regex, because option strings (broker addresses, hosts) are typed AST fields neither the visitor nor the engine's redacted Display treats as literals;
applies the structural identifier mapping (column types, child schema/db, query routing) directly — not SQL;
rebuilds source-child dict keys from mapped database/schema names (these previously leaked the originals);
requires the parser by default (--require-parser); falls back to regex, with a warning, only when the binary is unavailable or a statement doesn't parse.

Safety net

A verify pass re-scans output for surviving original identifiers (in any string, including structural keys) and non-placeholder literals — exempting reserved format keys and preserved config statements — and refuses to write if anything leaks. Output now requires an explicit -o/--in-place instead of silently overwriting the input.

Validation (production capture)

Metric	regex prototype	this PR (AST)
Anonymized queries that fail to re-parse	65 / 565 (12%)	0 / 623
Identifier leaks	—	0 (only the `transaction_id` format key)
Query numeric literals remaining	leaked	0
Cluster SIZE / SET timeout	—	preserved

Tests

8 Rust unit tests (token-level rename, qualified refs, no substring/in-string corruption, query-literal redaction incl. numbers, cluster/SET preservation, parse-failure → null) + 22 Python tests (end-to-end anonymization, leak regressions, subsource-key fix, verify guards, require-parser gate, regex fallback). The DDL-option-string limitation (broker addresses handled by the regex, not the AST) is documented in the README and code.

🤖 Generated with Claude Code

`bin/mz-workload-anonymize` scrubs identifiers and literals from captured workloads so they can be shared. It relied on text-regex substitution, which leaked sensitive data and corrupted SQL. This reworks it to rewrite SQL on Materialize's own parsed AST, and hardens the surrounding tool. New crate `src/sql-anonymize` (`mz-sql-anonymize`): reads {mapping, rename_identifiers, redact_literals, statements} and rewrites each statement on the AST via VisitMut — - renames identifiers as whole tokens, reaching object/cluster/type references (the `Raw` AstInfo associated types, whose generic visitors are no-ops) by overriding visit_item_name_mut and friends. No substring corruption, no in-string rewrites, no word-boundary or case guesswork (Mz identifiers are case-sensitive, so exact matching is correct); - redacts query literals — strings, numbers, hex strings, intervals — to '<REDACTED>'; - preserves config literals (CREATE CLUSTER / CLUSTER REPLICA / ALTER CLUSTER and SET / RESET / SET TRANSACTION / ALTER SYSTEM), e.g. sizes and timeouts, which replay needs and which are not sensitive. The anonymizer (Python) sends all cluster/DDL/query SQL through the helper and: - still scrubs DDL create_sql literals with a blanket regex, because option strings (broker addresses, hosts) are typed AST fields neither the visitor nor the engine's redacted Display treats as literals; - applies the structural identifier mapping (column types, child schema/db, query routing fields) directly, since those are not SQL; - rebuilds source-child dict keys from the mapped database/schema names, which previously leaked the originals; - requires the parser by default (--require-parser); falls back to the regex, with a warning, only when the binary is unavailable or a statement does not parse. A verify pass re-scans the output for surviving original identifiers (in any string, including structural keys) and non-placeholder literals, exempting reserved format keys and preserved config statements, and refuses to write if anything leaks. Output now requires an explicit -o/--in-place rather than silently overwriting the input. Validated against a production capture: 0 of 623 anonymized queries fail to re-parse (a regex prototype corrupted 65 of 565), 0 identifier leaks, query numbers redacted, cluster/SET config preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload-replay: harden and AST-anonymize captured workloads#36803

workload-replay: harden and AST-anonymize captured workloads#36803
jasonhernandez wants to merge 1 commit into
mainfrom
workload-anonymize

jasonhernandez commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonhernandez commented May 29, 2026

New crate src/sql-anonymize (mz-sql-anonymize)

Anonymizer (Python)

Safety net

Validation (production capture)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New crate `src/sql-anonymize` (`mz-sql-anonymize`)