Skip to content

workload-replay: harden and AST-anonymize captured workloads#36803

Draft
jasonhernandez wants to merge 1 commit into
mainfrom
workload-anonymize
Draft

workload-replay: harden and AST-anonymize captured workloads#36803
jasonhernandez wants to merge 1 commit into
mainfrom
workload-anonymize

Conversation

@jasonhernandez
Copy link
Copy Markdown
Contributor

Squash of the workload-anonymize-* stack (#36745, #36746, #36749, #36797, #36799, #36801) into one reviewable change against main. Those drafts are superseded by this PR.

bin/mz-workload-anonymize scrubs identifiers and literals from captured workloads so they can be shared. It relied on text-regex substitution, which both leaked sensitive data and corrupted SQL. This reworks the SQL rewriting onto Materialize's own parsed AST and hardens the surrounding tool.

New crate src/sql-anonymize (mz-sql-anonymize)

Reads {mapping, rename_identifiers, redact_literals, statements} and rewrites each statement on the AST via VisitMut:

  • Renames identifiers as whole tokens, reaching object/cluster/type references — the Raw AstInfo associated types, whose generic visitors are no-ops — by overriding visit_item_name_mut & friends. No substring corruption, no in-string rewrites, no word-boundary/case guesswork (Mz identifiers are case-sensitive, so exact match is correct).
  • Redacts query literals (strings, numbers, hex, intervals) → '<REDACTED>'.
  • Preserves config literals (CREATE CLUSTER/CLUSTER REPLICA/ALTER CLUSTER, SET/RESET/SET TRANSACTION/ALTER SYSTEM) — sizes, timeouts — that replay needs and aren't sensitive.

Anonymizer (Python)

Sends all cluster/DDL/query SQL through the helper, and:

  • still scrubs DDL create_sql literals with a blanket regex, because option strings (broker addresses, hosts) are typed AST fields neither the visitor nor the engine's redacted Display treats as literals;
  • applies the structural identifier mapping (column types, child schema/db, query routing) directly — not SQL;
  • rebuilds source-child dict keys from mapped database/schema names (these previously leaked the originals);
  • requires the parser by default (--require-parser); falls back to regex, with a warning, only when the binary is unavailable or a statement doesn't parse.

Safety net

A verify pass re-scans output for surviving original identifiers (in any string, including structural keys) and non-placeholder literals — exempting reserved format keys and preserved config statements — and refuses to write if anything leaks. Output now requires an explicit -o/--in-place instead of silently overwriting the input.

Validation (production capture)

Metric regex prototype this PR (AST)
Anonymized queries that fail to re-parse 65 / 565 (12%) 0 / 623
Identifier leaks 0 (only the transaction_id format key)
Query numeric literals remaining leaked 0
Cluster SIZE / SET timeout preserved

Tests

8 Rust unit tests (token-level rename, qualified refs, no substring/in-string corruption, query-literal redaction incl. numbers, cluster/SET preservation, parse-failure → null) + 22 Python tests (end-to-end anonymization, leak regressions, subsource-key fix, verify guards, require-parser gate, regex fallback). The DDL-option-string limitation (broker addresses handled by the regex, not the AST) is documented in the README and code.

🤖 Generated with Claude Code

`bin/mz-workload-anonymize` scrubs identifiers and literals from captured
workloads so they can be shared. It relied on text-regex substitution, which
leaked sensitive data and corrupted SQL. This reworks it to rewrite SQL on
Materialize's own parsed AST, and hardens the surrounding tool.

New crate `src/sql-anonymize` (`mz-sql-anonymize`): reads
{mapping, rename_identifiers, redact_literals, statements} and rewrites each
statement on the AST via VisitMut —

- renames identifiers as whole tokens, reaching object/cluster/type references
  (the `Raw` AstInfo associated types, whose generic visitors are no-ops) by
  overriding visit_item_name_mut and friends. No substring corruption, no
  in-string rewrites, no word-boundary or case guesswork (Mz identifiers are
  case-sensitive, so exact matching is correct);
- redacts query literals — strings, numbers, hex strings, intervals — to
  '<REDACTED>';
- preserves config literals (CREATE CLUSTER / CLUSTER REPLICA / ALTER CLUSTER
  and SET / RESET / SET TRANSACTION / ALTER SYSTEM), e.g. sizes and timeouts,
  which replay needs and which are not sensitive.

The anonymizer (Python) sends all cluster/DDL/query SQL through the helper and:

- still scrubs DDL create_sql literals with a blanket regex, because option
  strings (broker addresses, hosts) are typed AST fields neither the visitor
  nor the engine's redacted Display treats as literals;
- applies the structural identifier mapping (column types, child schema/db,
  query routing fields) directly, since those are not SQL;
- rebuilds source-child dict keys from the mapped database/schema names, which
  previously leaked the originals;
- requires the parser by default (--require-parser); falls back to the regex,
  with a warning, only when the binary is unavailable or a statement does not
  parse.

A verify pass re-scans the output for surviving original identifiers (in any
string, including structural keys) and non-placeholder literals, exempting
reserved format keys and preserved config statements, and refuses to write if
anything leaks. Output now requires an explicit -o/--in-place rather than
silently overwriting the input.

Validated against a production capture: 0 of 623 anonymized queries fail to
re-parse (a regex prototype corrupted 65 of 565), 0 identifier leaks, query
numbers redacted, cluster/SET config preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant