Skip to content

workload-replay: add tests for the workload anonymizer#36749

Draft
jasonhernandez wants to merge 2 commits into
workload-anonymize-parserfrom
workload-anonymize-tests
Draft

workload-replay: add tests for the workload anonymizer#36749
jasonhernandez wants to merge 2 commits into
workload-anonymize-parserfrom
workload-anonymize-tests

Conversation

@jasonhernandez
Copy link
Copy Markdown
Contributor

Third in the stack — base workload-anonymize-parser (#36746), which is itself stacked on #36745. Merge those first.

Motivation

The anonymizer (mz_workload_anonymize.py) had zero automated tests despite being a privacy tool with a heuristic core — the worst combination. Everything in #36745 and #36746 was verified by hand. This codifies those checks as regression tests.

What's covered

A new misc/python/materialize/cli/mz_workload_anonymize_test.py (15 tests), colocated so the existing pytest --doctest-modules misc/python CI step picks it up with no pipeline changes:

  • End-to-end anonymization of a structurally complete workload — identifiers scrubbed, anonymized names present.
  • Leak regressions (the workload-replay: harden workload anonymization #36745 fixes): connection host/user, sink topic, and column-default literals are scrubbed.
  • Query string-literal redaction.
  • Cluster SIZE preserved (non-sensitive config replay needs).
  • The no-output-target error and --in-place overwrite.
  • --no-literals keeps literals while still anonymizing identifiers.
  • verify_anonymized: catches surviving identifiers and literals, accepts both '<REDACTED>' and 'literal_N' placeholders, and exempts cluster literals.
  • redact_literals_via_parser returns None (fallback signal) without the binary; the regex fallback warns.

Most tests force the regex fallback (monkeypatch _locate_redactor) so they're deterministic regardless of whether the mz-sql-anonymize helper is built. One test exercises the parser path (numeric-literal redaction) and is skipif'd when the binary is absent — so CI (no binary) runs 14 and skips 1; locally all 15 run.

Testing

  • All 15 pass locally (pytest); bin/fmt and ruff check clean.
  • Confirmed the file is collected by pytest's default *_test.py discovery.

🤖 Generated with Claude Code

The anonymizer had no automated coverage despite being a privacy tool with a
heuristic core. Add a pytest module, colocated as
`mz_workload_anonymize_test.py` so the existing `pytest --doctest-modules
misc/python` CI step collects it.

Coverage:
- end-to-end anonymization of a structurally complete workload: identifiers
  scrubbed, anonymized names present;
- regression tests for the connection/sink/source DDL literal leak (hosts,
  users, topics) and column defaults;
- query string-literal redaction;
- cluster SIZE preserved (non-sensitive config);
- the no-output-target error and --in-place overwrite;
- --no-literals keeps literals while still anonymizing identifiers;
- verify_anonymized catches surviving identifiers and literals, accepts both
  the '<REDACTED>' and 'literal_N' placeholders, and exempts cluster literals;
- redact_literals_via_parser returns None (fallback signal) without the binary,
  and the regex fallback warns.

Most tests force the regex fallback so they are deterministic regardless of
whether the mz-sql-anonymize helper is built. One test exercises the
parser-backed path (numeric literal redaction) and is skipped when the binary
is absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jasonhernandez jasonhernandez force-pushed the workload-anonymize-tests branch from 24cff59 to 6bde0d6 Compare May 28, 2026 05:58
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant