feat: c1z sanitizer v0.1 — library + CLI by btipling · Pull Request #875 · ConductorOne/baton-sdk

btipling · 2026-05-25T05:11:16Z

Summary

New pkg/c1zsanitize package + cmd/baton-c1z-sanitize CLI that transform a .c1z into an identity-stripped copy via connectorstore.Reader/Writer. Per-c1z HMAC-SHA256 secret drives every transform.
Cross-references stay coherent because the transform is deterministic within a c1z — every place an id appears (Grant.principal, Grant.entitlement, Grant.sources map keys, Entitlement.resource, Resource.parent_resource_id) maps through the same sanitize_id.
Annotation dispatch is a whitelist of Any type URLs. v0.1 handles UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, ScopeBindingTrait. Unknown annotations are dropped by default with a log line; -allow-unknown-annotations flips to pass-through.
Timestamps use anchor-and-shift (newest source timestamp → anchor, single Δ applied uniformly) so relative deltas survive. Asset payloads get a content-type-matched placeholder while the ref chain is preserved.

Implementation follows the design in §6.2 of the investigation document. The sanitizer code never imports c1.storage.v3; it works entirely through connectorstore.Reader/Writer and the connector-v2 wire types as the investigation prescribed.

Output format choice

v2 (sqlite-zstd). The investigation's §7 question 5 punted on v2 vs v3 with the proviso "v0.1 should write v3 by default if PRs #870/#871/#872 have landed; otherwise v2." At the time of this PR, the storage-engine-v4 stack (#867–#872) is all still open on main, so v0.1 writes v2 and v0.2 swaps to v3 once the writer adapter ships.

Open questions / choices for ambiguous items

Resource type IDs preserved. §2.3.1 of the investigation flagged that some connectors might construct resource_type_id with tenant data. v0.1 preserves them; the §7 question 1 audit hasn't run yet.
Asset placeholder content. Image content types get a 1x1 PNG; everything else gets a single zero byte. PutAsset silently drops empty data, so the single byte is the minimum that keeps the cross-reference alive. Document as known-lossy.
Sync ID minting. The destination Writer's StartNewSync mints a fresh KSUID rather than accepting a deterministic transform of the source sync id. Parent linkage is preserved via an in-memory srcSyncID → dstSyncID map maintained for the call. The v2 connectorstore.Writer interface doesn't expose SetSyncID, so the deterministic-KSUID approach from the investigation §6.4 is deferred.
All sync runs copied. No -max-sync-runs flag yet (investigation §7 question 6). Add when needed.
Reversibility. Operator-supplied secret via -secret-file, or the CLI generates one and writes it next to -out with mode 0600. Archive or shred — the sanitizer doesn't choose.
Role scope conditions. §3.7's expression strings are HMACed wholesale. Lossy by design (see §7 question 3). Connector-side structured alternatives are the v0.2 path.

Out of scope for v0.1 (per §6.3)

S3 fetch / upload, Union Station / Temporal integration, content-store registration, opt-in tracking, encryption at rest, reversibility tooling, selective re-sanitization.

Test plan

go vet ./..., gofmt -l, golangci-lint run clean on new code
go test ./pkg/c1zsanitize/ passes — all unit + invariant tests green
go test ./... passes — no existing tests broken
Hand-crafted fixture exercises every annotation handler in v0.1's whitelist (UserTrait/GroupTrait/AppTrait/RoleTrait/SecretTrait/LicenseProfileTrait/ScopeBindingTrait)
Cross-reference integrity invariant: for every source id that appears N times, sanitize_id(id) appears exactly N times in dst
Graph integrity invariants: Grant.principal/Grant.entitlement/Entitlement.resource/Resource.parent_resource_id all resolve in dst
Unknown annotation type dropped by default with DropUnknownAnnotations=true
CLI end-to-end smoke test: build a small fixture, run baton-c1z-sanitize -in src -out dst, assert exit 0 and dst exists

github-actions · 2026-05-25T05:13:52Z

+		Secret:                 secret,
+		TimestampAnchor:        anchor,
+		DropUnknownAnnotations: !*allowUnknown,
+	}


🟠 Bug: dst.Close(ctx) is deferred so its error is silently dropped. For a write-mode C1File that must flush and compress the sqlite-zstd output, a failed close means the CLI reports success but the output .c1z is corrupt. Explicitly close dst on the success path and check the error, keeping the defer only as a safety net for early-return error paths.

github-actions · 2026-05-25T05:13:55Z

+		if err != nil {
+			// Asset referenced from an annotation but missing from
+			// the asset table. Skip — we don't fabricate placeholder
+			// rows because the cross-reference invariant treats it
+			// as a known dangling pointer in the source.
+			s.log.Debug("c1zsanitize: asset ref not found in source", zap.String("asset_id", srcID), zap.Error(err))
+			continue
+		}
+		if _, err := io.Copy(io.Discard, r); err != nil && !errors.Is(err, io.EOF) {
+			return fmt.Errorf("drain source asset %s: %w", srcID, err)


🟡 Suggestion: The reader r from GetAsset is drained but never closed. If the concrete implementation returns an io.ReadCloser behind the io.Reader interface, this leaks the underlying resource. Consider adding a defer-close with an io.Closer type assertion after the nil-error check.

github-actions · 2026-05-25T05:14:16Z

General PR Review: feat: c1z sanitizer v0.1 — library + CLI

Blocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0
Review mode: incremental since fb68164
View review run

Review Summary

The new commits add a clarifying comment on the explicit dst.Close() success-path call and improve the error message from a terse "close dst c1z" to the more descriptive "failed to finalize output c1z". The change is cosmetic and correct. Full PR scan of the c1zsanitize package confirms proper use of crypto/rand for secret generation, crypto/hmac+SHA-256 for deterministic pseudonymization, safe default-drop posture for unknown annotations, and thorough error propagation throughout. No new issues found.

Security Issues

None found.

Correctness Issues

None found.

Suggestions

None.

github-actions

Blocking issues found — see review comments.

github-actions

No blocking issues found.

kans · 2026-05-26T15:12:52Z

+}
+
+func run() error {
+	fs := flag.NewFlagSet("baton-c1z-sanitize", flag.ContinueOnError)


This could go in the baton utility.

kans · 2026-05-26T15:22:56Z

+		}
+		handler, ok := s.handlers[a.GetTypeUrl()]
+		if !ok {
+			if s.dropUnknownAnnotations {


This should probably be true by default?

kans · 2026-05-26T15:25:53Z

+		return ""
+	}
+	s.idHmac.Reset()
+	s.idHmac.Write([]byte(input))


This transformation isn't application aware. For a grant, we can split the parts and hmac the substrings then reconstruct the string so that the references align.

Adds a new pkg/c1zsanitize package and cmd/baton-c1z-sanitize CLI that copies a .c1z file through connectorstore.Reader/Writer, transforming identifiers, names, free text, emails, and timestamps under a per-c1z HMAC-SHA256 secret while preserving graph topology, cardinalities, and annotation structure. Cross-references stay coherent because every transform is deterministic within a c1z: the same input id always maps to the same sanitized id, so GrantRecord.principal / .entitlement / .sources keys, EntitlementRecord .resource, and ResourceRecord.parent_resource_id resolve in the output. Different per-c1z secrets keep distinct c1zs uncorrelatable. Annotation dispatch is a whitelist keyed by Any type URL. v0.1 ships handlers for UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, and ScopeBindingTrait. Unknown annotation types are dropped by default with a log line naming the type URL; an operator flag flips behavior to pass-through. Timestamps use anchor-and-shift: a single delta is computed from the newest source sync timestamp and applied uniformly so relative deltas survive. AssetRecord.data is replaced with a content-type-matched placeholder while the asset ref chain is preserved.

The CLI deferred dst.Close and threw the error away. For a write-mode C1File the Close call is where sqlite-zstd finalizes and compresses the output; losing that error meant a corrupt sanitized .c1z could be reported as success. Explicit Close on the success path with a guarded defer for the early-return paths. The sanitizer's hot-loop id() built a fresh hmac.Hash on every call, redoing the SHA-256 key schedule each time. Stash one hmac.Hash on the sanitizer and Reset() between calls — single-threaded run, no locking needed. SanitizeID stays as the allocation-y reference for external callers (sanitizeEmail, tests). copyAssets drained the GetAsset reader with io.Copy but never asked whether the underlying type was an io.Closer; at least one impl is *os.File-backed and was leaking a fd per asset. drainAndClose does the type assertion and the drain in one spot. Drop a dead range loop in the xref-integrity test.

Make the Options zero value fail closed. A caller that does not set the flag now gets unknown-type annotations dropped instead of passed through, so a newly-added annotation type carrying customer data can never leave the sanitizer unsanitized. Pass-through is still available for development via -allow-unknown-annotations / AllowUnknownAnnotations. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

Transform grant and entitlement IDs component-wise instead of hashing the whole opaque string. A baton entitlement ID is resourceType:resourceID:permission and a grant ID is entitlementID:principalType:principalID; the connector-defined type tokens are preserved and the identifier components are HMAC'd. This keeps a sanitized grant's embedded entitlement and principal references equal to the separately-sanitized entitlement row, resource rows, and grant-sources keys, so cross-references still resolve. IDs that don't parse to the expected component count are connector-custom and not decomposable, so they fall back to a whole-string HMAC, preserving the invariant that equal source IDs map to equal sanitized IDs everywhere they appear. All ID sites are updated together: grant id, entitlement id, grant-sources map keys, and license-profile entitlement ids. Move the HMAC-secret load/generate helper out of the CLI main into the c1zsanitize package, which already owns the secret-length contract, so it is reusable and unit-testable. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

Drop the named returns from LoadOrGenerateSecret (nonamedreturns) and return explicit values. Discard the hmac.Hash.Write results in the two ID hashers; that method never returns a non-nil error, so the discard is intentional and silences the unhandled-error check. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

Transform a composite identifier one ':'-delimited component at a time and preserve a component in cleartext only when it is a declared resource-type token. Every other component is HMAC'd, including an opaque id that sits where the canonical grammar expects a type. The previous transform trusted positional grammar — grant "entitlementID:principalType:principalID", entitlement "resourceType:resourceID:permission" — and kept the type-slot field verbatim. Connectors emit non-canonical IDs: a Microsoft Entra grant id carries a tenant group UUID in a type slot, so positional trust let that UUID survive un-hashed. Keying on the set of declared resource types instead means a UUID in any position is hashed. HMAC is deterministic, so a component that is also a resource or principal id still hashes to the value used for the separately- sanitized structured field, so cross-references stay coherent and the canonical alignment holds. The resource-type set is gathered as types are copied, which precedes entitlements and grants in every sync. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

The sanitizer's destination c1z is a net-new intermediate discarded on any failure, so it pays for durability it does not need. Open it with journal_mode=OFF and synchronous=OFF — the same throwaway-output rationale the compactor uses — to drop the per-commit journal and fsync, and give SQLite a 64MB page cache to cut index-maintenance misses once the index working set outgrows the small default cache on multi-million-grant syncs. Raise the source read page from 1000 to 10000. The page also bounds how many rows each dst Put batches into one transaction, so larger pages mean far fewer commits at scale; the dst writer still sub-chunks each INSERT statement, keeping the per-statement parameter count under SQLite's ceiling. These change throughput only: the sanitized output is byte-identical on success (verified by digest over a multi-page population). Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

kans asked for the sanitize command to live in the baton utility rather than as a standalone binary. Add it as a `baton sanitize` subcommand alongside the other c1z tools, taking the source from the shared --file flag and keeping the dst-build pragmas and secret handling. Remove the standalone cmd/baton-c1z-sanitize. The pkg/c1zsanitize library is unchanged, so the sanitized output is byte-identical. Also document at the unknown-annotation branch why drop is the default: an annotation type with no handler has not been inspected, so passing it through could leak un-sanitized data. Pass-through stays opt-in via AllowUnknownAnnotations for development. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

Close on the sanitize success path flushes and zstd-compresses the sqlite output, so a failed close means the .c1z is incomplete or corrupt. Report it as a finalize failure instead of a generic close error so a broken output is never mistaken for success. The explicit checked close on the success path and the deferred safety-net close for error paths were already in place; this only clarifies the failure. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>

github-actions

No blocking issues found.

github-actions Bot reviewed May 25, 2026

View reviewed changes

github-actions Bot requested changes May 25, 2026

View reviewed changes

github-actions Bot reviewed May 25, 2026

View reviewed changes

btipling marked this pull request as ready for review May 25, 2026 09:05

btipling requested a review from a team May 25, 2026 09:06

kans reviewed May 26, 2026

View reviewed changes

btipling and others added 3 commits June 1, 2026 20:55

btipling force-pushed the bt/c1zsanitize-v0.1 branch from ad4f939 to 9a5885a Compare June 1, 2026 21:02

github-actions Bot reviewed Jun 1, 2026

View reviewed changes

btipling requested a review from kans June 2, 2026 04:06

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

Conversation

btipling commented May 25, 2026

Summary

Output format choice

Open questions / choices for ambiguous items

Out of scope for v0.1 (per §6.3)

Test plan

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General PR Review: feat: c1z sanitizer v0.1 — library + CLI

Review Summary

Security Issues

Correctness Issues

Suggestions

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

kans May 26, 2026

Choose a reason for hiding this comment

Uh oh!

kans May 26, 2026

Choose a reason for hiding this comment

Uh oh!

kans May 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 25, 2026 •

edited

Loading