Skip to content

feat: c1z sanitizer v0.1 — library + CLI#875

Open
btipling wants to merge 9 commits into
mainfrom
bt/c1zsanitize-v0.1
Open

feat: c1z sanitizer v0.1 — library + CLI#875
btipling wants to merge 9 commits into
mainfrom
bt/c1zsanitize-v0.1

Conversation

@btipling
Copy link
Copy Markdown
Contributor

Summary

  • New pkg/c1zsanitize package + cmd/baton-c1z-sanitize CLI that transform a .c1z into an identity-stripped copy via connectorstore.Reader/Writer. Per-c1z HMAC-SHA256 secret drives every transform.
  • Cross-references stay coherent because the transform is deterministic within a c1z — every place an id appears (Grant.principal, Grant.entitlement, Grant.sources map keys, Entitlement.resource, Resource.parent_resource_id) maps through the same sanitize_id.
  • Annotation dispatch is a whitelist of Any type URLs. v0.1 handles UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, ScopeBindingTrait. Unknown annotations are dropped by default with a log line; -allow-unknown-annotations flips to pass-through.
  • Timestamps use anchor-and-shift (newest source timestamp → anchor, single Δ applied uniformly) so relative deltas survive. Asset payloads get a content-type-matched placeholder while the ref chain is preserved.

Implementation follows the design in §6.2 of the investigation document. The sanitizer code never imports c1.storage.v3; it works entirely through connectorstore.Reader/Writer and the connector-v2 wire types as the investigation prescribed.

Output format choice

v2 (sqlite-zstd). The investigation's §7 question 5 punted on v2 vs v3 with the proviso "v0.1 should write v3 by default if PRs #870/#871/#872 have landed; otherwise v2." At the time of this PR, the storage-engine-v4 stack (#867#872) is all still open on main, so v0.1 writes v2 and v0.2 swaps to v3 once the writer adapter ships.

Open questions / choices for ambiguous items

  • Resource type IDs preserved. §2.3.1 of the investigation flagged that some connectors might construct resource_type_id with tenant data. v0.1 preserves them; the §7 question 1 audit hasn't run yet.
  • Asset placeholder content. Image content types get a 1x1 PNG; everything else gets a single zero byte. PutAsset silently drops empty data, so the single byte is the minimum that keeps the cross-reference alive. Document as known-lossy.
  • Sync ID minting. The destination Writer's StartNewSync mints a fresh KSUID rather than accepting a deterministic transform of the source sync id. Parent linkage is preserved via an in-memory srcSyncID → dstSyncID map maintained for the call. The v2 connectorstore.Writer interface doesn't expose SetSyncID, so the deterministic-KSUID approach from the investigation §6.4 is deferred.
  • All sync runs copied. No -max-sync-runs flag yet (investigation §7 question 6). Add when needed.
  • Reversibility. Operator-supplied secret via -secret-file, or the CLI generates one and writes it next to -out with mode 0600. Archive or shred — the sanitizer doesn't choose.
  • Role scope conditions. §3.7's expression strings are HMACed wholesale. Lossy by design (see §7 question 3). Connector-side structured alternatives are the v0.2 path.

Out of scope for v0.1 (per §6.3)

  • S3 fetch / upload, Union Station / Temporal integration, content-store registration, opt-in tracking, encryption at rest, reversibility tooling, selective re-sanitization.

Test plan

  • go vet ./..., gofmt -l, golangci-lint run clean on new code
  • go test ./pkg/c1zsanitize/ passes — all unit + invariant tests green
  • go test ./... passes — no existing tests broken
  • Hand-crafted fixture exercises every annotation handler in v0.1's whitelist (UserTrait/GroupTrait/AppTrait/RoleTrait/SecretTrait/LicenseProfileTrait/ScopeBindingTrait)
  • Cross-reference integrity invariant: for every source id that appears N times, sanitize_id(id) appears exactly N times in dst
  • Graph integrity invariants: Grant.principal/Grant.entitlement/Entitlement.resource/Resource.parent_resource_id all resolve in dst
  • Unknown annotation type dropped by default with DropUnknownAnnotations=true
  • CLI end-to-end smoke test: build a small fixture, run baton-c1z-sanitize -in src -out dst, assert exit 0 and dst exists

Comment thread cmd/baton-c1z-sanitize/main.go Outdated
Secret: secret,
TimestampAnchor: anchor,
DropUnknownAnnotations: !*allowUnknown,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Bug: dst.Close(ctx) is deferred so its error is silently dropped. For a write-mode C1File that must flush and compress the sqlite-zstd output, a failed close means the CLI reports success but the output .c1z is corrupt. Explicitly close dst on the success path and check the error, keeping the defer only as a safety net for early-return error paths.

Comment thread pkg/c1zsanitize/assets.go
Comment on lines +99 to +108
if err != nil {
// Asset referenced from an annotation but missing from
// the asset table. Skip — we don't fabricate placeholder
// rows because the cross-reference invariant treats it
// as a known dangling pointer in the source.
s.log.Debug("c1zsanitize: asset ref not found in source", zap.String("asset_id", srcID), zap.Error(err))
continue
}
if _, err := io.Copy(io.Discard, r); err != nil && !errors.Is(err, io.EOF) {
return fmt.Errorf("drain source asset %s: %w", srcID, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The reader r from GetAsset is drained but never closed. If the concrete implementation returns an io.ReadCloser behind the io.Reader interface, this leaks the underlying resource. Consider adding a defer-close with an io.Closer type assertion after the nil-error check.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

General PR Review: feat: c1z sanitizer v0.1 — library + CLI

Blocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0
Review mode: incremental since fb68164
View review run

Review Summary

The new commits add a clarifying comment on the explicit dst.Close() success-path call and improve the error message from a terse "close dst c1z" to the more descriptive "failed to finalize output c1z". The change is cosmetic and correct. Full PR scan of the c1zsanitize package confirms proper use of crypto/rand for secret generation, crypto/hmac+SHA-256 for deterministic pseudonymization, safe default-drop posture for unknown annotations, and thorough error propagation throughout. No new issues found.

Security Issues

None found.

Correctness Issues

None found.

Suggestions

None.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see review comments.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

@btipling btipling marked this pull request as ready for review May 25, 2026 09:05
@btipling btipling requested a review from a team May 25, 2026 09:06
Comment thread cmd/baton-c1z-sanitize/main.go Outdated
}

func run() error {
fs := flag.NewFlagSet("baton-c1z-sanitize", flag.ContinueOnError)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go in the baton utility.

}
handler, ok := s.handlers[a.GetTypeUrl()]
if !ok {
if s.dropUnknownAnnotations {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be true by default?

Comment thread pkg/c1zsanitize/sanitize.go Outdated
return ""
}
s.idHmac.Reset()
s.idHmac.Write([]byte(input))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation isn't application aware. For a grant, we can split the parts and hmac the substrings then reconstruct the string so that the references align.

btipling and others added 3 commits June 1, 2026 20:55
Adds a new pkg/c1zsanitize package and cmd/baton-c1z-sanitize CLI that
copies a .c1z file through connectorstore.Reader/Writer, transforming
identifiers, names, free text, emails, and timestamps under a per-c1z
HMAC-SHA256 secret while preserving graph topology, cardinalities, and
annotation structure.

Cross-references stay coherent because every transform is deterministic
within a c1z: the same input id always maps to the same sanitized id, so
GrantRecord.principal / .entitlement / .sources keys, EntitlementRecord
.resource, and ResourceRecord.parent_resource_id resolve in the output.
Different per-c1z secrets keep distinct c1zs uncorrelatable.

Annotation dispatch is a whitelist keyed by Any type URL. v0.1 ships
handlers for UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait,
LicenseProfileTrait, and ScopeBindingTrait. Unknown annotation types
are dropped by default with a log line naming the type URL; an
operator flag flips behavior to pass-through.

Timestamps use anchor-and-shift: a single delta is computed from the
newest source sync timestamp and applied uniformly so relative deltas
survive. AssetRecord.data is replaced with a content-type-matched
placeholder while the asset ref chain is preserved.
The CLI deferred dst.Close and threw the error away. For a write-mode
C1File the Close call is where sqlite-zstd finalizes and compresses
the output; losing that error meant a corrupt sanitized .c1z could
be reported as success. Explicit Close on the success path with a
guarded defer for the early-return paths.

The sanitizer's hot-loop id() built a fresh hmac.Hash on every call,
redoing the SHA-256 key schedule each time. Stash one hmac.Hash on
the sanitizer and Reset() between calls — single-threaded run, no
locking needed. SanitizeID stays as the allocation-y reference for
external callers (sanitizeEmail, tests).

copyAssets drained the GetAsset reader with io.Copy but never asked
whether the underlying type was an io.Closer; at least one impl is
*os.File-backed and was leaking a fd per asset. drainAndClose does
the type assertion and the drain in one spot.

Drop a dead range loop in the xref-integrity test.
Make the Options zero value fail closed. A caller that does not set
the flag now gets unknown-type annotations dropped instead of passed
through, so a newly-added annotation type carrying customer data can
never leave the sanitizer unsanitized. Pass-through is still available
for development via -allow-unknown-annotations / AllowUnknownAnnotations.

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
@btipling btipling force-pushed the bt/c1zsanitize-v0.1 branch from ad4f939 to 9a5885a Compare June 1, 2026 21:02
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

Transform grant and entitlement IDs component-wise instead of hashing
the whole opaque string. A baton entitlement ID is
resourceType:resourceID:permission and a grant ID is
entitlementID:principalType:principalID; the connector-defined type
tokens are preserved and the identifier components are HMAC'd. This
keeps a sanitized grant's embedded entitlement and principal
references equal to the separately-sanitized entitlement row, resource
rows, and grant-sources keys, so cross-references still resolve. IDs
that don't parse to the expected component count are connector-custom
and not decomposable, so they fall back to a whole-string HMAC,
preserving the invariant that equal source IDs map to equal sanitized
IDs everywhere they appear. All ID sites are updated together: grant
id, entitlement id, grant-sources map keys, and license-profile
entitlement ids.

Move the HMAC-secret load/generate helper out of the CLI main into the
c1zsanitize package, which already owns the secret-length contract, so
it is reusable and unit-testable.

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

Drop the named returns from LoadOrGenerateSecret (nonamedreturns) and
return explicit values. Discard the hmac.Hash.Write results in the two
ID hashers; that method never returns a non-nil error, so the discard
is intentional and silences the unhandled-error check.

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

Transform a composite identifier one ':'-delimited component at a time
and preserve a component in cleartext only when it is a declared
resource-type token. Every other component is HMAC'd, including an
opaque id that sits where the canonical grammar expects a type.

The previous transform trusted positional grammar — grant
"entitlementID:principalType:principalID", entitlement
"resourceType:resourceID:permission" — and kept the type-slot field
verbatim. Connectors emit non-canonical IDs: a Microsoft Entra grant
id carries a tenant group UUID in a type slot, so positional trust let
that UUID survive un-hashed. Keying on the set of declared resource
types instead means a UUID in any position is hashed.

HMAC is deterministic, so a component that is also a resource or
principal id still hashes to the value used for the separately-
sanitized structured field, so cross-references stay coherent and the
canonical alignment holds. The resource-type set is gathered as types
are copied, which precedes entitlements and grants in every sync.

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

The sanitizer's destination c1z is a net-new intermediate discarded on
any failure, so it pays for durability it does not need. Open it with
journal_mode=OFF and synchronous=OFF — the same throwaway-output
rationale the compactor uses — to drop the per-commit journal and
fsync, and give SQLite a 64MB page cache to cut index-maintenance
misses once the index working set outgrows the small default cache on
multi-million-grant syncs.

Raise the source read page from 1000 to 10000. The page also bounds
how many rows each dst Put batches into one transaction, so larger
pages mean far fewer commits at scale; the dst writer still sub-chunks
each INSERT statement, keeping the per-statement parameter count under
SQLite's ceiling.

These change throughput only: the sanitized output is byte-identical
on success (verified by digest over a multi-page population).

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

kans asked for the sanitize command to live in the baton utility
rather than as a standalone binary. Add it as a `baton sanitize`
subcommand alongside the other c1z tools, taking the source from the
shared --file flag and keeping the dst-build pragmas and secret
handling. Remove the standalone cmd/baton-c1z-sanitize. The
pkg/c1zsanitize library is unchanged, so the sanitized output is
byte-identical.

Also document at the unknown-annotation branch why drop is the default:
an annotation type with no handler has not been inspected, so passing
it through could leak un-sanitized data. Pass-through stays opt-in via
AllowUnknownAnnotations for development.

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

@btipling btipling requested a review from kans June 2, 2026 04:06
Close on the sanitize success path flushes and zstd-compresses the
sqlite output, so a failed close means the .c1z is incomplete or
corrupt. Report it as a finalize failure instead of a generic close
error so a broken output is never mistaken for success. The explicit
checked close on the success path and the deferred safety-net close for
error paths were already in place; this only clarifies the failure.

Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants