feat: c1z sanitizer v0.1 — library + CLI#875
Conversation
| Secret: secret, | ||
| TimestampAnchor: anchor, | ||
| DropUnknownAnnotations: !*allowUnknown, | ||
| } |
There was a problem hiding this comment.
🟠 Bug: dst.Close(ctx) is deferred so its error is silently dropped. For a write-mode C1File that must flush and compress the sqlite-zstd output, a failed close means the CLI reports success but the output .c1z is corrupt. Explicitly close dst on the success path and check the error, keeping the defer only as a safety net for early-return error paths.
| if err != nil { | ||
| // Asset referenced from an annotation but missing from | ||
| // the asset table. Skip — we don't fabricate placeholder | ||
| // rows because the cross-reference invariant treats it | ||
| // as a known dangling pointer in the source. | ||
| s.log.Debug("c1zsanitize: asset ref not found in source", zap.String("asset_id", srcID), zap.Error(err)) | ||
| continue | ||
| } | ||
| if _, err := io.Copy(io.Discard, r); err != nil && !errors.Is(err, io.EOF) { | ||
| return fmt.Errorf("drain source asset %s: %w", srcID, err) |
There was a problem hiding this comment.
🟡 Suggestion: The reader r from GetAsset is drained but never closed. If the concrete implementation returns an io.ReadCloser behind the io.Reader interface, this leaks the underlying resource. Consider adding a defer-close with an io.Closer type assertion after the nil-error check.
General PR Review: feat: c1z sanitizer v0.1 — library + CLIBlocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0 Review SummaryThe new commits add a clarifying comment on the explicit Security IssuesNone found. Correctness IssuesNone found. SuggestionsNone. |
| } | ||
|
|
||
| func run() error { | ||
| fs := flag.NewFlagSet("baton-c1z-sanitize", flag.ContinueOnError) |
There was a problem hiding this comment.
This could go in the baton utility.
| } | ||
| handler, ok := s.handlers[a.GetTypeUrl()] | ||
| if !ok { | ||
| if s.dropUnknownAnnotations { |
There was a problem hiding this comment.
This should probably be true by default?
| return "" | ||
| } | ||
| s.idHmac.Reset() | ||
| s.idHmac.Write([]byte(input)) |
There was a problem hiding this comment.
This transformation isn't application aware. For a grant, we can split the parts and hmac the substrings then reconstruct the string so that the references align.
Adds a new pkg/c1zsanitize package and cmd/baton-c1z-sanitize CLI that copies a .c1z file through connectorstore.Reader/Writer, transforming identifiers, names, free text, emails, and timestamps under a per-c1z HMAC-SHA256 secret while preserving graph topology, cardinalities, and annotation structure. Cross-references stay coherent because every transform is deterministic within a c1z: the same input id always maps to the same sanitized id, so GrantRecord.principal / .entitlement / .sources keys, EntitlementRecord .resource, and ResourceRecord.parent_resource_id resolve in the output. Different per-c1z secrets keep distinct c1zs uncorrelatable. Annotation dispatch is a whitelist keyed by Any type URL. v0.1 ships handlers for UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, and ScopeBindingTrait. Unknown annotation types are dropped by default with a log line naming the type URL; an operator flag flips behavior to pass-through. Timestamps use anchor-and-shift: a single delta is computed from the newest source sync timestamp and applied uniformly so relative deltas survive. AssetRecord.data is replaced with a content-type-matched placeholder while the asset ref chain is preserved.
The CLI deferred dst.Close and threw the error away. For a write-mode C1File the Close call is where sqlite-zstd finalizes and compresses the output; losing that error meant a corrupt sanitized .c1z could be reported as success. Explicit Close on the success path with a guarded defer for the early-return paths. The sanitizer's hot-loop id() built a fresh hmac.Hash on every call, redoing the SHA-256 key schedule each time. Stash one hmac.Hash on the sanitizer and Reset() between calls — single-threaded run, no locking needed. SanitizeID stays as the allocation-y reference for external callers (sanitizeEmail, tests). copyAssets drained the GetAsset reader with io.Copy but never asked whether the underlying type was an io.Closer; at least one impl is *os.File-backed and was leaking a fd per asset. drainAndClose does the type assertion and the drain in one spot. Drop a dead range loop in the xref-integrity test.
Make the Options zero value fail closed. A caller that does not set the flag now gets unknown-type annotations dropped instead of passed through, so a newly-added annotation type carrying customer data can never leave the sanitizer unsanitized. Pass-through is still available for development via -allow-unknown-annotations / AllowUnknownAnnotations. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
ad4f939 to
9a5885a
Compare
Transform grant and entitlement IDs component-wise instead of hashing the whole opaque string. A baton entitlement ID is resourceType:resourceID:permission and a grant ID is entitlementID:principalType:principalID; the connector-defined type tokens are preserved and the identifier components are HMAC'd. This keeps a sanitized grant's embedded entitlement and principal references equal to the separately-sanitized entitlement row, resource rows, and grant-sources keys, so cross-references still resolve. IDs that don't parse to the expected component count are connector-custom and not decomposable, so they fall back to a whole-string HMAC, preserving the invariant that equal source IDs map to equal sanitized IDs everywhere they appear. All ID sites are updated together: grant id, entitlement id, grant-sources map keys, and license-profile entitlement ids. Move the HMAC-secret load/generate helper out of the CLI main into the c1zsanitize package, which already owns the secret-length contract, so it is reusable and unit-testable. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Drop the named returns from LoadOrGenerateSecret (nonamedreturns) and return explicit values. Discard the hmac.Hash.Write results in the two ID hashers; that method never returns a non-nil error, so the discard is intentional and silences the unhandled-error check. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Transform a composite identifier one ':'-delimited component at a time and preserve a component in cleartext only when it is a declared resource-type token. Every other component is HMAC'd, including an opaque id that sits where the canonical grammar expects a type. The previous transform trusted positional grammar — grant "entitlementID:principalType:principalID", entitlement "resourceType:resourceID:permission" — and kept the type-slot field verbatim. Connectors emit non-canonical IDs: a Microsoft Entra grant id carries a tenant group UUID in a type slot, so positional trust let that UUID survive un-hashed. Keying on the set of declared resource types instead means a UUID in any position is hashed. HMAC is deterministic, so a component that is also a resource or principal id still hashes to the value used for the separately- sanitized structured field, so cross-references stay coherent and the canonical alignment holds. The resource-type set is gathered as types are copied, which precedes entitlements and grants in every sync. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
The sanitizer's destination c1z is a net-new intermediate discarded on any failure, so it pays for durability it does not need. Open it with journal_mode=OFF and synchronous=OFF — the same throwaway-output rationale the compactor uses — to drop the per-commit journal and fsync, and give SQLite a 64MB page cache to cut index-maintenance misses once the index working set outgrows the small default cache on multi-million-grant syncs. Raise the source read page from 1000 to 10000. The page also bounds how many rows each dst Put batches into one transaction, so larger pages mean far fewer commits at scale; the dst writer still sub-chunks each INSERT statement, keeping the per-statement parameter count under SQLite's ceiling. These change throughput only: the sanitized output is byte-identical on success (verified by digest over a multi-page population). Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
kans asked for the sanitize command to live in the baton utility rather than as a standalone binary. Add it as a `baton sanitize` subcommand alongside the other c1z tools, taking the source from the shared --file flag and keeping the dst-build pragmas and secret handling. Remove the standalone cmd/baton-c1z-sanitize. The pkg/c1zsanitize library is unchanged, so the sanitized output is byte-identical. Also document at the unknown-annotation branch why drop is the default: an annotation type with no handler has not been inspected, so passing it through could leak un-sanitized data. Pass-through stays opt-in via AllowUnknownAnnotations for development. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Close on the sanitize success path flushes and zstd-compresses the sqlite output, so a failed close means the .c1z is incomplete or corrupt. Report it as a finalize failure instead of a generic close error so a broken output is never mistaken for success. The explicit checked close on the success path and the deferred safety-net close for error paths were already in place; this only clarifies the failure. Co-authored-by: c1-squire-dev[bot] <c1-squire-dev[bot]@users.noreply.github.com>
Summary
pkg/c1zsanitizepackage +cmd/baton-c1z-sanitizeCLI that transform a.c1zinto an identity-stripped copy viaconnectorstore.Reader/Writer. Per-c1z HMAC-SHA256 secret drives every transform.Grant.principal,Grant.entitlement,Grant.sourcesmap keys,Entitlement.resource,Resource.parent_resource_id) maps through the samesanitize_id.UserTrait,GroupTrait,AppTrait,RoleTrait,SecretTrait,LicenseProfileTrait,ScopeBindingTrait. Unknown annotations are dropped by default with a log line;-allow-unknown-annotationsflips to pass-through.Implementation follows the design in §6.2 of the investigation document. The sanitizer code never imports
c1.storage.v3; it works entirely throughconnectorstore.Reader/Writerand the connector-v2 wire types as the investigation prescribed.Output format choice
v2 (sqlite-zstd). The investigation's §7 question 5 punted on v2 vs v3 with the proviso "v0.1 should write v3 by default if PRs #870/#871/#872 have landed; otherwise v2." At the time of this PR, the storage-engine-v4 stack (#867–#872) is all still open on
main, so v0.1 writes v2 and v0.2 swaps to v3 once the writer adapter ships.Open questions / choices for ambiguous items
resource_type_idwith tenant data. v0.1 preserves them; the §7 question 1 audit hasn't run yet.PutAssetsilently drops empty data, so the single byte is the minimum that keeps the cross-reference alive. Document as known-lossy.StartNewSyncmints a fresh KSUID rather than accepting a deterministic transform of the source sync id. Parent linkage is preserved via an in-memorysrcSyncID → dstSyncIDmap maintained for the call. The v2connectorstore.Writerinterface doesn't exposeSetSyncID, so the deterministic-KSUID approach from the investigation §6.4 is deferred.-max-sync-runsflag yet (investigation §7 question 6). Add when needed.-secret-file, or the CLI generates one and writes it next to-outwith mode 0600. Archive or shred — the sanitizer doesn't choose.Out of scope for v0.1 (per §6.3)
Test plan
go vet ./...,gofmt -l,golangci-lint runclean on new codego test ./pkg/c1zsanitize/passes — all unit + invariant tests greengo test ./...passes — no existing tests brokensanitize_id(id)appears exactly N times in dstGrant.principal/Grant.entitlement/Entitlement.resource/Resource.parent_resource_idall resolve in dstDropUnknownAnnotations=truebaton-c1z-sanitize -in src -out dst, assert exit 0 and dst exists