Skip to content

Add a function to serialize MDBMinimalShard with a preset timestamp#796

Open
seanses wants to merge 1 commit into
mainfrom
di/serialize-shard-without-timestamp
Open

Add a function to serialize MDBMinimalShard with a preset timestamp#796
seanses wants to merge 1 commit into
mainfrom
di/serialize-shard-without-timestamp

Conversation

@seanses

@seanses seanses commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

To be used in https://github.com/huggingface-internal/xetcas/pull/1015


Note

Medium Risk
Touches shard serialization/footer metadata and changes serialize_impl call signatures, which could affect on-disk shard determinism and any callers relying on the previous timestamp behavior.

Overview
Adds support for deterministic shard serialization by allowing MDBMinimalShard to be serialized with a caller-provided creation timestamp.

serialize_impl now accepts an optional timestamp and uses it to populate MDBShardFileFooter.shard_creation_timestamp, while existing serialization paths continue to default to current_timestamp; a new serialize_with_timestamp API is added and existing serialize_* helpers are updated to pass the new parameter.

Reviewed by Cursor Bugbot for commit 687a9bc. Bugbot is set up for automated code reviews on this repo. Configure here.

@seanses seanses marked this pull request as ready for review April 14, 2026 00:30

@hoytak hoytak left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rajatarya rajatarya left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice small addition — the deterministic-timestamp entry point is a clean extension of the existing serialize_impl plumbing, and the default path is preserved exactly. A few thoughts that are mostly follow-ups rather than blockers (the two existing approvals look right to me):

  1. No test for the new serialize_with_timestamp — the existing test_serialize_xorb_subset_with_expiry_footer is a nice template for round-tripping the preset value through the footer. See inline comment on the new function.
  2. The SystemTimeu64 unix-seconds conversion is now duplicated twice in serialize_impl. Small local helper would tighten this up. Inline comment on the new .map(...) block.
  3. One question about the public API shape — serialize and serialize_with_timestamp are now almost-identical with the only difference being None vs Some(t). Would a single method taking Option<SystemTime> (or impl Into<Option<SystemTime>>) be preferable, or do you explicitly want the two call sites to read differently at the callee? Inline on the new function.
  4. Minor: serialize_impl now takes two consecutive Option<SystemTime> params (timestamp, expiry). Easy to swap at a call site. Inline comment on the signature with a tiny suggestion.

For context on what I skipped: @sirahd and @hoytak both approved with no written feedback, so none of the above has been raised yet.

with_file_section: bool,
with_verification: bool,
timestamp: Option<SystemTime>,
expiry: Option<SystemTime>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two Option<SystemTime> params back-to-back is easy to mis-order at a call site (and a compiler-accepted swap would silently corrupt the footer). Consider a tiny FooterOverrides { timestamp: Option<SystemTime>, expiry: Option<SystemTime> } struct, or flip the order of the named fields at the call site to put a non-Option between them. Not worth blocking on, but worth a moment's thought since both fields end up in the same MDBShardFileFooter.

shard_creation_timestamp: current_timestamp(),
shard_creation_timestamp: timestamp
.map(|t| t.duration_since(std::time::UNIX_EPOCH).unwrap_or_default().as_secs())
.unwrap_or_else(current_timestamp),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the t.duration_since(UNIX_EPOCH).unwrap_or_default().as_secs() pattern now appears here and on the very next line for expiry. A small private helper like

fn system_time_to_unix_secs(t: SystemTime) -> u64 {
    t.duration_since(std::time::UNIX_EPOCH).unwrap_or_default().as_secs()
}

would make both sites read as timestamp.map(system_time_to_unix_secs).unwrap_or_else(current_timestamp) and expiry.map_or(0, system_time_to_unix_secs), which IMO is easier to scan. Up to you.

with_verification: bool,
timestamp: SystemTime,
) -> Result<usize> {
self.serialize_impl(writer, true, with_verification, Some(timestamp), None, |_| true)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions on the public shape here (not a nit — these are the user-facing names):

  1. serialize_with_timestamp takes timestamp: SystemTime (required) while serialize_impl threads it as Option<SystemTime>. Any reason not to collapse this into a single serialize(&self, writer, with_verification, timestamp: Option<SystemTime>) and drop the new method? It keeps the surface smaller and the None case still reads naturally at the call site.
  2. The doc comment says what this does but not why a caller would reach for it over serialize. A one-liner explaining the intended use case (e.g. "for deterministic shard hashing / GC" — I'm guessing based on the linked xetcas PR) would make this much more discoverable for future readers who hit this in autocomplete.

Also: no test coverage for this new entry point. The test_serialize_xorb_subset_with_expiry_footer test below is a great template — a similar test that passes a preset timestamp, re-reads the footer, and asserts shard_creation_timestamp == preset_secs would lock in the contract and cheap-insurance against anyone "helpfully" refactoring the .map(...).unwrap_or_else(current_timestamp) chain later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants