Add a function to serialize MDBMinimalShard with a preset timestamp#796
Add a function to serialize MDBMinimalShard with a preset timestamp#796seanses wants to merge 1 commit into
Conversation
rajatarya
left a comment
There was a problem hiding this comment.
Nice small addition — the deterministic-timestamp entry point is a clean extension of the existing serialize_impl plumbing, and the default path is preserved exactly. A few thoughts that are mostly follow-ups rather than blockers (the two existing approvals look right to me):
- No test for the new
serialize_with_timestamp— the existingtest_serialize_xorb_subset_with_expiry_footeris a nice template for round-tripping the preset value through the footer. See inline comment on the new function. - The
SystemTime→u64unix-seconds conversion is now duplicated twice inserialize_impl. Small local helper would tighten this up. Inline comment on the new.map(...)block. - One question about the public API shape —
serializeandserialize_with_timestampare now almost-identical with the only difference beingNonevsSome(t). Would a single method takingOption<SystemTime>(orimpl Into<Option<SystemTime>>) be preferable, or do you explicitly want the two call sites to read differently at the callee? Inline on the new function. - Minor:
serialize_implnow takes two consecutiveOption<SystemTime>params (timestamp,expiry). Easy to swap at a call site. Inline comment on the signature with a tiny suggestion.
For context on what I skipped: @sirahd and @hoytak both approved with no written feedback, so none of the above has been raised yet.
| with_file_section: bool, | ||
| with_verification: bool, | ||
| timestamp: Option<SystemTime>, | ||
| expiry: Option<SystemTime>, |
There was a problem hiding this comment.
nit: two Option<SystemTime> params back-to-back is easy to mis-order at a call site (and a compiler-accepted swap would silently corrupt the footer). Consider a tiny FooterOverrides { timestamp: Option<SystemTime>, expiry: Option<SystemTime> } struct, or flip the order of the named fields at the call site to put a non-Option between them. Not worth blocking on, but worth a moment's thought since both fields end up in the same MDBShardFileFooter.
| shard_creation_timestamp: current_timestamp(), | ||
| shard_creation_timestamp: timestamp | ||
| .map(|t| t.duration_since(std::time::UNIX_EPOCH).unwrap_or_default().as_secs()) | ||
| .unwrap_or_else(current_timestamp), |
There was a problem hiding this comment.
nit: the t.duration_since(UNIX_EPOCH).unwrap_or_default().as_secs() pattern now appears here and on the very next line for expiry. A small private helper like
fn system_time_to_unix_secs(t: SystemTime) -> u64 {
t.duration_since(std::time::UNIX_EPOCH).unwrap_or_default().as_secs()
}would make both sites read as timestamp.map(system_time_to_unix_secs).unwrap_or_else(current_timestamp) and expiry.map_or(0, system_time_to_unix_secs), which IMO is easier to scan. Up to you.
| with_verification: bool, | ||
| timestamp: SystemTime, | ||
| ) -> Result<usize> { | ||
| self.serialize_impl(writer, true, with_verification, Some(timestamp), None, |_| true) |
There was a problem hiding this comment.
Two questions on the public shape here (not a nit — these are the user-facing names):
serialize_with_timestamptakestimestamp: SystemTime(required) whileserialize_implthreads it asOption<SystemTime>. Any reason not to collapse this into a singleserialize(&self, writer, with_verification, timestamp: Option<SystemTime>)and drop the new method? It keeps the surface smaller and theNonecase still reads naturally at the call site.- The doc comment says what this does but not why a caller would reach for it over
serialize. A one-liner explaining the intended use case (e.g. "for deterministic shard hashing / GC" — I'm guessing based on the linked xetcas PR) would make this much more discoverable for future readers who hit this in autocomplete.
Also: no test coverage for this new entry point. The test_serialize_xorb_subset_with_expiry_footer test below is a great template — a similar test that passes a preset timestamp, re-reads the footer, and asserts shard_creation_timestamp == preset_secs would lock in the contract and cheap-insurance against anyone "helpfully" refactoring the .map(...).unwrap_or_else(current_timestamp) chain later.
To be used in https://github.com/huggingface-internal/xetcas/pull/1015
Note
Medium Risk
Touches shard serialization/footer metadata and changes
serialize_implcall signatures, which could affect on-disk shard determinism and any callers relying on the previous timestamp behavior.Overview
Adds support for deterministic shard serialization by allowing
MDBMinimalShardto be serialized with a caller-provided creation timestamp.serialize_implnow accepts an optionaltimestampand uses it to populateMDBShardFileFooter.shard_creation_timestamp, while existing serialization paths continue to default tocurrent_timestamp; a newserialize_with_timestampAPI is added and existingserialize_*helpers are updated to pass the new parameter.Reviewed by Cursor Bugbot for commit 687a9bc. Bugbot is set up for automated code reviews on this repo. Configure here.