Skip to content

fix(spec): don't panic on overwrite-truncate when previous totals exceed i32::MAX or are non-numeric#2511

Merged
blackmwk merged 1 commit into
apache:mainfrom
SreeramGarlapati:fix/snapshot-summary-overwrite-panic
May 28, 2026
Merged

fix(spec): don't panic on overwrite-truncate when previous totals exceed i32::MAX or are non-numeric#2511
blackmwk merged 1 commit into
apache:mainfrom
SreeramGarlapati:fix/snapshot-summary-overwrite-panic

Conversation

@SreeramGarlapati
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

What changes are included in this PR?

update_snapshot_summaries on the overwrite-with-truncate path was doing truncate_table_summary(...).map_err(...).unwrap(). The unwrap() turned a recoverable parse error into a process panic, and there are two ways to trip it from normal API usage.

The first is a sneaky one: get_prop parsed the previous snapshot's totals as i32, even though every other counter path in the same file already uses u64. So once a table accumulated more than ~2.15B data files / records / bytes (i.e. crossed i32::MAX), the next overwrite-with-truncate would parse-fail, hit the unwrap, and crash the writer. Quietly tableable too — nothing in the public API hints that i32 is the secret ceiling.

The second is corruption / cross-implementation interop: any non-numeric value in a previous total-* property (a foreign implementation, a manual edit, a half-corrupt metadata file) produces a parse error and the same panic.

The fix is small. Widen get_prop to u64 so it matches the rest of the file and lifts the artificial ceiling. Replace the .unwrap() with ?, keeping the existing \"Failed to truncate table summary.\" wrapper as caller context so the error chain still tells you what was being attempted when the inner parse blew up. get_prop's error message now includes the offending property name and value, which is what you actually need when triaging a malformed metadata file.

Scope-wise this only touches the overwrite-truncate path. update_totals has a structurally similar unwrap() antipattern that is a separate (lower-severity) issue and not included here.

Are these changes tested?

Yes — two unit tests in snapshot_summary.rs, both red-state proven against the unfixed code:

The first asserts that a previous summary with total-data-files = i32::MAX + 1 (and total-records similarly) survives the truncation and lands in deleted-data-files / deleted-records intact. Against the unfixed code this panics at snapshot_summary.rs:356:18 with \"number too large to fit in target type\" — that's the i32-ceiling bug demonstrated directly.

The second feeds \"not_a_number\" as a previous total and asserts the function returns an Err whose message carries the \"truncate table summary\" wrapper. Against the unfixed code this panics at the same site with \"invalid digit found in string\" — the malformed-input bug.

I ran each test against the unfixed code, observed the panic, applied the fix, and watched them go green. Existing test_truncate_table_summary and test_update_snapshot_summaries_append continue to pass, and cargo fmt --check / cargo clippy --all-features --tests -- -D warnings are clean.

@SreeramGarlapati
Copy link
Copy Markdown
Contributor Author

Hi @blackmwk — when you have a moment, appreciate a look. Short fix in spec/snapshot_summary.rs, scoped to the overwrite-truncate panic; context in #2510.

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SreeramGarlapati for this pr, just one minor point!

Comment thread crates/iceberg/src/spec/snapshot_summary.rs Outdated
@SreeramGarlapati
Copy link
Copy Markdown
Contributor Author

@blackmwk - given the follow up here addresses the comment - could u complete merging the current PR - which is a bit critical?

blackmwk pushed a commit that referenced this pull request May 28, 2026
…2514)

## Which issue does this PR close?

Follow-up to review feedback on #2511
([thread](#2511 (comment))).

## What changes are included in this PR?

Removes all four `#[allow(dead_code)]` attributes in
`crates/iceberg/src/spec/snapshot_summary.rs`. They were added in #1085
when the module was first introduced, before any caller existed. The
chain is now production-live:

- `update_snapshot_summaries` is called from `transaction/snapshot.rs`
- `truncate_table_summary` is called from `update_snapshot_summaries`
- `get_prop` is called 6 times from `truncate_table_summary`
- `update_totals` is called 6 times from `update_snapshot_summaries`

None of these need the suppression any more — the dead_code lint is
silent without them.

Kept out of #2511 to keep that PR's scope focused on the
overwrite-truncate panic fix.

## Are these changes tested?

`cargo check -p iceberg` and `cargo clippy -p iceberg --all-features
--tests -- -D warnings` both stay clean after the removals — i.e. no
dead_code warning re-emerges.
…e, widen totals to u64

`update_snapshot_summaries` on the overwrite-with-truncate path called
`truncate_table_summary(...).map_err(...).unwrap()`, turning a recoverable
parse error into a process-level panic. Two ways to hit it from normal API
usage:

  1. `get_prop` parsed previous totals as `i32` (max ~2.15B). The rest of
     this file already parses these counters as `u64`, so `get_prop` was
     the odd one out and any table that legitimately accumulated more than
     `i32::MAX` data files / records / file-size bytes panicked the writer
     on the next overwrite-with-truncate.
  2. Any non-numeric previous-total value (corruption, manual edits, a
     foreign implementation) produced a parse `Err` and the same panic.

Switch `get_prop` to `u64` to match the rest of the file and give a more
specific error message including the offending property name and value.
Replace the `.unwrap()` on the truncation result with `?`, preserving the
existing "Failed to truncate table summary." wrapper as caller context.

Two unit tests cover the two failure modes, each pinned via the `expect`/
`expect_err` API so a future regression to the panic path fails the test
loudly rather than silently re-introducing the crash. Verified that both
tests panic at `snapshot_summary.rs:356:18` against the unfixed code
(red-state proof) and pass cleanly against the fix.

Closes apache#2510.

Co-authored-by: Aman Rawat <rawataaryan09@gmail.com>
@SreeramGarlapati SreeramGarlapati force-pushed the fix/snapshot-summary-overwrite-panic branch from 1632fca to aa82b91 Compare May 28, 2026 06:20
@blackmwk blackmwk merged commit 496f061 into apache:main May 28, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(spec): update_snapshot_summaries panics on overwrite when previous totals exceed i32::MAX or are non-numeric

2 participants