fix(transaction): detect duplicate file paths within a single FastAppend batch#2509
Open
SreeramGarlapati wants to merge 1 commit into
Open
fix(transaction): detect duplicate file paths within a single FastAppend batch#2509SreeramGarlapati wants to merge 1 commit into
SreeramGarlapati wants to merge 1 commit into
Conversation
…end batch
`SnapshotProducer::validate_duplicate_files` collected `added_data_files`
straight into a `HashSet<&str>` before checking against existing manifests.
That collect step silently dedupes the batch, so two `DataFile` entries
sharing the same `file_path` in one `add_data_files(...)` call were written
into the manifest unchecked and committed without error - producing a
snapshot whose `added_files_count` and read-side row count both double-count
the offending file.
Walk the added files explicitly: insert each path into the seen set and
track every distinct path that collides. If any collisions are observed,
return `ErrorKind::DataInvalid` with the sorted, deduped list of duplicated
paths. The existing cross-snapshot check continues to operate on the same
`new_files` set, so its behaviour is unchanged.
Adds two unit tests:
- rejection path, including the dedup-in-message guarantee when a path
appears three or more times;
- `with_check_duplicate(false)` opt-out still accepts batch duplicates,
matching the opt-out semantics already documented for the
cross-snapshot check.
Closes apache#2507.
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
validate_duplicate_fileswas meant to refuse any append that would point at a file the table already references. It's been doing exactly half of that. The other half — refusing the same file appearing twice inside one batch — has been broken since the function was written, because the very first thing it does is collect the batch into aHashSet. The duplicates are gone before the check looks for them.Concretely: hand
FastAppendActiontwoDataFiles with identicalfile_paths, callcommit()with the defaultcheck_duplicate=true, and you getOk. The resulting manifest holds two entries pointing at the same Parquet file, the snapshot summary'sadded_files_countis wrong, and every scan double-counts the rows. Nothing in the commit path notices.The fix is small: walk the added files instead of dumping them into a set, track every distinct path that collides, and return
DataInvalidwith the sorted list of offenders if any did. The existing cross-snapshot check reuses the same set and is otherwise unchanged, andwith_check_duplicate(false)still disables both halves of the check together — the opt-out stays symmetric.This is a sibling of #1394, which fixed the cross-snapshot half of the same function. Same shape of bug, opposite side of the comparison.
Are these changes tested?
Yes — two unit tests in
crates/iceberg/src/transaction/append.rs.The first throws three identical paths into a single batch and asserts the commit fails with
DataInvalid, with the offending path appearing exactly once in the error message (so a future refactor can't quietly regress to repeatinga, a, a).The second asserts that
with_check_duplicate(false)still accepts batch duplicates — same opt-out behaviour the cross-snapshot check already documents.I also verified the repro by reverting just the
snapshot.rschange and rerunning the first test against the unfixed code — it fails exactly where the panic-on-Okbranch lives, which is the bug demonstrated as a failing unit test.