Skip to content

fix(transaction): detect duplicate file paths within a single FastAppend batch#2509

Open
SreeramGarlapati wants to merge 1 commit into
apache:mainfrom
SreeramGarlapati:main
Open

fix(transaction): detect duplicate file paths within a single FastAppend batch#2509
SreeramGarlapati wants to merge 1 commit into
apache:mainfrom
SreeramGarlapati:main

Conversation

@SreeramGarlapati
Copy link
Copy Markdown

Which issue does this PR close?

What changes are included in this PR?

validate_duplicate_files was meant to refuse any append that would point at a file the table already references. It's been doing exactly half of that. The other half — refusing the same file appearing twice inside one batch — has been broken since the function was written, because the very first thing it does is collect the batch into a HashSet. The duplicates are gone before the check looks for them.

Concretely: hand FastAppendAction two DataFiles with identical file_paths, call commit() with the default check_duplicate=true, and you get Ok. The resulting manifest holds two entries pointing at the same Parquet file, the snapshot summary's added_files_count is wrong, and every scan double-counts the rows. Nothing in the commit path notices.

The fix is small: walk the added files instead of dumping them into a set, track every distinct path that collides, and return DataInvalid with the sorted list of offenders if any did. The existing cross-snapshot check reuses the same set and is otherwise unchanged, and with_check_duplicate(false) still disables both halves of the check together — the opt-out stays symmetric.

This is a sibling of #1394, which fixed the cross-snapshot half of the same function. Same shape of bug, opposite side of the comparison.

Are these changes tested?

Yes — two unit tests in crates/iceberg/src/transaction/append.rs.

The first throws three identical paths into a single batch and asserts the commit fails with DataInvalid, with the offending path appearing exactly once in the error message (so a future refactor can't quietly regress to repeating a, a, a).

The second asserts that with_check_duplicate(false) still accepts batch duplicates — same opt-out behaviour the cross-snapshot check already documents.

I also verified the repro by reverting just the snapshot.rs change and rerunning the first test against the unfixed code — it fails exactly where the panic-on-Ok branch lives, which is the bug demonstrated as a failing unit test.

…end batch

`SnapshotProducer::validate_duplicate_files` collected `added_data_files`
straight into a `HashSet<&str>` before checking against existing manifests.
That collect step silently dedupes the batch, so two `DataFile` entries
sharing the same `file_path` in one `add_data_files(...)` call were written
into the manifest unchecked and committed without error - producing a
snapshot whose `added_files_count` and read-side row count both double-count
the offending file.

Walk the added files explicitly: insert each path into the seen set and
track every distinct path that collides. If any collisions are observed,
return `ErrorKind::DataInvalid` with the sorted, deduped list of duplicated
paths. The existing cross-snapshot check continues to operate on the same
`new_files` set, so its behaviour is unchanged.

Adds two unit tests:
  - rejection path, including the dedup-in-message guarantee when a path
    appears three or more times;
  - `with_check_duplicate(false)` opt-out still accepts batch duplicates,
    matching the opt-out semantics already documented for the
    cross-snapshot check.

Closes apache#2507.
@SreeramGarlapati
Copy link
Copy Markdown
Author

Hi @blackmwk — appreciate a quick review on this when you have a moment. Short, well-scoped fix (one function plus two tests); context in the linked #2507.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(transaction): FastAppendAction silently accepts duplicate file paths within a single batch

1 participant