Spark: Add streaming-overwrite-mode option for handling OVERWRITE snapshots#15152
Spark: Add streaming-overwrite-mode option for handling OVERWRITE snapshots#15152sergiomartinswhg wants to merge 6 commits into
Conversation
|
I’ve been waiting for this feature since Iceberg 1.6. Thank you |
|
Hi everyone! This PR has been open for a couple of weeks without reviews, so I'm reaching out to folks who have relevant expertise in this area. I'd greatly appreciate any feedback you can provide. TL;DR: Adds Tagging @huaxingao @wypoon @singhpk234 @aokolnychyi as you've been active reviewers of recent Spark streaming PRs and have experience with the This addresses the feature request from #2788 and builds on ideas from #7295. Happy to answer any questions! Thanks! 🙏 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Tagging recent Spark PR reviewers: @huaxingao @RussellSpitzer @singhpk234 @szehon-ho |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Tagging recent Spark PR reviewers: @huaxingao @singhpk234 @nastra @amogh-jahagirdar |
Context
This PR addresses a long-standing feature request for handling OVERWRITE snapshots in Spark Structured Streaming.
Related issues and PRs:
Issue #2788 - Original feature request by @SreeramGarlapati
PR #2944 - Format version-aware approach by @tprelle
PR #7295 - Enum-based approach by @karim-ramadan
This implementation builds on the ideas from both previous PRs, adopting the enum-based design from #7295 while maintaining backward compatibility with the existing
streaming-skip-overwrite-snapshotsoption.Summary
This PR adds a new
streaming-overwrite-modeoption that provides more flexibility for handling OVERWRITE snapshots during Spark Structured Streaming reads. While users today typically usestreaming-skip-overwrite-snapshots=trueto skip these snapshots entirely, this PR introduces anadded-files-onlymode that allows processing the added files from OVERWRITE snapshots instead of skipping them.Motivation
Tables frequently undergo operations that produce OVERWRITE snapshots:
INSERT OVERWRITEto specific partitionsMERGE INTO/UPDATE/DELETEoperationsToday, users handle this by setting
streaming-skip-overwrite-snapshots=true, which skips these snapshots entirely. However, this means any new data added during these operations is missed by the stream.This PR gives users a third option: process only the added files from OVERWRITE snapshots, allowing streams to capture new data from these operations.
Changes
New option:
streaming-overwrite-modewith three modes:failskipadded-files-onlyBackward compatibility:
streaming-skip-overwrite-snapshots=truemaps tostreaming-overwrite-mode=skipUsage
Warning for added-files-only mode
This mode may produce duplicate records when overwrites rewrite existing data (e.g., MERGE, UPDATE, DELETE). Downstream processing must handle duplicates (e.g., idempotent writes, deduplication).
Testing