Skip to content

Spark: Add streaming-overwrite-mode option for handling OVERWRITE snapshots#15152

Open
sergiomartinswhg wants to merge 6 commits into
apache:mainfrom
sergiomartinswhg:streaming-overwrite-mode
Open

Spark: Add streaming-overwrite-mode option for handling OVERWRITE snapshots#15152
sergiomartinswhg wants to merge 6 commits into
apache:mainfrom
sergiomartinswhg:streaming-overwrite-mode

Conversation

@sergiomartinswhg
Copy link
Copy Markdown

@sergiomartinswhg sergiomartinswhg commented Jan 27, 2026

Context

This PR addresses a long-standing feature request for handling OVERWRITE snapshots in Spark Structured Streaming.

Related issues and PRs:

Issue #2788 - Original feature request by @SreeramGarlapati
PR #2944 - Format version-aware approach by @tprelle
PR #7295 - Enum-based approach by @karim-ramadan

This implementation builds on the ideas from both previous PRs, adopting the enum-based design from #7295 while maintaining backward compatibility with the existing streaming-skip-overwrite-snapshots option.

Summary

This PR adds a new streaming-overwrite-mode option that provides more flexibility for handling OVERWRITE snapshots during Spark Structured Streaming reads. While users today typically use streaming-skip-overwrite-snapshots=true to skip these snapshots entirely, this PR introduces an added-files-only mode that allows processing the added files from OVERWRITE snapshots instead of skipping them.

Motivation

Tables frequently undergo operations that produce OVERWRITE snapshots:

  • INSERT OVERWRITE to specific partitions
  • MERGE INTO / UPDATE / DELETE operations

Today, users handle this by setting streaming-skip-overwrite-snapshots=true, which skips these snapshots entirely. However, this means any new data added during these operations is missed by the stream.

This PR gives users a third option: process only the added files from OVERWRITE snapshots, allowing streams to capture new data from these operations.

Changes

New option: streaming-overwrite-mode with three modes:

Mode Behavior
fail Throws exception on OVERWRITE snapshots (default)
skip Ignores OVERWRITE snapshots entirely
added-files-only Processes only added files from OVERWRITE snapshots

Backward compatibility:

  • streaming-skip-overwrite-snapshots=true maps to streaming-overwrite-mode=skip
  • New option takes precedence when both are specified
  • Deprecation warning logged when legacy option is used

Usage

spark.readStream()
    .format("iceberg")
    .option("streaming-overwrite-mode", "added-files-only")
    .load("catalog.db.table")

Warning for added-files-only mode

This mode may produce duplicate records when overwrites rewrite existing data (e.g., MERGE, UPDATE, DELETE). Downstream processing must handle duplicates (e.g., idempotent writes, deduplication).

Testing

  • Unit tests for StreamingOverwriteMode enum parsing
  • Integration tests for all three modes across Spark 3.5, 4.0, and 4.1
  • Tests verify backward compatibility with legacy option

@ndcuong869
Copy link
Copy Markdown

ndcuong869 commented Feb 5, 2026

I’ve been waiting for this feature since Iceberg 1.6. Thank you

@sergiomartinswhg
Copy link
Copy Markdown
Author

Hi everyone! This PR has been open for a couple of weeks without reviews, so I'm reaching out to folks who have relevant expertise in this area. I'd greatly appreciate any feedback you can provide.

TL;DR: Adds streaming-overwrite-mode option with an added-files-only mode to process new data from OVERWRITE snapshots, while maintaining backward compatibility.

Tagging @huaxingao @wypoon @singhpk234 @aokolnychyi as you've been active reviewers of recent Spark streaming PRs and have experience with the streaming-skip-overwrite functionality and Spark configuration patterns.

This addresses the feature request from #2788 and builds on ideas from #7295. Happy to answer any questions!

Thanks! 🙏

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Mar 20, 2026
@sergiomartinswhg
Copy link
Copy Markdown
Author

Tagging recent Spark PR reviewers: @huaxingao @RussellSpitzer @singhpk234 @szehon-ho
Could you please take a look at this one? any feedback is welcome!

@github-actions github-actions Bot removed the stale label Mar 26, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Apr 25, 2026
@sergiomartinswhg
Copy link
Copy Markdown
Author

Tagging recent Spark PR reviewers: @huaxingao @singhpk234 @nastra @amogh-jahagirdar
Could you please take a look at this one? any feedback is welcome!

@github-actions github-actions Bot removed the stale label Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants