Skip to content

Extend commit metadata with deduplication / precombine statistics (e.g. numDuplicates, numPrecombined) #18976

@prashantwason

Description

@prashantwason

Problem statement

Every Hudi write produces commit metadata that records per-file and per-partition write statistics — numInserts, numUpdates, numWrites, numDeletes, and related counters. These stats are the primary source of truth that operators, pipelines, and reconciliation tooling use to answer the question: "How many records did my write actually produce?"

However, when deduplication (hoodie.combine.before.insert) or precombine (during upsert) is enabled, multiple input records that share the same record key are collapsed into a single output record before anything is written. The commit metadata reports only the final written count — it does not report how many input records were collapsed along the way, or why the count shrank.

This creates an observability gap: a discrepancy between input record count and written record count cannot be attributed to a cause.

Concrete example

Suppose an input RDD/Dataset contains 5 records that all share the same record key:

key=A, ts=1
key=A, ts=2
key=A, ts=3
key=A, ts=4
key=A, ts=5

With dedup/precombine enabled, Hudi keeps one record (say ts=5) and writes it. The commit metadata reports:

numInserts = 1

From this number alone, an operator cannot tell the difference between two very different scenarios:

  1. Expected behavior: 4 records were legitimate duplicates, correctly collapsed by precombine. Data is fully intact. ✅
  2. A bug / data loss: records were silently dropped somewhere in the pipeline (a partitioning bug, a faulty merge, an index issue, etc.), and the "4 missing" records were not actually duplicates. ❌

Both scenarios look identical in commit metadata: 5 in -> 1 out. There is no field that says "4 of these were dropped as duplicates."

Why this matters

  • Data integrity / auditing: Pipelines that reconcile source-vs-sink counts hit a dead end. A drop from 5 to 1 is unexplained, so it cannot be safely signed off as correct nor flagged as a real loss.
  • Debugging: When a genuine data-loss bug occurs, there is no metadata signal distinguishing it from normal dedup behavior, making root-cause analysis much harder.
  • Trust: Without dedup attribution, every count discrepancy requires manual, expensive investigation.

Scope

This applies to both write paths:

  • Insert dedup — duplicates dropped before insert when combine-before-insert is on.
  • Upsert precombine — multiple incoming records for the same key combined down to one (and combined against the existing record on disk).

Proposed solution

Extend Hudi commit metadata (HoodieWriteStat and the aggregated commit-level stats) with additional counters that make dedup/precombine explicit, for example:

  • numDuplicates / numRecordsDeduplicated — input records dropped because they shared a key with another input record.
  • numPrecombined — records eliminated by the precombine step specifically.

With these stats, the invariant becomes verifiable:

numInputRecords == numWrites + numDeletes + numDuplicates (+ numErrors)

When this equation balances, a count drop is provably explained by deduplication. When it does not balance, the gap points at a real bug — turning a silent ambiguity into an actionable signal.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions