Problem statement
Every Hudi write produces commit metadata that records per-file and per-partition write statistics — numInserts, numUpdates, numWrites, numDeletes, and related counters. These stats are the primary source of truth that operators, pipelines, and reconciliation tooling use to answer the question: "How many records did my write actually produce?"
However, when deduplication (hoodie.combine.before.insert) or precombine (during upsert) is enabled, multiple input records that share the same record key are collapsed into a single output record before anything is written. The commit metadata reports only the final written count — it does not report how many input records were collapsed along the way, or why the count shrank.
This creates an observability gap: a discrepancy between input record count and written record count cannot be attributed to a cause.
Concrete example
Suppose an input RDD/Dataset contains 5 records that all share the same record key:
key=A, ts=1
key=A, ts=2
key=A, ts=3
key=A, ts=4
key=A, ts=5
With dedup/precombine enabled, Hudi keeps one record (say ts=5) and writes it. The commit metadata reports:
From this number alone, an operator cannot tell the difference between two very different scenarios:
- Expected behavior: 4 records were legitimate duplicates, correctly collapsed by precombine. Data is fully intact. ✅
- A bug / data loss: records were silently dropped somewhere in the pipeline (a partitioning bug, a faulty merge, an index issue, etc.), and the "4 missing" records were not actually duplicates. ❌
Both scenarios look identical in commit metadata: 5 in -> 1 out. There is no field that says "4 of these were dropped as duplicates."
Why this matters
- Data integrity / auditing: Pipelines that reconcile source-vs-sink counts hit a dead end. A drop from 5 to 1 is unexplained, so it cannot be safely signed off as correct nor flagged as a real loss.
- Debugging: When a genuine data-loss bug occurs, there is no metadata signal distinguishing it from normal dedup behavior, making root-cause analysis much harder.
- Trust: Without dedup attribution, every count discrepancy requires manual, expensive investigation.
Scope
This applies to both write paths:
- Insert dedup — duplicates dropped before insert when combine-before-insert is on.
- Upsert precombine — multiple incoming records for the same key combined down to one (and combined against the existing record on disk).
Proposed solution
Extend Hudi commit metadata (HoodieWriteStat and the aggregated commit-level stats) with additional counters that make dedup/precombine explicit, for example:
numDuplicates / numRecordsDeduplicated — input records dropped because they shared a key with another input record.
numPrecombined — records eliminated by the precombine step specifically.
With these stats, the invariant becomes verifiable:
numInputRecords == numWrites + numDeletes + numDuplicates (+ numErrors)
When this equation balances, a count drop is provably explained by deduplication. When it does not balance, the gap points at a real bug — turning a silent ambiguity into an actionable signal.
Problem statement
Every Hudi write produces commit metadata that records per-file and per-partition write statistics —
numInserts,numUpdates,numWrites,numDeletes, and related counters. These stats are the primary source of truth that operators, pipelines, and reconciliation tooling use to answer the question: "How many records did my write actually produce?"However, when deduplication (
hoodie.combine.before.insert) or precombine (during upsert) is enabled, multiple input records that share the same record key are collapsed into a single output record before anything is written. The commit metadata reports only the final written count — it does not report how many input records were collapsed along the way, or why the count shrank.This creates an observability gap: a discrepancy between input record count and written record count cannot be attributed to a cause.
Concrete example
Suppose an input RDD/Dataset contains 5 records that all share the same record key:
With dedup/precombine enabled, Hudi keeps one record (say
ts=5) and writes it. The commit metadata reports:From this number alone, an operator cannot tell the difference between two very different scenarios:
Both scenarios look identical in commit metadata:
5 in -> 1 out. There is no field that says "4 of these were dropped as duplicates."Why this matters
Scope
This applies to both write paths:
Proposed solution
Extend Hudi commit metadata (
HoodieWriteStatand the aggregated commit-level stats) with additional counters that make dedup/precombine explicit, for example:numDuplicates/numRecordsDeduplicated— input records dropped because they shared a key with another input record.numPrecombined— records eliminated by the precombine step specifically.With these stats, the invariant becomes verifiable:
When this equation balances, a count drop is provably explained by deduplication. When it does not balance, the gap points at a real bug — turning a silent ambiguity into an actionable signal.