Extend commit metadata with deduplication / precombine statistics (e.g. numDuplicates, numPrecombined)

## Problem statement

Every Hudi write produces commit metadata that records per-file and per-partition write statistics — `numInserts`, `numUpdates`, `numWrites`, `numDeletes`, and related counters. These stats are the primary source of truth that operators, pipelines, and reconciliation tooling use to answer the question: *"How many records did my write actually produce?"*

However, when **deduplication** (`hoodie.combine.before.insert`) or **precombine** (during upsert) is enabled, multiple input records that share the same record key are collapsed into a single output record before anything is written. The commit metadata reports only the **final written count** — it does not report how many input records were collapsed along the way, or *why* the count shrank.

This creates an **observability gap**: a discrepancy between input record count and written record count cannot be attributed to a cause.

### Concrete example

Suppose an input RDD/Dataset contains 5 records that all share the same record key:

```
key=A, ts=1
key=A, ts=2
key=A, ts=3
key=A, ts=4
key=A, ts=5
```

With dedup/precombine enabled, Hudi keeps one record (say `ts=5`) and writes it. The commit metadata reports:

```
numInserts = 1
```

From this number alone, an operator **cannot tell the difference** between two very different scenarios:

1. **Expected behavior:** 4 records were legitimate duplicates, correctly collapsed by precombine. Data is fully intact. :white_check_mark:
2. **A bug / data loss:** records were silently dropped somewhere in the pipeline (a partitioning bug, a faulty merge, an index issue, etc.), and the "4 missing" records were *not* actually duplicates. :x:

Both scenarios look identical in commit metadata: `5 in -> 1 out`. There is no field that says "4 of these were dropped as duplicates."

### Why this matters

- **Data integrity / auditing:** Pipelines that reconcile source-vs-sink counts hit a dead end. A drop from 5 to 1 is unexplained, so it cannot be safely signed off as correct nor flagged as a real loss.
- **Debugging:** When a genuine data-loss bug occurs, there is no metadata signal distinguishing it from normal dedup behavior, making root-cause analysis much harder.
- **Trust:** Without dedup attribution, every count discrepancy requires manual, expensive investigation.

### Scope

This applies to **both** write paths:

- **Insert dedup** — duplicates dropped before insert when combine-before-insert is on.
- **Upsert precombine** — multiple incoming records for the same key combined down to one (and combined against the existing record on disk).

## Proposed solution

Extend Hudi commit metadata (`HoodieWriteStat` and the aggregated commit-level stats) with additional counters that make dedup/precombine explicit, for example:

- `numDuplicates` / `numRecordsDeduplicated` — input records dropped because they shared a key with another input record.
- `numPrecombined` — records eliminated by the precombine step specifically.

With these stats, the invariant becomes verifiable:

```
numInputRecords == numWrites + numDeletes + numDuplicates (+ numErrors)
```

When this equation balances, a count drop is provably explained by deduplication. When it does **not** balance, the gap points at a real bug — turning a silent ambiguity into an actionable signal.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend commit metadata with deduplication / precombine statistics (e.g. numDuplicates, numPrecombined) #18976

Problem statement

Concrete example

Why this matters

Scope

Proposed solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Extend commit metadata with deduplication / precombine statistics (e.g. numDuplicates, numPrecombined) #18976

Description

Problem statement

Concrete example

Why this matters

Scope

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions