fix(schema): permit opt-in timestamp-precision evolution#19028
Conversation
Adds a write config `hoodie.write.schema.allow.timestamp.precision.evolution` (default false) that, when true, lets the internal-schema reconcile path relabel a column between timestamp-millis and timestamp-micros (and between the local-timestamp variants), and attach a missing local-timestamp logical type on top of a bare long. Default false preserves the existing strict rejection so no caller sees a behavior change. The non-reconcile write path was already lenient via Avro reader/writer compatibility (both logical types share the same Avro long primitive). The internal-schema reconcile path, triggered when `hoodie.write.set.null.for.missing.columns=true`, instead rejected the relabel through `SchemaChangeUtils.isTypeUpdateAllow`. This closes the parity gap and enables forward-fixing tables that earlier versions persisted with a timestamp-micros label but timestamp-millis values, or that dropped the local-timestamp logical type entirely and stored the column as bare long. Threaded through SchemaChangeUtils -> TableChanges.ColumnUpdateChange -> AvroSchemaEvolutionUtils.reconcileSchema, with HoodieSchemaUtils, BaseHoodieWriteClient, HoodieMergeHelper, and FileGroupReaderBasedMergeHandle reading the config from the write properties. Tests: - TestAvroSchemaEvolutionUtils.testReconcileSchemaTimestampPrecisionEvolution covers default-strict reject and opt-in permit for all three shapes (timestamp precision swap, local-timestamp precision swap, long -> local-timestamp logical-type attach). - testCOWLogicalRepair / testMORLogicalRepair parameterize on both setNullForMissingColumns and allowTimestampPrecisionEvolution; positive variants exercise the gated repair path on v6/v8/CURRENT fixtures; a negative variant asserts SchemaCompatibilityException when the reconcile path is on with the gate closed.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR adds an opt-in write config to permit timestamp-precision evolution through the internal-schema reconcile path, with reasonable test coverage on both positive variants and a negative variant for the closed gate. One small inconsistency worth taking a look at in the inline comments around the API delegate promise. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability nits below.
| * @return whether to allow the column type to be updated. | ||
| */ | ||
| public static boolean isTypeUpdateAllow(Type src, Type dst) { | ||
| /** |
There was a problem hiding this comment.
🤖 The PR description notes "Pre-existing overloads kept as delegates," which is followed for reconcileSchema and ColumnUpdateChange.get, but here the 2-arg isTypeUpdateAllow(Type, Type) is replaced rather than kept as a delegate. Could you preserve the original 2-arg signature as a thin delegate (return isTypeUpdateAllow(src, dst, false);)? This class is a public utility in internal.schema.utils and could plausibly be called by downstream tooling.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| .defaultValue(false) | ||
| .markAdvanced() | ||
| .sinceVersion("1.3.0") | ||
| .withDocumentation("Controls whether schema evolution may change a column between timestamp-millis and " |
There was a problem hiding this comment.
🤖 Should the docs make it explicit that this only relabels the logical type — it does not rewrite stored values? Since old base files keep their original encoding, enabling the gate when actual values don't match the new precision can lead to misinterpretation on read. A one-liner like "Stored values are not rewritten; only enable when the actual values already match the target precision" would steer users away from misuse.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| val setNullForMissingColumns = opts.getOrElse(HoodieCommonConfig.SET_NULL_FOR_MISSING_COLUMNS.key(), | ||
| HoodieCommonConfig.SET_NULL_FOR_MISSING_COLUMNS.defaultValue()).toBoolean | ||
| val mergedInternalSchema = AvroSchemaEvolutionUtils.reconcileSchema(canonicalizedSourceSchema.toAvroSchema(), internalSchema, setNullForMissingColumns) | ||
| val allowTimestampPrecisionEvolution = opts.getOrElse(HoodieCommonConfig.ALLOW_TIMESTAMP_PRECISION_EVOLUTION.key, |
There was a problem hiding this comment.
🤖 nit: allowTimestampPrecisionEvolution is already declared from opts at the outer scope a few dozen lines above — could you just reference that val here instead of re-reading the same key with the same default?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
Superseded by #19029 (same diff, renamed branch to drop the internal ticket prefix). Apologies for the noise. |
Change Logs
Background. Earlier Hudi versions mishandled long-backed timestamp logical types in
AvroInternalSchemaConverter:timestamp-millisandtimestamp-microsboth collapsed into a single internalTimestampTypeand were always re-emitted astimestampMicros()on serialize. A source schema declaringtimestamp-millisgot persisted in the table with the wrongtimestamp-microslogical type, while the underlyinglongvalues written to parquet remained epoch-millis. Pure logical-type drift.local-timestamp-millisandlocal-timestamp-microshad no branch at all. They fell through to the bareLongType, and the logical type was dropped from the table schema entirely. Logical-type loss.In both cases the parquet values are correct, only the logical type on the field is wrong. Current converters recognize all four logical types as distinct, so the writer schema now declares the correct logical type. On every subsequent write the reconcile path compares writer schema against the persisted table schema, finds the logical-type mismatch, and rejects it.
Why writes get blocked. When
hoodie.write.set.null.for.missing.columns=true,HoodieSchemaUtils.deduceWriterSchemacallsAvroSchemaEvolutionUtils.reconcileSchema, which goes throughTableChanges.ColumnUpdateChange.updateColumnTypeandSchemaChangeUtils.isTypeUpdateAllow. That switch had no case forTIMESTAMPorTIMESTAMP_MILLIS, so any precision/logical-type change fell intodefault: return falseand threwSchemaCompatibilityException. Streams that need null-fill on missing columns cannot write to any table previously persisted with long-backed timestamp columns mishandled this way.The non-reconcile path (
set.null=false) is unaffected: it skipsreconcileSchemaand letsAvroSchemaCompatibility.checkReaderWriterCompatibilityvalidate, which is logical-type-blind (both timestamps arelongunderneath) and accepts the correction. So the reconcile path was strictly stricter than the non-reconcile path for the same scenario; only theset.null=truepath was broken.Fix. New write config
hoodie.write.schema.allow.timestamp.precision.evolution(defaultfalse) that, whentrue, letsSchemaChangeUtils.isTypeUpdateAllowpermit:timestamp-millis ↔ timestamp-micros(logical-type drift case, both directions)local-timestamp-millis ↔ local-timestamp-micros(precision swap among the recognized variants)long → local-timestamp-millis/long → local-timestamp-micros(logical-type loss case, attach the missing logical type)Default
falsepreserves the existing strict rejection. Opt-in is per-write by setting the config totrue. The flag is threaded intoAvroSchemaEvolutionUtils.reconcileSchemaandTableChanges.ColumnUpdateChange, and read from the write properties byHoodieSchemaUtils.scala,BaseHoodieWriteClient,HoodieMergeHelper, andFileGroupReaderBasedMergeHandle.Why not unconditional. Default writes should not silently change column logical types; the gate keeps the opt-in explicit and per-write. Older readers that don't recognize the new logical-type values degrade gracefully:
AvroInternalSchemaConverterfalls through to the bare primitive for unrecognized logical types, so the corrected logical type is simply ignored on the reader side.Complements the read-side repair from #14161, which handles parquet files carrying the wrong logical type transparently until the table schema is corrected. This PR closes the write-side gap so the table schema itself can be brought into agreement with the writer schema once and for all.
Impact
AvroSchemaEvolutionUtils.reconcileSchema, one new factory onTableChanges.ColumnUpdateChange.get. Pre-existing overloads kept as delegates.HoodieSchemaUtils,BaseHoodieWriteClient,HoodieMergeHelper,FileGroupReaderBasedMergeHandle.Risk level (write none, low medium or high below)
low
Opt-in gate with default-false means existing writers are unchanged. Positive variants exercise the gated repair path on the v6/v8/CURRENT logical-repair fixtures from #14161. Negative variant asserts
SchemaCompatibilityExceptionwhen the reconcile path is on with the gate closed, locking in the default behavior.Documentation Update
New config documented inline on
HoodieCommonConfig.ALLOW_TIMESTAMP_PRECISION_EVOLUTIONwithsinceVersion("1.3.0").Contributor's checklist