Skip to content

Spec: Document standard encryption key metadata #16527

Open
xanderbailey wants to merge 5 commits into
apache:mainfrom
xanderbailey:xb/spec-encryption-key-metadata
Open

Spec: Document standard encryption key metadata #16527
xanderbailey wants to merge 5 commits into
apache:mainfrom
xanderbailey:xb/spec-encryption-key-metadata

Conversation

@xanderbailey
Copy link
Copy Markdown

Before this PR

The key_metadata field in manifest entries and the encrypted-key-metadata field in table metadata are described as "implementation-specific" in the spec. While implementing table encryption in iceberg-rust, we found that the actual binary formats used by the Java implementation are required for cross-implementation interop but aren't documented anywhere - you have to reverse-engineer them from the Java source.

See mailing list thread: [DISCUSS] Specifying the encryption key metadata formats for cross-implementation interop

After this PR

The spec now documents:

  • The version-prefixed Avro binary format for per-file key_metadata (DEK, AAD prefix, file length)
  • The two-tier key hierarchy (KEKs wrapped by KMS vs manifest list keys encrypted by KEKs)

Another implementation can now correctly read/write encrypted tables without referencing the Java source.

@xanderbailey
Copy link
Copy Markdown
Author

xanderbailey commented May 22, 2026

cc: @ggershinsky in case you have any thoughts here

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label May 22, 2026
@xanderbailey xanderbailey force-pushed the xb/spec-encryption-key-metadata branch from 57ab53b to d67101b Compare May 22, 2026 12:07
…ementation interop

Specify the binary wire format for the key_metadata field in manifest
entries and the key hierarchy in table metadata encryption-keys, which
were previously described only as "implementation-specific" but are
required for implementations to interoperate.
@xanderbailey xanderbailey force-pushed the xb/spec-encryption-key-metadata branch from d67101b to ae07984 Compare May 22, 2026 12:09
Copy link
Copy Markdown
Contributor

@szlta szlta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a minor nits.

Comment thread format/spec.md Outdated

The standard encryption scheme uses a two-tier key hierarchy tracked in the table metadata `encryption-keys` list:

1. **Key Encryption Keys (KEKs):** Entries where `encrypted-by-id` equals the table's encryption key ID (configured via `encryption.key-id`). The `encrypted-key-metadata` contains the KEK wrapped by the KMS and is opaque to Iceberg — its format is determined by the KMS provider.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention here that a KEY_TIMESTAMP property is expected to be present for KEKs - AFAIK without it decryption flow will error out.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good shout. 5f38182

Comment thread format/spec.md Outdated
| Field name | Avro type | Required | Description |
|---|---|---|---|
| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). |
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) block authentication. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) block authentication. |
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky
Copy link
Copy Markdown
Contributor

I think this makes a lot of sense, thank you. I'll have a look at the details.

- Clarify that aad_prefix is for integrity protection, not authentication
- Document that KEY_TIMESTAMP property is required on KEK entries
Comment thread format/spec.md
| _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] |
| _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] |
| _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption |
| _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Per-file encryption key metadata. See [Standard Key Metadata](#standard-key-metadata) for the interoperable format used by the standard encryption scheme. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is also a key_metadata field in the Manifest File struct (field id 519)

Comment thread format/spec.md Outdated
| Field name | Avro type | Required | Description |
|---|---|---|---|
| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). |
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAD prefix is used not only in AES GCM Stream files, but also in encrypted Parquet files (https://parquet.apache.org/docs/file-format/data-pages/encryption/ or https://github.com/apache/parquet-format/blob/master/Encryption.md)

Comment thread format/spec.md Outdated
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. |
| **`file_length`** | `long` | _optional_ | The plaintext file length before encryption. Used to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). |

The AAD prefix is combined with a 4-byte little-endian block index to form the AAD for each AES GCM Stream cipher block, as described in the [AES GCM Stream AAD section](gcm-stream-spec.md#additional-authenticated-data).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Parquet encryption, this works differently, https://parquet.apache.org/docs/file-format/data-pages/encryption/

Comment thread format/spec.md Outdated
|---|---|---|---|
| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). |
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. |
| **`file_length`** | `long` | _optional_ | The plaintext file length before encryption. Used to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps file length after encryption, https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java#L45

Only for AES GCM Stream files. Not set/used for encrypted Parquet data files.

@xanderbailey xanderbailey force-pushed the xb/spec-encryption-key-metadata branch from b58d594 to 7d60b4c Compare May 26, 2026 08:48
Clarify that file_length stores the encrypted (ciphertext) length, not
plaintext. Broaden aad_prefix and file_length descriptions to cover both
AES GCM Stream and Parquet modular encryption formats.
@xanderbailey xanderbailey force-pushed the xb/spec-encryption-key-metadata branch from 4690c89 to 4f12e28 Compare May 26, 2026 08:54
@xanderbailey
Copy link
Copy Markdown
Author

Have hopefully addressed comments! Thanks folks. What's the process for getting spec changes merged? Do people typically vote or is just a regular PR approval?

Comment thread format/spec.md Outdated
| Field name | Avro type | Required | Description |
|---|---|---|---|
| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). |
| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for encryption integrity protection. For [AES GCM Stream](gcm-stream-spec.md) files, the prefix is combined with a block index to form the per-block AAD. For [Parquet modular encryption](https://parquet.apache.org/docs/file-format/data-pages/encryption/), the prefix is passed as the `aad_file_unique` component. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Parquet, it also passed as an AAD Prefix parameter; it then combined with AAD Suffixes.
aad_file_unique is something else (a Parquet-internal parameter)

Comment thread format/spec.md Outdated

The usage of the `encryption_key` and `aad_prefix` fields depends on the file format:

* **AES GCM Stream files** (manifest lists, manifests, and non-Parquet data files): The `encryption_key` is used directly as the AES-GCM key. The `aad_prefix` is combined with a 4-byte little-endian block index to form the AAD for each cipher block, as described in the [AES GCM Stream AAD section](gcm-stream-spec.md#additional-authenticated-data). The `file_length` field stores the encrypted file length for truncation detection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg table encryption is not supported yet for ORC data format, so "manifest lists, manifests, and non-Parquet data files" should be something like "manifest lists, manifests, and Avro data files"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually delete vectors (Puffin files) are also AES GCM encrypted for encrypted tables. I think the list is long enough to worth showing it as a proper list in the spec rather than in just brackets?

@xanderbailey xanderbailey force-pushed the xb/spec-encryption-key-metadata branch from 4fce858 to bcaf862 Compare May 26, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants