Spec: Document standard encryption key metadata #16527
Conversation
|
cc: @ggershinsky in case you have any thoughts here |
57ab53b to
d67101b
Compare
…ementation interop Specify the binary wire format for the key_metadata field in manifest entries and the key hierarchy in table metadata encryption-keys, which were previously described only as "implementation-specific" but are required for implementations to interoperate.
d67101b to
ae07984
Compare
szlta
left a comment
There was a problem hiding this comment.
Looks good to me, just a minor nits.
|
|
||
| The standard encryption scheme uses a two-tier key hierarchy tracked in the table metadata `encryption-keys` list: | ||
|
|
||
| 1. **Key Encryption Keys (KEKs):** Entries where `encrypted-by-id` equals the table's encryption key ID (configured via `encryption.key-id`). The `encrypted-key-metadata` contains the KEK wrapped by the KMS and is opaque to Iceberg — its format is determined by the KMS provider. |
There was a problem hiding this comment.
Perhaps mention here that a KEY_TIMESTAMP property is expected to be present for KEKs - AFAIK without it decryption flow will error out.
| | Field name | Avro type | Required | Description | | ||
| |---|---|---|---| | ||
| | **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | | ||
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) block authentication. | |
There was a problem hiding this comment.
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) block authentication. | | |
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | |
|
I think this makes a lot of sense, thank you. I'll have a look at the details. |
- Clarify that aad_prefix is for integrity protection, not authentication - Document that KEY_TIMESTAMP property is required on KEK entries
| | _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | | ||
| | _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | | ||
| | _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | | ||
| | _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Per-file encryption key metadata. See [Standard Key Metadata](#standard-key-metadata) for the interoperable format used by the standard encryption scheme. | |
There was a problem hiding this comment.
there is also a key_metadata field in the Manifest File struct (field id 519)
| | Field name | Avro type | Required | Description | | ||
| |---|---|---|---| | ||
| | **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | | ||
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | |
There was a problem hiding this comment.
AAD prefix is used not only in AES GCM Stream files, but also in encrypted Parquet files (https://parquet.apache.org/docs/file-format/data-pages/encryption/ or https://github.com/apache/parquet-format/blob/master/Encryption.md)
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | | ||
| | **`file_length`** | `long` | _optional_ | The plaintext file length before encryption. Used to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). | | ||
|
|
||
| The AAD prefix is combined with a 4-byte little-endian block index to form the AAD for each AES GCM Stream cipher block, as described in the [AES GCM Stream AAD section](gcm-stream-spec.md#additional-authenticated-data). |
There was a problem hiding this comment.
in Parquet encryption, this works differently, https://parquet.apache.org/docs/file-format/data-pages/encryption/
| |---|---|---|---| | ||
| | **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | | ||
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | | ||
| | **`file_length`** | `long` | _optional_ | The plaintext file length before encryption. Used to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). | |
There was a problem hiding this comment.
This keeps file length after encryption, https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java#L45
Only for AES GCM Stream files. Not set/used for encrypted Parquet data files.
b58d594 to
7d60b4c
Compare
Clarify that file_length stores the encrypted (ciphertext) length, not plaintext. Broaden aad_prefix and file_length descriptions to cover both AES GCM Stream and Parquet modular encryption formats.
4690c89 to
4f12e28
Compare
|
Have hopefully addressed comments! Thanks folks. What's the process for getting spec changes merged? Do people typically vote or is just a regular PR approval? |
| | Field name | Avro type | Required | Description | | ||
| |---|---|---|---| | ||
| | **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | | ||
| | **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for encryption integrity protection. For [AES GCM Stream](gcm-stream-spec.md) files, the prefix is combined with a block index to form the per-block AAD. For [Parquet modular encryption](https://parquet.apache.org/docs/file-format/data-pages/encryption/), the prefix is passed as the `aad_file_unique` component. | |
There was a problem hiding this comment.
For Parquet, it also passed as an AAD Prefix parameter; it then combined with AAD Suffixes.
aad_file_unique is something else (a Parquet-internal parameter)
|
|
||
| The usage of the `encryption_key` and `aad_prefix` fields depends on the file format: | ||
|
|
||
| * **AES GCM Stream files** (manifest lists, manifests, and non-Parquet data files): The `encryption_key` is used directly as the AES-GCM key. The `aad_prefix` is combined with a 4-byte little-endian block index to form the AAD for each cipher block, as described in the [AES GCM Stream AAD section](gcm-stream-spec.md#additional-authenticated-data). The `file_length` field stores the encrypted file length for truncation detection. |
There was a problem hiding this comment.
Iceberg table encryption is not supported yet for ORC data format, so "manifest lists, manifests, and non-Parquet data files" should be something like "manifest lists, manifests, and Avro data files"
There was a problem hiding this comment.
Actually delete vectors (Puffin files) are also AES GCM encrypted for encrypted tables. I think the list is long enough to worth showing it as a proper list in the spec rather than in just brackets?
4fce858 to
bcaf862
Compare
Before this PR
The
key_metadatafield in manifest entries and theencrypted-key-metadatafield in table metadata are described as "implementation-specific" in the spec. While implementing table encryption in iceberg-rust, we found that the actual binary formats used by the Java implementation are required for cross-implementation interop but aren't documented anywhere - you have to reverse-engineer them from the Java source.See mailing list thread: [DISCUSS] Specifying the encryption key metadata formats for cross-implementation interop
After this PR
The spec now documents:
key_metadata(DEK, AAD prefix, file length)Another implementation can now correctly read/write encrypted tables without referencing the Java source.