CQ shared store v2 file format

CQ shared message store has always had slowness issues around compaction/deletion, which were improved on but nothing perfect. To better do compaction/deletion a new file format is necessary because the existing format doesn't have any header or anything that would avoid potential collisions (even if rare) so we can't change the format. I came up with this plan:

## 1. Goals

The current message store ("v1") has significant limitations addressed by
v2:

1. **Hole representation is implicit.** Removed messages leave gaps in segment
   files. To prevent stray bytes from being misinterpreted as messages,
   compaction zero-fills every gap. The cost of that zero-fill is proportional
   to the gap size and is the dominant I/O cost of compaction on large files.

2. **Records are not self-describing.** Every record is a message. There is no
   way to introduce new record kinds (e.g. explicit free-space markers,
   checksummed messages) without breaking compatibility with existing files.

3. **Index entries are kept too long in-memory.** Removing index entries eagerly
   means entries can't be detected and we have to parse one byte at a time.
   Keeping index entries means we have to cleanup on segment deletion.

v2 addresses all by introducing **typed records** with explicit, cheap
**HOLE markers** that the scanner skips in O(1) per hole regardless of hole
size, and by leaving a clear extension path (additional record types).

## 2. Strategy: format identity by extension + per-store boundary

### 2.1 Decided model

- **Format identity is encoded in the file extension.**
  - `.rdq` -> v1 segment.
  - `.sqs` -> v2 segment.
- **Each store carries a single boundary integer**, `last_v1_file`:
  - `File =< last_v1_file` -> v1.
  - `File >  last_v1_file` -> v2.
  - `last_v1_file = none` -> the store has no v1 files at all.
- **Writes always produce v2** from release N onward. v1 files are only ever
  read, never created.
- **Conversion happens by attrition.** v1 files disappear naturally as their
  messages are removed (compaction reduces them, then `delete_file_if_empty`
  unlinks them).
- **Forced migration** of any v1 tail happens in a later release N+X.

### 2.2 Phased rollout

| Release | Behaviour |
|---|---|
| **N**     | Ship v2 encoder/decoder/scanner, the `last_v1_file` boundary, and dual-format reads. New files are always v2. Existing v1 files keep being read in place. No bulk migration. |
| **N+X**   | On startup, if any `*.rdq` files remain, migrate each to a fresh v2 segment (next integer, swap index entries, delete the old `*.rdq`). Set `last_v1_file = none`. |
| **N+Y**   | Remove all v1 code. Refuse to start if `*.rdq` is present? Operator must boot through N+X first. |

The volume to migrate at N+X is whatever survived natural drain across X
releases — typically small for normal workloads.

## 3. File format

### 3.1 File header (fixed 64 bytes at offset 0)

Every v2 segment begins with a 64-byte header, modelled on the
`rabbit_classic_queue_store_v2` segment header.

```text
+---------+---------+----------------------------+
| MAGIC   | VERSION | reserved (zero-filled)     |
| 32 bit  |  8 bit  | 59 bytes                   |
+---------+---------+----------------------------+
total: 64 bytes
```

- `MAGIC`   = TBD 4-byte ASCII constant (e.g. `"RCQV"` / chosen at
  implementation time; must differ from `"RCQS"` and `"RCQI"`).
- `VERSION` = `2`.
- The remaining bytes are reserved for future use and zero-filled.

There is no per-file `FromSeqId`/`ToSeqId` (the shared store does not have a
segment-relative seq id concept). The header is **written but not validated
on the hot path**; readers unconditionally start at offset `64`. Dirty recovery
scanning and salvage tools may validate the header for forensic purposes.

The first record of a freshly created v2 segment lives at offset `64`. The
`current_file_offset` of a new v2 segment starts at `64` instead of `0`. All
absolute offsets stored in the index, all `pread` calls, and all gap
arithmetic in compaction continue to work unchanged because they are all
expressed in terms of absolute file offsets.

### 3.2 Records

All multi-byte integers are big-endian. The first byte of every record is its
`Type`. The remaining fields are type-specific.

```text
Type = 0 : RESERVED (must not appear; surfaces zero-filled regions during scan)

Type = 1 : SMALL_HOLE
  +------+
  | 1:8  |    1 byte total, no length field
  +------+
  Used only when a contiguous gap is < 5 bytes.
  A gap of 1..4 bytes is encoded as 1..4 consecutive SMALL_HOLE bytes.

Type = 2 : HOLE
  +------+----------+--------------------+
  | 2:8  | Size:32  | inner:(Size-5)     |    Size = total record length
  +------+----------+--------------------+
  Min 5 bytes. Used for any gap >= 5 bytes.
  Inner bytes are NOT interpreted (left as-is, no zero-fill required).

Type = 3 : MESSAGE
  +------+----------+-----------+----------------+
  | 3:8  | Size:32  | MsgId:128 | Body:(Size-21) |    Size = total record length
  +------+----------+-----------+----------------+
  Min 21 bytes (empty body permitted).
  Body = term_to_binary(Msg) (same encoding as v1).

Type = 4 : MESSAGE_WITH_CHECKSUM (reserved for the future)
  e.g.
  +------+----------+-----------+--------+----------------+
  | 4:8  | Size:32  | MsgId:128 | CRC:32 | Body:(Size-25) |
  +------+----------+-----------+--------+----------------+

Type = 5..255 : RESERVED
```

### 3.3 Per-message overhead vs v1

| Body | v1 record | v2 MESSAGE | delta |
|---|---|---|---|
| 0 B   | 25 B   | 21 B   | -4 |
| 64 B  | 89 B   | 85 B   | -4 |
| 1 KiB | 1049 B | 1045 B | -4 |
| 4 MiB | 4194329 | 4194325 | -4 |

v2 is 4 bytes smaller per message thanks to the narrower size field
(32 bit instead of 64 bit) and the absence of a trailer byte; the added Type
byte is offset.

### 3.4 No record trailer

v1 records end with a `255` byte. In v2 there is no trailer.

The trailer in v1 was a 1-in-256 false-positive guard for the salvage scanner
and a weak torn-write detector. v2 obtains equivalent robustness from
**structural validation** during scan:

- `Type` is a known value.
- `Size` is within `[min_for_type, FileSize - Offset]`.
- For MESSAGE, `MsgId` is checked against the "already seen" map (carried
  over from v1's scanner).

A torn last record at EOF still fails `Offset + Size =< FileSize` and is
discarded, just as in v1. Bit-flip detection is out of scope for both v1 and
v2; if needed it must be added via `MESSAGE_WITH_CHECKSUM`.

### 3.5 Hole encoding rule (writers / compaction)

```text
gap_size == 0  -> nothing
gap_size <  5  -> gap_size x <<1>>          (1..4 SMALL_HOLE bytes)
gap_size >= 5  -> one HOLE record, Size = gap_size
```

Every non-zero gap is representable. Compaction never has to refuse moving a
message into a hole because of an awkward remainder.

### 3.6 Unknown types and why `Type = 0` is reserved

A zero-filled or uninitialised region cannot accidentally look like a valid
record: an unrecognised first byte leads to an immediate crash.

`Type = 0` is reserved forever to better detect potential zero-filled
binary issues in the code or VM.

## 4. State changes

### 4.1 New per-store field

```erlang
%% Added to #msstate, #client_msstate, #gc_state.
last_v1_file :: integer() | none
```

Single integer per store. Mirrored into the client and GC states so that
every code path that opens a segment knows which extension to use without an
ETS lookup.

### 4.2 Format dispatch

```erlang
file_format(_File, none)                       -> v2;
file_format(File,  LastV1) when File =< LastV1 -> v1;
file_format(_,     _)                          -> v2.

filenum_to_name(File, LastV1) ->
    case file_format(File, LastV1) of
        v1 -> integer_to_list(File) ++ ".rdq";
        v2 -> integer_to_list(File) ++ ".sqs"
    end.
```

This function is the only indirection added at every file-open site (writer,
reader, scanner, GC). It is a single integer comparison.

### 4.3 Persistence

`last_v1_file` is persisted in `clean.dot` recovery terms alongside
`{client_refs, _}`:

```erlang
[{client_refs, ...},
 {last_v1_file, integer() | none}]
```

On dirty recovery (no `clean.dot`), `last_v1_file` is recomputed from disk:

```erlang
case filelib:wildcard("*.rdq", Dir) of
    []    -> none;
    Files -> lists:max([filename_to_num(F) || F <- Files])
end
```

### 4.4 Eager ets index entry deletion

We can enable eager index entry deletion again: https://github.com/rabbitmq/rabbitmq-server/pull/16142

Deleting many segment files should be very fast because we don't need to scan.

### 4.5 Salvage tools

Salvage tools (if any) will need to be updated to handle both v1 and v2 file formats.

### 4.6 What is NOT added

- No new fields on `#file_summary`.
- No format flag on `#msg_location`.
- No on-disk file header validation on the read path.

cc @lukebakken @gomoripeti 

Release	Behaviour
N	Ship v2 encoder/decoder/scanner, the `last_v1_file` boundary, and dual-format reads. New files are always v2. Existing v1 files keep being read in place. No bulk migration.
N+X	On startup, if any `.rdq` files remain, migrate each to a fresh v2 segment (next integer, swap index entries, delete the old `.rdq`). Set `last_v1_file = none`.
N+Y	Remove all v1 code. Refuse to start if `*.rdq` is present? Operator must boot through N+X first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CQ shared store v2 file format #16194

1. Goals

2. Strategy: format identity by extension + per-store boundary

2.1 Decided model

2.2 Phased rollout

3. File format

3.1 File header (fixed 64 bytes at offset 0)

3.2 Records

3.3 Per-message overhead vs v1

3.4 No record trailer

3.5 Hole encoding rule (writers / compaction)

3.6 Unknown types and why `Type = 0` is reserved

4. State changes

4.1 New per-store field

4.2 Format dispatch

4.3 Persistence

4.4 Eager ets index entry deletion

4.5 Salvage tools

4.6 What is NOT added

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Body	v1 record	v2 MESSAGE	delta
0 B	25 B	21 B	-4
64 B	89 B	85 B	-4
1 KiB	1049 B	1045 B	-4
4 MiB	4194329	4194325	-4

CQ shared store v2 file format #16194

Description

1. Goals

2. Strategy: format identity by extension + per-store boundary

2.1 Decided model

2.2 Phased rollout

3. File format

3.1 File header (fixed 64 bytes at offset 0)

3.2 Records

3.3 Per-message overhead vs v1

3.4 No record trailer

3.5 Hole encoding rule (writers / compaction)

3.6 Unknown types and why Type = 0 is reserved

4. State changes

4.1 New per-store field

4.2 Format dispatch

4.3 Persistence

4.4 Eager ets index entry deletion

4.5 Salvage tools

4.6 What is NOT added

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3.6 Unknown types and why `Type = 0` is reserved