Skip to content

CQ shared store v2 file format #16194

@lhoguin

Description

@lhoguin

CQ shared message store has always had slowness issues around compaction/deletion, which were improved on but nothing perfect. To better do compaction/deletion a new file format is necessary because the existing format doesn't have any header or anything that would avoid potential collisions (even if rare) so we can't change the format. I came up with this plan:

1. Goals

The current message store ("v1") has significant limitations addressed by
v2:

  1. Hole representation is implicit. Removed messages leave gaps in segment
    files. To prevent stray bytes from being misinterpreted as messages,
    compaction zero-fills every gap. The cost of that zero-fill is proportional
    to the gap size and is the dominant I/O cost of compaction on large files.

  2. Records are not self-describing. Every record is a message. There is no
    way to introduce new record kinds (e.g. explicit free-space markers,
    checksummed messages) without breaking compatibility with existing files.

  3. Index entries are kept too long in-memory. Removing index entries eagerly
    means entries can't be detected and we have to parse one byte at a time.
    Keeping index entries means we have to cleanup on segment deletion.

v2 addresses all by introducing typed records with explicit, cheap
HOLE markers that the scanner skips in O(1) per hole regardless of hole
size, and by leaving a clear extension path (additional record types).

2. Strategy: format identity by extension + per-store boundary

2.1 Decided model

  • Format identity is encoded in the file extension.
    • .rdq -> v1 segment.
    • .sqs -> v2 segment.
  • Each store carries a single boundary integer, last_v1_file:
    • File =< last_v1_file -> v1.
    • File > last_v1_file -> v2.
    • last_v1_file = none -> the store has no v1 files at all.
  • Writes always produce v2 from release N onward. v1 files are only ever
    read, never created.
  • Conversion happens by attrition. v1 files disappear naturally as their
    messages are removed (compaction reduces them, then delete_file_if_empty
    unlinks them).
  • Forced migration of any v1 tail happens in a later release N+X.

2.2 Phased rollout

Release Behaviour
N Ship v2 encoder/decoder/scanner, the last_v1_file boundary, and dual-format reads. New files are always v2. Existing v1 files keep being read in place. No bulk migration.
N+X On startup, if any *.rdq files remain, migrate each to a fresh v2 segment (next integer, swap index entries, delete the old *.rdq). Set last_v1_file = none.
N+Y Remove all v1 code. Refuse to start if *.rdq is present? Operator must boot through N+X first.

The volume to migrate at N+X is whatever survived natural drain across X
releases — typically small for normal workloads.

3. File format

3.1 File header (fixed 64 bytes at offset 0)

Every v2 segment begins with a 64-byte header, modelled on the
rabbit_classic_queue_store_v2 segment header.

+---------+---------+----------------------------+
| MAGIC   | VERSION | reserved (zero-filled)     |
| 32 bit  |  8 bit  | 59 bytes                   |
+---------+---------+----------------------------+
total: 64 bytes
  • MAGIC = TBD 4-byte ASCII constant (e.g. "RCQV" / chosen at
    implementation time; must differ from "RCQS" and "RCQI").
  • VERSION = 2.
  • The remaining bytes are reserved for future use and zero-filled.

There is no per-file FromSeqId/ToSeqId (the shared store does not have a
segment-relative seq id concept). The header is written but not validated
on the hot path
; readers unconditionally start at offset 64. Dirty recovery
scanning and salvage tools may validate the header for forensic purposes.

The first record of a freshly created v2 segment lives at offset 64. The
current_file_offset of a new v2 segment starts at 64 instead of 0. All
absolute offsets stored in the index, all pread calls, and all gap
arithmetic in compaction continue to work unchanged because they are all
expressed in terms of absolute file offsets.

3.2 Records

All multi-byte integers are big-endian. The first byte of every record is its
Type. The remaining fields are type-specific.

Type = 0 : RESERVED (must not appear; surfaces zero-filled regions during scan)

Type = 1 : SMALL_HOLE
  +------+
  | 1:8  |    1 byte total, no length field
  +------+
  Used only when a contiguous gap is < 5 bytes.
  A gap of 1..4 bytes is encoded as 1..4 consecutive SMALL_HOLE bytes.

Type = 2 : HOLE
  +------+----------+--------------------+
  | 2:8  | Size:32  | inner:(Size-5)     |    Size = total record length
  +------+----------+--------------------+
  Min 5 bytes. Used for any gap >= 5 bytes.
  Inner bytes are NOT interpreted (left as-is, no zero-fill required).

Type = 3 : MESSAGE
  +------+----------+-----------+----------------+
  | 3:8  | Size:32  | MsgId:128 | Body:(Size-21) |    Size = total record length
  +------+----------+-----------+----------------+
  Min 21 bytes (empty body permitted).
  Body = term_to_binary(Msg) (same encoding as v1).

Type = 4 : MESSAGE_WITH_CHECKSUM (reserved for the future)
  e.g.
  +------+----------+-----------+--------+----------------+
  | 4:8  | Size:32  | MsgId:128 | CRC:32 | Body:(Size-25) |
  +------+----------+-----------+--------+----------------+

Type = 5..255 : RESERVED

3.3 Per-message overhead vs v1

Body v1 record v2 MESSAGE delta
0 B 25 B 21 B -4
64 B 89 B 85 B -4
1 KiB 1049 B 1045 B -4
4 MiB 4194329 4194325 -4

v2 is 4 bytes smaller per message thanks to the narrower size field
(32 bit instead of 64 bit) and the absence of a trailer byte; the added Type
byte is offset.

3.4 No record trailer

v1 records end with a 255 byte. In v2 there is no trailer.

The trailer in v1 was a 1-in-256 false-positive guard for the salvage scanner
and a weak torn-write detector. v2 obtains equivalent robustness from
structural validation during scan:

  • Type is a known value.
  • Size is within [min_for_type, FileSize - Offset].
  • For MESSAGE, MsgId is checked against the "already seen" map (carried
    over from v1's scanner).

A torn last record at EOF still fails Offset + Size =< FileSize and is
discarded, just as in v1. Bit-flip detection is out of scope for both v1 and
v2; if needed it must be added via MESSAGE_WITH_CHECKSUM.

3.5 Hole encoding rule (writers / compaction)

gap_size == 0  -> nothing
gap_size <  5  -> gap_size x <<1>>          (1..4 SMALL_HOLE bytes)
gap_size >= 5  -> one HOLE record, Size = gap_size

Every non-zero gap is representable. Compaction never has to refuse moving a
message into a hole because of an awkward remainder.

3.6 Unknown types and why Type = 0 is reserved

A zero-filled or uninitialised region cannot accidentally look like a valid
record: an unrecognised first byte leads to an immediate crash.

Type = 0 is reserved forever to better detect potential zero-filled
binary issues in the code or VM.

4. State changes

4.1 New per-store field

%% Added to #msstate, #client_msstate, #gc_state.
last_v1_file :: integer() | none

Single integer per store. Mirrored into the client and GC states so that
every code path that opens a segment knows which extension to use without an
ETS lookup.

4.2 Format dispatch

file_format(_File, none)                       -> v2;
file_format(File,  LastV1) when File =< LastV1 -> v1;
file_format(_,     _)                          -> v2.

filenum_to_name(File, LastV1) ->
    case file_format(File, LastV1) of
        v1 -> integer_to_list(File) ++ ".rdq";
        v2 -> integer_to_list(File) ++ ".sqs"
    end.

This function is the only indirection added at every file-open site (writer,
reader, scanner, GC). It is a single integer comparison.

4.3 Persistence

last_v1_file is persisted in clean.dot recovery terms alongside
{client_refs, _}:

[{client_refs, ...},
 {last_v1_file, integer() | none}]

On dirty recovery (no clean.dot), last_v1_file is recomputed from disk:

case filelib:wildcard("*.rdq", Dir) of
    []    -> none;
    Files -> lists:max([filename_to_num(F) || F <- Files])
end

4.4 Eager ets index entry deletion

We can enable eager index entry deletion again: #16142

Deleting many segment files should be very fast because we don't need to scan.

4.5 Salvage tools

Salvage tools (if any) will need to be updated to handle both v1 and v2 file formats.

4.6 What is NOT added

  • No new fields on #file_summary.
  • No format flag on #msg_location.
  • No on-disk file header validation on the read path.

cc @lukebakken @gomoripeti

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions