CQ shared message store has always had slowness issues around compaction/deletion, which were improved on but nothing perfect. To better do compaction/deletion a new file format is necessary because the existing format doesn't have any header or anything that would avoid potential collisions (even if rare) so we can't change the format. I came up with this plan:
1. Goals
The current message store ("v1") has significant limitations addressed by
v2:
-
Hole representation is implicit. Removed messages leave gaps in segment
files. To prevent stray bytes from being misinterpreted as messages,
compaction zero-fills every gap. The cost of that zero-fill is proportional
to the gap size and is the dominant I/O cost of compaction on large files.
-
Records are not self-describing. Every record is a message. There is no
way to introduce new record kinds (e.g. explicit free-space markers,
checksummed messages) without breaking compatibility with existing files.
-
Index entries are kept too long in-memory. Removing index entries eagerly
means entries can't be detected and we have to parse one byte at a time.
Keeping index entries means we have to cleanup on segment deletion.
v2 addresses all by introducing typed records with explicit, cheap
HOLE markers that the scanner skips in O(1) per hole regardless of hole
size, and by leaving a clear extension path (additional record types).
2. Strategy: format identity by extension + per-store boundary
2.1 Decided model
- Format identity is encoded in the file extension.
.rdq -> v1 segment.
.sqs -> v2 segment.
- Each store carries a single boundary integer,
last_v1_file:
File =< last_v1_file -> v1.
File > last_v1_file -> v2.
last_v1_file = none -> the store has no v1 files at all.
- Writes always produce v2 from release N onward. v1 files are only ever
read, never created.
- Conversion happens by attrition. v1 files disappear naturally as their
messages are removed (compaction reduces them, then delete_file_if_empty
unlinks them).
- Forced migration of any v1 tail happens in a later release N+X.
2.2 Phased rollout
| Release |
Behaviour |
| N |
Ship v2 encoder/decoder/scanner, the last_v1_file boundary, and dual-format reads. New files are always v2. Existing v1 files keep being read in place. No bulk migration. |
| N+X |
On startup, if any *.rdq files remain, migrate each to a fresh v2 segment (next integer, swap index entries, delete the old *.rdq). Set last_v1_file = none. |
| N+Y |
Remove all v1 code. Refuse to start if *.rdq is present? Operator must boot through N+X first. |
The volume to migrate at N+X is whatever survived natural drain across X
releases — typically small for normal workloads.
3. File format
3.1 File header (fixed 64 bytes at offset 0)
Every v2 segment begins with a 64-byte header, modelled on the
rabbit_classic_queue_store_v2 segment header.
+---------+---------+----------------------------+
| MAGIC | VERSION | reserved (zero-filled) |
| 32 bit | 8 bit | 59 bytes |
+---------+---------+----------------------------+
total: 64 bytes
MAGIC = TBD 4-byte ASCII constant (e.g. "RCQV" / chosen at
implementation time; must differ from "RCQS" and "RCQI").
VERSION = 2.
- The remaining bytes are reserved for future use and zero-filled.
There is no per-file FromSeqId/ToSeqId (the shared store does not have a
segment-relative seq id concept). The header is written but not validated
on the hot path; readers unconditionally start at offset 64. Dirty recovery
scanning and salvage tools may validate the header for forensic purposes.
The first record of a freshly created v2 segment lives at offset 64. The
current_file_offset of a new v2 segment starts at 64 instead of 0. All
absolute offsets stored in the index, all pread calls, and all gap
arithmetic in compaction continue to work unchanged because they are all
expressed in terms of absolute file offsets.
3.2 Records
All multi-byte integers are big-endian. The first byte of every record is its
Type. The remaining fields are type-specific.
Type = 0 : RESERVED (must not appear; surfaces zero-filled regions during scan)
Type = 1 : SMALL_HOLE
+------+
| 1:8 | 1 byte total, no length field
+------+
Used only when a contiguous gap is < 5 bytes.
A gap of 1..4 bytes is encoded as 1..4 consecutive SMALL_HOLE bytes.
Type = 2 : HOLE
+------+----------+--------------------+
| 2:8 | Size:32 | inner:(Size-5) | Size = total record length
+------+----------+--------------------+
Min 5 bytes. Used for any gap >= 5 bytes.
Inner bytes are NOT interpreted (left as-is, no zero-fill required).
Type = 3 : MESSAGE
+------+----------+-----------+----------------+
| 3:8 | Size:32 | MsgId:128 | Body:(Size-21) | Size = total record length
+------+----------+-----------+----------------+
Min 21 bytes (empty body permitted).
Body = term_to_binary(Msg) (same encoding as v1).
Type = 4 : MESSAGE_WITH_CHECKSUM (reserved for the future)
e.g.
+------+----------+-----------+--------+----------------+
| 4:8 | Size:32 | MsgId:128 | CRC:32 | Body:(Size-25) |
+------+----------+-----------+--------+----------------+
Type = 5..255 : RESERVED
3.3 Per-message overhead vs v1
| Body |
v1 record |
v2 MESSAGE |
delta |
| 0 B |
25 B |
21 B |
-4 |
| 64 B |
89 B |
85 B |
-4 |
| 1 KiB |
1049 B |
1045 B |
-4 |
| 4 MiB |
4194329 |
4194325 |
-4 |
v2 is 4 bytes smaller per message thanks to the narrower size field
(32 bit instead of 64 bit) and the absence of a trailer byte; the added Type
byte is offset.
3.4 No record trailer
v1 records end with a 255 byte. In v2 there is no trailer.
The trailer in v1 was a 1-in-256 false-positive guard for the salvage scanner
and a weak torn-write detector. v2 obtains equivalent robustness from
structural validation during scan:
Type is a known value.
Size is within [min_for_type, FileSize - Offset].
- For MESSAGE,
MsgId is checked against the "already seen" map (carried
over from v1's scanner).
A torn last record at EOF still fails Offset + Size =< FileSize and is
discarded, just as in v1. Bit-flip detection is out of scope for both v1 and
v2; if needed it must be added via MESSAGE_WITH_CHECKSUM.
3.5 Hole encoding rule (writers / compaction)
gap_size == 0 -> nothing
gap_size < 5 -> gap_size x <<1>> (1..4 SMALL_HOLE bytes)
gap_size >= 5 -> one HOLE record, Size = gap_size
Every non-zero gap is representable. Compaction never has to refuse moving a
message into a hole because of an awkward remainder.
3.6 Unknown types and why Type = 0 is reserved
A zero-filled or uninitialised region cannot accidentally look like a valid
record: an unrecognised first byte leads to an immediate crash.
Type = 0 is reserved forever to better detect potential zero-filled
binary issues in the code or VM.
4. State changes
4.1 New per-store field
%% Added to #msstate, #client_msstate, #gc_state.
last_v1_file :: integer() | none
Single integer per store. Mirrored into the client and GC states so that
every code path that opens a segment knows which extension to use without an
ETS lookup.
4.2 Format dispatch
file_format(_File, none) -> v2;
file_format(File, LastV1) when File =< LastV1 -> v1;
file_format(_, _) -> v2.
filenum_to_name(File, LastV1) ->
case file_format(File, LastV1) of
v1 -> integer_to_list(File) ++ ".rdq";
v2 -> integer_to_list(File) ++ ".sqs"
end.
This function is the only indirection added at every file-open site (writer,
reader, scanner, GC). It is a single integer comparison.
4.3 Persistence
last_v1_file is persisted in clean.dot recovery terms alongside
{client_refs, _}:
[{client_refs, ...},
{last_v1_file, integer() | none}]
On dirty recovery (no clean.dot), last_v1_file is recomputed from disk:
case filelib:wildcard("*.rdq", Dir) of
[] -> none;
Files -> lists:max([filename_to_num(F) || F <- Files])
end
4.4 Eager ets index entry deletion
We can enable eager index entry deletion again: #16142
Deleting many segment files should be very fast because we don't need to scan.
4.5 Salvage tools
Salvage tools (if any) will need to be updated to handle both v1 and v2 file formats.
4.6 What is NOT added
- No new fields on
#file_summary.
- No format flag on
#msg_location.
- No on-disk file header validation on the read path.
cc @lukebakken @gomoripeti
CQ shared message store has always had slowness issues around compaction/deletion, which were improved on but nothing perfect. To better do compaction/deletion a new file format is necessary because the existing format doesn't have any header or anything that would avoid potential collisions (even if rare) so we can't change the format. I came up with this plan:
1. Goals
The current message store ("v1") has significant limitations addressed by
v2:
Hole representation is implicit. Removed messages leave gaps in segment
files. To prevent stray bytes from being misinterpreted as messages,
compaction zero-fills every gap. The cost of that zero-fill is proportional
to the gap size and is the dominant I/O cost of compaction on large files.
Records are not self-describing. Every record is a message. There is no
way to introduce new record kinds (e.g. explicit free-space markers,
checksummed messages) without breaking compatibility with existing files.
Index entries are kept too long in-memory. Removing index entries eagerly
means entries can't be detected and we have to parse one byte at a time.
Keeping index entries means we have to cleanup on segment deletion.
v2 addresses all by introducing typed records with explicit, cheap
HOLE markers that the scanner skips in O(1) per hole regardless of hole
size, and by leaving a clear extension path (additional record types).
2. Strategy: format identity by extension + per-store boundary
2.1 Decided model
.rdq-> v1 segment..sqs-> v2 segment.last_v1_file:File =< last_v1_file-> v1.File > last_v1_file-> v2.last_v1_file = none-> the store has no v1 files at all.read, never created.
messages are removed (compaction reduces them, then
delete_file_if_emptyunlinks them).
2.2 Phased rollout
last_v1_fileboundary, and dual-format reads. New files are always v2. Existing v1 files keep being read in place. No bulk migration.*.rdqfiles remain, migrate each to a fresh v2 segment (next integer, swap index entries, delete the old*.rdq). Setlast_v1_file = none.*.rdqis present? Operator must boot through N+X first.The volume to migrate at N+X is whatever survived natural drain across X
releases — typically small for normal workloads.
3. File format
3.1 File header (fixed 64 bytes at offset 0)
Every v2 segment begins with a 64-byte header, modelled on the
rabbit_classic_queue_store_v2segment header.MAGIC= TBD 4-byte ASCII constant (e.g."RCQV"/ chosen atimplementation time; must differ from
"RCQS"and"RCQI").VERSION=2.There is no per-file
FromSeqId/ToSeqId(the shared store does not have asegment-relative seq id concept). The header is written but not validated
on the hot path; readers unconditionally start at offset
64. Dirty recoveryscanning and salvage tools may validate the header for forensic purposes.
The first record of a freshly created v2 segment lives at offset
64. Thecurrent_file_offsetof a new v2 segment starts at64instead of0. Allabsolute offsets stored in the index, all
preadcalls, and all gaparithmetic in compaction continue to work unchanged because they are all
expressed in terms of absolute file offsets.
3.2 Records
All multi-byte integers are big-endian. The first byte of every record is its
Type. The remaining fields are type-specific.3.3 Per-message overhead vs v1
v2 is 4 bytes smaller per message thanks to the narrower size field
(32 bit instead of 64 bit) and the absence of a trailer byte; the added Type
byte is offset.
3.4 No record trailer
v1 records end with a
255byte. In v2 there is no trailer.The trailer in v1 was a 1-in-256 false-positive guard for the salvage scanner
and a weak torn-write detector. v2 obtains equivalent robustness from
structural validation during scan:
Typeis a known value.Sizeis within[min_for_type, FileSize - Offset].MsgIdis checked against the "already seen" map (carriedover from v1's scanner).
A torn last record at EOF still fails
Offset + Size =< FileSizeand isdiscarded, just as in v1. Bit-flip detection is out of scope for both v1 and
v2; if needed it must be added via
MESSAGE_WITH_CHECKSUM.3.5 Hole encoding rule (writers / compaction)
Every non-zero gap is representable. Compaction never has to refuse moving a
message into a hole because of an awkward remainder.
3.6 Unknown types and why
Type = 0is reservedA zero-filled or uninitialised region cannot accidentally look like a valid
record: an unrecognised first byte leads to an immediate crash.
Type = 0is reserved forever to better detect potential zero-filledbinary issues in the code or VM.
4. State changes
4.1 New per-store field
Single integer per store. Mirrored into the client and GC states so that
every code path that opens a segment knows which extension to use without an
ETS lookup.
4.2 Format dispatch
This function is the only indirection added at every file-open site (writer,
reader, scanner, GC). It is a single integer comparison.
4.3 Persistence
last_v1_fileis persisted inclean.dotrecovery terms alongside{client_refs, _}:[{client_refs, ...}, {last_v1_file, integer() | none}]On dirty recovery (no
clean.dot),last_v1_fileis recomputed from disk:4.4 Eager ets index entry deletion
We can enable eager index entry deletion again: #16142
Deleting many segment files should be very fast because we don't need to scan.
4.5 Salvage tools
Salvage tools (if any) will need to be updated to handle both v1 and v2 file formats.
4.6 What is NOT added
#file_summary.#msg_location.cc @lukebakken @gomoripeti