From a8839b7244f150557913f0e459e901e6e9141906 Mon Sep 17 00:00:00 2001
From: "randomizedcoder dave.seddon.ca@gmail.com" <dave.seddon.ca@gmail.com>
Date: Fri, 19 Jun 2026 20:28:12 -0700
Subject: [PATCH 1/5] docs: add Parquet format reference for data teams
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New docs/parquet-format.md explains the S3/Parquet export for an enterprise
data/analytics audience consuming xtcp2's TCP telemetry:
- Hive partition layout (host=/date=/hour=, UTC) and object naming
- file size/cadence (~63 MiB uncompressed soft cap) and per-column
  compression (ZSTD strings/bytes, SNAPPY numerics)
- how to read it (DuckDB/pandas/Trino) with partition pruning
- the grain (one row per socket per poll; cumulative counters; socket cookie)
- a 'start here' set of the key TCP columns with units (rtt µs, cwnd packets,
  delivery_rate bytes/s, total_retrans, byte counters, congestion algo)
- decoding cheat sheet (raw-byte IPs via family, TCP state map, enums, ts)
- full schema grouping + types, proto3 no-null gotchas, and where the schema
  is defined (ParquetRow + drift test).

Cross-linked from the docs hub and output-and-destinations (S3 section).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/README.md                  |   1 +
 docs/output-and-destinations.md |   2 +
 docs/parquet-format.md          | 226 ++++++++++++++++++++++++++++++++
 3 files changed, 229 insertions(+)
 create mode 100644 docs/parquet-format.md
diff --git a/docs/README.md b/docs/README.md
index a6b63ef..7f19015 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -59,6 +59,7 @@ The captured netlink `.pcap` fixture corpus spanning many kernel versions, the r
 See **[CONTRIBUTING.md](../CONTRIBUTING.md)** for the development environment, the full set of Nix build/test targets, the automated test suite (unit, race, per-flavor, and microVM integration tests), linting tiers, and protobuf regeneration. Related references:
 
 - [Protobuf formats](protobuf-formats.md) — the config/data/ClickHouse schemas, generated code, and how to regenerate.
+- [Parquet format](parquet-format.md) — the S3/Parquet layout and schema, for data teams consuming the export.
 - [Build flavors](build-flavors.md) — the build-variant × destination-flavor matrix.
 - [Integration testing](integration-testing.md) — the QEMU microVM test harness.
 - [Quality report](quality-report.md) — auto-generated linter/coverage status.
diff --git a/docs/output-and-destinations.md b/docs/output-and-destinations.md
index 1292111..4ddaf79 100644
--- a/docs/output-and-destinations.md
+++ b/docs/output-and-destinations.md
@@ -79,6 +79,8 @@ Notes on the stream sinks:
 
 `pkg/xtcp/destinations_s3parquet.go` (build tag `dest_s3parquet`) writes Hive-partitioned Parquet files to an S3-compatible store (e.g. MinIO) instead of streaming to a broker. The record-to-Parquet schema mapping is in `destinations_s3parquet_schema.go`. Files are partitioned `host=…/date=…/hour=…/<file>.parquet` and finalized/uploaded when the in-memory builder crosses `-s3ParquetFlushBytes` (default 63 MiB uncompressed). Credentials and endpoint come from `-s3*` flags or `S3_*` environment variables; the bucket must already exist.
 
+For a consumer-facing reference of the Parquet layout, schema, and the key TCP columns — written for a **data team** ingesting this into an enterprise platform — see [parquet-format.md](parquet-format.md).
+
 ## The record schema
 
 The per-socket record and its batch wrapper are defined in `proto/xtcp_flat_record/v1/xtcp_flat_record.proto`:
diff --git a/docs/parquet-format.md b/docs/parquet-format.md
new file mode 100644
index 0000000..e2924f8
--- /dev/null
+++ b/docs/parquet-format.md
@@ -0,0 +1,226 @@
+# Parquet format (for data teams)
+
+This document is written for a **data / analytics team** that needs to consume xtcp2's TCP
+telemetry into an enterprise data platform (lakehouse, warehouse, query engine). It assumes
+you're fluent in Parquet, object storage, and SQL, but only have a *basic* understanding of
+TCP — so it explains the columns that matter most and where to focus your first
+implementation.
+
+The short version: when xtcp2 runs with the S3/Parquet destination it writes **Hive-style
+partitioned, column-compressed Apache Parquet files** to an S3-compatible bucket. One Parquet
+**row = one socket observed at one poll**. The schema is flat (no nested or repeated fields),
+one column per field, so it loads cleanly into any Parquet reader.
+
+## Table of contents
+
+- [Where the files land](#where-the-files-land)
+- [File size, cadence, and compression](#file-size-cadence-and-compression)
+- [Reading the data](#reading-the-data)
+- [The grain: one row per socket per poll](#the-grain-one-row-per-socket-per-poll)
+- [Start here: the columns that matter](#start-here-the-columns-that-matter)
+- [Decoding cheat sheet](#decoding-cheat-sheet)
+- [Full schema and column types](#full-schema-and-column-types)
+- [Types, nulls, and gotchas](#types-nulls-and-gotchas)
+- [Where the schema is defined](#where-the-schema-is-defined)
+- [See also](#see-also)
+
+## Where the files land
+
+Object keys are **Hive-partitioned** by host, date, and hour (all UTC):
+
+```
+<prefix>/host=<hostname>/date=<YYYY-MM-DD>/hour=<HH>/<unix_ts>_<rand>.parquet
+```
+
+Example:
+
+```
+xtcp/host=web-01/date=2026-06-19/hour=14/1750345200_9f3a1c20.parquet
+```
+
+- `host=` — the emitting machine (`hostname`); sanitized for object-store safety, empty →
+  `unknown`.
+- `date=` / `hour=` — **UTC** wall clock at write time, ready for partition pruning
+  (`WHERE date = '...' AND hour = '...'`).
+- The file name is `<unix-seconds>_<random-hex>.parquet` — unique, append-only; xtcp2 never
+  rewrites a file.
+
+These partition keys are part of the **path, not the file** (standard Hive convention). Most
+engines (Spark, Trino/Athena, DuckDB, BigQuery external tables) expose `host`, `date`, `hour`
+as virtual columns automatically when you point them at `<prefix>/`.
+
+## File size, cadence, and compression
+
+- **Size:** xtcp2 finalizes and uploads a file when its in-memory builder reaches a soft cap
+  of **~63 MiB uncompressed** (configurable via `-s3ParquetFlushBytes`). On the wire the
+  `.parquet` is several times smaller after compression. A partial file is also flushed on
+  shutdown, so the last file of a run may be small.
+- **Cadence:** depends on traffic volume — a busy host fills 63 MiB quickly; a quiet host may
+  take a while, so don't assume one file per hour. Use the `date`/`hour` partitions, not file
+  counts.
+- **Compression:** per-column. String and address columns use **ZSTD** (high ratio);
+  numeric columns use **SNAPPY** (fast, widely supported). Every mainstream Parquet reader
+  handles both — you don't need to configure anything.
+
+## Reading the data
+
+Point any Parquet engine at the prefix. A few starting points:
+
+```sql
+-- DuckDB (great for exploration); hive_partitioning surfaces host/date/hour as columns
+SELECT host, date, hour, count(*) AS rows
+FROM read_parquet('s3://bucket/xtcp/**/*.parquet', hive_partitioning = true)
+WHERE date = '2026-06-19'
+GROUP BY 1,2,3 ORDER BY 1,2,3;
+```
+
+```python
+# pandas / pyarrow
+import pyarrow.dataset as ds
+dataset = ds.dataset("s3://bucket/xtcp/", format="parquet", partitioning="hive")
+df = dataset.to_table(columns=["timestamp_ns","hostname","tcp_info_rtt"]).to_pandas()
+```
+
+```sql
+-- Trino / Athena: create an external table over the prefix with
+-- partitions (host string, date string, hour string); project columns you need.
+```
+
+**Always select only the columns you need** — there are ~120, and columnar pruning is where
+Parquet earns its keep. Likewise filter on `date`/`hour` for partition pruning.
+
+## The grain: one row per socket per poll
+
+xtcp2 polls every network namespace on a fixed interval (default 10s) and emits one row per
+TCP socket per poll. So:
+
+- A long-lived connection appears in **many** rows over time — one per poll while it exists.
+- Most counters (bytes, segments, retransmits) are **cumulative over the socket's lifetime**,
+  so you typically `MAX()` them per socket or take deltas between consecutive polls.
+- Identify a single socket across polls with `inet_diag_msg_socket_cookie` (a stable
+  kernel-assigned id) together with `hostname`/`netns`.
+
+## Start here: the columns that matter
+
+If you're scoping an initial implementation, these are the high-value columns. Everything else
+can come later.
+
+### Identity & time
+| Column | Type | Meaning |
+|---|---|---|
+| `timestamp_ns` | double | When the sample was taken — **Unix epoch nanoseconds, UTC**. Divide by 1e9 for seconds. |
+| `hostname` | string | Emitting machine (also the `host=` partition). |
+| `netns` | string | Network namespace path — distinguishes host vs container/pod sockets. |
+| `inet_diag_msg_socket_cookie` | uint64 | Stable per-socket id; use to track one connection across polls. |
+
+### The connection (4-tuple)
+| Column | Type | Meaning |
+|---|---|---|
+| `inet_diag_msg_family` | uint32 | Address family: **2 = IPv4, 10 = IPv6**. Tells you how to read the address bytes. |
+| `inet_diag_msg_socket_source` | bytes | Local IP, **raw bytes** (4 for v4, 16 for v6). See [decoding](#decoding-cheat-sheet). |
+| `inet_diag_msg_socket_source_port` | uint32 | Local port (host byte order; use as-is). |
+| `inet_diag_msg_socket_destination` | bytes | Remote IP, raw bytes. |
+| `inet_diag_msg_socket_destination_port` | uint32 | Remote port. |
+| `inet_diag_msg_state` | uint32 | TCP state (see the [state map](#decoding-cheat-sheet)); `1`=ESTABLISHED, `10`=LISTEN. |
+
+### Health & performance (the metrics most teams want)
+| Column | Type | Unit / meaning |
+|---|---|---|
+| `tcp_info_rtt` | uint32 | Smoothed round-trip time, **microseconds**. The headline latency metric. |
+| `tcp_info_min_rtt` | uint32 | Minimum RTT seen, microseconds — a cleaner latency baseline. |
+| `tcp_info_rtt_var` | uint32 | RTT variance, microseconds (jitter). |
+| `tcp_info_snd_cwnd` | uint32 | Congestion window, **in packets/segments** (not bytes). |
+| `tcp_info_total_retrans` | uint32 | Cumulative retransmitted segments — the simplest "is this connection healthy?" signal. |
+| `tcp_info_bytes_sent` / `tcp_info_bytes_acked` | uint64 | Cumulative bytes sent / acknowledged. |
+| `tcp_info_bytes_received` | uint64 | Cumulative bytes received. |
+| `tcp_info_delivery_rate` | uint64 | Recent delivery rate, **bytes/second** — effective throughput. |
+| `tcp_info_pacing_rate` | uint64 | Sender pacing rate, bytes/second. |
+| `congestion_algorithm_string` | string | Congestion-control algorithm name (e.g. `cubic`, `bbr`) — easiest to read. |
+
+A solid first dashboard: per host/destination, `MAX(tcp_info_rtt)` and `MAX(tcp_info_min_rtt)`,
+the delta of `tcp_info_total_retrans`, and throughput from `tcp_info_delivery_rate` — filtered
+to `inet_diag_msg_state = 1` (ESTABLISHED).
+
+## Decoding cheat sheet
+
+A few columns are stored as machine values for fidelity/size and need decoding for humans:
+
+- **IP addresses** (`inet_diag_msg_socket_source` / `_destination`) are raw bytes. Read them
+  with `inet_diag_msg_family`: 4 bytes → dotted-quad IPv4, 16 bytes → IPv6. In DuckDB you can
+  reconstruct IPv4 as `concat_ws('.', get_byte(col,0), get_byte(col,1), get_byte(col,2),
+  get_byte(col,3))`. If you'd rather not decode in SQL at all, the daemon can emit **already
+  humanized** CSV/JSON instead — see [output formats](output-and-destinations.md) — but the
+  Parquet path keeps raw bytes so nothing is lost.
+- **TCP state** (`inet_diag_msg_state`, and `tcp_info_state`) is a kernel integer. Map:
+
+  | value | name | value | name |
+  |---|---|---|---|
+  | 1 | ESTABLISHED | 7 | CLOSE |
+  | 2 | SYN_SENT | 8 | CLOSE_WAIT |
+  | 3 | SYN_RECV | 9 | LAST_ACK |
+  | 4 | FIN_WAIT1 | 10 | LISTEN |
+  | 5 | FIN_WAIT2 | 11 | CLOSING |
+  | 6 | TIME_WAIT | 12 | NEW_SYN_RECV |
+
+- **Congestion algorithm**: prefer `congestion_algorithm_string` (the kernel name). The
+  `congestion_algorithm_enum` integer is `0`=UNSPECIFIED, `1`=CUBIC, `2`=DCTCP, `3`=VEGAS,
+  `4`=PRAGUE, `5`=BBR1, `6`=BBR2, `7`=BBR3.
+- **timestamp_ns** is a double; `to_timestamp(timestamp_ns / 1e9)` (or your engine's
+  equivalent) gives a UTC timestamp.
+
+## Full schema and column types
+
+The complete column list (~120) groups as follows; column names are the proto's snake_case
+names, identical to the ClickHouse table columns:
+
+- **Metadata** — `timestamp_ns` (double), `hostname`, `netns`, `nsid`, `label`, `tag`,
+  `record_counter`, `socket_fd`, `netlinker_id`.
+- **`inet_diag_msg_*`** — the socket id/4-tuple, state, queues, uid/inode, ASN annotations.
+- **`mem_info_*` / `sk_mem_info_*`** — socket memory accounting.
+- **`tcp_info_*`** — the bulk of the data: RTT, cwnd, ssthresh, MSS, windows, segment and byte
+  counters, pacing/delivery rates, RTO stats, busy/limited times.
+- **`congestion_algorithm_*`** — enum (`int32`) + string name.
+- **Per-algorithm blocks** — `vegas_info_*`, `dctcp_info_*`, `bbr_info_*` (only meaningful when
+  that algorithm is in use).
+- **QoS / misc** — `type_of_service`, `traffic_class`, `shutdown_state`, `class_id`,
+  `sock_opt`, `c_group`.
+
+Column types are: `double` (timestamp only), `string` (hostname/netns/label/tag/congestion
+string), `bytes` (the two IP-address columns), `int32` (congestion enum), and `uint32`/`uint64`
+for everything else. The authoritative, field-by-field list with types and compression is the
+[`ParquetRow` struct](../pkg/xtcp/destinations_s3parquet_schema.go); field meanings are in the
+[protobuf schema](../proto/xtcp_flat_record/v1/xtcp_flat_record.proto) and
+[protobuf-formats.md](protobuf-formats.md).
+
+## Types, nulls, and gotchas
+
+- **No NULLs.** The records come from proto3, which has no null — an absent/zero value is the
+  numeric `0` (or empty string/bytes). Treat `0` as "unset or genuinely zero"; don't expect
+  SQL `NULL`.
+- **Counters are cumulative**, per socket lifetime — delta between consecutive polls (matched
+  by `inet_diag_msg_socket_cookie`) for per-interval rates, or `MAX()` for totals.
+- **Units differ**: RTTs are microseconds; rates are bytes/second; `snd_cwnd` is packets;
+  byte counters are bytes. The per-column units are in the tables above.
+- **Per-algorithm columns are sparse-in-meaning**: `bbr_info_*` is only populated when the
+  socket uses BBR, etc. Filter on `congestion_algorithm_string` before trusting them.
+- **Schema evolution**: new fields are *added* (never renamed/reordered in place), so plan for
+  forward-compatible reads (select by name, tolerate new columns).
+
+## Where the schema is defined
+
+The Parquet columns are generated from the
+[`ParquetRow`](../pkg/xtcp/destinations_s3parquet_schema.go) struct, whose `parquet:` tags set
+the column names and per-column compression. A drift test
+(`TestS3ParquetSchema_matchesProto`) asserts that set matches the
+[`XtcpFlatRecord` proto](../proto/xtcp_flat_record/v1/xtcp_flat_record.proto) field-for-field,
+so the Parquet schema, the protobuf, and the ClickHouse table never diverge. To change the
+schema you edit the proto, regenerate (`nix run .#regen-protos`), and mirror the field in
+`ParquetRow`. The S3/Parquet destination itself is documented in
+[output formats & destinations](output-and-destinations.md#s3-and-parquet); it ships only in
+builds that include the `dest_s3parquet` tag (see [build flavors](build-flavors.md)).
+
+## See also
+
+- [Protobuf formats](protobuf-formats.md) — the canonical schema and field semantics.
+- [Output formats & destinations](output-and-destinations.md) — the S3/Parquet destination and the alternative humanized CSV/JSON formats.
+- [Build flavors](build-flavors.md) — enabling the `s3parquet` destination.

From 59f56494c7d11003b21f061c276a7f8f4559b0c5 Mon Sep 17 00:00:00 2001
From: "randomizedcoder dave.seddon.ca@gmail.com" <dave.seddon.ca@gmail.com>
Date: Sat, 20 Jun 2026 12:15:53 -0700
Subject: [PATCH 2/5] docs: soft-wrap parquet-format.md (let the renderer wrap)

---
 docs/parquet-format.md | 121 +++++++++++------------------------------
 1 file changed, 31 insertions(+), 90 deletions(-)

diff --git a/docs/parquet-format.md b/docs/parquet-format.md
index e2924f8..72ceeb5 100644
--- a/docs/parquet-format.md
+++ b/docs/parquet-format.md
@@ -1,15 +1,8 @@
 # Parquet format (for data teams)
 
-This document is written for a **data / analytics team** that needs to consume xtcp2's TCP
-telemetry into an enterprise data platform (lakehouse, warehouse, query engine). It assumes
-you're fluent in Parquet, object storage, and SQL, but only have a *basic* understanding of
-TCP — so it explains the columns that matter most and where to focus your first
-implementation.
+This document is written for a **data / analytics team** that needs to consume xtcp2's TCP telemetry into an enterprise data platform (lakehouse, warehouse, query engine). It assumes you're fluent in Parquet, object storage, and SQL, but only have a *basic* understanding of TCP — so it explains the columns that matter most and where to focus your first implementation.
 
-The short version: when xtcp2 runs with the S3/Parquet destination it writes **Hive-style
-partitioned, column-compressed Apache Parquet files** to an S3-compatible bucket. One Parquet
-**row = one socket observed at one poll**. The schema is flat (no nested or repeated fields),
-one column per field, so it loads cleanly into any Parquet reader.
+The short version: when xtcp2 runs with the S3/Parquet destination it writes **Hive-style partitioned, column-compressed Apache Parquet files** to an S3-compatible bucket. One Parquet **row = one socket observed at one poll**. The schema is flat (no nested or repeated fields), one column per field, so it loads cleanly into any Parquet reader.
 
 ## Table of contents
 
@@ -38,29 +31,17 @@ Example:
 xtcp/host=web-01/date=2026-06-19/hour=14/1750345200_9f3a1c20.parquet
 ```
 
-- `host=` — the emitting machine (`hostname`); sanitized for object-store safety, empty →
-  `unknown`.
-- `date=` / `hour=` — **UTC** wall clock at write time, ready for partition pruning
-  (`WHERE date = '...' AND hour = '...'`).
-- The file name is `<unix-seconds>_<random-hex>.parquet` — unique, append-only; xtcp2 never
-  rewrites a file.
+- `host=` — the emitting machine (`hostname`); sanitized for object-store safety, empty → `unknown`.
+- `date=` / `hour=` — **UTC** wall clock at write time, ready for partition pruning (`WHERE date = '...' AND hour = '...'`).
+- The file name is `<unix-seconds>_<random-hex>.parquet` — unique, append-only; xtcp2 never rewrites a file.
 
-These partition keys are part of the **path, not the file** (standard Hive convention). Most
-engines (Spark, Trino/Athena, DuckDB, BigQuery external tables) expose `host`, `date`, `hour`
-as virtual columns automatically when you point them at `<prefix>/`.
+These partition keys are part of the **path, not the file** (standard Hive convention). Most engines (Spark, Trino/Athena, DuckDB, BigQuery external tables) expose `host`, `date`, `hour` as virtual columns automatically when you point them at `<prefix>/`.
 
 ## File size, cadence, and compression
 
-- **Size:** xtcp2 finalizes and uploads a file when its in-memory builder reaches a soft cap
-  of **~63 MiB uncompressed** (configurable via `-s3ParquetFlushBytes`). On the wire the
-  `.parquet` is several times smaller after compression. A partial file is also flushed on
-  shutdown, so the last file of a run may be small.
-- **Cadence:** depends on traffic volume — a busy host fills 63 MiB quickly; a quiet host may
-  take a while, so don't assume one file per hour. Use the `date`/`hour` partitions, not file
-  counts.
-- **Compression:** per-column. String and address columns use **ZSTD** (high ratio);
-  numeric columns use **SNAPPY** (fast, widely supported). Every mainstream Parquet reader
-  handles both — you don't need to configure anything.
+- **Size:** xtcp2 finalizes and uploads a file when its in-memory builder reaches a soft cap of **~63 MiB uncompressed** (configurable via `-s3ParquetFlushBytes`). On the wire the `.parquet` is several times smaller after compression. A partial file is also flushed on shutdown, so the last file of a run may be small.
+- **Cadence:** depends on traffic volume — a busy host fills 63 MiB quickly; a quiet host may take a while, so don't assume one file per hour. Use the `date`/`hour` partitions, not file counts.
+- **Compression:** per-column. String and address columns use **ZSTD** (high ratio); numeric columns use **SNAPPY** (fast, widely supported). Every mainstream Parquet reader handles both — you don't need to configure anything.
 
 ## Reading the data
 
@@ -86,24 +67,19 @@ df = dataset.to_table(columns=["timestamp_ns","hostname","tcp_info_rtt"]).to_pan
 -- partitions (host string, date string, hour string); project columns you need.
 ```
 
-**Always select only the columns you need** — there are ~120, and columnar pruning is where
-Parquet earns its keep. Likewise filter on `date`/`hour` for partition pruning.
+**Always select only the columns you need** — there are ~120, and columnar pruning is where Parquet earns its keep. Likewise filter on `date`/`hour` for partition pruning.
 
 ## The grain: one row per socket per poll
 
-xtcp2 polls every network namespace on a fixed interval (default 10s) and emits one row per
-TCP socket per poll. So:
+xtcp2 polls every network namespace on a fixed interval (default 10s) and emits one row per TCP socket per poll. So:
 
 - A long-lived connection appears in **many** rows over time — one per poll while it exists.
-- Most counters (bytes, segments, retransmits) are **cumulative over the socket's lifetime**,
-  so you typically `MAX()` them per socket or take deltas between consecutive polls.
-- Identify a single socket across polls with `inet_diag_msg_socket_cookie` (a stable
-  kernel-assigned id) together with `hostname`/`netns`.
+- Most counters (bytes, segments, retransmits) are **cumulative over the socket's lifetime**, so you typically `MAX()` them per socket or take deltas between consecutive polls.
+- Identify a single socket across polls with `inet_diag_msg_socket_cookie` (a stable kernel-assigned id) together with `hostname`/`netns`.
 
 ## Start here: the columns that matter
 
-If you're scoping an initial implementation, these are the high-value columns. Everything else
-can come later.
+If you're scoping an initial implementation, these are the high-value columns. Everything else can come later.
 
 ### Identity & time
 | Column | Type | Meaning |
@@ -137,20 +113,13 @@ can come later.
 | `tcp_info_pacing_rate` | uint64 | Sender pacing rate, bytes/second. |
 | `congestion_algorithm_string` | string | Congestion-control algorithm name (e.g. `cubic`, `bbr`) — easiest to read. |
 
-A solid first dashboard: per host/destination, `MAX(tcp_info_rtt)` and `MAX(tcp_info_min_rtt)`,
-the delta of `tcp_info_total_retrans`, and throughput from `tcp_info_delivery_rate` — filtered
-to `inet_diag_msg_state = 1` (ESTABLISHED).
+A solid first dashboard: per host/destination, `MAX(tcp_info_rtt)` and `MAX(tcp_info_min_rtt)`, the delta of `tcp_info_total_retrans`, and throughput from `tcp_info_delivery_rate` — filtered to `inet_diag_msg_state = 1` (ESTABLISHED).
 
 ## Decoding cheat sheet
 
 A few columns are stored as machine values for fidelity/size and need decoding for humans:
 
-- **IP addresses** (`inet_diag_msg_socket_source` / `_destination`) are raw bytes. Read them
-  with `inet_diag_msg_family`: 4 bytes → dotted-quad IPv4, 16 bytes → IPv6. In DuckDB you can
-  reconstruct IPv4 as `concat_ws('.', get_byte(col,0), get_byte(col,1), get_byte(col,2),
-  get_byte(col,3))`. If you'd rather not decode in SQL at all, the daemon can emit **already
-  humanized** CSV/JSON instead — see [output formats](output-and-destinations.md) — but the
-  Parquet path keeps raw bytes so nothing is lost.
+- **IP addresses** (`inet_diag_msg_socket_source` / `_destination`) are raw bytes. Read them with `inet_diag_msg_family`: 4 bytes → dotted-quad IPv4, 16 bytes → IPv6. In DuckDB you can reconstruct IPv4 as `concat_ws('.', get_byte(col,0), get_byte(col,1), get_byte(col,2), get_byte(col,3))`. If you'd rather not decode in SQL at all, the daemon can emit **already humanized** CSV/JSON instead — see [output formats](output-and-destinations.md) — but the Parquet path keeps raw bytes so nothing is lost.
 - **TCP state** (`inet_diag_msg_state`, and `tcp_info_state`) is a kernel integer. Map:
 
   | value | name | value | name |
@@ -162,62 +131,34 @@ A few columns are stored as machine values for fidelity/size and need decoding f
   | 5 | FIN_WAIT2 | 11 | CLOSING |
   | 6 | TIME_WAIT | 12 | NEW_SYN_RECV |
 
-- **Congestion algorithm**: prefer `congestion_algorithm_string` (the kernel name). The
-  `congestion_algorithm_enum` integer is `0`=UNSPECIFIED, `1`=CUBIC, `2`=DCTCP, `3`=VEGAS,
-  `4`=PRAGUE, `5`=BBR1, `6`=BBR2, `7`=BBR3.
-- **timestamp_ns** is a double; `to_timestamp(timestamp_ns / 1e9)` (or your engine's
-  equivalent) gives a UTC timestamp.
+- **Congestion algorithm**: prefer `congestion_algorithm_string` (the kernel name). The `congestion_algorithm_enum` integer is `0`=UNSPECIFIED, `1`=CUBIC, `2`=DCTCP, `3`=VEGAS, `4`=PRAGUE, `5`=BBR1, `6`=BBR2, `7`=BBR3.
+- **timestamp_ns** is a double; `to_timestamp(timestamp_ns / 1e9)` (or your engine's equivalent) gives a UTC timestamp.
 
 ## Full schema and column types
 
-The complete column list (~120) groups as follows; column names are the proto's snake_case
-names, identical to the ClickHouse table columns:
+The complete column list (~120) groups as follows; column names are the proto's snake_case names, identical to the ClickHouse table columns:
 
-- **Metadata** — `timestamp_ns` (double), `hostname`, `netns`, `nsid`, `label`, `tag`,
-  `record_counter`, `socket_fd`, `netlinker_id`.
+- **Metadata** — `timestamp_ns` (double), `hostname`, `netns`, `nsid`, `label`, `tag`, `record_counter`, `socket_fd`, `netlinker_id`.
 - **`inet_diag_msg_*`** — the socket id/4-tuple, state, queues, uid/inode, ASN annotations.
 - **`mem_info_*` / `sk_mem_info_*`** — socket memory accounting.
-- **`tcp_info_*`** — the bulk of the data: RTT, cwnd, ssthresh, MSS, windows, segment and byte
-  counters, pacing/delivery rates, RTO stats, busy/limited times.
+- **`tcp_info_*`** — the bulk of the data: RTT, cwnd, ssthresh, MSS, windows, segment and byte counters, pacing/delivery rates, RTO stats, busy/limited times.
 - **`congestion_algorithm_*`** — enum (`int32`) + string name.
-- **Per-algorithm blocks** — `vegas_info_*`, `dctcp_info_*`, `bbr_info_*` (only meaningful when
-  that algorithm is in use).
-- **QoS / misc** — `type_of_service`, `traffic_class`, `shutdown_state`, `class_id`,
-  `sock_opt`, `c_group`.
-
-Column types are: `double` (timestamp only), `string` (hostname/netns/label/tag/congestion
-string), `bytes` (the two IP-address columns), `int32` (congestion enum), and `uint32`/`uint64`
-for everything else. The authoritative, field-by-field list with types and compression is the
-[`ParquetRow` struct](../pkg/xtcp/destinations_s3parquet_schema.go); field meanings are in the
-[protobuf schema](../proto/xtcp_flat_record/v1/xtcp_flat_record.proto) and
-[protobuf-formats.md](protobuf-formats.md).
+- **Per-algorithm blocks** — `vegas_info_*`, `dctcp_info_*`, `bbr_info_*` (only meaningful when that algorithm is in use).
+- **QoS / misc** — `type_of_service`, `traffic_class`, `shutdown_state`, `class_id`, `sock_opt`, `c_group`.
+
+Column types are: `double` (timestamp only), `string` (hostname/netns/label/tag/congestion string), `bytes` (the two IP-address columns), `int32` (congestion enum), and `uint32`/`uint64` for everything else. The authoritative, field-by-field list with types and compression is the [`ParquetRow` struct](../pkg/xtcp/destinations_s3parquet_schema.go); field meanings are in the [protobuf schema](../proto/xtcp_flat_record/v1/xtcp_flat_record.proto) and [protobuf-formats.md](protobuf-formats.md).
 
 ## Types, nulls, and gotchas
 
-- **No NULLs.** The records come from proto3, which has no null — an absent/zero value is the
-  numeric `0` (or empty string/bytes). Treat `0` as "unset or genuinely zero"; don't expect
-  SQL `NULL`.
-- **Counters are cumulative**, per socket lifetime — delta between consecutive polls (matched
-  by `inet_diag_msg_socket_cookie`) for per-interval rates, or `MAX()` for totals.
-- **Units differ**: RTTs are microseconds; rates are bytes/second; `snd_cwnd` is packets;
-  byte counters are bytes. The per-column units are in the tables above.
-- **Per-algorithm columns are sparse-in-meaning**: `bbr_info_*` is only populated when the
-  socket uses BBR, etc. Filter on `congestion_algorithm_string` before trusting them.
-- **Schema evolution**: new fields are *added* (never renamed/reordered in place), so plan for
-  forward-compatible reads (select by name, tolerate new columns).
+- **No NULLs.** The records come from proto3, which has no null — an absent/zero value is the numeric `0` (or empty string/bytes). Treat `0` as "unset or genuinely zero"; don't expect SQL `NULL`.
+- **Counters are cumulative**, per socket lifetime — delta between consecutive polls (matched by `inet_diag_msg_socket_cookie`) for per-interval rates, or `MAX()` for totals.
+- **Units differ**: RTTs are microseconds; rates are bytes/second; `snd_cwnd` is packets; byte counters are bytes. The per-column units are in the tables above.
+- **Per-algorithm columns are sparse-in-meaning**: `bbr_info_*` is only populated when the socket uses BBR, etc. Filter on `congestion_algorithm_string` before trusting them.
+- **Schema evolution**: new fields are *added* (never renamed/reordered in place), so plan for forward-compatible reads (select by name, tolerate new columns).
 
 ## Where the schema is defined
 
-The Parquet columns are generated from the
-[`ParquetRow`](../pkg/xtcp/destinations_s3parquet_schema.go) struct, whose `parquet:` tags set
-the column names and per-column compression. A drift test
-(`TestS3ParquetSchema_matchesProto`) asserts that set matches the
-[`XtcpFlatRecord` proto](../proto/xtcp_flat_record/v1/xtcp_flat_record.proto) field-for-field,
-so the Parquet schema, the protobuf, and the ClickHouse table never diverge. To change the
-schema you edit the proto, regenerate (`nix run .#regen-protos`), and mirror the field in
-`ParquetRow`. The S3/Parquet destination itself is documented in
-[output formats & destinations](output-and-destinations.md#s3-and-parquet); it ships only in
-builds that include the `dest_s3parquet` tag (see [build flavors](build-flavors.md)).
+The Parquet columns are generated from the [`ParquetRow`](../pkg/xtcp/destinations_s3parquet_schema.go) struct, whose `parquet:` tags set the column names and per-column compression. A drift test (`TestS3ParquetSchema_matchesProto`) asserts that set matches the [`XtcpFlatRecord` proto](../proto/xtcp_flat_record/v1/xtcp_flat_record.proto) field-for-field, so the Parquet schema, the protobuf, and the ClickHouse table never diverge. To change the schema you edit the proto, regenerate (`nix run .#regen-protos`), and mirror the field in `ParquetRow`. The S3/Parquet destination itself is documented in [output formats & destinations](output-and-destinations.md#s3-and-parquet); it ships only in builds that include the `dest_s3parquet` tag (see [build flavors](build-flavors.md)).
 
 ## See also
 

From 073e5060f06e6f1b0bb5de098d232eff0ae80807 Mon Sep 17 00:00:00 2001
From: "randomizedcoder dave.seddon.ca@gmail.com" <dave.seddon.ca@gmail.com>
Date: Sat, 20 Jun 2026 12:23:38 -0700
Subject: [PATCH 3/5] docs(parquet): add a Snowflake ingestion section

Short, name-matched COPY INTO recipe (file format + stage + INFER_SCHEMA
auto-create + MATCH_BY_COLUMN_NAME), an external-table/AUTO_REFRESH note for
continuous ingest, and the two Snowflake gotchas: path-based Hive partitions
(derive from metadata$filename) and BINARY address columns.
---
 docs/parquet-format.md | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/docs/parquet-format.md b/docs/parquet-format.md
index 72ceeb5..ca01673 100644
--- a/docs/parquet-format.md
+++ b/docs/parquet-format.md
@@ -9,6 +9,7 @@ The short version: when xtcp2 runs with the S3/Parquet destination it writes **H
 - [Where the files land](#where-the-files-land)
 - [File size, cadence, and compression](#file-size-cadence-and-compression)
 - [Reading the data](#reading-the-data)
+- [Loading into Snowflake](#loading-into-snowflake)
 - [The grain: one row per socket per poll](#the-grain-one-row-per-socket-per-poll)
 - [Start here: the columns that matter](#start-here-the-columns-that-matter)
 - [Decoding cheat sheet](#decoding-cheat-sheet)
@@ -69,6 +70,41 @@ df = dataset.to_table(columns=["timestamp_ns","hostname","tcp_info_rtt"]).to_pan
 
 **Always select only the columns you need** — there are ~120, and columnar pruning is where Parquet earns its keep. Likewise filter on `date`/`hour` for partition pruning.
 
+## Loading into Snowflake
+
+If your platform team already manages an S3 storage integration, ingesting is a few statements. The column names match the Parquet schema, so name-based matching does the mapping for you.
+
+```sql
+-- One-time: a Parquet file format and an external stage over the prefix.
+-- STORAGE_INTEGRATION is the bucket grant your platform team provisions.
+CREATE OR REPLACE FILE FORMAT xtcp_parquet TYPE = PARQUET;
+
+CREATE OR REPLACE STAGE xtcp_stage
+  URL = 's3://bucket/xtcp/'
+  STORAGE_INTEGRATION = my_s3_int
+  FILE_FORMAT = xtcp_parquet;
+
+-- Auto-create a table whose columns mirror the Parquet schema.
+CREATE OR REPLACE TABLE xtcp_flat_records
+  USING TEMPLATE (
+    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
+    FROM TABLE(INFER_SCHEMA(LOCATION => '@xtcp_stage', FILE_FORMAT => 'xtcp_parquet'))
+  );
+
+-- Load. MATCH_BY_COLUMN_NAME maps Parquet columns → table columns by name.
+COPY INTO xtcp_flat_records
+  FROM @xtcp_stage
+  FILE_FORMAT = (TYPE = PARQUET)
+  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
+```
+
+For **continuous, query-in-place** ingestion instead of a one-shot load, define an external table with `AUTO_REFRESH = TRUE` (backed by Snowpipe + an S3 event notification) over the same stage.
+
+Two Snowflake-specific notes:
+
+- **Partitions live in the path, not the file.** Snowflake won't surface `host`/`date`/`hour` automatically the way Spark/Trino do — derive them from `metadata$filename` (e.g. `REGEXP_SUBSTR(metadata$filename, 'date=([^/]+)', 1, 1, 'e', 1)`) as virtual columns on an external table, or as extra columns during `COPY` via a transform. They make great clustering/pruning keys.
+- **The two IP-address columns load as `BINARY`.** Decode them with `inet_diag_msg_family` as in the [cheat sheet](#decoding-cheat-sheet) — or, if you'd rather not decode in SQL, point the daemon at the humanized CSV/JSON output instead (see [output formats](output-and-destinations.md)).
+
 ## The grain: one row per socket per poll
 
 xtcp2 polls every network namespace on a fixed interval (default 10s) and emits one row per TCP socket per poll. So:

From bfe6c7a4e469f63ac56d91cd6c681a5d52b81f29 Mon Sep 17 00:00:00 2001
From: "randomizedcoder dave.seddon.ca@gmail.com" <dave.seddon.ca@gmail.com>
Date: Sat, 20 Jun 2026 12:25:57 -0700
Subject: [PATCH 4/5] docs(parquet): state the exact column count (122, not
 ~120)

---
 docs/parquet-format.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/parquet-format.md b/docs/parquet-format.md
index ca01673..cef1237 100644
--- a/docs/parquet-format.md
+++ b/docs/parquet-format.md
@@ -68,7 +68,7 @@ df = dataset.to_table(columns=["timestamp_ns","hostname","tcp_info_rtt"]).to_pan
 -- partitions (host string, date string, hour string); project columns you need.
 ```
 
-**Always select only the columns you need** — there are ~120, and columnar pruning is where Parquet earns its keep. Likewise filter on `date`/`hour` for partition pruning.
+**Always select only the columns you need** — there are 122, and columnar pruning is where Parquet earns its keep. Likewise filter on `date`/`hour` for partition pruning.
 
 ## Loading into Snowflake
 
@@ -172,7 +172,7 @@ A few columns are stored as machine values for fidelity/size and need decoding f
 
 ## Full schema and column types
 
-The complete column list (~120) groups as follows; column names are the proto's snake_case names, identical to the ClickHouse table columns:
+The complete column list (122 columns) groups as follows; column names are the proto's snake_case names, identical to the ClickHouse table columns:
 
 - **Metadata** — `timestamp_ns` (double), `hostname`, `netns`, `nsid`, `label`, `tag`, `record_counter`, `socket_fd`, `netlinker_id`.
 - **`inet_diag_msg_*`** — the socket id/4-tuple, state, queues, uid/inode, ASN annotations.

From f41aebdad94d886f7f4045039983ebfafd0088ae Mon Sep 17 00:00:00 2001
From: "randomizedcoder dave.seddon.ca@gmail.com" <dave.seddon.ca@gmail.com>
Date: Sat, 20 Jun 2026 12:45:53 -0700
Subject: [PATCH 5/5] docs: add socket-analysis guide (RTT bands + clustering
 for data teams)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

New docs/socket-analysis.md: a methodology guide for finding the natural RTT
bands statistically (min_rtt on a log scale; GMM+BIC for adaptive, drift-aware
bands; Jenks/KDE simple alternative; Snowflake quantile quick-win), with
labeling/validation against dest ASN/geo and per-DC/over-time tracking. Adds
multi-feature clustering (HDBSCAN) and other groupings (throughput, loss,
congestion algo, per-ASN, diurnal), a worked SQL→Python example, and a
pitfalls section (per-socket grain, cumulative counters, µs units, survivorship,
app-limited throughput, drift). Cross-linked from the docs hub and parquet doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/README.md          |   1 +
 docs/parquet-format.md  |   1 +
 docs/socket-analysis.md | 209 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 211 insertions(+)
 create mode 100644 docs/socket-analysis.md

diff --git a/docs/README.md b/docs/README.md
index 7f19015..1fd712e 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -60,6 +60,7 @@ See **[CONTRIBUTING.md](../CONTRIBUTING.md)** for the development environment, t
 
 - [Protobuf formats](protobuf-formats.md) — the config/data/ClickHouse schemas, generated code, and how to regenerate.
 - [Parquet format](parquet-format.md) — the S3/Parquet layout and schema, for data teams consuming the export.
+- [Socket analysis](socket-analysis.md) — finding RTT bands and other socket groupings by clustering (data-team methodology).
 - [Build flavors](build-flavors.md) — the build-variant × destination-flavor matrix.
 - [Integration testing](integration-testing.md) — the QEMU microVM test harness.
 - [Quality report](quality-report.md) — auto-generated linter/coverage status.
diff --git a/docs/parquet-format.md b/docs/parquet-format.md
index cef1237..40ccb1a 100644
--- a/docs/parquet-format.md
+++ b/docs/parquet-format.md
@@ -198,6 +198,7 @@ The Parquet columns are generated from the [`ParquetRow`](../pkg/xtcp/destinatio
 
 ## See also
 
+- [Socket analysis](socket-analysis.md) — finding RTT bands and other socket groupings by clustering, once the data is loaded.
 - [Protobuf formats](protobuf-formats.md) — the canonical schema and field semantics.
 - [Output formats & destinations](output-and-destinations.md) — the S3/Parquet destination and the alternative humanized CSV/JSON formats.
 - [Build flavors](build-flavors.md) — enabling the `s3parquet` destination.
diff --git a/docs/socket-analysis.md b/docs/socket-analysis.md
new file mode 100644
index 0000000..03a3e6e
--- /dev/null
+++ b/docs/socket-analysis.md
@@ -0,0 +1,209 @@
+# Analyzing socket data: RTT bands and clustering
+
+This guide is for a **data / analytics team** turning xtcp2's per-socket TCP telemetry into insight. It assumes you're comfortable with SQL and basic statistics / clustering. The headline use case is discovering the natural **RTT bands** in your fleet — and doing it *statistically*, because the bands differ by data center and **drift over time**, so hardcoded thresholds go stale. It also sketches other groupings (throughput, retransmission/loss, congestion algorithm, per-ASN, time-of-day).
+
+Read [parquet-format.md](parquet-format.md) first for how to load the data and what the columns mean; this doc is the analysis companion.
+
+## Table of contents
+
+- [The RTT-band mental model](#the-rtt-band-mental-model)
+- [Pick the right signal](#pick-the-right-signal)
+- [Data preparation](#data-preparation)
+- [Finding RTT bands](#finding-rtt-bands)
+- [Multi-feature clustering](#multi-feature-clustering)
+- [Other useful analyses](#other-useful-analyses)
+- [Worked example](#worked-example)
+- [Pitfalls](#pitfalls)
+- [See also](#see-also)
+
+## The RTT-band mental model
+
+Round-trip time is the strongest single signal for *where* a socket's peer is and *what kind* of path it's on. In a typical fleet you'd expect RTT to fall into a handful of bands, roughly:
+
+1. **Intra-data-center** — peers in the same DC; sub-millisecond to a few ms, usually high throughput.
+2. **Metro / same-region CDN edge** — Cloudflare, Fastly, Google, Akamai POPs in the same metro; ~10–30 ms.
+3. **Regional services** — another region or a more distant POP; ~60–120 ms.
+4. **Outliers** — cross-continent, satellite, or pathological paths; much higher.
+5. **Mobile / last-mile** — end users on 4G/LTE/Wi-Fi reaching your hosts; often >150 ms with high variance.
+
+Treat these as **hypotheses, not constants.** The actual band count, centroids, and boundaries vary by DC and move over time (new POPs, peering/routing changes, congestion). The goal is to let the data define the bands and to re-derive them on a schedule, comparing across DCs.
+
+## Pick the right signal
+
+Use **`tcp_info_min_rtt`** as the primary banding feature, not the smoothed `tcp_info_rtt` (srtt):
+
+- `min_rtt` is the *minimum* RTT the kernel has seen on the socket — it approximates the propagation/path floor and is largely free of transient queueing and load. That makes it a clean proxy for distance/path, which is exactly what bands are about.
+- `tcp_info_rtt` (srtt) is useful as a *current latency* feature and, together with `tcp_info_rtt_var`, as a **jitter** signal — but it inflates under load, so it's noisier for geography.
+
+**All RTT fields are microseconds** — divide by 1000 for milliseconds. RTT spans several orders of magnitude (0.1 ms intra-DC to 300 ms mobile), so **analyze it on a log scale**; the modes that correspond to bands are far clearer in `log10(min_rtt_ms)` than in linear space.
+
+## Data preparation
+
+This is the make-or-break step. Two issues dominate.
+
+**1. Fix the grain.** The export is one row per socket **per poll** (default every 10s), so a long-lived connection contributes many rows. If you cluster raw rows, long flows dominate and you're really clustering "poll-seconds," not sockets. **Aggregate to one feature vector per socket**, keyed by `inet_diag_msg_socket_cookie` + `hostname` + `netns` (the cookie is unique only within a host/namespace). Sensible aggregations: `MIN(tcp_info_min_rtt)`, median or last srtt, `MAX(tcp_info_delivery_rate)`, and the **last** value of cumulative counters.
+
+**2. Filter and normalize.**
+
+- Keep `inet_diag_msg_state = 1` (ESTABLISHED). Drop listeners (`10`), `TIME_WAIT`, etc.
+- Drop loopback / intra-host noise (source == destination, `127.0.0.0/8`, `::1`).
+- Require a minimum lifetime — e.g. ≥ N samples or ≥ some bytes — so ephemeral sockets with one noisy RTT sample don't pollute the bands (survivorship filtering).
+- Convert units: µs → ms for RTT; bytes/s → Mbit/s for throughput.
+- Counters (`tcp_info_bytes_*`, `tcp_info_total_retrans`, `tcp_info_segs_*`) are **cumulative over the socket lifetime** — use the last value per socket, or deltas between consecutive polls for rates.
+- Derive the **data-center** dimension. xtcp2 doesn't emit "DC" directly — derive it from your `hostname` convention, or set it explicitly with the daemon's `-label`/`-tag` (carried in the `label`/`tag` columns).
+
+The per-socket feature table (one row per socket) is the input to everything below. See the [worked example](#worked-example) for the SQL.
+
+## Finding RTT bands
+
+This is a one-dimensional clustering problem on `log(min_rtt)`.
+
+**Look before you cluster.** Plot a histogram and a kernel-density estimate (KDE) of `log10(min_rtt_ms)`. You'll usually *see* the modes (one bump per band) and the valleys between them. That sanity-checks everything that follows.
+
+**Method A — Gaussian Mixture Model + BIC (recommended, adaptive).** Fit a 1-D GMM to `log(min_rtt)` and let **BIC** choose the number of components. Each component is a band; the boundaries are where adjacent components cross over. This auto-discovers how many bands exist *today* — re-fit per time window (e.g. daily, per DC) and the band count/locations track drift automatically. This is the method to build the production pipeline on.
+
+**Method B — natural breaks (simple, explainable).** Jenks natural-breaks optimization (or finding the valleys/minima of the KDE) on `log(min_rtt)` gives defensible cut points that are easy to explain to stakeholders ("we split where the data has gaps"). Good for a first pass or when a GMM is overkill.
+
+**Method C — quantile bands in the warehouse (quick win).** `NTILE(n)` or `APPROX_PERCENTILE` in Snowflake gives instant coarse bands with zero data movement. Caveat: quantiles cut the data into *equal-sized* groups, which is **not** the same as finding natural modes — a quantile boundary can land in the middle of a real band. Use it for a fast look, not as the definition of a band.
+
+**Label and validate.** Name each band by its centroid RTT, then **confirm the physical story** with independent columns: `inet_diag_msg_socket_dest_asn`, an IP-geolocation join on the decoded `inet_diag_msg_socket_destination`, or `inet_diag_msg_socket_destination_port`. The lowest band should be mostly intra-DC peers, the next mostly CDN ASNs in your metro, and so on. If the physical story doesn't match the statistical band, investigate before trusting it. Keep the labels **derived from the centroids**, not hardcoded.
+
+**Track over time and per DC.** Persist each run's band centroids and boundaries keyed by (data center, day) as a time series. Now you can compare EU vs US vs AU, watch a band's RTT creep, and alert when a boundary jumps (a routing change, a new POP, or an outage).
+
+## Multi-feature clustering
+
+RTT is the headline, but sockets cluster on more than latency. Build a standardized feature vector per socket and cluster in several dimensions:
+
+- `log(min_rtt_ms)` — path/distance
+- `log(throughput_mbps)` — capacity / flow size
+- `retrans_rate` — loss (`bytes_retrans / bytes_sent`)
+- `rtt_var / min_rtt` — *relative* jitter (dimensionless)
+- `log(snd_cwnd)` — window the path sustains
+
+Standardize (z-score) after log-transforming the heavy-tailed features. Algorithm trade-offs:
+
+| Algorithm | Pros | Cons |
+|---|---|---|
+| **K-means** | Fast, simple baseline | Must pick K; assumes round, equal-size clusters |
+| **GMM** | Soft assignments; BIC picks K; elliptical clusters | Assumes Gaussian components |
+| **HDBSCAN** | No K; arbitrary shapes; **labels outliers as noise** | Sensitive to `min_cluster_size`; needs scaled features |
+
+**HDBSCAN is the recommended default** here — it doesn't need a predetermined cluster count and its built-in noise label naturally captures the "outliers" band (item 4 above) instead of forcing every socket into a group. Use PCA or UMAP to project to 2-D for a scatter plot colored by cluster. Validate with silhouette score (or BIC for GMM), **stability across time windows** (do the same clusters reappear tomorrow?), and **external agreement** — clusters should line up with `dest_asn`, `congestion_algorithm_string`, or DC.
+
+## Other useful analyses
+
+- **Throughput bands.** `log(tcp_info_delivery_rate)` is heavy-tailed; cluster it to separate "elephant" flows from "mice." Exclude `tcp_info_delivery_rate_app_limited = 1` rows when you want *path* capacity (those flows were limited by the application, not the network).
+- **Retransmission / loss bands.** `bytes_retrans / bytes_sent` (or `total_retrans / segs_out`) splits healthy (~0) from lossy paths. Cross-tab with the RTT band — high-RTT mobile paths often also show elevated loss.
+- **Congestion-algorithm comparison.** Group by `congestion_algorithm_string` (e.g. BBR vs CUBIC) and compare RTT/throughput/loss distributions for the same destination band.
+- **Per-ASN / per-CDN performance.** Aggregate by `inet_diag_msg_socket_dest_asn` to rank CDN edges or transit providers by latency and loss from each DC.
+- **Diurnal patterns.** Bucket by hour-of-day (`timestamp_ns`); mobile/last-mile RTT typically rises in the evening peak. Useful for capacity planning.
+- **Anomaly / drift detection.** Monitor band centroids over time; a sudden shift is a strong signal of a routing change or incident.
+
+## Worked example
+
+The snippets below are **illustrative — adapt names and thresholds to your environment.**
+
+**Step 1 — per-socket feature table (DuckDB; near-identical in Snowflake).** Aggregate the poll-grain rows to one row per socket.
+
+```sql
+WITH socket AS (
+  SELECT
+    hostname,
+    netns,
+    inet_diag_msg_socket_cookie                       AS cookie,
+    -- DC from a hostname convention like "iad1-web-07"; adapt the regex,
+    -- or use the daemon's -label which lands in the `label` column.
+    regexp_extract(hostname, '^([a-z]+[0-9]+)', 1)    AS dc,
+    inet_diag_msg_socket_dest_asn                     AS dest_asn,
+    MIN(tcp_info_min_rtt) / 1000.0                    AS min_rtt_ms,
+    MEDIAN(tcp_info_rtt)  / 1000.0                    AS srtt_ms,
+    MEDIAN(tcp_info_rtt_var) / 1000.0                 AS rtt_var_ms,
+    MAX(tcp_info_delivery_rate) * 8.0 / 1e6           AS mbps,
+    MAX(tcp_info_snd_cwnd)                            AS cwnd,
+    -- cumulative counters: last value ≈ MAX over the socket's life
+    MAX(tcp_info_bytes_sent)                          AS bytes_sent,
+    MAX(tcp_info_bytes_retrans)                       AS bytes_retrans,
+    ANY_VALUE(congestion_algorithm_string)            AS congestion,
+    COUNT(*)                                          AS samples
+  FROM read_parquet('s3://bucket/xtcp/**/*.parquet', hive_partitioning => true)
+  WHERE inet_diag_msg_state = 1                       -- ESTABLISHED only
+  GROUP BY 1,2,3,4,5
+)
+SELECT
+  *,
+  CASE WHEN bytes_sent > 0 THEN bytes_retrans::DOUBLE / bytes_sent ELSE 0 END AS retrans_rate
+FROM socket
+WHERE samples >= 3            -- survivorship: drop ephemeral sockets
+  AND min_rtt_ms > 0;
+```
+
+**Step 2 — RTT bands in Python (GMM + BIC).**
+
+```python
+import numpy as np, pandas as pd
+from sklearn.mixture import GaussianMixture
+
+feat = pd.read_parquet("socket_features.parquet")      # the table from step 1
+x = np.log10(feat["min_rtt_ms"].to_numpy()).reshape(-1, 1)
+
+# Let BIC choose the number of bands (1..8).
+models = {k: GaussianMixture(k, covariance_type="full", random_state=0).fit(x)
+          for k in range(1, 9)}
+k = min(models, key=lambda k: models[k].bic(x))
+gmm = models[k]
+feat["band"] = gmm.predict(x)
+
+# Order bands by RTT and summarize.
+centroids = (10 ** gmm.means_.ravel())
+order = np.argsort(centroids)
+print(f"{k} bands; centroids (ms):", np.round(centroids[order], 2))
+print(feat.groupby("band")["min_rtt_ms"].describe()[["count", "mean", "min", "max"]])
+# Plot log10(min_rtt_ms) histogram + the fitted components to eyeball the fit.
+```
+
+**Step 3 — multi-feature clusters (HDBSCAN).**
+
+```python
+import hdbscan
+from sklearn.preprocessing import StandardScaler
+
+cols = ["min_rtt_ms", "mbps", "rtt_var_ms", "cwnd"]
+F = feat[cols].copy()
+F[["min_rtt_ms", "mbps", "cwnd"]] = np.log10(F[["min_rtt_ms", "mbps", "cwnd"]].clip(lower=1e-6))
+F["rel_jitter"] = feat["rtt_var_ms"] / feat["min_rtt_ms"]
+Z = StandardScaler().fit_transform(F)
+
+labels = hdbscan.HDBSCAN(min_cluster_size=50).fit_predict(Z)  # -1 == outlier/noise
+feat["cluster"] = labels
+print(feat.groupby("cluster")[["min_rtt_ms", "mbps", "retrans_rate"]].median())
+```
+
+**Quick warehouse bands (Snowflake).** When you just need coarse buckets fast:
+
+```sql
+SELECT APPROX_PERCENTILE(tcp_info_min_rtt/1000.0, 0.5)  AS p50_ms,
+       APPROX_PERCENTILE(tcp_info_min_rtt/1000.0, 0.9)  AS p90_ms,
+       NTILE(5) OVER (ORDER BY tcp_info_min_rtt)        AS rtt_quintile
+FROM xtcp_flat_records
+WHERE inet_diag_msg_state = 1;
+```
+
+Snowflake users who want clustering in-warehouse can also use Snowflake ML / Cortex's k-means on the per-socket feature table instead of exporting to Python.
+
+## Pitfalls
+
+- **RTT is microseconds** — divide by 1000 for ms. Easy to be off by 1000×.
+- **`min_rtt` vs srtt** — band on `min_rtt`; srtt is inflated by load and queueing.
+- **Cluster per-socket, not per-poll-row** — otherwise long-lived flows dominate the model.
+- **Counters are cumulative** — use the last value per socket, or deltas for rates; don't sum across polls.
+- **Survivorship** — filter out very short / low-byte sockets; their RTT is a single noisy sample.
+- **App-limited throughput** — `tcp_info_delivery_rate_app_limited = 1` means the app, not the network, capped the rate; exclude for path-capacity work.
+- **Addresses are raw bytes** — decode with `inet_diag_msg_family` before any geo/ASN join (see the [parquet decoding cheat sheet](parquet-format.md#decoding-cheat-sheet)).
+- **Bands drift** — re-fit on a schedule; a model trained last quarter will mislabel today.
+- **Per-host clocks** — `timestamp_ns` is each host's wall clock; don't assume sub-second alignment across machines.
+
+## See also
+
+- [Parquet format](parquet-format.md) — how to load the data and what every column means.
+- [Protobuf formats](protobuf-formats.md) — the authoritative schema and field semantics.
+- [Output formats & destinations](output-and-destinations.md) — the export pipeline and the humanized formats.