docs: Parquet format reference for data teams#41
Open
randomizedcoder wants to merge 2 commits into
Open
Conversation
New docs/parquet-format.md explains the S3/Parquet export for an enterprise data/analytics audience consuming xtcp2's TCP telemetry: - Hive partition layout (host=/date=/hour=, UTC) and object naming - file size/cadence (~63 MiB uncompressed soft cap) and per-column compression (ZSTD strings/bytes, SNAPPY numerics) - how to read it (DuckDB/pandas/Trino) with partition pruning - the grain (one row per socket per poll; cumulative counters; socket cookie) - a 'start here' set of the key TCP columns with units (rtt µs, cwnd packets, delivery_rate bytes/s, total_retrans, byte counters, congestion algo) - decoding cheat sheet (raw-byte IPs via family, TCP state map, enums, ts) - full schema grouping + types, proto3 no-null gotchas, and where the schema is defined (ParquetRow + drift test). Cross-linked from the docs hub and output-and-destinations (S3 section). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
docs/parquet-format.md— a consumer-facing guide to the S3/Parquet export, written for an enterprise data/analytics team that has only a basic grasp of TCP.What it covers
host=/date=/hour=(UTC), object naming, how engines expose the partitions.-s3ParquetFlushBytes), per-column ZSTD (strings/bytes) + SNAPPY (numerics).inet_diag_msg_socket_cookie.snd_cwndpackets,delivery_ratebytes/s,total_retrans, byte counters, congestion algorithm) so the team knows where to focus first.inet_diag_msg_family, the TCP state integer→name map, congestion enum,timestamp_ns.ParquetRow+ the drift test that keeps Parquet/proto/ClickHouse in lockstep).Cross-linked from the docs hub and
output-and-destinations.md(S3 section).Notes
ParquetRow(destinations_s3parquet_schema.go), theobjectKeylayout, and the 63 MiB flush cap.../pkg/…,../proto/…, sibling docs) resolve; no broken intra-doc anchors.🤖 Generated with Claude Code