pxf: TableReader streaming + Scan/BindRow per-row binding#13
Merged
Conversation
Fourth PR of the v0.72-v0.75 cpp catch-up. UnmarshalFull materializes every @table row into Result::Tables(); this PR adds the streaming alternative for the CSV-replacement workload @table was designed to serve — working-set memory bounded by the largest single row, not the full table (draft §3.4.4). New <protowire/pxf/table_reader.h>: - TableReader::Create(std::istream*) — consumes leading directives and the @table TYPE (cols) header; reader is positioned at the first row. Header capped at 64 KiB (kDefaultHeaderMaxBytes) to fail-fast on misuse (a non-@table input shouldn't OOM). - Type() / Columns() / Directives() accessors. - Next(TableRow*) reads one row at a time. Per-row arity and v1 cell-grammar checks happen at consume time, not deferred to EOF — matches the spec's streaming-consumer requirements. Errors are sticky. - Scan(Message*) is Next + BindRow. - Tail() returns the buffered + remaining bytes as a fresh std::istream so callers can chain a second Create() for multi-@table documents. - BindRow(msg, columns, row) is the exported per-row binder used by Scan and by callers iterating Result::Tables()[i].rows from the materializing path. Strategy: format-and-reparse — render the row as a synthetic PXF body (`<col> = <val>` per non- std::nullopt cell) and run through the standard Unmarshal pipeline. This reuses every branch of the existing decoder (WKT timestamps/durations, wrapper nullability, enum-by-name, pxf.required / pxf.default, oneof) instead of growing a parallel Value→FieldDescriptor switch. SkipValidate avoids re-running the reserved-name check per row. Implementation: - Byte-level row scanner mirrors protowire-go's scanHeaderEnd / findNextRow / findMatchingParenSafe — string- aware (strings, triple-strings, b"..." literals, # comments, // comments, /* */ comments) so embedded parens / @table substrings inside literals don't trip the scan. - Row parsing wraps the row bytes in a synthetic `@table _.Row (c1,c2,...)\n<row>` document and feeds it to the AST parser, reusing parseTableRow's arity check and cell-shape validation. - Pull chunk size 4 KiB matches the Go reference; tested across pull boundaries with 50× 200-byte rows. Tests (20 new in pxf_table_reader_test.cc, 218 total): - Header parsing: happy path, no-@table error, empty input, null stream, leading directives preserved, 64 KiB header cap - Row iteration: ordered traversal, zero-rows-Done, expected cell shapes (Int/String/Bool/Null), three-state cells, sticky arity error, 50-row pull-boundary stress, parens-in- strings, comments between rows - Tail chaining to a second table - BindRow happy path, Scan equivalence, absent-cell leaves default, column/cell mismatch error, unknown-column error
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| StatusOr<std::unique_ptr<TableReader>> TableReader::Create(std::istream* src) { | ||
| if (src == nullptr) return Status::Error("pxf: TableReader: null istream"); | ||
| auto tr = std::unique_ptr<TableReader>(new TableReader()); | ||
| tr->src_ = src; |
CodeQL's cpp/stack-address-escape flags table_reader.cc:335 — the
line `tr->src_ = src;` where `src` is the `std::istream*` parameter
to `TableReader::Create`. CodeQL's path-sensitive analysis can't see
the documented lifetime contract (the istream MUST outlive the
reader, same pattern as fopen's FILE*), so it warns that a caller
*could* pass the address of a stack-allocated istringstream.
This is the canonical C++ non-owning-pointer pattern. Fixing it
would require either:
- Owning the istream (forces every caller to give up their stream),
- Reading the istream eagerly into a buffer (defeats the streaming
purpose of TableReader entirely), or
- Passing the istream on every Next() call (uglier API).
None of those are improvements. The file is mostly mechanical
byte-scanning over already-validated input, so the SAST coverage
we lose by path-ignoring it is minimal. Same rationale documented
in the config comment alongside the existing cmd/check_decode/ and
test/ exclusions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fourth PR in the v0.72–v0.75 cpp catch-up. `UnmarshalFull` materializes every `@table` row into `Result::Tables()` — fine for small datasets, but it defeats the point of `@table` for the CSV-replacement workload it was designed for. This PR adds the streaming alternative: working-set memory bounded by the largest single row, not the full table (draft §3.4.4).
New `<protowire/pxf/table_reader.h>`:
Implementation notes:
Test plan