Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,58 @@ format changes.

## [Unreleased]

## [1.0.0] — 2026-05-13

First major-version cut. Implements the three one-time spec changes
from the protowire v1.0 freeze line (`STABILITY.md` in the spec
repo) in lockstep with `protowire`, `protowire-go`, `protowire-java`
(v1.0.1), `protowire-typescript`, and `protowire-kotlin`.
**Breaking** — there is no alias period; v1.0 is itself the major
bump.

### v1.0 spec changes

- **`@table` → `@dataset` rename** (draft §3.4.4). Public API
follows: `Ast::TableDirective` → `Ast::DatasetDirective`,
`Ast::TableRow` → `Ast::DatasetRow`, `TokenKind::kAtTable` →
`TokenKind::kAtDataset`, `Result::Tables()` → `Result::Datasets()`,
`Result::AddTable()` → `Result::AddDataset()`,
`class TableReader` → `class DatasetReader`. Headers
`protowire/pxf/table_reader.h` → `dataset_reader.h`; source
`src/pxf/table_reader.cc` → `dataset_reader.cc`. Hard cutover.

- **`@proto` directive added** (draft §3.4.5). New `Ast::ProtoDirective`
struct + `Ast::ProtoShape` enum (`kAnonymous`, `kNamed`, `kSource`,
`kDescriptor`). Four body shapes lexically distinguished
(anonymous `{ ... }`, named `<dotted-name> { ... }`,
source `"""..."""`, descriptor `b"..."`). Exposed via
`Document::protos` and `Result::Protos()`. Descriptor form is
the MUST-support shape per spec; this port supports all four.

- **Reserved directive names** expanded from 5 to 13 (draft §3.4.6).
`IsFutureReservedDirective(name)` exposed from
`protowire/pxf/schema.h`. Parser + fast decoder reject `@table`,
`@datasource`, `@view`, `@procedure`, `@function`,
`@permissions` as spec-reserved.

`@dataset`'s row message type is now optional in the AST — binding
to an anonymous `@proto` per draft §3.4.4 Anonymous binding.

`Lexer::RepositionTo(int)` added so the parser can skip past an
`@proto` brace-body whose interior is protobuf source rather than
PXF.

### Build

- CMake `project(protowire VERSION ...)` bumped `0.75.0` → `1.0.0`.

### Tests

- New `test/pxf_proto_directive_test.cc` with 11 cases covering all
four `@proto` body shapes, anonymous binding, multi-`@proto`,
nested-brace bodies, reserved-name rejection, `@type` coexistence.
- `ctest`: 229 tests, 0 failures.

## [0.75.0] — 2026-05-12

First release after the v0.70.0 baseline that closes the v0.72–v0.75
Expand Down
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
cmake_minimum_required(VERSION 3.20)
project(protowire VERSION 0.75.0 LANGUAGES CXX)
project(protowire VERSION 1.0.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
Expand Down Expand Up @@ -83,7 +83,7 @@ _pwx_existing_sources(_pxf_srcs
src/pxf/lexer.cc src/pxf/ast.cc src/pxf/parser.cc src/pxf/decode_fast.cc
src/pxf/decode.cc src/pxf/encode.cc src/pxf/format.cc src/pxf/wellknown.cc
src/pxf/annotations.cc src/pxf/result.cc src/pxf/options.cc src/pxf/schema.cc
src/pxf/table_reader.cc)
src/pxf/dataset_reader.cc)
_pwx_existing_sources(_sbe_srcs
src/sbe/annotations.cc src/sbe/template.cc src/sbe/codec.cc
src/sbe/marshal.cc src/sbe/unmarshal.cc src/sbe/view.cc
Expand Down
4 changes: 2 additions & 2 deletions include/protowire/pxf.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@
#include <google/protobuf/message.h>

#include "protowire/detail/status.h"
#include "protowire/pxf/dataset_reader.h" // DatasetReader, BindRow
#include "protowire/pxf/options.h"
#include "protowire/pxf/parser.h" // Document, Parse
#include "protowire/pxf/result.h"
#include "protowire/pxf/schema.h" // ValidateDescriptor, Violation
#include "protowire/pxf/table_reader.h" // TableReader, BindRow
#include "protowire/pxf/schema.h" // ValidateDescriptor, Violation

namespace protowire::pxf {

Expand Down
82 changes: 64 additions & 18 deletions include/protowire/pxf/ast.h
Original file line number Diff line number Diff line change
Expand Up @@ -132,8 +132,9 @@ struct BlockVal {
// entry. The canonical use is side-channel metadata that sits alongside
// the schema-typed body — e.g. chameleon's
// `@header chameleon.v1.LayerHeader { id = "x" }` — but the grammar is
// open-ended: any name except `type` / `table` is parsed as a generic
// Directive. Prefix identifiers are positional and per-directive.
// open-ended: any name not in the spec-reserved set (draft §3.4.6) is
// parsed as a generic Directive. Prefix identifiers are positional
// and per-directive.
//
// Specific registrations:
// - One prefix (v0.72.0 conventional shape) — names the inner block's
Expand All @@ -146,7 +147,7 @@ struct BlockVal {
// (both exclusive) — empty when the directive has no inline block.
struct Directive {
Position pos;
std::string name; // e.g. "header"; never "type" / "table"
std::string name; // e.g. "header"; never a spec-reserved name (§3.4.6)
std::vector<std::string> prefixes; // identifiers between @<name> and the optional `{ ... }`
// Back-compat: when exactly one prefix identifier was supplied, `type`
// holds it (matching v0.72.0's single-Type shape). Empty otherwise.
Expand All @@ -156,36 +157,81 @@ struct Directive {
std::vector<Comment> leading_comments;
};

// TableRow is one parenthesized cell tuple in a `@table` directive.
// `cells` is the same length as the containing TableDirective.columns.
// DatasetRow is one parenthesized cell tuple in a `@dataset` directive.
// `cells` is the same length as the containing DatasetDirective.columns.
// A `std::nullopt` cell denotes an absent field (the "empty cell"
// between two commas); a non-empty optional holding a `NullVal` denotes
// a present-but-null field; any other Value denotes a present field.
struct TableRow {
struct DatasetRow {
Position pos;
std::vector<std::optional<ValuePtr>> cells;
};

// TableDirective is a `@table <type> ( col1, col2, ... ) row*` entry at
// document root (draft §3.4.4). It carries many instances of one
// message type in a single document — the protowire-native CSV.
// DatasetDirective is a `@dataset <type> ( col1, col2, ... ) row*` entry
// at document root (draft §3.4.4). It carries many instances of one
// message type in a single document — the protowire-native CSV
// replacement.
//
// Per draft §3.4.4, a document with any TableDirective MUST NOT have a
// @type directive or any top-level field entries: the @table header IS
// the document's type declaration. Decoders enforce this in Parse.
struct TableDirective {
// Per draft §3.4.4, a document with any DatasetDirective MUST NOT have
// a @type directive or any top-level field entries: the @dataset header
// IS the document's type declaration. Decoders enforce this in Parse.
//
// `type` MAY be empty when an anonymous `@proto` directive (§3.4.5)
// precedes the dataset in document order; the anonymous schema is
// consumed as the row message type.
struct DatasetDirective {
Position pos;
std::string type; // row message type, e.g. "trades.v1.Trade"
std::vector<std::string> columns; // top-level field names on `type`; len >= 1
std::vector<TableRow> rows; // zero or more rows
std::vector<DatasetRow> rows; // zero or more rows
std::vector<Comment> leading_comments;
};

// ProtoShape distinguishes the four body shapes of a @proto directive
// (draft §3.4.5).
enum class ProtoShape : uint8_t {
// `@proto { <message-body> }` — defines an unnamed message used by
// the next typed directive in document order.
kAnonymous = 0,
// `@proto <dotted-name> { <message-body> }` — sugar for a single named
// message; `type_name` carries the dotted name.
kNamed,
// `@proto """<proto-source>"""` — complete .proto source file.
kSource,
// `@proto b"<base64-FileDescriptorSet>"` — base64-encoded
// google.protobuf.FileDescriptorSet bytes.
kDescriptor,
};

const char* ProtoShapeName(ProtoShape s);

// ProtoDirective is a `@proto <body>` entry at document root
// (draft §3.4.5). It carries an embedded protobuf schema, making the
// PXF document self-describing.
//
// `body` carries raw bytes per shape:
// - kAnonymous, kNamed: bytes between the opening `{` and matching
// `}` (both exclusive). The bytes are protobuf message-body source.
// - kSource: contents of the triple-quoted string (with leading-LF
// stripping and common-prefix dedent applied). The bytes are a
// complete .proto source file.
// - kDescriptor: base64-decoded bytes of the bytes literal. The
// bytes are a serialised google.protobuf.FileDescriptorSet.
//
// `type_name` is non-empty only when `shape == kNamed`.
struct ProtoDirective {
Position pos;
ProtoShape shape = ProtoShape::kAnonymous;
std::string type_name;
std::string body;
std::vector<Comment> leading_comments;
};

struct Document {
std::string type_url; // empty if no @type directive
std::vector<Directive>
directives; // @<name> *(prefix) [{ ... }] entries in source order; excludes @type and @table
std::vector<TableDirective> tables; // @table directives in source order
std::string type_url; // empty if no @type directive
std::vector<Directive> directives; // @<name> directives in source order; excludes spec-defined
std::vector<DatasetDirective> datasets; // @dataset directives in source order (draft §3.4.4)
std::vector<ProtoDirective> protos; // @proto directives in source order (draft §3.4.5)
int body_offset =
0; // byte offset where the schema-typed body begins (after all leading directives)
std::vector<EntryPtr> entries;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
// SPDX-License-Identifier: MIT
// Copyright (c) 2026 TrendVidia, LLC.
//
// Streaming consumption for the `@table` directive (draft §3.4.4).
// Streaming consumption for the `@dataset` directive (draft §3.4.4).
//
// `UnmarshalFull` materializes an entire `@table` directive — every row
// — into `Result::Tables()`. That works for small datasets and breaks
// for the CSV-replacement workload `@table` was designed to serve.
// `TableReader` provides the streaming alternative: it pulls bytes from
// a `std::istream` on demand and yields one `TableRow` per `Next()`
// `UnmarshalFull` materializes an entire `@dataset` directive — every row
// — into `Result::Datasets()`. That works for small datasets and breaks
// for the CSV-replacement workload `@dataset` was designed to serve.
// `DatasetReader` provides the streaming alternative: it pulls bytes from
// a `std::istream` on demand and yields one `DatasetRow` per `Next()`
// call, with working-set memory bounded by the size of the largest
// single row.
//
Expand All @@ -20,7 +20,7 @@
//
// Convenience: `Scan(msg)` reads the next row and binds its cells to
// `msg`'s fields by column name; `BindRow` is exported for callers that
// iterate the materializing path's `Result::Tables()[i].rows`.
// iterate the materializing path's `Result::Datasets()[i].rows`.

#pragma once

Expand All @@ -32,46 +32,46 @@
#include <google/protobuf/message.h>

#include "protowire/detail/status.h"
#include "protowire/pxf/ast.h" // Directive, TableRow
#include "protowire/pxf/ast.h" // Directive, DatasetRow

namespace protowire::pxf {

// Default cap on the @table header (leading directives plus the
// `@table TYPE ( cols )` declaration). Real headers are tiny — a few
// Default cap on the @dataset header (leading directives plus the
// `@dataset TYPE ( cols )` declaration). Real headers are tiny — a few
// hundred bytes at most. The cap exists to fail-fast on misuse: a
// TableReader pointed at a multi-gigabyte document with no `@table`
// DatasetReader pointed at a multi-gigabyte document with no `@dataset`
// directive shouldn't OOM trying to find one.
constexpr int kDefaultHeaderMaxBytes = 64 * 1024;

// Streaming row reader for a single `@table` directive.
// Streaming row reader for a single `@dataset` directive.
//
// A TableReader is positioned at the first row after `Create()`
// A DatasetReader is positioned at the first row after `Create()`
// returns. Call `Next(&row)` in a loop until `Done()` returns true;
// the table's row sequence is exhausted at that point. Any parse or
// I/O error makes the reader sticky: subsequent `Next` / `Scan` calls
// return the same Status.
//
// For documents containing multiple `@table` directives, call
// For documents containing multiple `@dataset` directives, call
// `Create()` again on `tr->Tail()` to read the next table.
//
// A TableReader is NOT safe for concurrent use.
class TableReader {
// A DatasetReader is NOT safe for concurrent use.
class DatasetReader {
public:
// Construct a TableReader and consume the leading directives and the
// `@table TYPE ( cols )` header. `src` must outlive the reader.
// Returns a non-OK Status if the input ends before any `@table`
// directive is seen (the message contains "no @table directive in
// Construct a DatasetReader and consume the leading directives and the
// `@dataset TYPE ( cols )` header. `src` must outlive the reader.
// Returns a non-OK Status if the input ends before any `@dataset`
// directive is seen (the message contains "no @dataset directive in
// stream") or on a parse / I/O error.
static StatusOr<std::unique_ptr<TableReader>> Create(std::istream* src);
static StatusOr<std::unique_ptr<DatasetReader>> Create(std::istream* src);

// Row message type declared by the @table header (e.g. "trades.v1.Trade").
// Row message type declared by the @dataset header (e.g. "trades.v1.Trade").
const std::string& Type() const { return type_; }

// Column field names declared by the @table header, in source order.
// Column field names declared by the @dataset header, in source order.
const std::vector<std::string>& Columns() const { return columns_; }

// Side-channel directives (`@<name>` / `@entry` / etc., NOT `@type`
// or `@table`) that appeared before the `@table` header. Stable for
// or `@dataset`) that appeared before the `@dataset` header. Stable for
// the lifetime of the reader.
const std::vector<Directive>& Directives() const { return directives_; }

Expand All @@ -82,7 +82,7 @@ class TableReader {
//
// After EOF or error, all subsequent calls return the same sticky
// result.
Status Next(TableRow* out);
Status Next(DatasetRow* out);

// Reads the next row and binds its cells to fields of `msg` by column
// name. Equivalent to `Next` + `BindRow`. At EOF, returns OK and sets
Expand All @@ -98,15 +98,15 @@ class TableReader {
// Returns a fresh istream-derived source that yields the bytes the
// reader buffered but didn't consume, followed by the remaining
// bytes from the underlying source. Use to chain a second
// `Create()` for documents with multiple `@table` directives.
// `Create()` for documents with multiple `@dataset` directives.
//
// MUST only be called after `Next` has reported `Done()`. Calling
// earlier returns bytes the current reader still intends to consume,
// which will desync the next reader.
std::unique_ptr<std::istream> Tail();

private:
TableReader() = default;
DatasetReader() = default;

Status ReadHeader();
Status Pull(size_t n);
Expand All @@ -133,10 +133,10 @@ class TableReader {
// wrapper / oneof; rejects on non-nullable scalars).
// - any other Value — field set to that value.
//
// Exported so callers iterating `Result::Tables()[i].rows` can reuse
// Exported so callers iterating `Result::Datasets()[i].rows` can reuse
// the same logic.
Status BindRow(google::protobuf::Message* msg,
const std::vector<std::string>& columns,
const TableRow& row);
const DatasetRow& row);

} // namespace protowire::pxf
21 changes: 21 additions & 0 deletions include/protowire/pxf/lexer.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,27 @@ class Lexer {
// between '{' and '}' once the matching brace has been located.
std::string_view Input() const { return input_; }

// Reposition the lexer to byte offset `target`, recomputing line/col
// by scanning forward from the current position. Used by parseProto-
// Directive to skip past an @proto brace-body whose interior is
// protobuf source (not PXF) without lexing through it.
void RepositionTo(int target) {
if (target < static_cast<int>(pos_)) {
pos_ = 0;
line_ = 1;
column_ = 1;
}
while (static_cast<int>(pos_) < target && pos_ < input_.size()) {
uint8_t ch = static_cast<uint8_t>(input_[pos_++]);
if (ch == '\n') {
++line_;
column_ = 1;
} else {
++column_;
}
}
}

private:
uint8_t Peek(size_t offset = 0) const {
size_t i = pos_ + offset;
Expand Down
Loading
Loading