beava-dev · dosubot · May 14, 2026
@@ -0,0 +1,277 @@
+# Pipeline DSL Expressions (`bv.col`)
+
+> **Status:** Authoritative for v0. Documents the **post-13.5 target** Python
+> expression DSL. The current `python/beava/_col.py` already implements every
+> surface in this doc — Phase 13.5 polishes naming + cross-language parity.
+> **Last reviewed:** 2026-05-03 (Phase 13.0).
+
+## Overview
+
+`bv.col(name)` constructs a **column-reference expression** — a leaf in an
+operator-overloaded AST. Composing it with arithmetic, comparison, boolean,
+and method calls yields more nodes; calling `.to_expr_string()` (the SDK
+does this implicitly at register time) emits a canonical parenthesised
+string the server's expression evaluator parses back into a predicate.
+
+The grammar is **locked** — the server-side parser depends on the canonical
+shape; SDK ports MUST produce the same string for the same Python source.
+
+## Grammar (canonical)
+
+```
+expr      := field | literal | bin_op | unary_op | call
+field     := identifier | identifier "." identifier         # e.g. x, Stream.x
+literal   := number | "'" string "'" | "true" | "false" | "null"
+bin_op    := "(" expr <space> op <space> expr ")"           # EVERY binary op is parenthesized
+op        := "+" | "-" | "*" | "/" | ">" | ">=" | "<" | "<=" | "==" | "!=" | "and" | "or"
+unary_op  := "(" "not" <space> expr ")"
+call      := ident "(" expr ("," <space> expr)* ")"
+```
+
+String literal escaping: `\\` becomes `\\\\`; `'` becomes `\\'`. This is the
+single mitigation point for predicate-string injection from user-supplied
+strings (T-03-02-01).
+
+## `bv.col(name)` — column reference
+
+```python
+import beava as bv
+
+amount = bv.col("amount")            # Field('amount')
+amount.to_expr_string()              # "amount"
+
+# Qualified field reference (used when ops cross multiple upstream sources):
+foreign = bv.col("Txn.amount")
+foreign.to_expr_string()             # "Txn.amount"
+```
+
+Args:
+
+- `name` — non-empty string. `TypeError` if absent / empty.
+
+Returns: an `_ExprAST` leaf node, ready to compose with operators.
+
+## Arithmetic: `+ - * /`
+
+The four binary arithmetic operators are overloaded on `_ExprAST` and accept
+either another expression or a Python scalar (auto-wrapped via `bv.lit(...)`):
+
+```python
+(bv.col("a") + bv.col("b")).to_expr_string()        # "(a + b)"
+(bv.col("amount") * 2).to_expr_string()             # "(amount * 2)"
+(bv.col("amount") / 100).to_expr_string()           # "(amount / 100)"
+(5 - bv.col("balance")).to_expr_string()            # "(5 - balance)"
+```
+
+Both forms (left-operand or right-operand scalar) compile correctly because
+`_ExprAST` implements both `__add__` and `__radd__` (and the same for `-`,
+`*`, `/`).
+
+**Type rules** (server-side, applied at register time during schema
+propagation):
+
+- Both operands must be numeric (`i64` or `f64`); `bool` is NOT numeric in
+  arithmetic context. Cast first via `.cast("int")`.
+- Division (`/`) always widens to `f64` to avoid integer-truncation surprises.
+- Otherwise, `f64 + i64 → f64`; `i64 + i64 → i64`.
+
+## Comparison: `> >= < <= == !=`
+
+```python
+(bv.col("amount") > 100).to_expr_string()           # "(amount > 100)"
+(bv.col("amount") >= 100).to_expr_string()          # "(amount >= 100)"
+(bv.col("status") == "ok").to_expr_string()         # "(status == 'ok')"
+(bv.col("status") != "ok").to_expr_string()         # "(status != 'ok')"
+```
+
+All comparison ops return `bool` regardless of operand types. String
+literals on the right-hand side are auto-quoted and backslash-escaped per
+the grammar.
+
+## Boolean combinators: `& | ~`
+
+Python's keyword `and / or / not` cannot be operator-overloaded; beava uses
+`& / | / ~` instead and emits the keywords in the canonical grammar:
+
+```python
+left = bv.col("amount") > 100
+right = bv.col("merchant") == "amazon"
+
+(left & right).to_expr_string()   # "((amount > 100) and (merchant == 'amazon'))"
+(left | right).to_expr_string()   # "((amount > 100) or (merchant == 'amazon'))"
+(~left).to_expr_string()          # "(not (amount > 100))"
+```
+
+> **Warning:** The `~` operator applied to a bare column reference (e.g., `~bv.col("field")`) currently produces `!(field)` instead of the correct `(not field)` format. The server's expression parser rejects this with `aggregation_invalid_where` because it does not recognize the `!` operator. This is a known bug (confirmed in `test_invert_currently_rejected_by_server`). Use `~(bv.col("field") == True)` or other boolean sub-expressions as a workaround; the bug only affects `~` applied directly to a column reference.
+
+**Both operands MUST be boolean** — the server-side type inference rejects
+`bool & i64` etc. with `schema_mismatch`. Use `.cast("bool")` if you have a
+0/1 column you want to combine; or use `(col != 0)` to coerce first.
+
+**Operator precedence:** Python's `&` and `|` bind **tighter** than `>` /
+`==`. To get the obvious "either of two predicates" reading, parenthesise:
+
+```python
+# WRONG — Python parses this as `bv.col("a") > (100 & bv.col("b")) > 0`
+bv.col("a") > 100 & bv.col("b") > 0
+
+# RIGHT
+(bv.col("a") > 100) & (bv.col("b") > 0)
+```
+
+The SDK does NOT detect or rewrite the wrong form; it produces a
+schema-mismatch error at register time. Always parenthesise your sub-predicates.
+
+## `.isnull()` — null-check
+
+```python
+bv.col("amount").isnull().to_expr_string()        # "(amount == null)"
+```
+
+Shorthand for the `(col == null)` form. Emitted as a `_BinOp` so it composes
+with boolean combinators:
+
+```python
+non_null_pos = (~bv.col("amount").isnull()) & (bv.col("amount") > 0)
+non_null_pos.to_expr_string()
+# "((not (amount == null)) and (amount > 0))"
+```
+
+Use `.isnull()` rather than `(col == None)` for clarity — both compile to
+the same wire form, but `.isnull()` reads better in chains.
+
+## `.cast(type_name)` — type coercion
+
+```python
+bv.col("flag_str").cast("bool").to_expr_string()  # "cast(flag_str, bool)"
+bv.col("amount").cast("int").to_expr_string()     # "cast(amount, int)"
+```
+
+Args:
+
+- `type_name` — one of `"str"`, `"int"`, `"float"`, `"bool"`. Other strings
+  are rejected at decoration time with `ValueError`.
+
+`.cast()` is a `_Call` node (not a `_BinOp`); the canonical form is
+`cast(<expr>, <type>)`. The cast target name renders as a **bare
+identifier** (NOT a quoted string), so the server's parser can dispatch on
+the type without a string-strip.
+
+**Use case 1 — within `with_columns(...)` to derive a new column:**
+
+```python
+@bv.event
+def TxnWithFlagInt(txn: Txn) -> bv.Event:
+    return txn.with_columns(is_fraud_int=bv.col("is_fraud").cast("int"))
+```
+
+This produces a new `is_fraud_int` column on the downstream derivation that
+aggregations can then sum. **This is the recommended boolean-sum pattern** —
+see [compilation-rules.md § Boolean-sum trick](compilation-rules.md#boolean-sum-trick-recommended-pattern-for-conditional-counts).
+
+**Use case 2 — within `where=` predicates:**
+
+```python
+@bv.table(key="user_id")
+def UserStats(txn) -> bv.Table:
+    return (
+        txn.group_by("user_id")
+           .agg(big_count=bv.count(window="1h",
+                                    where=bv.col("amount").cast("int") > 100))
+    )
+```
+
+## `.alias(name)` — rename in derivation context
+
+> **Status:** Implemented post-13.5; current code uses `**kwargs` naming on
+> `with_columns(name=expr)` instead. SDK porters in 13.6 may add `.alias()`
+> as a convenience method. Documented here for forward compatibility.
+
+```python
+expensive = (bv.col("amount") > 100).alias("is_expensive")
+```
+
+In the v0 API the recommended form is the kwarg-as-name shape:
+
+```python
+@bv.event
+def TxnAnnotated(txn: Txn) -> bv.Event:
+    return txn.with_columns(is_expensive=bv.col("amount") > 100)
+```
+
+The kwarg name (`is_expensive`) becomes the new column name on the wire;
+`.alias(...)` exists in the spec for completeness with Polars conventions
+but is not required to author v0 pipelines.
+
+## `bv.lit(value)` — literal-value expression
+
+> **Status:** Optional — most SDK code paths auto-wrap scalars via the
+> internal `_wrap()` helper (e.g., `bv.col("amount") > 100` works without an
+> explicit `bv.lit(100)`). `bv.lit(...)` is exposed for clarity in
+> auto-generated / programmatic pipelines.
+
+```python
+import beava as bv
+
+bv.lit(100).to_expr_string()       # "100"
+bv.lit(3.14).to_expr_string()      # "3.14"
+bv.lit(True).to_expr_string()      # "true"
+bv.lit(None).to_expr_string()      # "null"
+bv.lit("amazon").to_expr_string()  # "'amazon'"
+```
+
+Supported value types: `int`, `float`, `bool`, `str`, `None`. Other types
+raise `TypeError` at decoration time. Strings are quoted with single quotes
+and backslash-escape `\\` and `'` per the grammar.
+
+You almost never need `bv.lit(...)` in user code — the operator overloading
+auto-wraps Python scalars on either side of binary ops. Use it when you
+need an explicit literal at a position the SDK cannot infer (e.g., as a
+positional argument to a future call expression).
+
+Per [ADR-003](../../.planning/decisions/ADR-003-global-aggregation-and-bv-lit.md), `bv.lit` is the **canonical public surface** for literal construction across all three SDKs (Python `bv.lit(value)`, TypeScript `bv.lit(value)`, Go `beava.Lit(value)`). The implicit operator-overloading coercion path keeps working in Python; `bv.lit` is exposed for explicit cases (constant columns via `events.with_columns(source=bv.lit("web"))`, type-coercion patterns, and cross-language parity with TS/Go SDKs that lack Python's flexible operator overloading).
+
+## Compilation: every node knows how to emit
+
+Each `_ExprAST` subclass implements `to_expr_string()`. The SDK calls this
+at register time when serialising:
+
+- `EventDerivation.filter(expr)` → `{"op": "filter", "expr": expr.to_expr_string()}`
+- `EventDerivation.with_columns(name=expr, ...)` → `{"op": "with_columns",
+  "exprs": {name: expr.to_expr_string(), ...}}`
+- `bv.<agg>(field, where=expr, ...)` (e.g. `bv.count(where=bv.col("status") == "ok")`)
+  → `{"op": "<agg>", "params": {"where": expr.to_expr_string(), ...}}`
+
+The expression string is the **canonical contract** between SDK and server.
+SDK porters MUST produce the same string for semantically equivalent
+expressions; round-trip tests on every fixture in
+[`examples/wire/`](../../examples/wire/) verify this.
+
+## Validation at register-time
+
+When the server parses the register payload (per
+[wire-spec OP_REGISTER](../wire-spec.md#op_register-0x0001)) it runs Phase 4
+expression validation on every `expr` string:
+
+- **Parse error** (malformed grammar, unbalanced parens) →
+  `RegistrationError(code="invalid_expression")`.
+- **Field reference unknown** (e.g., `bv.col("typo_amount")` referencing a
+  field not in the upstream schema) → `RegistrationError(code="unknown_field_reference")`.
+- **Type mismatch** (e.g., arithmetic on a `bool` field, boolean op on a
+  numeric field) → `RegistrationError(code="schema_mismatch")`.
+- **Cast target invalid** (e.g., `cast(x, "complex64")`) →
+  `RegistrationError(code="invalid_cast_target")`.
+
+The server emits all errors in a fail-soft batch — a single register call
+returns the full list of validation failures, not just the first.
+
+## Cross-references
+
+- [Pipeline DSL Overview](overview.md) — how expressions fit in the larger
+  pipeline-authoring story.
+- [Pipeline DSL Compilation Rules](compilation-rules.md) — per-method H3
+  worked examples; expressions are referenced from `.filter`, `.with_columns`,
+  `.cast`, and aggregation `where=` kwargs.
+- [Wire spec](../wire-spec.md) — canonical JSON contract.
+- [Error codes](../error-codes.md) — `invalid_expression`,
+  `unknown_field_reference`, `schema_mismatch`, `invalid_cast_target`.