Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions crates/beava-server/tests/phase12_6_mio_only_dataplane.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
//! `crates/beava-server/src/recovery.rs::replay_handrolled_wal_dir` plus
//! `replay_wal_from_lsn` (cold-path WAL replay on boot). Any third caller is
//! an architectural regression per `project_phase18_no_dual_runtime` +
//! `project_redis_shaped_no_event_time_ever`.
//! `project_redis_shaped_no_event_time_ever`
//! (post-2026-05-19: partially overturned by ADR-004 for v0.1 bucketing).
//!
//! **2. axum is restricted to the admin sidecar.**
//! `axum::Router`, `axum::Json`, `axum::Extension`, `axum::extract::*`,
Expand Down Expand Up @@ -145,7 +146,8 @@ fn only_apply_shard_and_recovery_call_apply_event_to_aggregations() {
violations.is_empty(),
"Architectural regression: only `apply_shard.rs::dispatch_push_sync` (mio data plane) \
and `recovery.rs::replay_*` (WAL replay) may call `apply_event_to_aggregations`. \
Per `project_phase18_no_dual_runtime` + `project_redis_shaped_no_event_time_ever`, \
Per `project_phase18_no_dual_runtime` + `project_redis_shaped_no_event_time_ever` \
(post-2026-05-19: partially overturned by ADR-004 for v0.1 bucketing), \
the mio event loop is the SOLE data-plane runtime. Any third caller is a regression.\n\
Violations:\n{}",
violations.join("\n")
Expand Down
158 changes: 158 additions & 0 deletions docs/operators/buffer-geo/histogram.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# bv.histogram

> Fixed-bucket count histogram of a numeric field. `buckets` is a required register-time kwarg per [V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md).

## Signature

```python
bv.histogram(
field: str,
*,
buckets: list[float], # REQUIRED — register-time kwarg
where: bv.Col | None = None,
) -> AggDescriptor
```

## Description

`bv.histogram` returns a count per fixed numeric bucket of `field` across
events that match the optional `where=` predicate. `buckets` is a strictly
increasing list of split points; the cells are `(-inf, b[0])`, `[b[0], b[1])`,
…, `[b[n-1], +inf)`. For `buckets=[10, 20, 50]` you get four cells with
labels `"<10"`, `"10-20"`, `"20-50"`, `">=50"`. Use it for "transaction
amount distribution by tier" or "request size in p50 / p90 / p99 buckets" —
features where you want a coarse shape, not a quantile estimate.

`buckets` is a **required keyword argument** per
[V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md): the lifetime-aggregation
memory contract requires every unbounded-by-default operator to declare a
finite per-entity ceiling at register time. `bv.histogram`'s ceiling is
exactly `len(buckets) + 1` `u64` counters per entity. The register-time
JSON-prelude shim (`pre_check_unbounded_op_in_lifetime_mode`) rejects any
`histogram` payload missing `buckets` (or with an empty `buckets` array)
with the structured error code `unbounded_op_in_lifetime_mode`. There is
no fallback default — picking the bucket edges is a deliberate
capacity-planning + signal-design step.

`bv.histogram` belongs to the **bounded-buffer** family. Per-event update
is a linear scan over `buckets` (≤ ~20 in practice) plus a saturating
counter increment — Tier 1 floor (~10 ns / ~30 ns measured). The query
path materializes a `BTreeMap` of labelled counts and is therefore listed
under Tier 3 in [cost-class.md](../cost-class.md); the apply-path cost
remains Tier 1. There is no `window=` kwarg in v0 — `bv.histogram` is
**lifetime-only**. For "amount histogram for the last 24 h", compose with
`@bv.event(cold_after=...)` or use [`bv.quantile`](../sketch/quantile.md)
for a quantile sketch instead.

## Parameters

| Name | Type | Required | Default | Description |
|------|------|----------|---------|-------------|
| `field` | `str` | Yes | — | Name of the numeric field to bucket (`f64` / `i64`). |
| `buckets` | `list[float]` | **Yes** | — | Strictly increasing list of split points. `n` values produce `n + 1` cells. Caps per-entity memory at `n + 1` `u64` counters per [V0-MEM-GOV-02 BoundedByRequiredKwarg("buckets")](../../../.planning/REQUIREMENTS.md). |
| `where` | `bv.Col` | No | `None` | Boolean expression on event fields; only matching events update the counters. |

## Returns

A `dict[str, int]` keyed by bucket label (e.g. `"<10"`, `"10-20"`,
`">=50"`) with `i64` count values. Wire form is `Value::Map` with
`BTreeMap`-sorted iteration. When the entity has seen zero matching
events, the result is the dict with all counters at `0`.

## Complexity

| Resource | Bound |
|----------|-------|
| CPU per event | **Tier 1** (~10 ns floor / ~30 ns measured — linear scan over ≤ ~20 buckets + saturating add) — see [cost-class.md](../cost-class.md#tier-1-fast-40-nscall--38-ops) |
| Query | **Tier 3** allocates a `BTreeMap` of `len(buckets) + 1` entries on each `app.get(...)` — see [cost-class.md](../cost-class.md#tier-3-algorithmic-floor-100-300-nscall--9-ops) — apply-thread cost is Tier 1; query-time cost is the asymmetry to flag when profiling |
| Memory per entity | **`BoundedByRequiredKwarg("buckets")`** — `(len(buckets) + 1) × 8` bytes per [Phase 12.8 V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md) |
| Lifetime mode | **Required** — `bv.histogram` has no `window=` kwarg in v0; lifetime is the only mode |

## Examples

### Example 1: Transaction-amount histogram per user

```python
import beava as bv

@bv.event
class Txn:
user_id: str
amount: float

@bv.table(key="user_id")
def UserAmountHistogram(txn) -> bv.Table:
return (
txn.group_by("user_id")
.agg(amount_hist=bv.histogram("amount",
buckets=[10.0, 50.0, 100.0, 500.0]))
)

# Push events
for amt in [5.0, 12.0, 25.0, 80.0, 200.0, 750.0]:
app.push("Txn", {"user_id": "alice", "amount": amt})

# Query
result = app.get("UserAmountHistogram", "alice")
# result == {"amount_hist": {"<10": 1, "10-50": 2, "50-100": 1, "100-500": 1, ">=500": 1}}
```

### Example 2: Successful-only request size histogram

```python
@bv.table(key="endpoint")
def EndpointReqSizeHist(reqs) -> bv.Table:
return (
reqs.group_by("endpoint")
.agg(req_size_hist=bv.histogram("size_bytes",
buckets=[1024, 65536, 1048576],
where=bv.col("status") == 200))
)
```

## Wire

JSON wire form in a register payload:

```json
{
"kind": "derivation",
"name": "UserAmountHistogram",
"output_kind": "table",
"key": ["user_id"],
"agg": {
"amount_hist": {
"op": "histogram",
"params": {
"field": "amount",
"buckets": [10.0, 50.0, 100.0, 500.0]
}
}
}
}
```

See [examples/wire/register-fraud-team.request.json](../../../examples/wire/register-fraud-team.request.json) for a full payload example.

## Edge cases

- **`buckets` missing or empty at register time:** rejected with structured error code `unbounded_op_in_lifetime_mode` per [V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md). The JSON-prelude shim catches this before any state is allocated.
- **`buckets` not strictly increasing:** rejected with `aggregation_invalid_param` at register time.
- **Empty stream / cold-start:** all counters are `0`; the result dict is the full label set with zero values.
- **`window=` kwarg attempted:** raises `TypeError` at SDK-helper-call time. v0 has no windowed histogram; use `@bv.event(cold_after=...)` to scope the lifetime via cold-entity eviction, or [`bv.quantile`](../sketch/quantile.md) for a windowed shape estimate.
- **Non-numeric source field (`Value::Str`, `Value::Bool`):** event silently dropped (not bucketed). For categorical histograms use [`bv.event_type_mix`](./event_type_mix.md).
- **NaN inputs:** dropped — `numeric_from_row` returns `None` on non-`F64`/`I64` payloads. For cleaner semantics filter with `where=~bv.col(field).isnull()`.
- **Bucket-label format:** integer-valued edges render without trailing `.0` (`"<10"` not `"<10.0"`); fractional edges keep their decimal form. Stable across versions — Python SDKs can dict-key on the labels.
- **Counter overflow:** the per-bucket `u64` saturates at `2^64 - 1` (impossible in practice for a single entity).
- **Out-of-order event-time:** **does not matter.** beava is processing-time-only per [`project_redis_shaped_no_event_time_ever`](../../../.planning/PROJECT.md) (post-2026-05-19: partially overturned by ADR-004 for v0.1 bucketing); buckets are populated in arrival order.
- **Lifetime mode:** **the only mode.** Per-entity ceiling is `(len(buckets) + 1) × 8` bytes per [V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md) BoundedByRequiredKwarg("buckets").

## See also

- [cost-class.md](../cost-class.md) — performance tier (Tier 1 update / Tier 3 query)
- [bv.quantile](../sketch/quantile.md) — quantile-estimate companion (when you want p50/p90/p99 instead of fixed buckets)
- [bv.event_type_mix](./event_type_mix.md) — categorical-distribution companion (proportions per category, not counts per bucket)
- [bv.hour_of_day_histogram](./hour_of_day_histogram.md) — fixed 24-bin time-of-day histogram (no `buckets=` needed)
- [bv.dow_hour_histogram](./dow_hour_histogram.md) — fixed 168-bin day-of-week × hour histogram
- [V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md) — BoundedByRequiredKwarg memory governance contract
- [pipeline-dsl/compilation-rules.md](../../pipeline-dsl/compilation-rules.md) — chain compilation rules
43 changes: 43 additions & 0 deletions docs/operators/decay/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Decay-Family Aggregation Operators

The 6 decay-family ops cover **exponentially-weighted statistics** (EWMA / EWVar / EW Z-Score), **forward-decay accumulators** (Cormode 2009), and **time-weighted averaging**. Five of the six use a `half_life` parameter to set an exponential decay rate; `bv.twa` uses `window` instead because it integrates held-time exactly rather than fading.

| Op | Required kwarg | Returns | CPU tier | Notes |
|----|----------------|---------|----------|-------|
| [`bv.ewma`](./ewma.md) (alias `bv.ema`) | `half_life` | `f64` or `null` | Tier 1 | Exponentially-weighted mean. |
| [`bv.ewvar`](./ewvar.md) | `half_life` | `f64` or `null` | Tier 1 | Exponentially-weighted variance — companion second-moment to EWMA. |
| [`bv.ew_zscore`](./ew_zscore.md) | `half_life` | `f64` or `null` | Tier 1 | Current-event z-score against EWMA / EWVar baseline; the standard drift-aware anomaly primitive. |
| [`bv.decayed_sum`](./decayed_sum.md) | `half_life` | `f64` or `null` | Tier 1 | Cormode forward-decay sum — recency-weighted total that converges to a stable steady-state. |
| [`bv.decayed_count`](./decayed_count.md) | `half_life` | `f64` or `null` | Tier 1 | Same primitive without `field` — answers "how active recently?". The cheapest decay op. |
| [`bv.twa`](./twa.md) | `window` | `f64` or `null` | Tier 1 | Time-weighted average for irregularly-sampled gauge fields. |

All six are `O(1)` memory per entity and Tier 1 CPU per [cost-class.md](../cost-class.md). The five EW-decay ops share a state shape of `(value, last_now_ms, initialized)` plus per-op extras; `bv.twa` carries `(sum_v_dt, sum_dt, last_v, last_t, initialized)`.

## Key invariants

- **Server processing-time only.** Decay coefficients use `Δt = now_ms() at this matching event - now_ms() at the previous matching event` per [`project_redis_shaped_no_event_time_ever`](../../../.planning/PROJECT.md) (post-2026-05-19: partially overturned by ADR-004 for v0.1 bucketing). Producers cannot influence decay via payload fields. Late events (`Δt ≤ 0`) fall back to an unweighted blend and do **not** advance `last_now_ms`.
- **`half_life` is mandatory and finite for the EW family.** `bv.ewma`, `bv.ewvar`, `bv.ew_zscore`, `bv.decayed_sum`, `bv.decayed_count` all reject `"forever"` at SDK-helper-call time (regex `[1-9]\d*(?:ms|s|m|h|d)$`); use the corresponding lifetime ops ([`bv.first`](../point-ordinal/first.md), [`bv.var`](../core/var.md), [`bv.z_score`](../velocity/z_score.md), [`bv.sum`](../core/sum.md) `window="forever"`, [`bv.count`](../core/count.md) `window="forever"`) when an undecayed lifetime reading is what you want.
- **`bv.twa` accepts `window="forever"`.** Time-weighted average integrates held-time exactly, so the lifetime form is well-defined; per-op classification at `crates/beava-core/src/register_validate.rs` (~line 436) classifies all six ops as `O(1)` lifetime-bound (`OpLifetimeBound::O1`).
- **Reads do not decay forward.** `app.get(...)` returns the running statistic as of the **last matching event** — the helper does not re-decay the value to query time. (Re-decaying on read would mutate state on every `get`, breaking idempotence.)
- **Cold-start returns `null`** for all six ops. `bv.ewvar` and `bv.ew_zscore` additionally return `null` after only one matching event (variance is `0`; no spread to normalize against).
- **Cold-entity eviction (`@bv.event(cold_after=...)`)** drops the underlying state per the Redis-TTL pattern (V0-MEM-GOV-01); decay ops rebuild fresh on the next post-eviction matching event.

## When to use which

- **Smoothed running mean** that adapts to drift → [`bv.ewma`](./ewma.md). Pick `half_life` ≈ the timescale of the behaviour you care about.
- **Smoothed running variance** for anomaly scoring or volatility tracking → [`bv.ewvar`](./ewvar.md), usually paired with [`bv.ew_zscore`](./ew_zscore.md).
- **Recency-weighted total** (e.g. "spend in roughly the last hour") → [`bv.decayed_sum`](./decayed_sum.md). Steady-state ≈ `rate * value * half_life / ln(2)`.
- **Recency-weighted activity rate** → [`bv.decayed_count`](./decayed_count.md). Steady-state ≈ `rate * half_life / ln(2)`.
- **True time-weighted average** for gauges sampled at irregular intervals → [`bv.twa`](./twa.md).

> Note: `bv.rate_of_change` is **not** in the decay family — it lives under [velocity/rate_of_change.md](../velocity/rate_of_change.md) per the Phase 9 op classification (it computes a slope across two adjacent windows, not an exponentially-weighted statistic). Polished by [Plan 13.0-09](../../../.planning/phases/13.0-design-contract-spec-docs/).

## See also

- [Operator catalog index](../index.md) — full 53-op catalogue (decay is the 6-op family)
- [cost-class.md](../cost-class.md) — per-op CPU tier metadata (all six decay ops are Tier 1)
- [Velocity family](../velocity/) — sibling family for slope-style and inter-arrival statistics, including `rate_of_change`
- [Core family](../core/) — fixed-window arithmetic mean / variance / sum / count counterparts
- [shared.md window grammar](../../sdk-api/shared.md) — duration-string format (`\d+(ms\|s\|m\|h\|d)` and the `"forever"` literal)
- Per-operator memory governance: [V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md) — every lifetime aggregation operator declares a finite per-entity memory ceiling at register-time
- [Pipeline DSL compilation rules](../../pipeline-dsl/compilation-rules.md) — how `bv.<op>(...)` calls compile to JSON wire form
34 changes: 34 additions & 0 deletions docs/operators/recency/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Recency Aggregation Operators

The 10 recency ops cover **time-since semantics** (how long since first / most-recent match), **windowed-recency booleans** (was the last match within window N?), and **streak counters** (consecutive matches and non-matches). All recency ops use **server processing-time** (`now_ms()` at the apply path) per [`project_redis_shaped_no_event_time_ever`](../../../.planning/PROJECT.md) — beava intentionally has no event-time concept (post-2026-05-19: partially overturned by ADR-004 for v0.1 bucketing).

| Op | Returns | Time source | Notes |
|----|---------|-------------|-------|
| [`bv.first_seen`](./first_seen.md) | `Datetime` (i64 ms) | server `now_ms()` at apply | First match's server arrival timestamp |
| [`bv.last_seen`](./last_seen.md) | `Datetime` (i64 ms) | server `now_ms()` at apply | Most recent match's server arrival timestamp |
| [`bv.age`](./age.md) | `i64` ms | computed at **read time** | `now_ms() - first_seen` |
| [`bv.has_seen`](./has_seen.md) | `bool` | n/a | Cumulative ever-matched flag |
| [`bv.time_since`](./time_since.md) | `i64` ms or `null` | computed at **read time** | `now_ms() - last_seen` |
| [`bv.time_since_last_n`](./time_since_last_n.md) | `i64` ms or `null` | computed at **read time**; `n` required | Generalization: ms since the kth most recent match |
| [`bv.streak`](./streak.md) | `i64` | event arrival order | Live consecutive-match counter |
| [`bv.max_streak`](./max_streak.md) | `i64` | event arrival order | All-time max of `streak`; never decreases |
| [`bv.negative_streak`](./negative_streak.md) | `i64` | event arrival order | Live consecutive-non-match counter (mirror of `streak`) |
| [`bv.first_seen_in_window`](./first_seen_in_window.md) | `bool` | `now_ms() - last_ms < window` at read time; `window=` required | "Was the last match within the last N ms?" |

## Key invariants

- **Server processing-time only.** Per [`project_redis_shaped_no_event_time_ever`](../../../.planning/PROJECT.md) (locked 2026-04-30) (post-2026-05-19: partially overturned by ADR-004 for v0.1 bucketing), beava records server `now_ms()` at apply for `first_seen` / `last_seen`, and computes elapsed-ms using server `now_ms()` at read for `age` / `time_since` / `time_since_last_n` / `first_seen_in_window`. Producers cannot influence captured timestamps via payload fields for recency operators, though v0.1 supports event-time bucketing via a reserved `_ts` field. The recency operators themselves continue to use server processing-time for their timestamp semantics.
- **Read-time computation.** `age`, `time_since`, `time_since_last_n`, and `first_seen_in_window` change between reads **without any new events** — the right-hand side of the elapsed calculation is captured at query time, not apply time. This makes them useful staleness/recency features.
- **`bv.time_since_last_n` requires `n`.** Per [V0-MEM-GOV-02 BoundedByRequiredKwarg("n")](../../../.planning/REQUIREMENTS.md), the deque of timestamps must have a register-time ceiling. Missing `n` is rejected by the JSON-prelude shim with code `unbounded_op_in_lifetime_mode`.
- **`bv.first_seen_in_window` requires `window`.** The windowed-recency boolean is meaningless without a horizon length. The `window` parameter is enforced at register time; `"forever"` is rejected (use [`bv.has_seen`](./has_seen.md) for that semantic).
- **Cold-start behavior is per-op.** Streaks return `0`. Booleans (`has_seen`, `first_seen_in_window`) return `false`. Datetime/duration ops (`first_seen`, `last_seen`, `age`, `time_since`, `time_since_last_n`) return `null`.
- **Cold-entity eviction (`@bv.event(cold_after=...)`)** drops the underlying state per the Redis-TTL pattern (V0-MEM-GOV-01); recency state rebuilds from the next post-eviction event.
- **9 of 10 ops share `SeenState`** (`first_seen`, `last_seen`, `age`, `has_seen`, `time_since`) or related `StreakState` (`streak`, `max_streak`) / `NegativeStreakState` / `FirstSeenInWindowState` — registering several siblings on the same `where=` predicate costs roughly the same as registering one.

## See also

- [Operator catalog index](../index.md) — full 53-op catalogue
- [cost-class.md](../cost-class.md) — per-op CPU tier metadata (all recency ops are Tier 1)
- [Point/ordinal family](../point-ordinal/) — value-based first/last/N counterparts to the timestamp ops here
- Per-operator memory governance: [V0-MEM-GOV-02](../../../.planning/REQUIREMENTS.md) — every lifetime aggregation operator declares a finite per-entity memory ceiling at register-time
- [Pipeline DSL compilation rules](../../pipeline-dsl/compilation-rules.md) — how `bv.<op>(...)` calls compile to JSON wire form
Loading
Loading