Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,15 @@ JSON array:
]
```

**Edge cases:**

| Scenario | Behavior |
|----------|----------|
| Empty file (0 bytes or whitespace only) | Succeeds with `Inserted 0 events`. Not an error. |
| Malformed JSONL (invalid JSON on any line) | Fails with a non-zero exit code and a parse error message. |
| JSON array file | Parsed as a list of events; each element is validated individually. |
| Duplicate `run_id` | Silently skipped; count reflects only newly inserted rows. Re-ingesting the same file is safe. |

See [http-api.md § POST /v1/events](http-api.md) for the full `RunEvent` field reference.

---
Expand Down
63 changes: 50 additions & 13 deletions docs/operations-and-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,30 @@ This can happen if `run_id` values from different agents were ingested under the
`release_id`. Ensure every `RunEvent` for a release carries the correct `agent_id`
matching `spec.agent.agent_id` in the release artifact.

### Pricing and model change detection

`DiffOutcome` includes a `pricing_or_model_changed` flag that is `True` when any of the
following differ between baseline and candidate:

- `spec.pricing_reference.provider` (e.g. `"openai"` vs. `"anthropic"`)
- `spec.pricing_reference.pricing_version` (e.g. `"openai-2026-04-30"` vs. a newer table)
- `spec.runtime.model` (e.g. `"gpt-4.1-mini"` vs. `"gpt-4.1"`)

When this flag is `True`, the CLI prints a note:

```
NOTE: cost delta includes pricing/model assumption changes (pricing reference and/or model differ).
```

The HTTP API's `/v1/diff` response includes `pricing.pricing_or_model_changed: true` in the
`pricing` block, and the web UI's `DiffPage` shows an `fd-alert--warn` banner. This is an
informational signal — the diff still computes and the policy still evaluates; cost deltas may
reflect pricing assumption changes in addition to actual usage changes.

Cross-provider diffs (e.g. OpenAI baseline vs. Anthropic candidate) are supported as long as
separate pricing tables for each provider/version are imported. Each side is priced against its
own table independently before deltas are computed.

### Rollup semantics

`ledger.compute_rollup` aggregates a list of `RunEvent` objects into a `Rollup`:
Expand Down Expand Up @@ -278,19 +302,6 @@ min_low_runs: 20

JSON Schema: [`schemas/v1/policy.schema.json`](../schemas/v1/policy.schema.json).

### Constraint evaluation

`ledger.evaluate_policy` checks constraints in order:

1. **`max_cost_per_run_usd`** — candidate average cost must not exceed the limit.
2. **`max_latency_ms`** — candidate average latency must not exceed the limit. Skipped
if the candidate window has no latency data.
3. **`max_error_rate`** — candidate error rate must not exceed the limit.
4. **`require_high_diff_confidence`** — when `True`, the diff must reach HIGH confidence.

Each failed constraint appends a human-readable reason to the result. An empty `reasons`
list means the policy passed (`passed = True`).

### Confidence tiers

Confidence is determined by comparing event counts against resolved thresholds:
Expand All @@ -317,6 +328,32 @@ min_low_runs: 0
require_high_diff_confidence: false
```

**`confidence_reason` format:** when confidence is not `HIGH`, a human-readable explanation is
set on `DiffResult.confidence_reason`. It is a semicolon-joined string of the applicable parts:

- `"candidate sample < {N} runs"` — candidate count is below `min_candidate_runs`
- `"baseline sample < {N} runs"` — baseline count is below `min_baseline_runs`
- `"LOW floor is {N} runs"` — either side is below `min_low_runs`
- Falls back to `"insufficient sample size"` when none of the above apply (should not occur in practice).

The same reason string is appended to the policy failure message when
`require_high_diff_confidence` blocks promotion, e.g.:
`"diff confidence is MEDIUM (candidate sample < 500 runs); promotion requires HIGH"`.

### Constraint evaluation: all constraints are checked

`evaluate_policy` checks **all** enabled constraints in order and accumulates every failure
reason before returning. A single promotion attempt can fail multiple constraints simultaneously:

1. `max_cost_per_run_usd` — candidate average cost must not exceed the limit.
2. `max_latency_ms` — candidate average latency must not exceed the limit. Skipped when candidate has no latency data.
3. `max_error_rate` — candidate error rate must not exceed the limit.
4. `require_high_diff_confidence` — when `True`, the diff must reach HIGH confidence.

Each failed constraint appends one entry to `policy.reasons`. An empty `reasons` list means
the policy passed (`passed = True`). This means **multiple reasons can appear** when several
constraints fail at once — e.g. cost and error rate both over-limit produce two entries.

### Promotion blocked by policy

When policy fails, the promotion/rollback attempt is **recorded in the audit ledger**
Expand Down
Loading