Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/release-pypi.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Publish sdist + wheel to PyPI when a SemVer tag is pushed (e.g. v1.0.2).
# Publish sdist + wheel to PyPI when a SemVer tag is pushed (e.g. v1.0.3).
# Configure "trusted publishing" on PyPI for this workflow + repository + optional GitHub environment.
# https://docs.pypi.org/trusted-publishers/

Expand Down
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ This project follows [Semantic Versioning](https://semver.org/). From **v1.0.0**

## Unreleased

## 1.0.3 - 2026-05-03

### Added

- **Tests:** **`tests/test_ledger.py`** — MEDIUM vs **`require_high_diff_confidence`**, LOW sample-floor boundary, **`max_latency_ms`** (and skip when latency absent), **`max_error_rate`**, multiple simultaneous policy failure reasons; **`tests/test_spine.py`** — MEDIUM confidence blocks second **`release promote`**, **`runs ingest`** on empty file / malformed JSONL / JSON array payload, **`release diff`** across different pricing providers and across different models on one provider table (plus **`POST /v1/diff`** `pricing.pricing_or_model_changed` assertion).
- **Web UI:** structured **Promote & rollback** outcome (policy badge, pointer status, action/release/baseline IDs, reason list) with raw response in a collapsed **`JsonPanel`**; **Run diff** shows a pricing/model-change callout when **`pricing.pricing_or_model_changed`** is true.

### Changed

- **Roadmap:** **Phase 0 progress** subsection and **Next release** pointer for **v1.0.3**; docs aligned with the patch scope above.

## 1.0.2 - 2026-05-02

### Added
Expand Down
2 changes: 1 addition & 1 deletion DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ Merging to **`main` does not publish packages** — PyPI uploads are **tag-drive
1. **PyPI:** add a **trusted publisher** for **[github.com/flightdeckdev/flightdeck](https://github.com/flightdeckdev/flightdeck)** — workflow **`release-pypi.yml`**. If PyPI offers **Environment name: (Any)**, you can still use a GitHub **Environment** named **`pypi`** for approval gates; otherwise match whatever you register on PyPI ([trusted publishers](https://docs.pypi.org/trusted-publishers/)).
2. **GitHub:** Settings → **Environments** → create **`pypi`** (optional: required reviewers / wait timer before OIDC publish).
3. Bump **`version`** in **`pyproject.toml`** and **`src/flightdeck/__init__.py`**, update **`CHANGELOG.md`**, merge to **`main`**.
4. **`git tag vX.Y.Z`** (must match **`pyproject.toml`** exactly, e.g. **`v1.0.2`**) then **`git push origin vX.Y.Z`**.
4. **`git tag vX.Y.Z`** (must match **`pyproject.toml`** exactly, e.g. **`v1.0.3`**) then **`git push origin vX.Y.Z`**.

The workflow runs **ruff**, **pytest**, schema drift, **`uv build`**, publishes **sdist + wheel** to **PyPI** via **OIDC** (no long-lived API token in repo secrets), enables **publish attestations**, and creates a **GitHub Release** with generated notes and **`dist/*`** assets.

Expand Down
4 changes: 4 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ High-level notes for **shipping FlightDeck**. Detailed history: **[CHANGELOG.md]

Narrative docs (including the CLI reference) are maintained on **[github.com/flightdeckdev/flightdeck](https://github.com/flightdeckdev/flightdeck)** `main`; this file and **`schemas/`** ship in minimal clones.

## v1.0.3 — Phase 0 hardening (tests + UI)

Patch release (see **[CHANGELOG.md](CHANGELOG.md)**): broader **pytest** coverage for **`diff_releases`** (MEDIUM/LOW confidence, **`max_latency_ms`**, **`max_error_rate`**, combined failures), **CLI** integration for MEDIUM confidence blocking promotion when **`require_high_diff_confidence`** is on, **`runs ingest`** edge cases (empty file, bad JSONL, JSON array file), and **multi-provider / cross-model** **`release diff`** plus **`POST /v1/diff`** parity on **`pricing.pricing_or_model_changed`**. **Web UI:** promote/rollback responses use structured panels (raw JSON optional); **Run diff** surfaces the same pricing/model-change note as the CLI when the diff payload flags it. **Stable contracts:** no CLI flag removals, no **`v1`** schema or **`POST /v1/events`** shape changes; **HTTP** diff and action response shapes are unchanged (additive UI only on the client).

## v1.0.2 — CI examples, serve packaging, and policy gate CLI

Minor release (see **[CHANGELOG.md](CHANGELOG.md)**): **`flightdeck release diff --fail-on-policy`** for CI gates; **`examples/ci/`** (`ledger-gate.sh`, GitHub Actions templates) exercised in root CI; **`examples/deploy/`** (Docker/Compose for **`flightdeck serve`**); **`examples/integration/`** (SDK sample emitter for **`POST /v1/events`**); **`GET /health`** adds non-secret **`mutation_auth`** (`loopback` vs `bearer`); web shell shows mutation/token ergonomics and optional read-only UI (**`VITE_FLIGHTDECK_UI_READ_ONLY`**). Fix: policy **`min_*_runs`** explicit **`0`** overrides workspace defaults ( **`is not None`** resolution in **`diff_releases`** ). **Stable contracts:** additive **`/health`** field only; CLI flag is backward-compatible.
Expand Down
17 changes: 17 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ This roadmap is meant to be clear from **what is already shipped** to **near-ter

---

## Next release

**v1.0.3** (patch): Phase 0 hardening — expanded **pytest** coverage for diff confidence (MEDIUM/LOW, policy on latency and error rate), **runs ingest** edge cases (empty file, malformed JSONL, JSON array payload), and **multi-provider / cross-model** `release diff` paths; web UI structured **promote/rollback** outcome plus a **pricing/model changed** banner on **Run diff** when the API reports it. See **[CHANGELOG.md](CHANGELOG.md)** and **[RELEASE_NOTES.md](RELEASE_NOTES.md)**. No breaking changes to stable CLI, HTTP, or **`api_version` `v1`** contracts.

---

## Production readiness gaps (why it can feel standalone)

These are current gaps between "works locally" and "easy to use across production services."
Expand Down Expand Up @@ -47,6 +53,17 @@ Goal: prove the wedge with real teams using FlightDeck as release governance sou
- Strengthen local security ergonomics: explicit token/env status in UI, mutation guardrails, optional read-only UX.
- Continue UI productization for current scope (structured views over raw JSON where stable).

### Phase 0 progress (toward v1.0.3)

Shipped on **`main`** for the next patch:

- **Policy / diff tests:** `diff_releases` coverage for MEDIUM confidence vs `require_high_diff_confidence`, LOW sample floor boundaries, `max_latency_ms` (including skip when latency is absent), `max_error_rate`, and stacked policy failure reasons; CLI integration for MEDIUM blocking a second promotion after a baseline is established.
- **Ingest tests:** empty JSONL (zero inserts), malformed line (non-zero exit), JSON array file accepted.
- **Multi-provider pricing:** integration tests that diff baseline vs candidate releases with different **`pricing_reference`** providers (and same-provider different models), including parity checks on **`POST /v1/diff`** `pricing.pricing_or_model_changed`.
- **Web UI:** structured outcome card after promote/rollback (policy, pointer, IDs) with raw JSON in a collapsible panel; Diff summary shows pricing/model change when the server marks it.

**Still open in Phase 0** (see gaps table and Phase 1 for larger items): richer **pricing normalization** product semantics (beyond per-side tables + flags), broader **integration** and **deployment** narrative in docs, and **observability** paths remain roadmap-sized rather than single-patch work.

### Phase-0 success signals

- Teams use release versioning + checksum verification as the source of truth for promotion decisions.
Expand Down
2 changes: 1 addition & 1 deletion examples/ci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ uv run python examples/ci/ledger_gate.py
Example (**PyPI** install):

```bash
pip install "flightdeck-ai>=1.0.2"
pip install "flightdeck-ai>=1.0.3"
export WORKSPACE="$(mktemp -d)"
export QUICKSTART_ROOT=/path/to/flightdeck/examples/quickstart
python /path/to/flightdeck/examples/ci/ledger_gate.py
Expand Down
2 changes: 1 addition & 1 deletion examples/ci/github-actions/policy-gate-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ on:
env:
# Pin to a tag or SHA that matches your installed flightdeck-ai version when possible.
FLIGHTDECK_REF: main
FLIGHTDECK_AI_SPEC: ">=1.0.2"
FLIGHTDECK_AI_SPEC: ">=1.0.3"

jobs:
ledger-gate:
Expand Down
2 changes: 1 addition & 1 deletion examples/deploy/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
FROM python:3.14-slim

RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir "flightdeck-ai>=1.0.2"
&& pip install --no-cache-dir "flightdeck-ai>=1.0.3"

WORKDIR /workspace

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "flightdeck-ai"
version = "1.0.2"
version = "1.0.3"
description = "AI Release Governance for production agents."
readme = "README.md"
license = "Apache-2.0"
Expand Down
2 changes: 1 addition & 1 deletion src/flightdeck/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""FlightDeck - AI Release Governance for production agents."""

__version__ = "1.0.2"
__version__ = "1.0.3"
11 changes: 11 additions & 0 deletions src/flightdeck/server/static/assets/index-B_1jz54d.js

Large diffs are not rendered by default.

11 changes: 0 additions & 11 deletions src/flightdeck/server/static/assets/index-Be9J5wBP.js

This file was deleted.

2 changes: 1 addition & 1 deletion src/flightdeck/server/static/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>FlightDeck</title>
<script type="module" crossorigin src="/assets/index-Be9J5wBP.js"></script>
<script type="module" crossorigin src="/assets/index-B_1jz54d.js"></script>
<link rel="stylesheet" crossorigin href="/assets/index-Dl91dBdu.css">
</head>
<body>
Expand Down
242 changes: 238 additions & 4 deletions tests/test_ledger.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,16 @@
)


def _event(*, agent_id: str, run_id: str, release_id: str) -> RunEvent:
def _event(
*,
agent_id: str,
run_id: str,
release_id: str,
latency_ms: int | None = 100,
success: bool = True,
input_tokens: int = 100,
output_tokens: int = 50,
) -> RunEvent:
return RunEvent(
timestamp=datetime.now(tz=timezone.utc),
agent_id=agent_id,
Expand All @@ -30,11 +39,11 @@ def _event(*, agent_id: str, run_id: str, release_id: str) -> RunEvent:
model=RunEventModelUsage(
provider="openai",
model="gpt-4.1-mini",
input_tokens=100,
output_tokens=50,
input_tokens=input_tokens,
output_tokens=output_tokens,
)
),
metrics=RunEventMetrics(latency_ms=100, success=True),
metrics=RunEventMetrics(latency_ms=latency_ms, success=success),
)


Expand Down Expand Up @@ -113,3 +122,228 @@ def test_diff_releases_respects_zero_policy_sample_thresholds() -> None:

assert result.confidence == "HIGH"
assert result.policy.passed


def _events(*, n: int, release_id: str, agent_id: str = "agent_a", **kwargs) -> list[RunEvent]:
return [_event(agent_id=agent_id, run_id=f"{release_id}_{i}", release_id=release_id, **kwargs) for i in range(n)]


def test_medium_confidence_blocks_when_require_high_flag_set() -> None:
cfg = WorkspaceConfig()
policy = Policy(require_high_diff_confidence=True)
table = _pricing_table()

result = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=200, release_id="rel_b"),
candidate_events=_events(n=200, release_id="rel_c"),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)

assert result.confidence == "MEDIUM"
assert not result.policy.passed
assert any("MEDIUM" in r for r in result.policy.reasons)
assert any("HIGH" in r for r in result.policy.reasons)


def test_medium_confidence_passes_without_require_high_flag() -> None:
cfg = WorkspaceConfig()
policy = Policy(require_high_diff_confidence=False)
table = _pricing_table()

result = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=200, release_id="rel_b"),
candidate_events=_events(n=200, release_id="rel_c"),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)

assert result.confidence == "MEDIUM"
assert result.policy.passed


def test_confidence_reason_populated_for_medium_and_low() -> None:
cfg = WorkspaceConfig()
policy = Policy(require_high_diff_confidence=False)
table = _pricing_table()

medium = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=200, release_id="rel_b"),
candidate_events=_events(n=200, release_id="rel_c"),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)
assert medium.confidence == "MEDIUM"
assert medium.confidence_reason
assert "sample" in medium.confidence_reason

low = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=10, release_id="rel_b"),
candidate_events=_events(n=200, release_id="rel_c"),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)
assert low.confidence == "LOW"
assert low.confidence_reason
assert "sample" in low.confidence_reason or "floor" in low.confidence_reason


def test_low_floor_boundary() -> None:
cfg = WorkspaceConfig()
# Override defaults so we can drive the LOW floor at runs=50 deterministically.
policy = Policy(
min_baseline_runs=500,
min_candidate_runs=500,
min_low_runs=50,
require_high_diff_confidence=False,
)
table = _pricing_table()

just_below = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=49, release_id="rel_b"),
candidate_events=_events(n=200, release_id="rel_c"),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)
assert just_below.confidence == "LOW"

at_floor = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=50, release_id="rel_b"),
candidate_events=_events(n=200, release_id="rel_c"),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)
assert at_floor.confidence == "MEDIUM"


def test_policy_max_latency_ms_blocks() -> None:
cfg = WorkspaceConfig()
policy = Policy(
max_latency_ms=50,
min_baseline_runs=0,
min_candidate_runs=0,
min_low_runs=0,
require_high_diff_confidence=False,
)
table = _pricing_table()

result = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=5, release_id="rel_b", latency_ms=100),
candidate_events=_events(n=5, release_id="rel_c", latency_ms=200),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)

assert not result.policy.passed
assert any("latency_ms_avg" in r for r in result.policy.reasons)


def test_policy_max_latency_ms_skipped_when_no_data() -> None:
cfg = WorkspaceConfig()
policy = Policy(
max_latency_ms=50,
min_baseline_runs=0,
min_candidate_runs=0,
min_low_runs=0,
require_high_diff_confidence=False,
)
table = _pricing_table()

result = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=5, release_id="rel_b", latency_ms=None),
candidate_events=_events(n=5, release_id="rel_c", latency_ms=None),
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)

assert result.candidate.latency_ms_avg is None
assert result.policy.passed
assert not any("latency" in r for r in result.policy.reasons)


def test_policy_max_error_rate_blocks() -> None:
cfg = WorkspaceConfig()
policy = Policy(
max_error_rate=0.1,
min_baseline_runs=0,
min_candidate_runs=0,
min_low_runs=0,
require_high_diff_confidence=False,
)
table = _pricing_table()

candidate_events = [
_event(agent_id="agent_a", run_id=f"c_{i}", release_id="rel_c", success=(i < 4))
for i in range(8)
]

result = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=5, release_id="rel_b"),
candidate_events=candidate_events,
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)

assert result.candidate.error_rate == 0.5
assert not result.policy.passed
assert any("error_rate" in r for r in result.policy.reasons)


def test_policy_multiple_failures_accumulate() -> None:
cfg = WorkspaceConfig()
policy = Policy(
max_cost_per_run_usd=0.0001,
max_error_rate=0.1,
min_baseline_runs=0,
min_candidate_runs=0,
min_low_runs=0,
require_high_diff_confidence=False,
)
table = _pricing_table()

candidate_events = [
_event(agent_id="agent_a", run_id=f"c_{i}", release_id="rel_c", success=(i < 4))
for i in range(8)
]

result = diff_releases(
cfg=cfg,
policy=policy,
baseline_events=_events(n=5, release_id="rel_b"),
candidate_events=candidate_events,
baseline_pricing_table=table,
candidate_pricing_table=table,
window="7d",
)

assert not result.policy.passed
assert any("cost_per_run_usd" in r for r in result.policy.reasons)
assert any("error_rate" in r for r in result.policy.reasons)
assert len(result.policy.reasons) >= 2
Loading
Loading