flightdeckdev · Gsbreddy · May 2, 2026 · May 2, 2026 · May 2, 2026
diff --git a/.github/workflows/release-pypi.yml b/.github/workflows/release-pypi.yml
@@ -1,4 +1,4 @@
-# Publish sdist + wheel to PyPI when a SemVer tag is pushed (e.g. v1.0.2).
+# Publish sdist + wheel to PyPI when a SemVer tag is pushed (e.g. v1.0.3).
 # Configure "trusted publishing" on PyPI for this workflow + repository + optional GitHub environment.
 # https://docs.pypi.org/trusted-publishers/
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,17 @@ This project follows [Semantic Versioning](https://semver.org/). From **v1.0.0**
 
 ## Unreleased
 
+## 1.0.3 - 2026-05-03
+
+### Added
+
+- **Tests:** **`tests/test_ledger.py`** — MEDIUM vs **`require_high_diff_confidence`**, LOW sample-floor boundary, **`max_latency_ms`** (and skip when latency absent), **`max_error_rate`**, multiple simultaneous policy failure reasons; **`tests/test_spine.py`** — MEDIUM confidence blocks second **`release promote`**, **`runs ingest`** on empty file / malformed JSONL / JSON array payload, **`release diff`** across different pricing providers and across different models on one provider table (plus **`POST /v1/diff`** `pricing.pricing_or_model_changed` assertion).
+- **Web UI:** structured **Promote & rollback** outcome (policy badge, pointer status, action/release/baseline IDs, reason list) with raw response in a collapsed **`JsonPanel`**; **Run diff** shows a pricing/model-change callout when **`pricing.pricing_or_model_changed`** is true.
+
+### Changed
+
+- **Roadmap:** **Phase 0 progress** subsection and **Next release** pointer for **v1.0.3**; docs aligned with the patch scope above.
+
 ## 1.0.2 - 2026-05-02
 
 ### Added

diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -119,7 +119,7 @@ Merging to **`main` does not publish packages** — PyPI uploads are **tag-drive
 1. **PyPI:** add a **trusted publisher** for **[github.com/flightdeckdev/flightdeck](https://github.com/flightdeckdev/flightdeck)** — workflow **`release-pypi.yml`**. If PyPI offers **Environment name: (Any)**, you can still use a GitHub **Environment** named **`pypi`** for approval gates; otherwise match whatever you register on PyPI ([trusted publishers](https://docs.pypi.org/trusted-publishers/)).
 2. **GitHub:** Settings → **Environments** → create **`pypi`** (optional: required reviewers / wait timer before OIDC publish).
 3. Bump **`version`** in **`pyproject.toml`** and **`src/flightdeck/__init__.py`**, update **`CHANGELOG.md`**, merge to **`main`**.
-4. **`git tag vX.Y.Z`** (must match **`pyproject.toml`** exactly, e.g. **`v1.0.2`**) then **`git push origin vX.Y.Z`**.
+4. **`git tag vX.Y.Z`** (must match **`pyproject.toml`** exactly, e.g. **`v1.0.3`**) then **`git push origin vX.Y.Z`**.
 
 The workflow runs **ruff**, **pytest**, schema drift, **`uv build`**, publishes **sdist + wheel** to **PyPI** via **OIDC** (no long-lived API token in repo secrets), enables **publish attestations**, and creates a **GitHub Release** with generated notes and **`dist/*`** assets.
 

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -4,6 +4,10 @@ High-level notes for **shipping FlightDeck**. Detailed history: **[CHANGELOG.md]
 
 Narrative docs (including the CLI reference) are maintained on **[github.com/flightdeckdev/flightdeck](https://github.com/flightdeckdev/flightdeck)** `main`; this file and **`schemas/`** ship in minimal clones.
 
+## v1.0.3 — Phase 0 hardening (tests + UI)
+
+Patch release (see **[CHANGELOG.md](CHANGELOG.md)**): broader **pytest** coverage for **`diff_releases`** (MEDIUM/LOW confidence, **`max_latency_ms`**, **`max_error_rate`**, combined failures), **CLI** integration for MEDIUM confidence blocking promotion when **`require_high_diff_confidence`** is on, **`runs ingest`** edge cases (empty file, bad JSONL, JSON array file), and **multi-provider / cross-model** **`release diff`** plus **`POST /v1/diff`** parity on **`pricing.pricing_or_model_changed`**. **Web UI:** promote/rollback responses use structured panels (raw JSON optional); **Run diff** surfaces the same pricing/model-change note as the CLI when the diff payload flags it. **Stable contracts:** no CLI flag removals, no **`v1`** schema or **`POST /v1/events`** shape changes; **HTTP** diff and action response shapes are unchanged (additive UI only on the client).
+
 ## v1.0.2 — CI examples, serve packaging, and policy gate CLI
 
 Minor release (see **[CHANGELOG.md](CHANGELOG.md)**): **`flightdeck release diff --fail-on-policy`** for CI gates; **`examples/ci/`** (`ledger-gate.sh`, GitHub Actions templates) exercised in root CI; **`examples/deploy/`** (Docker/Compose for **`flightdeck serve`**); **`examples/integration/`** (SDK sample emitter for **`POST /v1/events`**); **`GET /health`** adds non-secret **`mutation_auth`** (`loopback` vs `bearer`); web shell shows mutation/token ergonomics and optional read-only UI (**`VITE_FLIGHTDECK_UI_READ_ONLY`**). Fix: policy **`min_*_runs`** explicit **`0`** overrides workspace defaults ( **`is not None`** resolution in **`diff_releases`** ). **Stable contracts:** additive **`/health`** field only; CLI flag is backward-compatible.

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -19,6 +19,12 @@ This roadmap is meant to be clear from **what is already shipped** to **near-ter
 
 ---
 
+## Next release
+
+**v1.0.3** (patch): Phase 0 hardening — expanded **pytest** coverage for diff confidence (MEDIUM/LOW, policy on latency and error rate), **runs ingest** edge cases (empty file, malformed JSONL, JSON array payload), and **multi-provider / cross-model** `release diff` paths; web UI structured **promote/rollback** outcome plus a **pricing/model changed** banner on **Run diff** when the API reports it. See **[CHANGELOG.md](CHANGELOG.md)** and **[RELEASE_NOTES.md](RELEASE_NOTES.md)**. No breaking changes to stable CLI, HTTP, or **`api_version` `v1`** contracts.
+
+---
+
 ## Production readiness gaps (why it can feel standalone)
 
 These are current gaps between "works locally" and "easy to use across production services."
@@ -47,6 +53,17 @@ Goal: prove the wedge with real teams using FlightDeck as release governance sou
 - Strengthen local security ergonomics: explicit token/env status in UI, mutation guardrails, optional read-only UX.
 - Continue UI productization for current scope (structured views over raw JSON where stable).
 
+### Phase 0 progress (toward v1.0.3)
+
+Shipped on **`main`** for the next patch:
+
+- **Policy / diff tests:** `diff_releases` coverage for MEDIUM confidence vs `require_high_diff_confidence`, LOW sample floor boundaries, `max_latency_ms` (including skip when latency is absent), `max_error_rate`, and stacked policy failure reasons; CLI integration for MEDIUM blocking a second promotion after a baseline is established.
+- **Ingest tests:** empty JSONL (zero inserts), malformed line (non-zero exit), JSON array file accepted.
+- **Multi-provider pricing:** integration tests that diff baseline vs candidate releases with different **`pricing_reference`** providers (and same-provider different models), including parity checks on **`POST /v1/diff`** `pricing.pricing_or_model_changed`.
+- **Web UI:** structured outcome card after promote/rollback (policy, pointer, IDs) with raw JSON in a collapsible panel; Diff summary shows pricing/model change when the server marks it.
+
+**Still open in Phase 0** (see gaps table and Phase 1 for larger items): richer **pricing normalization** product semantics (beyond per-side tables + flags), broader **integration** and **deployment** narrative in docs, and **observability** paths remain roadmap-sized rather than single-patch work.
+
 ### Phase-0 success signals
 
 - Teams use release versioning + checksum verification as the source of truth for promotion decisions.

diff --git a/examples/ci/README.md b/examples/ci/README.md
@@ -37,7 +37,7 @@ uv run python examples/ci/ledger_gate.py
 Example (**PyPI** install):
 
 ```bash
-pip install "flightdeck-ai>=1.0.2"
+pip install "flightdeck-ai>=1.0.3"
 export WORKSPACE="$(mktemp -d)"
 export QUICKSTART_ROOT=/path/to/flightdeck/examples/quickstart
 python /path/to/flightdeck/examples/ci/ledger_gate.py

diff --git a/examples/ci/github-actions/policy-gate-pypi.yml b/examples/ci/github-actions/policy-gate-pypi.yml
@@ -11,7 +11,7 @@ on:
 env:
   # Pin to a tag or SHA that matches your installed flightdeck-ai version when possible.
   FLIGHTDECK_REF: main
-  FLIGHTDECK_AI_SPEC: ">=1.0.2"
+  FLIGHTDECK_AI_SPEC: ">=1.0.3"
 
 jobs:
   ledger-gate:

diff --git a/examples/deploy/Dockerfile b/examples/deploy/Dockerfile
@@ -2,7 +2,7 @@
 FROM python:3.14-slim
 
 RUN pip install --no-cache-dir --upgrade pip \
-    && pip install --no-cache-dir "flightdeck-ai>=1.0.2"
+    && pip install --no-cache-dir "flightdeck-ai>=1.0.3"
 
 WORKDIR /workspace
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "flightdeck-ai"
-version = "1.0.2"
+version = "1.0.3"
 description = "AI Release Governance for production agents."
 readme = "README.md"
 license = "Apache-2.0"

diff --git a/src/flightdeck/__init__.py b/src/flightdeck/__init__.py
@@ -1,3 +1,3 @@
 """FlightDeck - AI Release Governance for production agents."""
 
-__version__ = "1.0.2"
+__version__ = "1.0.3"
diff --git a/src/flightdeck/server/static/assets/index-B_1jz54d.js b/src/flightdeck/server/static/assets/index-B_1jz54d.js
diff --git a/src/flightdeck/server/static/assets/index-Be9J5wBP.js b/src/flightdeck/server/static/assets/index-Be9J5wBP.js
diff --git a/src/flightdeck/server/static/index.html b/src/flightdeck/server/static/index.html
@@ -4,7 +4,7 @@
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>FlightDeck</title>
-    <script type="module" crossorigin src="/assets/index-Be9J5wBP.js"></script>
+    <script type="module" crossorigin src="/assets/index-B_1jz54d.js"></script>
     <link rel="stylesheet" crossorigin href="/assets/index-Dl91dBdu.css">
   </head>
   <body>

diff --git a/tests/test_ledger.py b/tests/test_ledger.py
@@ -17,7 +17,16 @@
 )
 
 
-def _event(*, agent_id: str, run_id: str, release_id: str) -> RunEvent:
+def _event(
+    *,
+    agent_id: str,
+    run_id: str,
+    release_id: str,
+    latency_ms: int | None = 100,
+    success: bool = True,
+    input_tokens: int = 100,
+    output_tokens: int = 50,
+) -> RunEvent:
     return RunEvent(
         timestamp=datetime.now(tz=timezone.utc),
         agent_id=agent_id,
@@ -30,11 +39,11 @@ def _event(*, agent_id: str, run_id: str, release_id: str) -> RunEvent:
             model=RunEventModelUsage(
                 provider="openai",
                 model="gpt-4.1-mini",
-                input_tokens=100,
-                output_tokens=50,
+                input_tokens=input_tokens,
+                output_tokens=output_tokens,
             )
         ),
-        metrics=RunEventMetrics(latency_ms=100, success=True),
+        metrics=RunEventMetrics(latency_ms=latency_ms, success=success),
     )
 
 
@@ -113,3 +122,228 @@ def test_diff_releases_respects_zero_policy_sample_thresholds() -> None:
 
     assert result.confidence == "HIGH"
     assert result.policy.passed
+
+
+def _events(*, n: int, release_id: str, agent_id: str = "agent_a", **kwargs) -> list[RunEvent]:
+    return [_event(agent_id=agent_id, run_id=f"{release_id}_{i}", release_id=release_id, **kwargs) for i in range(n)]
+
+
+def test_medium_confidence_blocks_when_require_high_flag_set() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(require_high_diff_confidence=True)
+    table = _pricing_table()
+
+    result = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=200, release_id="rel_b"),
+        candidate_events=_events(n=200, release_id="rel_c"),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+
+    assert result.confidence == "MEDIUM"
+    assert not result.policy.passed
+    assert any("MEDIUM" in r for r in result.policy.reasons)
+    assert any("HIGH" in r for r in result.policy.reasons)
+
+
+def test_medium_confidence_passes_without_require_high_flag() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(require_high_diff_confidence=False)
+    table = _pricing_table()
+
+    result = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=200, release_id="rel_b"),
+        candidate_events=_events(n=200, release_id="rel_c"),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+
+    assert result.confidence == "MEDIUM"
+    assert result.policy.passed
+
+
+def test_confidence_reason_populated_for_medium_and_low() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(require_high_diff_confidence=False)
+    table = _pricing_table()
+
+    medium = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=200, release_id="rel_b"),
+        candidate_events=_events(n=200, release_id="rel_c"),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+    assert medium.confidence == "MEDIUM"
+    assert medium.confidence_reason
+    assert "sample" in medium.confidence_reason
+
+    low = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=10, release_id="rel_b"),
+        candidate_events=_events(n=200, release_id="rel_c"),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+    assert low.confidence == "LOW"
+    assert low.confidence_reason
+    assert "sample" in low.confidence_reason or "floor" in low.confidence_reason
+
+
+def test_low_floor_boundary() -> None:
+    cfg = WorkspaceConfig()
+    # Override defaults so we can drive the LOW floor at runs=50 deterministically.
+    policy = Policy(
+        min_baseline_runs=500,
+        min_candidate_runs=500,
+        min_low_runs=50,
+        require_high_diff_confidence=False,
+    )
+    table = _pricing_table()
+
+    just_below = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=49, release_id="rel_b"),
+        candidate_events=_events(n=200, release_id="rel_c"),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+    assert just_below.confidence == "LOW"
+
+    at_floor = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=50, release_id="rel_b"),
+        candidate_events=_events(n=200, release_id="rel_c"),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+    assert at_floor.confidence == "MEDIUM"
+
+
+def test_policy_max_latency_ms_blocks() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(
+        max_latency_ms=50,
+        min_baseline_runs=0,
+        min_candidate_runs=0,
+        min_low_runs=0,
+        require_high_diff_confidence=False,
+    )
+    table = _pricing_table()
+
+    result = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=5, release_id="rel_b", latency_ms=100),
+        candidate_events=_events(n=5, release_id="rel_c", latency_ms=200),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+
+    assert not result.policy.passed
+    assert any("latency_ms_avg" in r for r in result.policy.reasons)
+
+
+def test_policy_max_latency_ms_skipped_when_no_data() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(
+        max_latency_ms=50,
+        min_baseline_runs=0,
+        min_candidate_runs=0,
+        min_low_runs=0,
+        require_high_diff_confidence=False,
+    )
+    table = _pricing_table()
+
+    result = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=5, release_id="rel_b", latency_ms=None),
+        candidate_events=_events(n=5, release_id="rel_c", latency_ms=None),
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+
+    assert result.candidate.latency_ms_avg is None
+    assert result.policy.passed
+    assert not any("latency" in r for r in result.policy.reasons)
+
+
+def test_policy_max_error_rate_blocks() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(
+        max_error_rate=0.1,
+        min_baseline_runs=0,
+        min_candidate_runs=0,
+        min_low_runs=0,
+        require_high_diff_confidence=False,
+    )
+    table = _pricing_table()
+
+    candidate_events = [
+        _event(agent_id="agent_a", run_id=f"c_{i}", release_id="rel_c", success=(i < 4))
+        for i in range(8)
+    ]
+
+    result = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=5, release_id="rel_b"),
+        candidate_events=candidate_events,
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+
+    assert result.candidate.error_rate == 0.5
+    assert not result.policy.passed
+    assert any("error_rate" in r for r in result.policy.reasons)
+
+
+def test_policy_multiple_failures_accumulate() -> None:
+    cfg = WorkspaceConfig()
+    policy = Policy(
+        max_cost_per_run_usd=0.0001,
+        max_error_rate=0.1,
+        min_baseline_runs=0,
+        min_candidate_runs=0,
+        min_low_runs=0,
+        require_high_diff_confidence=False,
+    )
+    table = _pricing_table()
+
+    candidate_events = [
+        _event(agent_id="agent_a", run_id=f"c_{i}", release_id="rel_c", success=(i < 4))
+        for i in range(8)
+    ]
+
+    result = diff_releases(
+        cfg=cfg,
+        policy=policy,
+        baseline_events=_events(n=5, release_id="rel_b"),
+        candidate_events=candidate_events,
+        baseline_pricing_table=table,
+        candidate_pricing_table=table,
+        window="7d",
+    )
+
+    assert not result.policy.passed
+    assert any("cost_per_run_usd" in r for r in result.policy.reasons)
+    assert any("error_rate" in r for r in result.policy.reasons)
+    assert len(result.policy.reasons) >= 2