feat(safety): decode-time safety control loop with checkpointed rollback (ADR-012) by Tovli · Pull Request #5 · Tovli/EdgeIntelligence

Tovli · 2026-06-18T12:59:29Z

Summary

Implements ADR-012 — a layered decode-time safety control loop with checkpointed rollback. ADR-005 decided which safety mode runs and where the LogitAdjustment sits in a decode step; it did not decide how steering recovers once generation has drifted unsafe. This PR wraps that per-token steering in a recoverable, fully on-device (air-gapped) control loop that catches unsafe drift mid-generation and rewinds to the last safe prefix.

What's included

Core safety primitives (el-safety)

ChunkGuard / SafetyScore — deterministic, float-free risk scoring of recent output (integer milli-units).
RollbackPolicy — tier-aware cadence and bounds (guard_every, soft/hard thresholds, max_rollbacks, max_checkpoints).
CheckpointManager / Checkpoint — bounded ring of safe-prefix snapshots; offsets only, KV payload never copied.

Runtime control loop (el-runtime)

InferenceSession::generate_with_policy preserves the invariant order grammar mask → safety adjust → sample → commit, checkpoints at each guard-verified-safe boundary, scores every guard_every tokens, and on a hard-threshold breach rolls KV and output back to the last safe checkpoint — banning the divergence token so the resumed decode diverges.
Mandatory final guard check before EOS / max_tokens termination, so a tail shorter than guard_every (or an unsafe completion ending in EOS) can never be returned unscored.
Bounded, fail-closed: rollbacks are capped; on exhaustion — or under memory pressure with no checkpoint — the loop refuses deterministically.
el-memory::KvRegion::truncate — O(dropped) descriptor rewind, no payload copy.

Engine rollback contract

InferenceEngine::rollback is required (no default) so no engine can silently resume on a stale KV cache (fail-open). Stateless engines implement a no-op; QwenEngine (candle) rebuilds the safe prefix by replaying the prompt from index_pos 0, since candle 0.8.4 exposes no in-place cache truncation.

Docs

New ADR-012, updated Safety DDD context + domain events, el-safety README, and the supporting SecDecoding research that motivates the design.
Selective soft-steering over an early-token window is documented as a deferred SecDecoding follow-up, not current behavior.

Cost model note

The session-layer rollback is O(dropped) and replay-free. The engine-layer cost is adapter-dependent: stateless engines are O(1), but an append-only-cache transformer (candle/QwenEngine) replays prompt + kept prefix per rollback — bounded by max_rollbacks. ADR-012's "Consequences" documents this explicitly.

Testing

cargo test --workspace — all green (77 tests; el-runtime 15, el-engine-candle 14, el-safety 6, …).
cargo clippy --all-targets and cargo fmt --all --check clean.
Regression tests added for: EOS / max_tokens guard-bypass, terminal-breach rollback recovery, and the session→engine rollback propagation (stateful mock engine mirroring QwenEngine's KV hazard).

🤖 Generated with Claude Code

…ack (ADR-012) Wrap the ADR-005 per-token steering in a recoverable, on-device control loop that catches unsafe drift mid-generation and rewinds to the last safe prefix, instead of treating safety as a one-shot per-token gate. Core primitives (el-safety): - ChunkGuard / SafetyScore: deterministic, float-free risk scoring of recent output (integer milli-units). - RollbackPolicy: tier-aware cadence + bounds (guard_every, soft/hard thresholds, max_rollbacks, max_checkpoints). - CheckpointManager / Checkpoint: bounded ring of safe-prefix snapshots (offsets only; KV payload never copied). Runtime control loop (el-runtime): - InferenceSession::generate_with_policy drives grammar mask -> safety adjust -> sample -> commit, captures a checkpoint at each guard-verified-safe boundary, scores every guard_every tokens, and on a hard breach rolls KV + output back to the last safe checkpoint, banning the divergence token. - Mandatory final guard check before EOS / max_tokens termination, so a tail shorter than guard_every (or an unsafe completion ending in EOS) is never returned unscored. - Bounded, fail-closed: rollbacks capped; on exhaustion or under memory pressure (no checkpoint) the loop refuses deterministically. - el-memory: KvRegion::truncate -- O(dropped) descriptor rewind, no replay. Engine rollback contract: - InferenceEngine::rollback is required (no default), so no engine can silently resume on a stale KV cache (fail-open). Stateless engines implement a no-op; QwenEngine (candle) rebuilds the safe prefix by replaying the prompt from index_pos 0 (candle exposes no in-place cache truncation). Docs: ADR-012, Safety DDD context + domain events, el-safety README, and the supporting SecDecoding research. Selective soft-steering over an early-token window is documented as a deferred SecDecoding follow-up, not current behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tovli merged commit 4de69ff into master Jun 18, 2026
5 checks passed

Tovli deleted the feature/addSecurityHandle branch June 18, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(safety): decode-time safety control loop with checkpointed rollback (ADR-012)#5

feat(safety): decode-time safety control loop with checkpointed rollback (ADR-012)#5
Tovli merged 1 commit into
masterfrom
feature/addSecurityHandle

Tovli commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tovli commented Jun 18, 2026

Summary

What's included

Cost model note

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant