Skip to content

feat(safety): decode-time safety control loop with checkpointed rollback (ADR-012)#5

Merged
Tovli merged 1 commit into
masterfrom
feature/addSecurityHandle
Jun 18, 2026
Merged

feat(safety): decode-time safety control loop with checkpointed rollback (ADR-012)#5
Tovli merged 1 commit into
masterfrom
feature/addSecurityHandle

Conversation

@Tovli

@Tovli Tovli commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Implements ADR-012 — a layered decode-time safety control loop with checkpointed rollback. ADR-005 decided which safety mode runs and where the LogitAdjustment sits in a decode step; it did not decide how steering recovers once generation has drifted unsafe. This PR wraps that per-token steering in a recoverable, fully on-device (air-gapped) control loop that catches unsafe drift mid-generation and rewinds to the last safe prefix.

What's included

Core safety primitives (el-safety)

  • ChunkGuard / SafetyScore — deterministic, float-free risk scoring of recent output (integer milli-units).
  • RollbackPolicy — tier-aware cadence and bounds (guard_every, soft/hard thresholds, max_rollbacks, max_checkpoints).
  • CheckpointManager / Checkpoint — bounded ring of safe-prefix snapshots; offsets only, KV payload never copied.

Runtime control loop (el-runtime)

  • InferenceSession::generate_with_policy preserves the invariant order grammar mask → safety adjust → sample → commit, checkpoints at each guard-verified-safe boundary, scores every guard_every tokens, and on a hard-threshold breach rolls KV and output back to the last safe checkpoint — banning the divergence token so the resumed decode diverges.
  • Mandatory final guard check before EOS / max_tokens termination, so a tail shorter than guard_every (or an unsafe completion ending in EOS) can never be returned unscored.
  • Bounded, fail-closed: rollbacks are capped; on exhaustion — or under memory pressure with no checkpoint — the loop refuses deterministically.
  • el-memory::KvRegion::truncateO(dropped) descriptor rewind, no payload copy.

Engine rollback contract

  • InferenceEngine::rollback is required (no default) so no engine can silently resume on a stale KV cache (fail-open). Stateless engines implement a no-op; QwenEngine (candle) rebuilds the safe prefix by replaying the prompt from index_pos 0, since candle 0.8.4 exposes no in-place cache truncation.

Docs

  • New ADR-012, updated Safety DDD context + domain events, el-safety README, and the supporting SecDecoding research that motivates the design.
  • Selective soft-steering over an early-token window is documented as a deferred SecDecoding follow-up, not current behavior.

Cost model note

The session-layer rollback is O(dropped) and replay-free. The engine-layer cost is adapter-dependent: stateless engines are O(1), but an append-only-cache transformer (candle/QwenEngine) replays prompt + kept prefix per rollback — bounded by max_rollbacks. ADR-012's "Consequences" documents this explicitly.

Testing

  • cargo test --workspace — all green (77 tests; el-runtime 15, el-engine-candle 14, el-safety 6, …).
  • cargo clippy --all-targets and cargo fmt --all --check clean.
  • Regression tests added for: EOS / max_tokens guard-bypass, terminal-breach rollback recovery, and the session→engine rollback propagation (stateful mock engine mirroring QwenEngine's KV hazard).

🤖 Generated with Claude Code

…ack (ADR-012)

Wrap the ADR-005 per-token steering in a recoverable, on-device control loop
that catches unsafe drift mid-generation and rewinds to the last safe prefix,
instead of treating safety as a one-shot per-token gate.

Core primitives (el-safety):
- ChunkGuard / SafetyScore: deterministic, float-free risk scoring of recent
  output (integer milli-units).
- RollbackPolicy: tier-aware cadence + bounds (guard_every, soft/hard
  thresholds, max_rollbacks, max_checkpoints).
- CheckpointManager / Checkpoint: bounded ring of safe-prefix snapshots
  (offsets only; KV payload never copied).

Runtime control loop (el-runtime):
- InferenceSession::generate_with_policy drives grammar mask -> safety adjust
  -> sample -> commit, captures a checkpoint at each guard-verified-safe
  boundary, scores every guard_every tokens, and on a hard breach rolls KV +
  output back to the last safe checkpoint, banning the divergence token.
- Mandatory final guard check before EOS / max_tokens termination, so a tail
  shorter than guard_every (or an unsafe completion ending in EOS) is never
  returned unscored.
- Bounded, fail-closed: rollbacks capped; on exhaustion or under memory
  pressure (no checkpoint) the loop refuses deterministically.
- el-memory: KvRegion::truncate -- O(dropped) descriptor rewind, no replay.

Engine rollback contract:
- InferenceEngine::rollback is required (no default), so no engine can silently
  resume on a stale KV cache (fail-open). Stateless engines implement a no-op;
  QwenEngine (candle) rebuilds the safe prefix by replaying the prompt from
  index_pos 0 (candle exposes no in-place cache truncation).

Docs: ADR-012, Safety DDD context + domain events, el-safety README, and the
supporting SecDecoding research. Selective soft-steering over an early-token
window is documented as a deferred SecDecoding follow-up, not current behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Tovli Tovli merged commit 4de69ff into master Jun 18, 2026
5 checks passed
@Tovli Tovli deleted the feature/addSecurityHandle branch June 18, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant