Skip to content

Failsafe guards: approval gating, capability, token budget, hardened scheduling#25

Merged
oxyc merged 3 commits into
mainfrom
failsafe-write-guards
May 26, 2026
Merged

Failsafe guards: approval gating, capability, token budget, hardened scheduling#25
oxyc merged 3 commits into
mainfrom
failsafe-write-guards

Conversation

@oxyc

@oxyc oxyc commented May 26, 2026

Copy link
Copy Markdown
Member

Summary

Production-safety failsafes for the AI assistant. The principle: content is safe because it's recoverable (revisions + trash); this PR brings comparable guardrails to operations that aren't recoverable, and adds containment + observability. Companion to generoi/gds-mcp#18 (the data-layer guards these rely on).

Human-approval gating for irreversible writes (DestructiveGuard)

Approval previously fired only for dangerous tools or moderate + destructive-annotated ones. Edits that are irreversible in practice but flagged non-destructive slipped through and executed silently. Now gated:

  • forms-update when it rewrites the field structure
  • terms-update, feeds-update, redirects-manage (create)
  • translations-link, strings-update, translations-machine
  • memory-save (it persists into every future system prompt — anti-poisoning)

On approval it injects confirm_destructive, which the matching gds-mcp guards consume, so an approved change passes the data layer while non-chat callers stay protected.

Other layers

  • Scope & escalation in the system prompt: content-only scope; recognise site bugs (colours/layout/logic) and offer /report-bug instead of papering over them with content edits.
  • Hardened scheduled skills: ran via WP-Cron with no user and no approver. Now run as the skill author, skipped unless the author is an admin, require manage_options to schedule, and surface tools left awaiting approval instead of silently no-op-ing.
  • Per-user daily token budget (TokenBudget): runaway-cost backstop alongside the per-request rate limiter. Filterable / env-configurable; 0 disables.
  • Audit logging of approvals & denials: approved (gated, destructive) actions execute in the approval path, not MessageLoop, so they were never being logged. Denials now leave a trace too.
  • Memory cap + ToolRestrictor fix (terms-delete now classifies as dangerous).

Tests

DestructiveGuardTest, TokenBudgetTest (new); extended ToolRestrictorTest, MemoryToolProviderTest, SkillSchedulerTest, ChatEndpointApprovalTest.

🤖 Generated with Claude Code

oxyc and others added 3 commits May 26, 2026 12:04
…ogging

DestructiveGuard forces human approval for writes that are irreversible in practice but not flagged destructive: Gravity Forms field changes, taxonomy term edits, feed edits, redirect creation, memory saves, and Polylang translation relinks. On approval it injects confirm_destructive so the matching data-layer guards in gds-mcp let the change through. ToolRestrictor: fix a shadowed branch so terms-delete classifies as dangerous. System prompt: content-only scope + bug-report escalation, plus warnings that forms/terms/translation edits are unrevisioned. TokenBudget: per-user daily token cap (filterable) as a runaway-cost backstop. Audit-log approvals and denials in the approval path — approved (gated, destructive) actions execute there, not in MessageLoop, so they were previously never logged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scheduled skills ran via WP-Cron with no current user and no approver, so (combined with __return_true abilities) they executed with no capability enforcement. They now run as the skill author, are skipped unless the author is an administrator, require manage_options to set a schedule, and surface tools left awaiting approval instead of silently doing nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
assistant__memory-save content is injected into every future system prompt. Cap the total entry count (filterable, matching the load limit) to bound prompt bloat and the prompt-injection surface; per-entry length limits already existed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oxyc oxyc merged commit 9ca15d6 into main May 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant