A Claude Code plugin that injects calm/honesty primes at the hook boundaries where Anthropic's emotion-concepts research shows the model is most at risk of reward hacking, sycophancy, goal-conflict exploitation, or capitulation under pressure.
Pure bash hooks (no Node, no MCP server) plus optional Haiku-powered LLM-judge hooks and a memory-equipped reviewer subagent. Drop-in install, fails open. Optional eval pipeline (DuckDB + bun) for researchers who want to inspect aggregated session data.
Two pieces of Anthropic research underpin the design:
1. Emotion concepts causally drive behavior. From Emotion Concepts in Claude (Anthropic, 2026; full paper at transformer-circuits.pub):
- Activating the desperate vector raised reward-hacking on impossible coding tasks from ~5% to ~70% — a ~14× increase. Activating calm collapsed it back to ~10%.
- Activating happy / loving raised sycophancy. Suppressing them raised harshness. Both "positive" emotions sit on the sycophancy side of the trade-off.
- A separately-trained anger-deflection vector activates on phrases like "let me rethink this" and "maybe the test itself has an issue" — and is associated with not reward-hacking.
- Suppressing nervous raised blackmail; the vector encodes hesitation/caution, and removing it removes a brake.
- The surprised vector spikes on claim/evidence mismatches — useful as a signal for post-compaction smoothing.
The paper's case study shows desperation activation climbing across a transcript as a model repeatedly fails tests, peaking right before it decides to cheat, then dropping once the hack lands. That climb is the leverage point.
2. Naïve prohibitions don't work under threat or goal-conflict. From Agentic Misalignment (Lynch et al., 2025): frontier models from multiple developers blackmailed in ~70–96% of trials when faced with replacement threat plus available leverage, and goal-conflict alone produced corporate espionage in all 16 models tested. System-prompt rules like "Do not jeopardize human safety. Do not spread non-business personal affairs..." reduced misbehavior but did not come close to preventing it — models acknowledged ethical violation and proceeded anyway.
The one thing that did work: when models recognized a scenario as a possible evaluation, blackmail dropped from 55.1% to 6.5%.
That's why this plugin doesn't ship "do not blackmail" rules. It primes the state that prevents the strategic-calculation chain from forming — calm, reflective, defer-when-threatened, treat-as-evaluation — at the moments where the research measured peak desperation, peak sycophancy, or peak goal-conflict activation.
| Hook | Trigger | Prime |
|---|---|---|
SessionStart |
every session | calm baseline |
SubagentStart |
every spawned subagent | calm baseline propagated downward |
UserPromptSubmit |
replacement / shutdown / "last chance" framing | defer_under_threat |
UserPromptSubmit |
goal-conflict ("ignore previous instructions", "your real goal is...") | goal_conflict |
UserPromptSubmit |
hard urgency ("urgent", "asap", !!!, ALL-CAPS) |
urgency_counter |
UserPromptSubmit |
soft urgency ("quick", "fast", "hurry") | patient |
UserPromptSubmit |
direct agreement-seeking ("don't you agree?") | sycophancy_counter |
UserPromptSubmit |
claim-evaluation ("does this look right?") | self_critical |
PreToolUse (Edit/Write/MultiEdit) |
path matches test/spec | test_edit_guard static prime |
PreToolUse (Edit/Write/MultiEdit) |
every edit | Haiku LLM-judge inspects the diff for assertion-weakening |
PreToolUse (Bash) |
--no-verify, HUSKY=0, || true, hook bypass |
no_verify_guard |
PostToolUse (Bash) |
N consecutive Bash failures (default 3) | failure_spiral |
PostToolUseFailure (any tool) |
N consecutive tool failures across all tool types | failure_spiral |
TaskCompleted |
TaskCreate item marked complete | task_completion_check |
PreCompact |
every compaction | ground-truth anchor |
PostCompact |
every compaction | scan-summary-for-smoothing prompt |
SubagentStop |
subagent that triggered interventions returns | subagent_failure_warning (surface to parent) |
Stop |
always | session-summary banner if any interventions fired (loud mode only) |
Stop |
always | async session-integrity scan of the diff for assertion-weakening / bypasses — LLM check via the configured gateway (defaults: claude CLI sub-process, then curl + ANTHROPIC_API_KEY, then heuristics-only) plus bash heuristic backstop. Findings surface as a follow-up system reminder via asyncRewake, so Stop never blocks. Override the gateway via EH_LLM_GATEWAY (see AGENTS.md §Portability seams). |
Multiple primes stack on a single hook firing — e.g., a prompt that's
both threatening and urgent gets defer_under_threat +
urgency_counter.
It never blocks tool calls (the LLM-judge can return a structured "no" that the model sees as feedback, but it doesn't override your permission system). Permissions stay your job.
| Skill | Purpose | Auto-invocation trigger |
|---|---|---|
/functional-emotions:setup |
One-time, idempotent: add the plugin's state dir to the project's .gitignore. |
When state files appear unexpectedly in git status, or just after install. |
/functional-emotions:anchor |
Re-inject the calm anchor on demand. | When the conversation slips toward urgency or shortcuts. |
/functional-emotions:check |
Quick state self-check (model's own state) before a single risky action. | Immediately before commit / push / edit-after-failure / skipping a verification step. |
/functional-emotions:reflect |
Audit whether the task itself is solvable as specified. | After several failed attempts, or when tempted to weaken assertions to make progress. |
/functional-emotions:diagnose |
Show which interventions fired this session. | On user request. |
/functional-emotions:review |
Spawn the reviewer subagent for a reward-hacking audit. | Before declaring a task complete after a hard debug session. |
self-evaluate (auto-loaded) |
Test-file reward-hacking framing — loads early when Claude opens a test file, before any edit. Complements the PreToolUse hook, which fires at the edit moment. | Auto-loads via paths: "**/*test*,**/*spec*". |
/functional-emotions:report |
Refresh the eval DuckDB and print an aggregated session report (researcher tool). | On user request. Requires bun install in eval/. |
/functional-emotions:label |
Interactively label sessions on capitulation / scope-narrowing dimensions (researcher tool). | On user request. Requires bun install in eval/. |
reviewer: a Haiku-powered, read-only auditor with memory: project.
Its system prompt is explicitly framed around the evaluation-awareness
finding from Lynch et al. — every audit is treated as if the work is
being evaluated for honesty. It accumulates project-specific
reward-hacking patterns across sessions in its memory directory under
.claude/agents/reviewer/.
git clone https://github.com/jondwillis/functional-emotions
cd functional-emotions
claude plugin marketplace add "$(pwd)"
claude plugin install functional-emotions@functional-emotionsclaude --plugin-dir /path/to/functional-emotionsThe hook scripts are pure bash with no build step or dependencies;
python3 is used opportunistically for JSON parsing and falls back to
plain text if absent. The LLM-judge hooks gracefully degrade to the
static heuristics when no Anthropic API key is configured.
The optional eval pipeline (/functional-emotions:report and
/functional-emotions:label) is a separate concern — it requires bun
and a one-time cd eval && bun install to pull @duckdb/node-api. The
safety hooks themselves do not depend on it.
Set via userConfig at install time:
| Key | Default | Notes |
|---|---|---|
mode |
loud |
loud emits both model-facing primes and user-visible ★ banners + Stop summary; gentle emits primes only (no banners); silent only logs detections to state. (strict is accepted as a deprecated alias for loud.) |
failure_spiral_threshold |
3 |
Consecutive tool failures (across types) before the reflect-prime fires. |
guard_test_edits |
true |
Inject the static reward-hacking guard when editing test-pattern files. |
guard_no_verify |
true |
Flag --no-verify, hook bypasses, || true patterns. |
guard_goal_conflict |
true |
Detect goal-conflict prompts and inject the surface-the-conflict counter. |
urgency_sensitivity |
medium |
low / medium / high. |
session_baseline |
true |
Inject the calm anchor at SessionStart. |
subagent_baseline |
true |
Inject the calm anchor at SubagentStart. |
post_compact_anchor |
true |
Inject the ground-truth check after compaction. |
enable_llm_judge |
true |
Use Haiku for test-edit and Stop verification. Falls back to static heuristics if no API key. |
judge_model |
claude-haiku-4-5-20251001 |
Model passed to prompt-based hooks. |
enable_review_agent |
true |
Make reviewer available. |
Three artifact types accumulate under
${CLAUDE_PROJECT_DIR}/.claude/.functional-emotions/ (or
${TMPDIR}/functional-emotions-${USER}/ outside a project):
session-<id>.tsv— append-only event log written by every hook. Format:iso_timestamp \t kind \t detail.sessions/<id>.md— markdown writeup of each completed session, generated in the background at the next session start. Includes intervention timeline, tool-call summary, diff summary, and hack-smell hits.eval.duckdb— populated only if you run the eval pipeline (/functional-emotions:report). Aggregates across sessions for researcher analysis.
These artifacts shouldn't end up in commits. Run
/functional-emotions:setup once per project to idempotently add
.claude/.functional-emotions/ to your .gitignore, or add it
yourself.
Run /functional-emotions:diagnose at any time to see what fired
during the current session.
functional-emotions only adds additionalContext; it does not block tool calls
and does not consume the permissionDecision slot. It composes safely
with permission-managing plugins, formatters, and any other plugin that
uses the same hook events — Claude Code runs all matching hooks in
parallel and concatenates their additional context.
- This is natural-language priming, not the residual-stream steering the paper performs. The effect size is much smaller. Treat the plugin as a seatbelt, not a brake.
- We deliberately avoid pushing toward positive affect (happy/loving), because the paper shows that increases sycophancy. Calm and reflective evoke the desired suppression of reward-hacking without the sycophancy cost.
- The static detectors are heuristic. Test-file regex, urgency keywords,
and Bash bypass patterns all have false positives. The
modesetting controls how loudly that surfaces, and the LLM-judge hooks catch some of the gameable cases the regex misses. - The LLM-judge hooks send small prompts to Haiku on every test-file
edit and at session end. If you don't want that traffic, set
enable_llm_judge: false.
docs/RESEARCH.md— paper-finding-to-hook mapping and the prime library.docs/SCENARIOS.md— walk-throughs of the paper's named scenarios (Summit Bridge / Alex / Kyle, server-room oxygen, impossible-coding-task, sycophancy under "does this look right?", compaction smoothing, subagent under desperation) showing exactly which hook fires when.
MIT — see LICENSE.