feat(plugin): stop the skill-suggestion firehose; reframe the surface by aktasbatuhan · Pull Request #122 · firstbatchxyz/watchmen

aktasbatuhan · 2026-06-08T13:54:21Z

Why

Triaging the dead-skill problem turned up the real cause, and it isn't skill quality or broken matching. Measured against the live corpus:

149 curated skills, 136 indexed
the suggestion hook fired 632 times across 68 distinct skills
suggested-and-then-fired overlap: exactly 0. The one curated skill that ever fired was invoked directly, never via a suggestion

So conversion on the whole suggestion pipeline is 0/632. A big part of that is noise: the hook runs on every prompt with no memory, so the same skill gets re-suggested over and over. The busiest session alone fired 244 suggestions across just 12 skills, with railway-stack-provision suggested 75 times in that one session. A signal that repeats that hard is trained to be ignored.

This is the reversible "human surface" step. It does not inject suggestions into the agent's context. The "inform, don't manipulate" line is unchanged. The goal here is just to make the existing human-facing surface quiet enough and actionable enough to have a chance of converting, before we consider anything heavier.

What changed

Dedup + cooldown (check_prompt.py): a skill is surfaced at most once per session, and not again across sessions until a cooldown elapses (WATCHMEN_SUGGEST_COOLDOWN_SECONDS, default 6h, set 0 to disable the cross-session layer). State lives in state/<project>.suggest_seen.json, pruned to a 14 day TTL so it can't grow unbounded. On a deduped match we stay silent and leave any standing suggestion untouched rather than re-asserting it. This collapses the 244-suggestion session to ~12.
Tunable threshold (check_prompt.py): SCORE_THRESHOLD is now read from WATCHMEN_SUGGEST_THRESHOLD (default -0.5 unchanged), so the match bar can be calibrated against real false positives without a code change. I deliberately did not guess a stricter default. That needs labeled data.
Copy reframe (statusline.sh): the old line was past tense and defeatist, you could have used /X to save time & tokens on this task. New copy leads with the runnable command and is actionable for the next similar prompt: /X fits this kind of task, run it to save time & tokens.
Hook mirrored to plugin-codex/ byte-for-byte (the no-drift smoke test enforces this). statusline.sh stays Claude-Code-only since Codex has no statusline surface.

Testing

8 new tests in tests/test_check_prompt.py: same-session dedup, cross-session cooldown, cooldown=0, per-skill (not global) muting, env-configurable threshold, no-match clearing, seen-state TTL pruning
uv run pytest tests/ green (546 passed), ruff clean, plugin no-drift test green

What this does not do

It does not by itself prove adoption will move. It removes the noise and fixes the framing so we can actually measure whether the human acts on a rare, well-timed suggestion. If uptake is still ~0 after this, the next fork is the heavier one: surfacing pre-turn or injecting into context, which is the intervention-layer call.

🤖 Generated with Claude Code

The UserPromptSubmit hook fires on every prompt with no memory, so a single skill gets re-suggested dozens of times per session (measured: railway-stack- provision suggested 75x in one session; 632 suggestions across 15 sessions, with a 0/632 conversion to actual invocations). That noise is a big part of why curated skills never get acted on. - check_prompt.py: suggest a skill at most once per session, and not again across sessions until a cooldown elapses (WATCHMEN_SUGGEST_COOLDOWN_SECONDS, default 6h, 0 disables the cross-session layer). Per-(session,skill) seen- state in state/<project>.suggest_seen.json, pruned to a 14d TTL so it can't grow unbounded. On a deduped match we stay silent and leave any standing suggestion untouched rather than re-asserting it. - check_prompt.py: SCORE_THRESHOLD is now env-tunable (WATCHMEN_SUGGEST_ THRESHOLD) so the match bar can be calibrated without a code change. - statusline.sh: drop the defeatist past-tense copy ("you could have used /X") for an actionable, present-tense nudge that leads with the runnable command. - Mirrors the hook to plugin-codex/ byte-for-byte (no-drift test). - 8 new tests in tests/test_check_prompt.py covering same-session dedup, cross-session cooldown, cooldown=0, per-skill (not global) muting, threshold config, no-match clearing, and seen-state TTL pruning. This is the reversible "human surface" step; it does not inject suggestions into the agent's context (the "inform, don't manipulate" line is unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…x review) Codex review of #122 surfaced a hot-path crash: _recently_suggested did `now - last` over whatever values were in state/<project>.suggest_seen.json, so a corrupt or hand-edited non-numeric stamp would raise TypeError on EVERY prompt for that project and break prompt submission. - _recently_suggested now only considers numeric stamps in the cooldown scan; junk is ignored (falls back to "not recently suggested", then _record_seen rewrites a clean value and prunes the junk — self-healing). - The dedup gate in main() is wrapped to fail open: any unexpected error there proceeds as not-recently-suggested rather than raising. The hook must never crash the prompt. - 2 new tests: corrupt seen-state doesn't crash + still surfaces (fail-open), and the cooldown scan skips non-numeric stamps. - Re-mirrored to plugin-codex/ byte-for-byte. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#2) Codex re-review flagged a should-fix: with no session_id, _recently_suggested mapped every prompt to the shared "?" bucket and suppressed on key presence alone, short-circuiting before the cooldown. Effect: the first no-session suggestion recorded "?|slug" and then every later no-session prompt for that skill was muted forever — even with WATCHMEN_SUGGEST_COOLDOWN_SECONDS=0 — and TTL never pruned it (the suppressed path returns before _record_seen). Matters for any harness whose hook event lacks session_id. - The same-session "suggest once" shortcut now only applies when session_id is truthy. With no session, fall through to the time-based cooldown scan, which already handles cooldown=0 and window expiry correctly via the numeric stamp. - 2 new tests: no-session + cooldown=0 keeps surfacing (no wedge); no-session respects the cooldown window and frees up after it elapses. - Re-mirrored to plugin-codex/ byte-for-byte. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d it (#123) The plugin ships to Claude Code / Codex via a marketplace clone, and `/plugin` only refreshes its installed cache when plugin.json's version changes. The version was frozen at 0.1.6 since 2026-05-12 while plugin/bin kept changing (through #122), so every user who installed via /plugin was stranded on a stale plugin — the dedup hook and reframed statusline never reached them. - Bump plugin/.claude-plugin/plugin.json and plugin-codex/.codex-plugin/ plugin.json 0.1.6 -> 0.1.7 to carry the #122 plugin payload. - CI guard (plugin-version job): on PRs, if anything under plugin*/bin|hooks| skills changed vs base, both manifest versions must bump too. Stops the freeze from recurring. - Unit test: the two manifests must declare the same version (lockstep), so a one-sided bump can't strand the other harness. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aktasbatuhan and others added 3 commits June 8, 2026 14:53

aktasbatuhan merged commit c927762 into main Jun 8, 2026
7 checks passed

aktasbatuhan deleted the feat/skill-suggestion-surface branch June 8, 2026 15:15

aktasbatuhan mentioned this pull request Jun 8, 2026

fix(plugin): bump version to 0.1.7 so /plugin picks up changes + guard it #123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plugin): stop the skill-suggestion firehose; reframe the surface#122

feat(plugin): stop the skill-suggestion firehose; reframe the surface#122
aktasbatuhan merged 3 commits into
mainfrom
feat/skill-suggestion-surface

aktasbatuhan commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aktasbatuhan commented Jun 8, 2026

Why

What changed

Testing

What this does not do

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant