feat(plugin): stop the skill-suggestion firehose; reframe the surface#122
Merged
Conversation
The UserPromptSubmit hook fires on every prompt with no memory, so a single
skill gets re-suggested dozens of times per session (measured: railway-stack-
provision suggested 75x in one session; 632 suggestions across 15 sessions,
with a 0/632 conversion to actual invocations). That noise is a big part of
why curated skills never get acted on.
- check_prompt.py: suggest a skill at most once per session, and not again
across sessions until a cooldown elapses (WATCHMEN_SUGGEST_COOLDOWN_SECONDS,
default 6h, 0 disables the cross-session layer). Per-(session,skill) seen-
state in state/<project>.suggest_seen.json, pruned to a 14d TTL so it can't
grow unbounded. On a deduped match we stay silent and leave any standing
suggestion untouched rather than re-asserting it.
- check_prompt.py: SCORE_THRESHOLD is now env-tunable (WATCHMEN_SUGGEST_
THRESHOLD) so the match bar can be calibrated without a code change.
- statusline.sh: drop the defeatist past-tense copy ("you could have used /X")
for an actionable, present-tense nudge that leads with the runnable command.
- Mirrors the hook to plugin-codex/ byte-for-byte (no-drift test).
- 8 new tests in tests/test_check_prompt.py covering same-session dedup,
cross-session cooldown, cooldown=0, per-skill (not global) muting, threshold
config, no-match clearing, and seen-state TTL pruning.
This is the reversible "human surface" step; it does not inject suggestions
into the agent's context (the "inform, don't manipulate" line is unchanged).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…x review) Codex review of #122 surfaced a hot-path crash: _recently_suggested did `now - last` over whatever values were in state/<project>.suggest_seen.json, so a corrupt or hand-edited non-numeric stamp would raise TypeError on EVERY prompt for that project and break prompt submission. - _recently_suggested now only considers numeric stamps in the cooldown scan; junk is ignored (falls back to "not recently suggested", then _record_seen rewrites a clean value and prunes the junk — self-healing). - The dedup gate in main() is wrapped to fail open: any unexpected error there proceeds as not-recently-suggested rather than raising. The hook must never crash the prompt. - 2 new tests: corrupt seen-state doesn't crash + still surfaces (fail-open), and the cooldown scan skips non-numeric stamps. - Re-mirrored to plugin-codex/ byte-for-byte. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2) Codex re-review flagged a should-fix: with no session_id, _recently_suggested mapped every prompt to the shared "?" bucket and suppressed on key presence alone, short-circuiting before the cooldown. Effect: the first no-session suggestion recorded "?|slug" and then every later no-session prompt for that skill was muted forever — even with WATCHMEN_SUGGEST_COOLDOWN_SECONDS=0 — and TTL never pruned it (the suppressed path returns before _record_seen). Matters for any harness whose hook event lacks session_id. - The same-session "suggest once" shortcut now only applies when session_id is truthy. With no session, fall through to the time-based cooldown scan, which already handles cooldown=0 and window expiry correctly via the numeric stamp. - 2 new tests: no-session + cooldown=0 keeps surfacing (no wedge); no-session respects the cooldown window and frees up after it elapses. - Re-mirrored to plugin-codex/ byte-for-byte. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aktasbatuhan
added a commit
that referenced
this pull request
Jun 8, 2026
…d it (#123) The plugin ships to Claude Code / Codex via a marketplace clone, and `/plugin` only refreshes its installed cache when plugin.json's version changes. The version was frozen at 0.1.6 since 2026-05-12 while plugin/bin kept changing (through #122), so every user who installed via /plugin was stranded on a stale plugin — the dedup hook and reframed statusline never reached them. - Bump plugin/.claude-plugin/plugin.json and plugin-codex/.codex-plugin/ plugin.json 0.1.6 -> 0.1.7 to carry the #122 plugin payload. - CI guard (plugin-version job): on PRs, if anything under plugin*/bin|hooks| skills changed vs base, both manifest versions must bump too. Stops the freeze from recurring. - Unit test: the two manifests must declare the same version (lockstep), so a one-sided bump can't strand the other harness. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Triaging the dead-skill problem turned up the real cause, and it isn't skill quality or broken matching. Measured against the live corpus:
So conversion on the whole suggestion pipeline is 0/632. A big part of that is noise: the hook runs on every prompt with no memory, so the same skill gets re-suggested over and over. The busiest session alone fired 244 suggestions across just 12 skills, with
railway-stack-provisionsuggested 75 times in that one session. A signal that repeats that hard is trained to be ignored.This is the reversible "human surface" step. It does not inject suggestions into the agent's context. The "inform, don't manipulate" line is unchanged. The goal here is just to make the existing human-facing surface quiet enough and actionable enough to have a chance of converting, before we consider anything heavier.
What changed
check_prompt.py): a skill is surfaced at most once per session, and not again across sessions until a cooldown elapses (WATCHMEN_SUGGEST_COOLDOWN_SECONDS, default 6h, set 0 to disable the cross-session layer). State lives instate/<project>.suggest_seen.json, pruned to a 14 day TTL so it can't grow unbounded. On a deduped match we stay silent and leave any standing suggestion untouched rather than re-asserting it. This collapses the 244-suggestion session to ~12.check_prompt.py):SCORE_THRESHOLDis now read fromWATCHMEN_SUGGEST_THRESHOLD(default -0.5 unchanged), so the match bar can be calibrated against real false positives without a code change. I deliberately did not guess a stricter default. That needs labeled data.statusline.sh): the old line was past tense and defeatist,you could have used /X to save time & tokens on this task. New copy leads with the runnable command and is actionable for the next similar prompt:/X fits this kind of task, run it to save time & tokens.plugin-codex/byte-for-byte (the no-drift smoke test enforces this).statusline.shstays Claude-Code-only since Codex has no statusline surface.Testing
tests/test_check_prompt.py: same-session dedup, cross-session cooldown, cooldown=0, per-skill (not global) muting, env-configurable threshold, no-match clearing, seen-state TTL pruninguv run pytest tests/green (546 passed), ruff clean, plugin no-drift test greenWhat this does not do
It does not by itself prove adoption will move. It removes the noise and fixes the framing so we can actually measure whether the human acts on a rare, well-timed suggestion. If uptake is still ~0 after this, the next fork is the heavier one: surfacing pre-turn or injecting into context, which is the intervention-layer call.
🤖 Generated with Claude Code