Skip to content

feat(plugin): stop the skill-suggestion firehose; reframe the surface#122

Merged
aktasbatuhan merged 3 commits into
mainfrom
feat/skill-suggestion-surface
Jun 8, 2026
Merged

feat(plugin): stop the skill-suggestion firehose; reframe the surface#122
aktasbatuhan merged 3 commits into
mainfrom
feat/skill-suggestion-surface

Conversation

@aktasbatuhan

Copy link
Copy Markdown
Member

Why

Triaging the dead-skill problem turned up the real cause, and it isn't skill quality or broken matching. Measured against the live corpus:

  • 149 curated skills, 136 indexed
  • the suggestion hook fired 632 times across 68 distinct skills
  • suggested-and-then-fired overlap: exactly 0. The one curated skill that ever fired was invoked directly, never via a suggestion

So conversion on the whole suggestion pipeline is 0/632. A big part of that is noise: the hook runs on every prompt with no memory, so the same skill gets re-suggested over and over. The busiest session alone fired 244 suggestions across just 12 skills, with railway-stack-provision suggested 75 times in that one session. A signal that repeats that hard is trained to be ignored.

This is the reversible "human surface" step. It does not inject suggestions into the agent's context. The "inform, don't manipulate" line is unchanged. The goal here is just to make the existing human-facing surface quiet enough and actionable enough to have a chance of converting, before we consider anything heavier.

What changed

  • Dedup + cooldown (check_prompt.py): a skill is surfaced at most once per session, and not again across sessions until a cooldown elapses (WATCHMEN_SUGGEST_COOLDOWN_SECONDS, default 6h, set 0 to disable the cross-session layer). State lives in state/<project>.suggest_seen.json, pruned to a 14 day TTL so it can't grow unbounded. On a deduped match we stay silent and leave any standing suggestion untouched rather than re-asserting it. This collapses the 244-suggestion session to ~12.
  • Tunable threshold (check_prompt.py): SCORE_THRESHOLD is now read from WATCHMEN_SUGGEST_THRESHOLD (default -0.5 unchanged), so the match bar can be calibrated against real false positives without a code change. I deliberately did not guess a stricter default. That needs labeled data.
  • Copy reframe (statusline.sh): the old line was past tense and defeatist, you could have used /X to save time & tokens on this task. New copy leads with the runnable command and is actionable for the next similar prompt: /X fits this kind of task, run it to save time & tokens.
  • Hook mirrored to plugin-codex/ byte-for-byte (the no-drift smoke test enforces this). statusline.sh stays Claude-Code-only since Codex has no statusline surface.

Testing

  • 8 new tests in tests/test_check_prompt.py: same-session dedup, cross-session cooldown, cooldown=0, per-skill (not global) muting, env-configurable threshold, no-match clearing, seen-state TTL pruning
  • uv run pytest tests/ green (546 passed), ruff clean, plugin no-drift test green

What this does not do

It does not by itself prove adoption will move. It removes the noise and fixes the framing so we can actually measure whether the human acts on a rare, well-timed suggestion. If uptake is still ~0 after this, the next fork is the heavier one: surfacing pre-turn or injecting into context, which is the intervention-layer call.

🤖 Generated with Claude Code

aktasbatuhan and others added 3 commits June 8, 2026 14:53
The UserPromptSubmit hook fires on every prompt with no memory, so a single
skill gets re-suggested dozens of times per session (measured: railway-stack-
provision suggested 75x in one session; 632 suggestions across 15 sessions,
with a 0/632 conversion to actual invocations). That noise is a big part of
why curated skills never get acted on.

- check_prompt.py: suggest a skill at most once per session, and not again
  across sessions until a cooldown elapses (WATCHMEN_SUGGEST_COOLDOWN_SECONDS,
  default 6h, 0 disables the cross-session layer). Per-(session,skill) seen-
  state in state/<project>.suggest_seen.json, pruned to a 14d TTL so it can't
  grow unbounded. On a deduped match we stay silent and leave any standing
  suggestion untouched rather than re-asserting it.
- check_prompt.py: SCORE_THRESHOLD is now env-tunable (WATCHMEN_SUGGEST_
  THRESHOLD) so the match bar can be calibrated without a code change.
- statusline.sh: drop the defeatist past-tense copy ("you could have used /X")
  for an actionable, present-tense nudge that leads with the runnable command.
- Mirrors the hook to plugin-codex/ byte-for-byte (no-drift test).
- 8 new tests in tests/test_check_prompt.py covering same-session dedup,
  cross-session cooldown, cooldown=0, per-skill (not global) muting, threshold
  config, no-match clearing, and seen-state TTL pruning.

This is the reversible "human surface" step; it does not inject suggestions
into the agent's context (the "inform, don't manipulate" line is unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…x review)

Codex review of #122 surfaced a hot-path crash: _recently_suggested did
`now - last` over whatever values were in state/<project>.suggest_seen.json,
so a corrupt or hand-edited non-numeric stamp would raise TypeError on EVERY
prompt for that project and break prompt submission.

- _recently_suggested now only considers numeric stamps in the cooldown scan;
  junk is ignored (falls back to "not recently suggested", then _record_seen
  rewrites a clean value and prunes the junk — self-healing).
- The dedup gate in main() is wrapped to fail open: any unexpected error there
  proceeds as not-recently-suggested rather than raising. The hook must never
  crash the prompt.
- 2 new tests: corrupt seen-state doesn't crash + still surfaces (fail-open),
  and the cooldown scan skips non-numeric stamps.
- Re-mirrored to plugin-codex/ byte-for-byte.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2)

Codex re-review flagged a should-fix: with no session_id, _recently_suggested
mapped every prompt to the shared "?" bucket and suppressed on key presence
alone, short-circuiting before the cooldown. Effect: the first no-session
suggestion recorded "?|slug" and then every later no-session prompt for that
skill was muted forever — even with WATCHMEN_SUGGEST_COOLDOWN_SECONDS=0 — and
TTL never pruned it (the suppressed path returns before _record_seen). Matters
for any harness whose hook event lacks session_id.

- The same-session "suggest once" shortcut now only applies when session_id is
  truthy. With no session, fall through to the time-based cooldown scan, which
  already handles cooldown=0 and window expiry correctly via the numeric stamp.
- 2 new tests: no-session + cooldown=0 keeps surfacing (no wedge); no-session
  respects the cooldown window and frees up after it elapses.
- Re-mirrored to plugin-codex/ byte-for-byte.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aktasbatuhan aktasbatuhan merged commit c927762 into main Jun 8, 2026
7 checks passed
@aktasbatuhan aktasbatuhan deleted the feat/skill-suggestion-surface branch June 8, 2026 15:15
aktasbatuhan added a commit that referenced this pull request Jun 8, 2026
…d it (#123)

The plugin ships to Claude Code / Codex via a marketplace clone, and `/plugin`
only refreshes its installed cache when plugin.json's version changes. The
version was frozen at 0.1.6 since 2026-05-12 while plugin/bin kept changing
(through #122), so every user who installed via /plugin was stranded on a stale
plugin — the dedup hook and reframed statusline never reached them.

- Bump plugin/.claude-plugin/plugin.json and plugin-codex/.codex-plugin/
  plugin.json 0.1.6 -> 0.1.7 to carry the #122 plugin payload.
- CI guard (plugin-version job): on PRs, if anything under plugin*/bin|hooks|
  skills changed vs base, both manifest versions must bump too. Stops the
  freeze from recurring.
- Unit test: the two manifests must declare the same version (lockstep), so a
  one-sided bump can't strand the other harness.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant