From 0c13846925f0cb52372c549d8386977a105e14e8 Mon Sep 17 00:00:00 2001 From: Yogesh Rao Date: Wed, 29 Apr 2026 10:48:49 +0530 Subject: [PATCH 1/2] feat: improve skill scores for forge MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hey @alxxpersonal 👋 I ran your skills through `tessl skill review` at work and found some targeted improvements. Here's the full before/after: | Skill | Before | After | Change | |-------|--------|-------|--------| | commit-forge | 90% | 100% | +10% | | model-forge | 90% | 99% | +9% | | skill-forge | 86% | 90% | +4% | | readme-forge | 86% | 90% | +4% | | prompt-forge | 86% | 89% | +3% | ![Score Card](https://raw.githubusercontent.com/yogesh-tessl/forge/improve/skill-review-optimization/score_card.png)
What changed **commit-forge (90% → 100%)** - Removed redundant "What Makes a Good/Bad Commit" tables and checklist that duplicated the hard rules and process steps - the skill now trusts Claude to internalize rules without restating them **model-forge (90% → 99%)** - Extracted codex-specific sections (failure modes, prompt requirements, optimal workflow) into a separate `CODEX.md` reference file, improving progressive disclosure and reducing token load for non-codex routing decisions - Consolidated anti-patterns list by removing items already covered in codex failure modes **skill-forge (86% → 90%)** - Expanded description with specific sub-actions (generate YAML frontmatter, define trigger conditions, structure instructions, validate against best practices) - Removed "Common Mistakes" section that duplicated the validation checklist - Trimmed explanation of routing internals Claude already understands **readme-forge (86% → 90%)** - Expanded description with concrete deliverables (project overview, quickstart, features, architecture diagram) - Consolidated style rules by removing items already covered in the template section (banner/emoji rules, bold formatting rules) **prompt-forge (86% → 89%)** - Added anti-trigger clause ("NOT for general AI application architecture or LLM integration plumbing") to reduce false matches - Added explicit 5-step development workflow (Draft → Test → Evaluate → Iterate → Ship) - Trimmed explanatory rationale Claude already knows (primacy/recency effects, why structure beats prose, ReAct pattern explanation)
I kept this PR focused on the 5 core forge skills with the biggest improvements to keep the diff reviewable. Happy to follow up with the charm-* skills in a separate PR if you'd like. Honest disclosure - I work at @tesslio where we build tooling around skills like these. Not a pitch - just saw room for improvement and wanted to contribute. Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me - [@yogesh-tessl](https://github.com/yogesh-tessl) - if you hit any snags. Thanks in advance 🙏 --- skills/commit-forge/SKILL.md | 27 ----------------- skills/model-forge/CODEX.md | 44 ++++++++++++++++++++++++++++ skills/model-forge/SKILL.md | 56 +++++------------------------------- skills/prompt-forge/SKILL.md | 34 ++++++++-------------- skills/readme-forge/SKILL.md | 16 ++--------- skills/skill-forge/SKILL.md | 14 +-------- 6 files changed, 67 insertions(+), 124 deletions(-) create mode 100644 skills/model-forge/CODEX.md diff --git a/skills/commit-forge/SKILL.md b/skills/commit-forge/SKILL.md index 3697ab8..a5d32fb 100644 --- a/skills/commit-forge/SKILL.md +++ b/skills/commit-forge/SKILL.md @@ -121,30 +121,3 @@ git show --stat Confirm the commit looks right. -## What Makes a Good Commit - -| Good | Bad | -|------|-----| -| One logical change | Multiple unrelated changes | -| All tests pass | Breaks something | -| Message says why | Message says "update stuff" | -| Can be reverted cleanly | Reverting would break other things | -| Specific files staged | `git add .` | - -## What Makes a Bad Commit Message - -- "fix stuff" / "update" / "changes" / "WIP" -- Using "and" to describe two unrelated things -- Describing how instead of what/why -- Over 72 characters in the subject line - -## Checklist - -Before committing: - -- [ ] Changes are atomic (one logical unit) -- [ ] Specific files staged (not `git add .`) -- [ ] Tests pass (if applicable) -- [ ] Message follows `type(scope): description` format -- [ ] No co-author tags -- [ ] No sensitive files staged (.env, credentials, keys) diff --git a/skills/model-forge/CODEX.md b/skills/model-forge/CODEX.md new file mode 100644 index 0000000..5646c4a --- /dev/null +++ b/skills/model-forge/CODEX.md @@ -0,0 +1,44 @@ +# Codex Reference + +Detailed guidance for using gpt-5.4 (codex) effectively in a claude + codex hybrid workflow. + +## Failure Modes + +Codex is **too literal**. It will fulfill the exact request and break adjacent things. Real example from HN: asked to "fix compiler warnings", it made a bunch of values nullable to silence the warnings - technically correct, broke data integrity downstream. + +**Codex will fail at (NEVER use it for):** +- **UI work of any kind** - layouts, components, styling, UX, design judgment. Codex has zero taste and no eye for visual hierarchy. Always claude. +- **Anything you need to discuss** - "yo i need to talk about this", brainstorming, exploration, "what do you think", trade-off conversations. Codex is a one-shot executor, not a collaborator. Always claude. +- **Reading user intent** - "make it better" → won't make leaps, will pick the most literal interpretation +- **Architecture/product judgment** - no sense of tradeoffs beyond what's written +- **Obscure libraries** - hallucinates rather than admitting unknown +- **Multi-file dependency chains** - gets stuck in circles +- **Continuous conversation context** - context degrades fast, not built for back-and-forth + +**Codex will excel at:** +- Tasks where being literal is a feature (test writing, mechanical refactors) +- Clearly-bounded specs with acceptance criteria +- Adversarial passes with explicit focus arguments + +## Prompt Requirements + +- One concern per prompt (don't bundle) +- Explicit scope boundaries (which files, what behavior, what done looks like) +- Structured sections: General → Autonomy → Code Implementation → Editing Constraints → Exploration → Plan Tool +- Use the positional focus argument: `/codex:adversarial-review challenge whether this caching design was right` +- Heavy XML structure beats prose for multi-step specs + +## Optimal Workflow + +The pattern that works (claude + codex hybrid): + +1. **Claude builds** - architecture, exploration, back-and-forth iteration +2. **Codex reviews** - dispatch `/codex:adversarial-review` with a specific focus after a major change. 6-10 min wait, actionable findings +3. **Claude filters** - triage codex output: "which are real issues vs noise, which need design decisions" +4. **Codex fixes** - confirmed bugs with clear specs get delegated back to codex +5. **Batch maintenance** - queue codex with 3-5 P2 tasks at session start, work on real stuff while it churns + +What doesn't work: +- Letting codex drive architecture decisions +- Vague task descriptions +- Auto-running codex on every claude response (drains usage fast) diff --git a/skills/model-forge/SKILL.md b/skills/model-forge/SKILL.md index a0bd732..f5b81b7 100644 --- a/skills/model-forge/SKILL.md +++ b/skills/model-forge/SKILL.md @@ -144,58 +144,16 @@ Adversarial engineer with different priors than claude. Mechanically excellent o | bug hunt | codex | worktree | no | yes | | security audit | codex | feat | no | yes | -## Codex-Specific Failure Modes - -Codex is **too literal**. It will fulfill the exact request and break adjacent things. Real example from HN: asked to "fix compiler warnings", it made a bunch of values nullable to silence the warnings - technically correct, broke data integrity downstream. - -**Codex will fail at (NEVER use it for):** -- **UI work of any kind** - layouts, components, styling, UX, design judgment. Codex has zero taste and no eye for visual hierarchy. Always claude. -- **Anything you need to discuss** - "yo i need to talk about this", brainstorming, exploration, "what do you think", trade-off conversations. Codex is a one-shot executor, not a collaborator. Always claude. -- **Reading user intent** - "make it better" → won't make leaps, will pick the most literal interpretation -- **Architecture/product judgment** - no sense of tradeoffs beyond what's written -- **Obscure libraries** - hallucinates rather than admitting unknown -- **Multi-file dependency chains** - gets stuck in circles -- **Continuous conversation context** - context degrades fast, not built for back-and-forth - -**Codex will excel at:** -- Tasks where being literal is a feature (test writing, mechanical refactors) -- Clearly-bounded specs with acceptance criteria -- Adversarial passes with explicit focus arguments - -**Prompt requirements for codex:** -- One concern per prompt (don't bundle) -- Explicit scope boundaries (which files, what behavior, what done looks like) -- Structured sections: General → Autonomy → Code Implementation → Editing Constraints → Exploration → Plan Tool -- Use the positional focus argument: `/codex:adversarial-review challenge whether this caching design was right` -- Heavy XML structure beats prose for multi-step specs - -## Optimal Codex Workflow - -The pattern that works (claude + codex hybrid): - -1. **Claude builds** - architecture, exploration, back-and-forth iteration -2. **Codex reviews** - dispatch `/codex:adversarial-review` with a specific focus after a major change. 6-10 min wait, actionable findings -3. **Claude filters** - triage codex output: "which are real issues vs noise, which need design decisions" -4. **Codex fixes** - confirmed bugs with clear specs get delegated back to codex -5. **Batch maintenance** - queue codex with 3-5 P2 tasks at session start, work on real stuff while it churns - -What doesn't work: -- Letting codex drive architecture decisions -- Vague task descriptions -- Auto-running codex on every claude response (drains usage fast) +## Codex Reference + +For codex failure modes, prompt requirements, and the optimal claude + codex hybrid workflow, read `CODEX.md`. ## Anti-patterns -1. **Codex for time-sensitive work** - 10+ min latency kills interactive flow -2. **Codex with vague prompts** - it will interpret literally and break things -3. **Opus for trivial tool calls** - haiku does it at 5% cost -4. **Worktrees for non-overlapping single-agent work** - merge overhead with no upside -5. **Big refactors on main** - hard rollback, polluted history -6. **Parallel agents on same files without worktrees** - concurrent write conflicts -7. **Dispatching codex synchronously** - always background, never wait -8. **Forgetting `--model gpt-5.4`** - default selection is unreliable -9. **Codex for UI work** - zero taste, no visual judgment, will produce bricks -10. **Codex when you need to discuss/brainstorm** - it's a one-shot executor, not a collaborator +1. **Opus for trivial tool calls** - haiku does it at 5% cost +2. **Worktrees for non-overlapping single-agent work** - merge overhead with no upside +3. **Big refactors on main** - hard rollback, polluted history +4. **Parallel agents on same files without worktrees** - concurrent write conflicts ## Examples diff --git a/skills/prompt-forge/SKILL.md b/skills/prompt-forge/SKILL.md index 2863152..7954ebf 100644 --- a/skills/prompt-forge/SKILL.md +++ b/skills/prompt-forge/SKILL.md @@ -1,6 +1,6 @@ --- name: prompt-forge -description: Universal prompt engineering guide for writing, reviewing, and optimizing LLM prompts across Claude and OpenAI models. Use when writing system prompts, designing extraction pipelines, building classification or summarization prompts, optimizing for cost/latency, reviewing existing prompts for quality, or any task involving prompt design for production AI systems. Trigger on keywords like "prompt", "system prompt", "few-shot", "extraction prompt", "prompt engineering", "prompt review", or when the user is building any AI-powered feature that needs a well-crafted prompt. +description: "Write, review, and optimize LLM prompts across Claude and OpenAI models - design extraction pipelines, build classification or summarization prompts, optimize for cost/latency, and audit existing prompts for quality. Use when writing system prompts, designing extraction pipelines, building classification or summarization prompts, or reviewing prompts. Trigger on 'prompt', 'system prompt', 'few-shot', 'extraction prompt', 'prompt engineering', 'prompt review'. NOT for general AI application architecture or LLM integration plumbing." --- # Prompt Forge @@ -13,9 +13,9 @@ For SDK code examples and implementation patterns, read `references/code-pattern 1. **Be Explicit, Not Implicit.** Treat every prompt like onboarding a new hire - spell out role, task, constraints, output format, and edge cases. -2. **Structure Beats Prose.** Structured prompts with clear sections outperform wall-of-text instructions. Use XML tags for Claude, markdown headers or XML for GPT. +2. **Structure Beats Prose.** Use XML tags for Claude, markdown headers or XML for GPT. -3. **Show, Don't Just Tell.** Few-shot examples are the single highest-leverage technique. 3-5 examples covering happy paths and edge cases. With prompt caching, afford 20+. +3. **Show, Don't Just Tell.** 3-5 few-shot examples covering happy paths and edge cases. With prompt caching, afford 20+. 4. **Constrain the Output Space.** Define exactly what success looks like. Schemas, templates, or format specs. Tighter output contract = more reliable results. @@ -23,7 +23,7 @@ For SDK code examples and implementation patterns, read `references/code-pattern 6. **Positive Instructions Over Negative.** "Write in plain prose paragraphs" beats "Don't use markdown". Tell the model what TO do. -7. **Order Matters.** Place long documents ABOVE instructions. Place critical instructions at beginning and end (primacy and recency effects). +7. **Order Matters.** Place long documents ABOVE instructions. Place critical instructions at beginning and end. ## Prompt Section Ordering @@ -43,12 +43,6 @@ System prompts are not monolithic strings. They are ordered arrays of sections. 10. **User instructions** - user-provided rules (config files, overrides) 11. **Memory** - persistent cross-session context -**Why this order works:** -- Static sections first = cacheable prefix (cache matches prefixes) -- Identity and rules before tools = model internalizes constraints before seeing capabilities -- User instructions AFTER defaults but marked as overrides = user can override any default -- Dynamic sections last = only the tail changes between turns, maximizing cache hits - **User instruction override pattern:** ``` User instructions are shown below. Be sure to adhere to these instructions. @@ -131,18 +125,6 @@ Preprocess inputs: strip signatures, disclaimers, HTML, whitespace. Set max_char ### Code Generation Role + `` (existing patterns, stack, style) + `` (scope tightly, error handling, follow patterns) + `` (relevant existing code). -### Multi-Step Reasoning (ReAct) -``` -For each step: -1. THOUGHT: reason about what information you need -2. ACTION: call the appropriate tool -3. OBSERVATION: analyze the result -4. Repeat until you have enough to answer -5. ANSWER: provide the final response -``` - -Use Claude's extended thinking or GPT's reasoning_effort for complex reasoning rather than forcing CoT when the model natively supports it. - ### Agent Delegation Subagents should NOT inherit the full parent prompt. Strip to: identity, task scope, constraints, environment. For full patterns, read `references/agentic-patterns.md`. @@ -192,6 +174,14 @@ Build confidence tracking into schemas: per-field confidence (HIGH/MEDIUM/LOW/MI - **Batch API:** 50% discount for non-realtime workloads (both providers) +## Development Workflow + +1. **Draft** - define role, rules, output format, and examples using the patterns above +2. **Test** - run against 5+ real inputs covering happy path, edge cases, and malformed input +3. **Evaluate** - score outputs against expected results, identify weak fields or failure modes +4. **Iterate** - add examples targeting weak spots, tighten constraints, adjust model tier +5. **Ship** - enable caching, configure model tiering, set up monitoring + ## Prompt Checklist ### Structure diff --git a/skills/readme-forge/SKILL.md b/skills/readme-forge/SKILL.md index 36fa8cb..9bb3350 100644 --- a/skills/readme-forge/SKILL.md +++ b/skills/readme-forge/SKILL.md @@ -1,6 +1,6 @@ --- name: readme-forge -description: Generate a README in alxx's personal style. Use when user says "write a README", "create README", "readme for this project", "update README", or needs a repo README. +description: "Generate a README with project overview, quickstart, features, and architecture diagram in alxx's personal style. Use when user says 'write a README', 'create README', 'readme for this project', 'update README', or needs a repo README." disable-model-invocation: false argument-hint: "[repo path or leave blank for current]" --- @@ -112,22 +112,12 @@ One line linking to LICENSE file. e.g. `[MIT](LICENSE)` These are non-negotiable: -- **h1 or banner** - either `# Name` or full-width banner image with `---` separator -- **Bold tagline** directly under h1/banner, no gap -- **Overview** - first section is always Overview. use 🚀 emoji only when banner is present, plain `## Overview` otherwise -- **Blockquote pitch** for the audience, in or after overview -- **Bold analogy** in overview (comparing to something familiar) -- **Bold italic** (`***text***`) for punchy one-liners -- **Bold feature names** with dash separator in features list -- **Mermaid diagram** - always include one, flowchart LR preferred - **Quickstart must be copy-paste** - someone should be able to run every command in order -- **FAQ is optional** - rarely used, only include for projects with common objections (e.g. "why not X?"). if something needs explaining, prefer putting it in the relevant section - **No badges** unless the project has CI set up -- **Bold key phrases in longer text** - in paragraphs, `**bold**` the important terms/concepts so readers can scan (e.g. "Nebula is **Git for agent context & tasks**") +- **Bold key phrases in longer text** - `**bold**` important terms so readers can scan - **Dense, no fluff** - every sentence carries information - **No "Getting Started" header** - use "Quickstart" for apps/CLIs, "Install" for packages/skills/libraries -- **No separate "Installation" header** - fold into Quickstart or Install -- **No "Usage" header** - fold into Quickstart or Features +- **No separate "Installation" or "Usage" headers** - fold into Quickstart or Features - **No long dashes** - never use "—" (em dash), use "-" or commas instead - **No AI slop** - no "furthermore", "it's worth noting", "comprehensive" - **Casual but professional** - not corporate, not too informal for a repo README diff --git a/skills/skill-forge/SKILL.md b/skills/skill-forge/SKILL.md index 2be7a4c..f2bd129 100644 --- a/skills/skill-forge/SKILL.md +++ b/skills/skill-forge/SKILL.md @@ -1,6 +1,6 @@ --- name: skill-forge -description: Write or improve an LLM agent skill. Use when user says "create skill", "write a skill", "improve this skill", "new skill", "skill for X", or wants to build a SKILL.md file. +description: "Write or improve an LLM agent skill - generate YAML frontmatter, define trigger conditions, structure step-by-step instructions, and validate against best practices. Use when user says 'create skill', 'write a skill', 'improve this skill', 'new skill', 'skill for X', or wants to build a SKILL.md file." disable-model-invocation: true argument-hint: "[skill name or path to existing SKILL.md]" --- @@ -71,8 +71,6 @@ description: ... # THE most important field. See below. ## Step 4: Write the Description -The description is what Claude pattern-matches against to decide whether to load the skill. There is NO algorithmic routing - Claude's language model reads all descriptions and semantically matches. - ### Formula 1. **What it does** - one sentence, outcome-first @@ -165,13 +163,3 @@ Run this checklist on the generated skill: - [ ] Steps are numbered when order matters - [ ] Supporting files referenced if body would exceed 500 lines -## Common Mistakes - -1. **Vague descriptions** - "Use when user wants to plan" triggers on everything -2. **Missing `disable-model-invocation: true` on side-effect skills** - Claude will auto-commit, auto-deploy -3. **Body too long** - 500+ lines burns context. Split into supporting files. -4. **Redundant description + body** - description matches, body instructs. Don't duplicate. -5. **`context: fork` on reference content** - subagent gets instructions with no actionable task -6. **Over-explaining in description** - the description isn't the instructions -7. **Missing NOT conditions** - for behavioral modes, add explicit "NEVER activate unless..." -8. **`user-invocable: false` on things users should manually trigger** - confusing From 32aee2dccee463972ec7566a808dadf15f21d6be Mon Sep 17 00:00:00 2001 From: Yogesh Rao Date: Wed, 29 Apr 2026 20:21:58 +0530 Subject: [PATCH 2/2] ci: add skill-review GitHub Action for automated skill review on PRs --- .github/workflows/skill-review.yml | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 .github/workflows/skill-review.yml diff --git a/.github/workflows/skill-review.yml b/.github/workflows/skill-review.yml new file mode 100644 index 0000000..f0ab779 --- /dev/null +++ b/.github/workflows/skill-review.yml @@ -0,0 +1,13 @@ +name: Skill Review +on: + pull_request: + paths: ['**/SKILL.md'] +jobs: + review: + runs-on: ubuntu-latest + permissions: + pull-requests: write + contents: read + steps: + - uses: actions/checkout@v4 + - uses: tesslio/skill-review@22e928dd837202b2b1d1397e0114c92e0fae5ead # main