From 9bf55333801be428b62744795f31e4986181b6a6 Mon Sep 17 00:00:00 2001 From: Erich Kuerschner Date: Fri, 26 Jun 2026 08:51:21 -0500 Subject: [PATCH 1/5] upgrade skill-creator skill --- .agents/skills/skill-creator/SKILL.md | 41 +- .../skills/skill-creator/agents/analyzer.md | 31 +- .../skills/skill-creator/agents/comparator.md | 25 +- .agents/skills/skill-creator/agents/grader.md | 6 +- .../skill-creator/assets/eval_review.html | 384 +-- .../skill-creator/eval-viewer/viewer.html | 2665 ++++++++--------- .../skill-creator/references/schemas.md | 43 +- .../__pycache__/__init__.cpython-313.pyc | Bin 189 -> 0 bytes .../aggregate_benchmark.cpython-313.pyc | Bin 18264 -> 0 bytes skills-lock.json | 1 + 10 files changed, 1433 insertions(+), 1763 deletions(-) delete mode 100644 .agents/skills/skill-creator/scripts/__pycache__/__init__.cpython-313.pyc delete mode 100644 .agents/skills/skill-creator/scripts/__pycache__/aggregate_benchmark.cpython-313.pyc diff --git a/.agents/skills/skill-creator/SKILL.md b/.agents/skills/skill-creator/SKILL.md index 8f12eaa0f7..65b3a402db 100644 --- a/.agents/skills/skill-creator/SKILL.md +++ b/.agents/skills/skill-creator/SKILL.md @@ -86,7 +86,6 @@ skill-name/ #### Progressive Disclosure Skills use a three-level loading system: - 1. **Metadata** (name + description) - Always in context (~100 words) 2. **SKILL.md body** - In context whenever skill triggers (<500 lines ideal) 3. **Bundled resources** - As needed (unlimited, scripts can execute without loading) @@ -94,13 +93,11 @@ Skills use a three-level loading system: These word counts are approximate and you can feel free to go longer if needed. **Key patterns:** - - Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up. - Reference files clearly from SKILL.md with guidance on when to read them - For large reference files (>300 lines), include a table of contents **Domain organization**: When a skill supports multiple domains/frameworks, organize by variant: - ``` cloud-deploy/ ├── SKILL.md (workflow + selection) @@ -109,7 +106,6 @@ cloud-deploy/ ├── gcp.md └── azure.md ``` - Claude reads only the relevant reference file. #### Principle of Lack of Surprise @@ -121,26 +117,18 @@ This goes without saying, but skills must not contain malware, exploit code, or Prefer using the imperative form in instructions. **Defining output formats** - You can do it like this: - ```markdown ## Report structure - ALWAYS use this exact template: - # [Title] - ## Executive summary - ## Key findings - ## Recommendations ``` **Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little): - ```markdown ## Commit message format - **Example 1:** Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication @@ -194,7 +182,6 @@ Execute this task: ``` **Baseline run** (same prompt, but the baseline depends on context): - - **Creating a new skill**: no skill at all. Same prompt, no skill path, save to `without_skill/outputs/`. - **Improving an existing skill**: the old version. Before editing, snapshot the skill (`cp -r /skill-snapshot/`), then point the baseline subagent at the snapshot. Save to `old_skill/outputs/`. @@ -238,18 +225,15 @@ Once all runs are done: 1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and evaluates each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or other variants) — the viewer depends on these exact field names. For assertions that can be checked programmatically, write and run a script rather than eyeballing it — scripts are faster, more reliable, and can be reused across iterations. 2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory: - ```bash python -m scripts.aggregate_benchmark /iteration-N --skill-name ``` - This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects. - Put each with_skill version before its baseline counterpart. +Put each with_skill version before its baseline counterpart. 3. **Do an analyst pass** — read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` (the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs. 4. **Launch the viewer** with both qualitative outputs and quantitative data: - ```bash nohup python /eval-viewer/generate_review.py \ /iteration-N \ @@ -258,7 +242,6 @@ Once all runs are done: > /dev/null 2>&1 & VIEWER_PID=$! ``` - For iteration 2+, also pass `--previous-workspace /iteration-`. **Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static ` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up. @@ -270,7 +253,6 @@ Note: please use generate_review.py to create the viewer; there's no need to wri ### What the user sees in the viewer The "Outputs" tab shows one test case at a time: - - **Prompt**: the task that was given - **Output**: the files the skill produced, rendered inline where possible - **Previous Output** (iteration 2+): collapsed section showing last iteration's output @@ -289,13 +271,9 @@ When the user tells you they're done, read `feedback.json`: ```json { "reviews": [ - { - "run_id": "eval-0-with_skill", - "feedback": "the chart is missing axis labels", - "timestamp": "..." - }, - { "run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..." }, - { "run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..." } + {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."}, + {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}, + {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."} ], "status": "complete" } @@ -321,7 +299,7 @@ This is the heart of the loop. You've run the test cases, the user has reviewed 2. **Keep the prompt lean.** Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens. -3. **Explain the why.** Try hard to explain the **why** behind everything you're asking the model to do. Today's LLMs are _smart_. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach. +3. **Explain the why.** Try hard to explain the **why** behind everything you're asking the model to do. Today's LLMs are *smart*. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach. 4. **Look for repeated work across test cases.** Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a `create_docx.py` or a `build_chart.py`, that's a strong signal the skill should bundle that script. Write it once, put it in `scripts/`, and tell the skill to use it. This saves every future invocation from reinventing the wheel. @@ -338,7 +316,6 @@ After improving the skill: 5. Read the new feedback, improve again, repeat Keep going until: - - The user says they're happy - The feedback is all empty (everything looks good) - You're not making meaningful progress @@ -363,8 +340,8 @@ Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save ```json [ - { "query": "the user prompt", "should_trigger": true }, - { "query": "another prompt", "should_trigger": false } + {"query": "the user prompt", "should_trigger": true}, + {"query": "another prompt", "should_trigger": false} ] ``` @@ -459,7 +436,6 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file. **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case: - - **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`). - **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy. - **If packaging manually, stage in `/tmp/` first**, then copy to the output directory -- direct writes may fail due to permissions. @@ -472,7 +448,7 @@ If you're in Cowork, the main things to know are: - You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.) - You don't have a browser or display, so when generating the eval viewer, use `--static ` to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser. -- For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.py` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER _BEFORE_ evaluating inputs yourself. You want to get them in front of the human ASAP! +- For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.py` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER *BEFORE* evaluating inputs yourself. You want to get them in front of the human ASAP! - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first). - Packaging works — `package_skill.py` just needs Python and a filesystem. - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape. @@ -489,7 +465,6 @@ The agents/ directory contains instructions for specialized subagents. Read them - `agents/analyzer.md` — How to analyze why one version beat another The references/ directory has additional documentation: - - `references/schemas.md` — JSON structures for evals.json, grading.json, etc. --- diff --git a/.agents/skills/skill-creator/agents/analyzer.md b/.agents/skills/skill-creator/agents/analyzer.md index bd9e6d67c8..14e41d6068 100644 --- a/.agents/skills/skill-creator/agents/analyzer.md +++ b/.agents/skills/skill-creator/agents/analyzer.md @@ -49,7 +49,6 @@ You receive these parameters in your prompt: ### Step 4: Analyze Instruction Following For each transcript, evaluate: - - Did the agent follow the skill's explicit instructions? - Did the agent use the skill's provided tools/scripts? - Were there missed opportunities to leverage skill content? @@ -60,7 +59,6 @@ Score instruction following 1-10 and note specific issues. ### Step 5: Identify Winner Strengths Determine what made the winner better: - - Clearer instructions that led to better behavior? - Better scripts/tools that produced better output? - More comprehensive examples that guided edge cases? @@ -71,7 +69,6 @@ Be specific. Quote from skills/transcripts where relevant. ### Step 6: Identify Loser Weaknesses Determine what held the loser back: - - Ambiguous instructions that led to suboptimal choices? - Missing tools/scripts that forced workarounds? - Gaps in edge case coverage? @@ -80,7 +77,6 @@ Determine what held the loser back: ### Step 7: Generate Improvement Suggestions Based on the analysis, produce actionable suggestions for improving the loser skill: - - Specific instruction changes to make - Tools/scripts to add or modify - Examples to include @@ -117,7 +113,9 @@ Write a JSON file with this structure: "instruction_following": { "winner": { "score": 9, - "issues": ["Minor: skipped optional logging step"] + "issues": [ + "Minor: skipped optional logging step" + ] }, "loser": { "score": 6, @@ -169,14 +167,14 @@ Write a JSON file with this structure: Use these categories to organize improvement suggestions: -| Category | Description | -| ---------------- | ---------------------------------------------- | -| `instructions` | Changes to the skill's prose instructions | -| `tools` | Scripts, templates, or utilities to add/modify | -| `examples` | Example inputs/outputs to include | -| `error_handling` | Guidance for handling failures | -| `structure` | Reorganization of skill content | -| `references` | External docs or resources to add | +| Category | Description | +|----------|-------------| +| `instructions` | Changes to the skill's prose instructions | +| `tools` | Scripts, templates, or utilities to add/modify | +| `examples` | Example inputs/outputs to include | +| `error_handling` | Guidance for handling failures | +| `structure` | Reorganization of skill content | +| `references` | External docs or resources to add | ## Priority Levels @@ -213,7 +211,6 @@ You receive these parameters in your prompt: ### Step 2: Analyze Per-Assertion Patterns For each expectation across all runs: - - Does it **always pass** in both configurations? (may not differentiate skill value) - Does it **always fail** in both configurations? (may be broken or beyond capability) - Does it **always pass with skill but fail without**? (skill clearly adds value here) @@ -223,7 +220,6 @@ For each expectation across all runs: ### Step 3: Analyze Cross-Eval Patterns Look for patterns across evals: - - Are certain eval types consistently harder/easier? - Do some evals show high variance while others are stable? - Are there surprising results that contradict expectations? @@ -231,7 +227,6 @@ Look for patterns across evals: ### Step 4: Analyze Metrics Patterns Look at time_seconds, tokens, tool_calls: - - Does the skill significantly increase execution time? - Is there high variance in resource usage? - Are there outlier runs that skew the aggregates? @@ -239,13 +234,11 @@ Look at time_seconds, tokens, tool_calls: ### Step 5: Generate Notes Write freeform observations as a list of strings. Each note should: - - State a specific observation - Be grounded in the data (not speculation) - Help the user understand something the aggregate metrics don't show Examples: - - "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value" - "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky" - "Without-skill runs consistently fail on table extraction expectations (0% pass rate)" @@ -269,14 +262,12 @@ Save notes to `{output_path}` as a JSON array of strings: ## Guidelines **DO:** - - Report what you observe in the data - Be specific about which evals, expectations, or runs you're referring to - Note patterns that aggregate metrics would hide - Provide context that helps interpret the numbers **DO NOT:** - - Suggest improvements to the skill (that's for the improvement step, not benchmarking) - Make subjective quality judgments ("the output was good/bad") - Speculate about causes without evidence diff --git a/.agents/skills/skill-creator/agents/comparator.md b/.agents/skills/skill-creator/agents/comparator.md index 990f9960ec..80e00eb45d 100644 --- a/.agents/skills/skill-creator/agents/comparator.md +++ b/.agents/skills/skill-creator/agents/comparator.md @@ -53,7 +53,6 @@ Based on the task, generate a rubric with two dimensions: | Usability | Difficult to use | Usable with effort | Easy to use | Adapt criteria to the specific task. For example: - - PDF form → "Field alignment", "Text readability", "Data placement" - Document → "Section structure", "Heading hierarchy", "Paragraph flow" - Data output → "Schema correctness", "Data types", "Completeness" @@ -145,25 +144,25 @@ Write a JSON file with this structure: "A": { "passed": 4, "total": 5, - "pass_rate": 0.8, + "pass_rate": 0.80, "details": [ - { "text": "Output includes name", "passed": true }, - { "text": "Output includes date", "passed": true }, - { "text": "Format is PDF", "passed": true }, - { "text": "Contains signature", "passed": false }, - { "text": "Readable text", "passed": true } + {"text": "Output includes name", "passed": true}, + {"text": "Output includes date", "passed": true}, + {"text": "Format is PDF", "passed": true}, + {"text": "Contains signature", "passed": false}, + {"text": "Readable text", "passed": true} ] }, "B": { "passed": 3, "total": 5, - "pass_rate": 0.6, + "pass_rate": 0.60, "details": [ - { "text": "Output includes name", "passed": true }, - { "text": "Output includes date", "passed": false }, - { "text": "Format is PDF", "passed": true }, - { "text": "Contains signature", "passed": false }, - { "text": "Readable text", "passed": true } + {"text": "Output includes name", "passed": true}, + {"text": "Output includes date", "passed": false}, + {"text": "Format is PDF", "passed": true}, + {"text": "Contains signature", "passed": false}, + {"text": "Readable text", "passed": true} ] } } diff --git a/.agents/skills/skill-creator/agents/grader.md b/.agents/skills/skill-creator/agents/grader.md index ba7a31e57e..558ab05c0a 100644 --- a/.agents/skills/skill-creator/agents/grader.md +++ b/.agents/skills/skill-creator/agents/grader.md @@ -61,7 +61,6 @@ This catches issues that predefined expectations might miss. ### Step 5: Read User Notes If `{outputs_dir}/user_notes.md` exists: - 1. Read it and note any uncertainties or issues flagged by the executor 2. Include relevant concerns in the grading output 3. These may reveal problems even when expectations pass @@ -70,10 +69,9 @@ If `{outputs_dir}/user_notes.md` exists: After grading, consider whether the evals themselves could be improved. Only surface suggestions when there's a clear gap. -Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion _discriminating_: it passes when the skill genuinely succeeds and fails when it doesn't. +Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion *discriminating*: it passes when the skill genuinely succeeds and fails when it doesn't. Suggestions worth raising: - - An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content) - An important outcome you observed — good or bad — that no assertion covers at all - An assertion that can't actually be verified from the available outputs @@ -87,13 +85,11 @@ Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir). ## Grading Criteria **PASS when**: - - The transcript or outputs clearly demonstrate the expectation is true - Specific evidence can be cited - The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename) **FAIL when**: - - No evidence found for the expectation - Evidence contradicts the expectation - The expectation cannot be verified from available information diff --git a/.agents/skills/skill-creator/assets/eval_review.html b/.agents/skills/skill-creator/assets/eval_review.html index 771a437d97..938ff32aed 100644 --- a/.agents/skills/skill-creator/assets/eval_review.html +++ b/.agents/skills/skill-creator/assets/eval_review.html @@ -1,226 +1,92 @@ - + - - - - Eval Set Review - __SKILL_NAME_PLACEHOLDER__ - - - - - - -

Eval Set Review: __SKILL_NAME_PLACEHOLDER__

-

- Current description: __SKILL_DESCRIPTION_PLACEHOLDER__ -

+ + + + Eval Set Review - __SKILL_NAME_PLACEHOLDER__ + + + + + + +

Eval Set Review: __SKILL_NAME_PLACEHOLDER__

+

Current description: __SKILL_DESCRIPTION_PLACEHOLDER__

-
- - -
+
+ + +
- - - - - - - - - -
QueryShould TriggerActions
+ + + + + + + + + +
QueryShould TriggerActions
-

+

- - + render(); + + diff --git a/.agents/skills/skill-creator/eval-viewer/viewer.html b/.agents/skills/skill-creator/eval-viewer/viewer.html index f3b7d9e2b4..6d8e96348a 100644 --- a/.agents/skills/skill-creator/eval-viewer/viewer.html +++ b/.agents/skills/skill-creator/eval-viewer/viewer.html @@ -1,1478 +1,1325 @@ - + - - - - Eval Review - - - - - + + +
+
+
+

Eval Review:

+
Review each output and leave feedback below. Navigate with arrow keys or buttons. When done, copy feedback and paste into Claude Code.
+
+
+
- /* ---- Benchmark view ---- */ - .benchmark-view { - padding: 1.5rem 2rem; - overflow-y: auto; - flex: 1; - } - .benchmark-table { - border-collapse: collapse; - background: var(--surface); - border: 1px solid var(--border); - border-radius: var(--radius); - font-size: 0.8125rem; - width: 100%; - margin-bottom: 1.5rem; - } - .benchmark-table th, - .benchmark-table td { - padding: 0.625rem 0.75rem; - text-align: left; - border: 1px solid var(--border); - } - .benchmark-table th { - font-family: 'Poppins', sans-serif; - background: var(--header-bg); - color: var(--header-text); - font-weight: 500; - font-size: 0.75rem; - text-transform: uppercase; - letter-spacing: 0.04em; - } - .benchmark-table tr:hover { - background: var(--bg); - } - .benchmark-table tr.benchmark-row-with { - background: rgba(33, 150, 243, 0.06); - } - .benchmark-table tr.benchmark-row-without { - background: rgba(255, 193, 7, 0.06); - } - .benchmark-table tr.benchmark-row-with:hover { - background: rgba(33, 150, 243, 0.12); - } - .benchmark-table tr.benchmark-row-without:hover { - background: rgba(255, 193, 7, 0.12); - } - .benchmark-table tr.benchmark-row-avg { - font-weight: 600; - border-top: 2px solid var(--border); - } - .benchmark-table tr.benchmark-row-avg.benchmark-row-with { - background: rgba(33, 150, 243, 0.12); - } - .benchmark-table tr.benchmark-row-avg.benchmark-row-without { - background: rgba(255, 193, 7, 0.12); - } - .benchmark-delta-positive { - color: var(--green); - font-weight: 600; - } - .benchmark-delta-negative { - color: var(--red); - font-weight: 600; - } - .benchmark-notes { - background: var(--surface); - border: 1px solid var(--border); - border-radius: var(--radius); - padding: 1rem; - } - .benchmark-notes h3 { - font-family: 'Poppins', sans-serif; - font-size: 0.875rem; - margin-bottom: 0.75rem; - } - .benchmark-notes ul { - list-style: disc; - padding-left: 1.25rem; - } - .benchmark-notes li { - font-size: 0.8125rem; - line-height: 1.6; - margin-bottom: 0.375rem; - } - .benchmark-empty { - color: var(--text-muted); - font-style: italic; - text-align: center; - padding: 3rem; - } + + - /* ---- Navigation ---- */ - .nav { - display: flex; - justify-content: space-between; - align-items: center; - padding: 1rem 2rem; - border-top: 1px solid var(--border); - background: var(--surface); - flex-shrink: 0; - } - .nav-btn { - font-family: 'Poppins', sans-serif; - padding: 0.5rem 1.25rem; - border: 1px solid var(--border); - border-radius: var(--radius); - background: var(--surface); - cursor: pointer; - font-size: 0.875rem; - font-weight: 500; - color: var(--text); - transition: all 0.15s; - } - .nav-btn:hover:not(:disabled) { - background: var(--bg); - border-color: var(--text-muted); - } - .nav-btn:disabled { - opacity: 0.4; - cursor: not-allowed; - } - .done-btn { - font-family: 'Poppins', sans-serif; - padding: 0.5rem 1.5rem; - border: 1px solid var(--border); - border-radius: var(--radius); - background: var(--surface); - color: var(--text); - cursor: pointer; - font-size: 0.875rem; - font-weight: 500; - transition: all 0.15s; - } - .done-btn:hover { - background: var(--bg); - border-color: var(--text-muted); - } - .done-btn.ready { - border: none; - background: var(--accent); - color: white; - font-weight: 600; - } - .done-btn.ready:hover { - background: var(--accent-hover); - } - /* ---- Done overlay ---- */ - .done-overlay { - display: none; - position: fixed; - inset: 0; - background: rgba(0, 0, 0, 0.5); - z-index: 100; - justify-content: center; - align-items: center; - } - .done-overlay.visible { - display: flex; - } - .done-card { - background: var(--surface); - border-radius: 12px; - padding: 2rem 3rem; - text-align: center; - box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3); - max-width: 500px; - } - .done-card h2 { - font-size: 1.5rem; - margin-bottom: 0.5rem; - } - .done-card p { - color: var(--text-muted); - margin-bottom: 1.5rem; - line-height: 1.5; - } - .done-card .btn-row { - display: flex; - gap: 0.5rem; - justify-content: center; - } - .done-card button { - padding: 0.5rem 1.25rem; - border: 1px solid var(--border); - border-radius: var(--radius); - background: var(--surface); - cursor: pointer; - font-size: 0.875rem; - } - .done-card button:hover { - background: var(--bg); - } - /* ---- Toast ---- */ - .toast { - position: fixed; - bottom: 5rem; - left: 50%; - transform: translateX(-50%); - background: var(--header-bg); - color: var(--header-text); - padding: 0.625rem 1.25rem; - border-radius: var(--radius); - font-size: 0.875rem; - opacity: 0; - transition: opacity 0.3s; - pointer-events: none; - z-index: 200; - } - .toast.visible { - opacity: 1; - } - - - -
-
-
-

Eval Review:

-
- Review each output and leave feedback below. Navigate with arrow keys or buttons. When - done, copy feedback and paste into Claude Code. -
+ +
+
+ +
+
Prompt
+
+
-
- -