docs(fix-flaky): de-overfit diagnosis playbook to generic classes#58
Merged
Conversation
d45f8e1 to
3a6b1bb
Compare
The step-4 table cells were reverse-engineered from semaphore's top
flakes (count_children → %{active:N,workers:N}, `in [:lt,:eq]`), so they
read as answer-keys a plain run parrots on unrelated repos.
- reframe step 4: pull the real `flaky failure` first; the table names
the class + fix direction only, never the fix itself
- genericize the cells (drop semaphore-specific shapes / Elixir-only naming)
Diagnosis now driven by real failure evidence — still nails semaphore,
no longer canned on other projects.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3a6b1bb to
f899a25
Compare
loadez
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The step-4 diagnosis table in the
fix-flakyskill had cells reverse-engineered from a few specific tests (e.g.count_children → %{active: N, workers: N},compare == :lt→ fixin [:lt, :eq]). On those exact tests it's a perfect match; on any other repo the cells read as answer-keys, and a clean run tends to parrot the canned fix instead of diagnosing the test in front of it.What
flaky failurefirst — the table now names the class of nondeterminism and the fix direction only, never the fix itself.Net: diagnosis is driven by the real failure evidence, so the skill still nails the cases it used to without going canned on unfamiliar repos.
Validation
Ran the skill from a clean Claude instance (only sem-ai installed) against the hardest case — the exact test one of the old answer-key cells was mined from. It led with
flaky failure, diagnosed from the actual code (a shared named supervisor with transient restart-loop workers leaking across tests), and wrote a one-placesetupdrain using the manager's realchildren/0+finish_schedule_task/1. No parroting, and it never needed a worked-example crutch.No version bump (bundled at release).