Domain-agnostic research-coding workflow discipline for computational science (the JAX/Python research family — gravax, stellax, progenax, radax, …), packaged as a Claude Code plugin. The human is the scientist-in-the-loop, PI-level collaborator, and supervisor; the skills enforce evidence-first execution, structural correctness over compatibility, falsifiability, and reproducible artifacts. Domain specifics (e.g. MESA parity) live in thin lenses, so the stances stay sharp while the suite stays general.
| Phase | Skill |
|---|---|
| Collaborate | researcher-in-the-loop · high-impact-checkpoint |
| Ideate | research-ideation · research-brainstorming |
| Literature | prior-art-check · reading-notes-discipline · related-work-map |
| Scope | minimal-falsifiable-slice · discriminating-experiment-design · testing-strategist |
| Build correctly | ownership-and-structure · correct-cutover · numerical-precision · derivation-before-implementation · staleness-sweep · no-silent-except |
| Equation-critical sources | pdf-equation-extraction · equation-to-code-traceability · reference-license-firewall · equation-errata-ledger |
| Verify | evidence-first-execution · verification-gate · numerical-method-validation · gradient-validation · reference-parity-audit · adversarial-result-check · uncertainty-reporting-gate · plausibility-envelope · ai-self-distrust · seed-and-stochasticity · prior-sensitivity · systematic-error-hunting · figure-interpretation-guard · no-stub-when-done |
| Inference rigor | mcmc-convergence-gate · predictive-checks · model-selection-discipline |
| Review (audit written code/figures) | scientific-code-reviewer · numerical-methods-auditor · jax-code-validator · error-handling-reviewer · code-craft-reviewer · benchmark-generator · plot-faithfulness-inspector · plot-craft-reviewer |
| Performance & scale | profiling-discipline · scaling-validation · jax-performance · cluster-run-contract |
| Record | decision-log-and-commits · provenance-of-constants · experiment-tracking · data-provenance · data-io-validator · null-result-integrity · assumption-ledger · no-secrets-in-git |
| Communicate (docs & figures) | myst-expert · docs-writing-voice · myst-ci · interactive-figures · mystmd-plugin-dev · astro-plotting-craft · plot-design-inspector · publication-figure-validator |
| Reproduce & release | artifact-first-reproducibility · reproducible-environment-contract · software-citation · research-release-checklist · data-management-plan |
Each skill's description carries a "Don't use when… (→ sibling)" partition and a ## Related block, so the suite reads as one ordered protocol. reference-parity-audit loads a domain lens when one exists (lenses/mesa.md and lenses/nbody.md ship; lenses/rad-transfer.md is added on first need).
The Ideate and Literature clusters (v1.4.0) complete the front of the funnel the suite previously lacked: research-ideation (divergent — generate and triage directions) → research-brainstorming (convergent — sharpen one into a falsifiable hypothesis + discriminating observable) → prior-art-check (is it novel?) → discriminating-experiment-design → minimal-falsifiable-slice → Build. The Inference rigor cluster gates the inference itself for the NumPyro family — sampler convergence (R-hat/ESS/divergences), prior/posterior predictive fit, and honest out-of-sample model selection — distinct from the forward-numerics Verify cluster. Performance & scale covers measure-first profiling, strong/weak scaling, JAX compile-boundary performance, and the HPC job→artifact contract. Reproduce & release extends reproducibility to the citable public artifact (CITATION.cff/DOI, the figure→release trace, FAIR data management plans).
The Equation-critical sources cluster is for papers whose equations become code, tests, or benchmark fixtures. It keeps rendered-PDF verification, implementation traceability, reference-code licensing boundaries, and errata/conflict decisions separate on purpose. The equation-verifier agent is the adversarial row checker for promoting digest rows to verified.
The Review and Communicate clusters and several MyST references were consolidated in v1.2.0 from the former astro-code-review and myst plugins (now retired) — see the Status section. MyST authoring skills ship co-located references (myst-cheatsheet, math-and-gotchas, myst-projects-and-workflows, voice-fingerprint, page-anatomy) and the shippable mystmd-plugins/interactive.mjs directive bundle.
The skills document the discipline; eight path-/command-scoped, self-limiting hooks (hooks/hooks.json) enforce it. Each stays inert outside research code (e.g. during course work or quick edits) and fails open on any error, so it never blocks legitimate work.
| Hook | Event | Fires on | Action |
|---|---|---|---|
| deletion gate | PreToolUse(Bash) |
rm / git rm / git clean / shred |
asks for confirmation before a destructive op |
| no-secrets-in-git | PreToolUse(Bash) |
git add/commit that names a credential file (.env, *.pem, …) or stages a secret signature (AWS/GitHub/Slack/Google token, PRIVATE KEY block, api_key=…) |
asks before a secret enters git history |
| test-integrity | PreToolUse(Edit/Write) |
edits to test_*.py / tests/** that loosen a tolerance, drop an assert, or add skip/xfail |
asks before a test is weakened to pass |
| no-silent-except | PreToolUse(Edit/Write) |
new Python that catches an exception and does nothing (bare except:, or except …: pass/…/continue) |
asks before an error is silently swallowed |
| myst-docs-hygiene | PreToolUse(Edit/Write) |
MyST docs (docs/**/*.md, myst.yml) with legacy Sphinx-MyST syntax ({toctree}/{eval-rst}/autodoc/RST), or a page missing the house-minimum title+description frontmatter |
asks before legacy/incomplete MyST docs land (pairs with the myst@myst-dev plugin) |
| provenance | PreToolUse(Edit/Write) |
uncited numeric literals in constants/calibration files, or references to external datasets/checkpoints (data-file URLs, data/raw/…) with no source/version/checksum |
asks for a source (DOI/arXiv/Zenodo/checksum) |
| evidence-before-done | Stop (+ SubagentStop when RWF_SUBAGENT_EVIDENCE set) |
a code/test/result/build claim ("fixed / passing / converged / built") with no fresh command output in the turn | blocks until the verification command + output are shown |
| no-stub-when-done | Stop (+ SubagentStop when RWF_SUBAGENT_EVIDENCE set) |
a completion claim ("implemented / complete / ready") while an edit this turn left a stub in code (NotImplementedError, TODO/FIXME, placeholder body) |
blocks until the stub is finished or the scope is restated |
| jq sanity check | SessionStart |
jq not on PATH |
warns that the gates are inactive (they need jq) |
Hooks load at session start — restart Claude Code after installing or updating the plugin to activate them. Smoke tests:
bash hooks/tests/run_tests.sh.
Prerequisite: the hooks use jq. If jq is not on PATH they fail open (no-op) — so install it (brew install jq) for the gates to be active.
Debugging: the hooks are silent by default. To see when each fires and what it decided (allow:* / ask:* / block:*), set RWF_HOOK_DEBUG=1; entries append to $RWF_HOOK_LOG (default ${TMPDIR:-/tmp}/research-workflow-hooks.log). Example:
export RWF_HOOK_DEBUG=1
tail -f "${TMPDIR:-/tmp}/research-workflow-hooks.log"
# 2026-06-15T22:41:48 [evidence] block:claim-without-evidence — All tests pass.
# 2026-06-15T22:41:48 [deletion] ask:destructive — rm -rf x
# 2026-06-15T22:41:49 [skill] invoke:research-workflow:numerical-precisionThe log also records skill invocations ([skill] invoke:<name>, via a PreToolUse(Skill) hook), so a week of RWF_HOOK_DEBUG data shows not just which gates fired but which of the 70 skills actually surface in real work — the missing signal for auditing the advisory layer. (Caveat: this captures skills invoked through the Skill tool; guidance the model follows without an explicit invocation is not logged — it's a lower bound.)
Six slash commands give deliberate entry points (skills also auto-surface by description); each does more than restate a skill:
| Command | Does |
|---|---|
/checkpoint [action] |
Go/no-go before an expensive or irreversible run (high-impact-checkpoint). |
/review [target] |
Multi-lens scientific code/figure review of a changeset — the deterministic entry point for the Review cluster (correctness · numerics · JAX · robustness · craft · figures), producing a severity-tagged report. Beats hoping the review skills auto-surface. |
/parity <ref> |
Reference-parity audit vs. an external reference, loading the matching lens (mesa/nbody). |
/reproduce |
Capture a reproducibility contract — env lock, seeds, precision, input ids, commit. |
/equation-digest <source> |
Create or review an equation-critical digest from a PDF/source note, with rendered-PDF verification states, traceability, reference-license firewalling, and errata handling. In installed plugin form, Claude may expose it as /research-workflow:equation-digest. |
/hooks-debug [status|tail|on|off] |
Inspect/enable the hook decision log (see Debugging above; enabling needs a settings.json env entry + restart). |
| Agent | Does |
|---|---|
equation-verifier |
Adversarially checks equation-digest rows against rendered PDFs or trusted publisher sources before rows are promoted to verified. |
This plugin is research-workflow; the dev marketplace (in .claude-plugin/marketplace.json) is research-workflow-dev. Public repo: https://github.com/drannarosen/research-workflow.
git clone https://github.com/drannarosen/research-workflow.git
# then, in Claude Code:
/plugin marketplace add ./research-workflow
/plugin install research-workflow@research-workflow-devThen restart Claude Code (hooks load at session start). The version is single-sourced in .claude-plugin/plugin.json; keep marketplace.json in sync.
CI (.github/workflows/ci.yml) runs on every push / PR: shellcheck, the consistency checks, and the hook smoke tests. Run the same locally before committing:
bash scripts/checks.sh # version sync (plugin.json == marketplace.json) + skill/command/agent/hook/lens lint
bash hooks/tests/run_tests.sh # hook smoke tests (60 cases)Consolidated 2026-05-30 from a former 15-skill scientific-workflow plugin: the ownership cluster merged → ownership-and-structure + correct-cutover; the MESA pair → reference-parity-audit + lenses/mesa.md; decision + commit → decision-log-and-commits; the rest were renamed and de-stellarified into a domain-agnostic numerical-research substrate.
v1.1.0 added the gradient-validation skill (finite-difference grad-checks, NaN/zero-gradient traps) and the first four enforcement hooks, and refined every skill — sharper "Use when…" descriptions with sibling disambiguation, concrete computational-astrophysics worked examples, dedupe-by-pointer cross-references, and explicit hard-vs-adaptable stances.
v1.1.x then grew the suite to 32 skills and added an epistemic-integrity set (derivation-before-implementation, plausibility-envelope, ai-self-distrust, null-result-integrity, the inference-robustness trio) plus four more deterministic gates (no-silent-except, no-secrets-in-git, no-stub-when-done, myst-docs-hygiene).
v1.2.0 consolidates this into one comprehensive research plugin: the former astro-code-review plugin (11 of its 12 skills — reproducibility-auditor dropped as a duplicate of reproducible-environment-contract / artifact-first-reproducibility) and the former myst plugin (5 skills + the interactive.mjs directive bundle) were migrated in, adding the Review (computational-physics code/figure review) and Communicate (MyST docs + figure design/publication) clusters → 48 skills, eight enforcement hooks. Both source plugins are retired (disabled, marked deprecated). MyST skills were re-scoped to research docs (teaching moved to the sophie platform).
v1.3.0 adds the equation-critical source layer: pdf-equation-extraction, equation-to-code-traceability, reference-license-firewall, equation-errata-ledger, the /equation-digest command, and the equation-verifier agent. This is additive to the research workflow rather than a refactor: ordinary source ingest stays lightweight, while implementation-critical equations now have rendered-PDF verification, traceability, firewall, and errata gates.
v1.3.1 hardens the plugin after adversarial review: Task delegation no longer counts as verification by itself, completion claims scan final touched code files for stubs, the shipped interactive.mjs escapes JSON/JS values safely, CI runs official Claude plugin validation, and scripts/checks.sh enforces the skill graph promises.
v1.4.0 extends the suite from 52 to 65 skills across four new clusters, completing the research lifecycle end-to-end: a true front-of-funnel (Ideate — research-ideation, research-brainstorming; Literature — prior-art-check), Inference rigor for the Bayesian/NumPyro family (mcmc-convergence-gate, predictive-checks, model-selection-discipline), Performance & scale for HPC work (profiling-discipline, scaling-validation, jax-performance, cluster-run-contract), and a Reproduce & release tail for citable artifacts (software-citation, research-release-checklist, data-management-plan). Additive — no hooks added, no skills renamed; sibling plugins keep their boundaries (manuscripts → manuscript-workflow, grants → grant-writing).
v1.4.1 completes the Literature cluster deferred in v1.4.0: reading-notes-discipline (per-paper claim/evidence/caveat intake) and related-work-map (durable cross-paper field map) — 67 skills. The boundary against manuscript-workflow:lit-scan (per-manuscript citation completeness) was confirmed clean before building: reading-notes is per-paper intake, related-work-map is cross-paper synthesis you maintain, prior-art-check is the one-time novelty gate, lit-scan is late-stage citation completeness in a different plugin.
v1.5.0 adds the figure craft & interpretation layer (70 skills), completing the figure lifecycle from authoring to reading: astro-plotting-craft (author publication-grade plots in the house style — the jaxstroviz theme/figure helpers as source of truth, seaborn perceptually-uniform colormaps, CVD-safe color discipline, log/linear and LaTeX-not-unicode rules), plot-craft-reviewer (audit existing plot code for craft defects — wrong axis scale, unicode-vs-LaTeX, mathtext syntax errors, overlays, off-brand/unsafe colormaps), and figure-interpretation-guard (what a finished figure lets you conclude — over-reading, visual traps, reproducing/comparing paper figures, AI-misread plots). These sit beside the existing design/faithfulness/publication trio (ADR-0009) without overlap; the house style matches the jaxstroviz package as-is (its themes/architecture are near-SoTA), with the colorblind-safety, color×marker, reproducible-font, and perceptually-uniform-colormap upgrade targets named in astro-plotting-craft/references/house-style.md.