feat(skills): add Dynamo lifecycle agent skill set under .agents/skills/#9847
Draft
dagil-nvidia wants to merge 7 commits into
Draft
feat(skills): add Dynamo lifecycle agent skill set under .agents/skills/#9847dagil-nvidia wants to merge 7 commits into
dagil-nvidia wants to merge 7 commits into
Conversation
Add seven agentskills.io-compatible skills covering the Dynamo user journey end-to-end: plan, optimize, serve (local), deploy, frontend (request path), troubleshoot (day-2), benchmark. Each skill ships a SKILL.md with a strict 4-phase workflow, DESTRUCTIVE/ MUTATING/SAFE command tables, Human-in-the-Loop decision points, plus references/ (annotated field references, deployment patterns, known- issue signatures) and scripts/ (shellcheck-clean pre-/post-operation validators using pass/fail/warn or check() helpers). The .agents/skills/README.md documents the methodology — heavy attribution to NVIDIA's internal ai-infra-agent repository for the structural conventions (4-phase workflow, command-tier rubric, HITL contract, script patterns, references+scripts layout), plus ten documented extensions (architectural survey, citation manifest, master-derivative publication pipeline, lifecycle decomposition, per- release skill versioning, NV-ACES standard-headers extension, run_script protocol surfacing, phase-shape variance for day-2, survey-driven feature-naming check, XML-tag-in-YAML pitfall). NV-ACES Tier 1 deterministic scoring (2026-05-21): average 92.1/100, lowest 90. All grades A- or A. .claude/skills/ remains a symlink to .agents/skills/ in this repo, so Claude Code sees the new content under its native path. The companion NVIDIA/skills catalog PR (one-line components.d/dynamo.yml registry entry) is submitted separately, gated on this PR merging. Signed-off-by: Dan Gil <dagil@nvidia.com>
Hard-wrapped lines render with awkward line breaks in some GitHub viewports. Unwrap prose paragraphs while preserving headings, tables, code blocks, and lists. No content changes. Signed-off-by: Dan Gil <dagil@nvidia.com>
- Mention Claude Code / Cursor / Codex by name in the intro and clients sections; do not link to their docs. - Strip NVIDIA-internal documentation URLs (Confluence, gitlab-master); the public PR should not reference SSO-walled resources that external reviewers can't access. No substantive content changes. Signed-off-by: Dan Gil <dagil@nvidia.com>
…ze to placeholder Adds .agents/skills/dynamo-skill-author/, a meta-skill that teaches an agent (or a human reader) how to author additional Dynamo agent skills with the same rigor floor as the seven lifecycle skills already in this directory. Structure mirrors the other skills: - SKILL.md (~430 lines) with a Gather -> Scaffold -> Author -> Validate 4-phase workflow, DESTRUCTIVE / MUTATING / SAFE command tables, decision points, refusal conditions, and standard NV-ACES headers. - references/frontmatter-shape.md spec for the YAML frontmatter (mechanically enforced by check-frontmatter.sh). - references/body-shape.md spec for the SKILL.md body, command-safety rubric, per-phase section shape, script patterns, references conventions. - references/known-issues.md cataloguing the pitfalls observed during the initial seven-skill authoring pass (XML-tag pitfall, non-existent feature claim, NV-ACES sub-90 lift, shellcheck warnings). - scripts/scaffold-skill.sh to mkdir a new skill directory from a sibling with frontmatter reset and placeholders inserted (idempotent). - scripts/validate-skill.sh to run shellcheck, frontmatter parse, cross-link audit, and length budget against an authored skill. - scripts/check-frontmatter.sh, a shell wrapper around an inline python3 PyYAML parse for strict frontmatter validation. The shell-wrapper shape keeps the validator out of the repository's Python formatter scope while preserving the parse rigor. The skill eats its own dogfood: validate-skill.sh against itself reports 9 passed / 0 failed / 0 warned, and the validator runs clean against the seven existing lifecycle skills as well. Two follow-on changes are bundled: - dynamo-optimize is reset to a placeholder (SKILL.md replaced; references and scripts removed). The recipe-runner workflow proposed under .agents/skills/dynamo-recipe-runner/ in a parallel in-flight PR will be brought through dynamo-skill-author's rigor floor and landed here in a follow-up commit on this branch. - dynamo-troubleshoot/SKILL.md has an internal master-path leak fixed: '~/dynamo-skills/ALL_Skills/dynamo-deploy/scripts/verify-platform.sh' becomes the public-repo relative path '../dynamo-deploy/scripts/verify-platform.sh'. Caught by the new validate-skill.sh cross-link audit on its first run against the existing skills. README catalog updated to list dynamo-skill-author and to mark dynamo-optimize as placeholder; NV-ACES paragraph adjusted to score the six lifecycle skills with shipping content (avg 91.7 / 100; lowest 90). Signed-off-by: Dan Gil <dagil@nvidia.com>
…wrapper) The previous commit (3a2152d) shipped check-frontmatter as a shell wrapper around inline python3 to dodge a pre-commit failure: black 23.1.0 crashes on Python 3.14 with 'module ast has no attribute Str' because ast.Str was removed in 3.12. The shell wrapper was a workaround for an 80-line validator and never appropriate for the real Python tools that follow in later commits. With the local pre-commit cache patched to black 23.12.1 (cache path ~/.cache/pre-commit/reporutdbgwp/py_env-python3.14/bin/black; supports Python 3.14; accepts the .py file unchanged), this commit restores the canonical Python form: - Re-creates scripts/check-frontmatter.py with the same parse rigor as the shell wrapper (pyyaml frontmatter parse + 11 named checks: required fields, name prefix, name-matches-dir, description type / length / XML-tag absence, version format, author, tags presence and count, tools non-empty). - Deletes scripts/check-frontmatter.sh. - Repoints validate-skill.sh to invoke 'python3 check-frontmatter.py'. - Repoints all references in SKILL.md, references/frontmatter-shape.md, and references/known-issues.md back to the .py form. Other contributors on Python 3.12+ need the same local cache patch: BLACK_VENV=~/.cache/pre-commit/<black-repo-dir>/py_env-python3.14 "$BLACK_VENV/bin/pip" install --upgrade 'black==23.12.1' No project-config changes. The repo's pinned black version stays 23.1.0; this commit leaves the project policy untouched and addresses the local environment mismatch only. Validators all clean: shellcheck .agents/skills/*/scripts/*.sh -> clean python3 .../check-frontmatter.py .../SKILL.md -> 11/0 bash .../validate-skill.sh -d .agents/skills/dynamo-skill-author -> 8/0/0 black --check .../check-frontmatter.py (23.12.1) -> unchanged Signed-off-by: Dan Gil <dagil@nvidia.com>
…content Signed-off-by: Dan Gil <dagil@nvidia.com>
Contributor
Adds the blog-figures skill for Dynamo blog post figure creation. The skill picks the right pathway (Python+Plotly, D2, hand-crafted SVG, dynamo-svg, or HTML+CSS to PNG) for each figure task, enforces palette and typography from canonical design tokens, and runs a mandatory render-and-critique loop. DESIGN.md is the cite-able source of truth for the two type families (flash-indexer for compact data, Digital Twin / DynoSim for hero and section anchors), palette, layout conventions, and forbidden anti-patterns. Companion files cover diagram aesthetic, chart craft, and HTML-to-PNG authoring recipes. Signed-off-by: Dan Gil <dagil@nvidia.com>
mosheabr
pushed a commit
to NVIDIA/skills
that referenced
this pull request
May 29, 2026
Register NVIDIA Dynamo with the catalog. Seven agent skills (Plan, Optimize, Serve, Deploy, Frontend, Troubleshoot, Benchmark) covering the full Dynamo lifecycle are maintained at ai-dynamo/dynamo under .agents/skills/ and will sync to this catalog daily once ai-dynamo/dynamo#9847 (the upstream PR landing the skills) merges. NVIDIA Dynamo is a distributed LLM inference framework. The skill content follows conventions inherited from NVIDIA's internal ai-infra-agent repository; NV-ACES Tier 1 deterministic scoring averages 92.1/100 across the seven skills, lowest 90. Submitted as draft pending the upstream PR merging. Signed-off-by: Dan Gil <dagil@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
Adds seven NVIDIA Dynamo agent skills under
.agents/skills/, plus a README documenting the methodology, attribution toai-infra-agent, and the per-release update model.This contribution is driven by Dynamo's own need to ship agent skills for its users. A complete lifecycle skill set lets agentskills.io-compatible clients like Claude Code, Cursor, and Codex guide a user through the Dynamo workflow end-to-end. The public NVIDIA/skills catalog will mirror these skills automatically via a companion
components.d/dynamo.ymlregistration (NVIDIA/skills#74, draft, gated on this PR merging).Details:
Motivation
Dynamo's surface area has grown to the point where new users (and seasoned users picking up a new workflow) benefit from agent-guided onboarding rather than reading the docs end-to-end. Seven coherent skills cover the user journey:
python3 -m dynamo.<backend>for iteration.dynamo-platformHelm chart, DGD/DGDR, or recipes.DynamoModelCR, multi-model serving, and gateway integration.benchmarks/suites.Each skill is invocable independently. The "cross-cutting integrations stay integrated, not standalone" principle (NIXL, Grove, KAI, GAIE, model caching, Snapshot, observability) avoids forcing users to chain across skills for one workflow.
What's Included
Seven skills (NV-ACES Tier 1 deterministic scoring, run 2026-05-21):
dynamo-plansearchStrategy, SLA framing, recipe selectiondynamo-optimizemodelopt) quantization — FP8 / NVFP4 / INT8 / AWQdynamo-servepython3 -m dynamo.<backend>workstation workflow; per-backend flag matrixdynamo-deploydynamo-platformHelm, DGD + DGDR authoring, recipes, conversion webhooks, day-2 opsdynamo-frontendDynamoModelCR, multi-model, GAIE / kgateway / Istiodynamo-troubleshootdynamo-benchmarkbenchmarks/suites, recipe-attached benchmarksAverage score 92.1 / 100. Lowest 90. All A- or A. Zero errors. 1-2 warnings per skill.
File structure:
Totals: 38 files, ~7,500 lines (skills only — does not include the methodology docs which stay in an internal repo).
Each skill ships:
SKILL.mdwith a strict 4-phase workflow, DESTRUCTIVE / MUTATING / SAFE command tables, Human-in-the-Loop decision points, refusal conditions, cross-skill references, observability hooks, and an explicit## Workflow/## Available Scripts/## Prerequisites/## Limitations/## Troubleshootingsection set.references/subdirectory: annotated field references, deployment patterns, known-issue signatures, decision matrices.scripts/subdirectory: pre-/post-operation validators. All scripts default to non-destructive modes (kubectl apply --dry-run=server,port-forward-based probes) and emit structuredPASS / FAIL / WARNoutput for agent consumption..agents/skills/vs.claude/skills/: this PR lands new content only under.agents/skills/— the cross-client interop convention from agentskills.io. This repository already has.claude/skills/as a symlink to.agents/skills/, so Claude Code sees the new skills under its native path; clients on the cross-client convention (e.g., Cursor) walk.agents/skills/directly. Existing.claude/skills/skills (debug-session,dep-status,gh-issue-bug, etc.) are unchanged.Methodology Attribution
The structural conventions — 4-phase workflow shape, DESTRUCTIVE / MUTATING / SAFE command-tier rubric, Human-in-the-Loop behavioral contract,
pass / fail / warnandcheck()script helper patterns, strict 6-element known-issues entry format, frontmatter shape, progressive disclosure block,references/+scripts/subdirectory layout, NV-ACES evaluation infrastructure — were developed and battle-tested by the team behind NVIDIA's internalai-infra-agentproject. Several exemplar skills there established the patterns this skill set inherits:nvidia-inference-stack— 4-phase workflow,pass / fail / warnscript helper,check()verification function, known-issues 6-element format, frontmatter shape.nvidia-inference-ra-orchestrator— Human-in-the-Loop behavioral contract (present, then wait; no soft-language interpretation).gpu-operator— DESTRUCTIVE / MUTATING / SAFE command-tier rubric with the explicit "you are responsible for the outcome" prompt.KAI— minimal-frontmatter exemplar; directory-casing vsname-field-casing convention.0% of the content here is copy-pasted; ~100% of the structural rigor is inherited with explicit row-ID citations to specific
ai-infra-agentfile:line locations. The Dynamo work is a domain-specific application of an existing methodology, not a new convention. Where this skill set extends the methodology, the extensions are upstreamable toai-infra-agentfor the same lift on its skills.Improvements on Top of ai-infra-agent
Ten extensions worth flagging — each is candidate for upstream adoption. Full discussion in the README. Summary:
scripts/derive-public.sh(shellcheck-clean) mechanically derives public artifacts from a rigorous internal master; emits MANUAL_REVIEW markers for editorial gaps it can't resolve mechanically.ai-infra-agent'snvidia-inference-ra-orchestratortop-down routing).versionfield tied to the Dynamo release line (1.2.0,1.3.0, ...), not arbitrary semver; codified refresh workflow.## Workflow/## Available Scripts/## Prerequisites/## Limitations/## Troubleshootinglifted average NV-ACES Tier 1 from 76.3 → 92.1 (Grade C → A-). Non-invasive patch; existing content stays.run_script()protocol surfacing — pairs human-facingbash scripts/x.shwith agent-facingrun_script("scripts/x.sh", args=[...]).dynamo-troubleshootsubstitutes Triage → Inspect → Diagnose → Remediate while keeping the four-phase contract. Extension documented in the authoring guide so other day-2 skills can follow.dynamo-runearly: that CLI does not exist in the Dynamo source. The principle generalizes — skills cite tools against a verified inventory, not training-data memory.<backend>,<n>) inside YAML folded-scalardescriptionfields get parsed as XML by NV-ACES and fail. Documented and avoided.Patches 6 and 7 are particularly upstreamable to
ai-infra-agent— a 15+ point NV-ACES lift on its existing skills with zero content rewrites.Quality and Testing
NV-ACES Tier 1 (deterministic scoring, NVIDIA SSO): ran
astra-skill-eval evaluate ALL_Skills/<skill> --staticagainst each skill. All 7 skills above themin_score: 70threshold. Average 92.1. Zero errors across all skills. Detailed dimensions per skill:shellcheck: all 9 scripts (7 per-skill plus 2 indynamo-deploy) shellcheck-clean — no warnings, no errors. Verified 2026-05-21.YAML frontmatter: all 7 SKILL.md frontmatters parse with
pyyaml. Required fields (name,description,version,author,tags,tools) present on every skill. No XML tag findings.Cross-link audit: every
references/*.mdandscripts/*.shlink in every SKILL.md resolves on disk. 29 unique row IDs cited across all 7 skills ([A6],[A9],[A11],[C1]-[C12],[D1]-[D6],[F1]-[F7]); all resolve in the internal citation manifest.Pre-MR checklist (codified in the internal authoring guide): every load-bearing claim cites a
VERIFIEDrow; every cluster-mutating command appears in a DESTRUCTIVE or MUTATING table; every script defaults to a non-destructive mode (kubectl apply --dry-run=server); every script emits PASS / FAIL / WARN structured output.Per-Release Update Model
citations.mdCheckedcolumnNVIDIA/skillscatalog mirror.agents/skills/Each release cycle, refreshing all seven skills is ~30 minutes of mechanical work: rerun the verification commands documented in the authoring guide, update pin tables, refresh known-issues against the active QA tracker, bump the
versionfield, re-runastra-skill-eval. The catalog sync handles the public-mirror update without human intervention.Full refresh procedure in the README.
Companion Catalog PR
NVIDIA/skills#74 (draft) registers Dynamo with the public catalog via a one-file
components.d/dynamo.ymlmanifest. Once this PR merges, the next daily sync workflow atNVIDIA/skillsreads.agents/skills/from the mergedmain, rendersskills/Dynamo/in the catalog, and regenerates the catalog README with Dynamo listed alongside the 16 existing products. No additional human action required for catalog publication.Design Decisions
The seventeen decisions locked during authoring, each with rationale:
nvidia-inference-stackbundledynamo-<verb>for Dynamo-owned components; product brands preserved for external (AIConfigurator, AIPerf, ModelOpt, NIXL, Grove, KAI)dynamo-runskill rejectedpython3 -m dynamo.<backend>(now owned bydynamo-serve)dynamo-installanddynamo-upgradedeferreddynamo-frontendcovers Frontend + gateway in one skill.agents/skills/as canonical.agents/skills/natively; this repo already symlinks.claude/skills/to it for Claude Code compatibilityai-infra-agent; substituted for day-2 skill (dynamo-troubleshoot)gpu-operatornvidia-inference-ra-orchestratorreferences/+scripts/subdirectoriesnvidia-inference-stack## Workflowanchor (not phase header renames)versionfield tied to release lineNVIDIA/skillsis a daily mirror; onecomponents.d/dynamo.ymlis the entire catalog contributionOut of Scope and Future Work
Out of scope for this PR:
dynamo-installskill (one-time cluster install ofdynamo-platform) anddynamo-upgrade(release-to-release migration). Both are real workflows but need the end-to-end command sequences captured first. Tracked for a follow-up MR..agents/skills/content into the alternate "public catalog style" (flat numbered steps, Scope/Input/Output lead-in). Discovered mid-stream thatNVIDIA/skillsmirrors source verbatim — no transform required — so this pass is unnecessary..claude/skills/skills. Those are agent-developer skills (debug-session, dep-status, gh-issue-bug, etc.) with a lighter convention. Unchanged.Future work (next Dynamo release cycle):
versionfield to the new Dynamo release (e.g.,1.3.0) and re-run the per-release refresh procedure. Touchpoints: backend pins, container tags, NIXL refs, recipe set, known issues against the new release-line QA tracker.dynamo-installanddynamo-upgradeto shippable skills once the install and upgrade command sequences are captured end-to-end.run_script()protocol surfacing back toai-infra-agent.Acknowledgments
ai-infra-agentteam for the methodology — 4-phase workflow, command-tier rubric, HITL contract, script patterns, references/scripts layout, NV-ACES evaluation infrastructure. This skill set is a domain-specific application of work they did first.NVIDIA/skillscatalog team for the daily sync infrastructure that makes the public mirror automatic once this PR merges.Where should the reviewer start?
.agents/skills/README.md— the philosophy, attribution, ten documented improvements, and full per-release update procedure. Most reviewer questions are answered here..agents/skills/dynamo-deploy/SKILL.md— the most cross-cutting skill; canonical worked example for the structure used by all seven. Validates that the 4-phase workflow + DESTRUCTIVE/MUTATING/SAFE tables + HITL decision points pattern produces a usable skill for a real-world workflow..agents/skills/dynamo-deploy/references/dgdr-shape.md— representativereferences/content: annotated CRD field reference. Shows how internal row-ID citations were stripped before publication (the stripped-cite convention is documented in the README)..agents/skills/dynamo-deploy/scripts/validate-dgdr.sh— representativescripts/content: pre-apply DGDR validator using thepass / fail / warnhelper pattern withkubectl apply --dry-run=serveras the safe default..agents/skills/dynamo-troubleshoot/references/symptom-signatures.md— the day-2 signature library. Representative of the alternate 4-phase shape (Triage → Inspect → Diagnose → Remediate) for non-install workflows.Skim any of the other six
SKILL.mdfiles to confirm the structural consistency. Each follows the same## Workflow/ phase-section /## Available Scripts/## Prerequisites/## Limitations/## Troubleshooting/## References and Scriptsshape.Reviewer notes:
[A4],[F6], etc.) referenced in the README is stripped from the publication path byscripts/derive-public.sh(internal tool, not part of this PR). The skill content itself contains no row-ID markers.references/*.mdfiles help agents navigate between skills and the efficiency penalty is small (5 points).Related Issues: