docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P by KavinKrishnan · Pull Request #271 · ai-dynamo/modelexpress

KavinKrishnan · 2026-05-08T01:59:07Z

Adds runnable Kubernetes manifests and a step-by-step run book for the ModelExpress + TRT-LLM P2P weight-loading integration on GCP GB200. This is the deployment-side companion to TRT-LLM PR #13531 (MXCheckpointLoader, merged 2026-05-06) and Dynamo PR #8037 (--model-express-url CLI, in review).

What's added under examples/p2p_transfer_k8s/client/trtllm/:

mx-infra-decode.yaml — ModelExpress server + Redis deployment with placeholders for namespace, registry, CPU node pool. Tunable reaper / GC timeouts (MX_HEARTBEAT_TIMEOUT_SECS=4500s, MX_GC_TIMEOUT_SECS=4500s) sized to cover 685B-class disk loads without permanently disabling cleanup.
kimi-source-decode-dgd.yaml — Source DGD (Kimi K2.5 TP=8 across 2 nodes). Loads from disk, then publishes via publish_model_params. Auto-detected as source by probing MX server for existing entries.
kimi-disagg-mx-tp8-dgd.yaml — Target DGD (Frontend + Prefill TP=8 + Decode TP=8). Auto-detected as target. Receives weights via MxLiveWeightLoader / NIXL RDMA in ~2 seconds per rank instead of 15-20 minutes from disk.
hpa/kimi-agg-autoscale-dgd.yaml — Aggregated worker DGD where every replica uses the same spec; auto-detect handles source vs target. Designed for HPA-driven horizontal scaling.
hpa/kimi-agg-autoscale-hpa.yaml — DGDSA + HorizontalPodAutoscaler wrapping the aggregated DGD. Exposes the scale subresource so HPA can drive replica count.
hpa/README.md — End-to-end demo: first replica ~22 minutes from disk, subsequent HPA-driven replicas ~5 minutes via RDMA at 361-583 Gbps/rank (validated on Kimi K2.5, GCP GB200, 4x 400G RoCE).
README.md (overall guide) — Quick-start (5 steps), build instructions, image-stack guidance for both the post-PR-13531 recommended path (just pip install modelexpress on top of tensorrtllm-runtime once it bumps to TRT-LLM 1.3.0rc15+) and a bridge path that points at the patch-based Dockerfiles still hosted on the kavink/trtllm_clean branch (PR feat(trtllm): MXCheckpointLoader deployment support with auto-detect and RDMA validation #218) for users on older runtime images.

All yamls use placeholders (, /:, <GPU_NODE_POOL>, <CPU_NODE_POOL>, <MX_INFRA_*>) so users replace cluster-specific values before applying.

This PR carries the deployment-example slice of PR #218 (kavink/trtllm_clean). The patch-shim slice retires once the next tensorrtllm-runtime image cuts post-PR #13531; PR #218 will be closed without merging at that point.

Validated end-to-end on GCP GB200 (dynamo-gcp-dev-02, kavin namespace): 16 target ranks x 90.75 GB transferred at 363-506 Gbps, end-to-end disaggregated serving (prefill + decode + frontend) verified.

Companion PRs:

TRT-LLM #13531: MXCheckpointLoader (merged 2026-05-06)
ModelExpress feat: TRT-LLM P2P weight transfer #202: MxLiveWeightLoader, publish_model_params (merged)
ModelExpress feat: allocation-based NIXL pool registration via MX_POOL_REG #267: MX_POOL_REG allocation-based registration (merged)
Dynamo #8037: --model-express-url CLI integration (open)

Summary by CodeRabbit

Documentation
- Added comprehensive autoscaling guide for TRT-LLM deployments with Kubernetes HPA.
- Expanded ModelExpress P2P deployment documentation with updated workflows and validation steps.
New Features
- Added Kubernetes manifests for HPA-driven autoscaling of TRT-LLM inference deployments.
Configuration Updates
- Enhanced deployment manifests with namespace and registry parameterization for improved flexibility.
- Updated server configuration for improved stability and metadata handling.

coderabbitai · 2026-05-08T02:01:47Z

Walkthrough

This PR updates Kubernetes manifests and documentation for ModelExpress P2P weight transfer with TRT-LLM autoscaling on GCP GB200. The main README is rewritten to focus on source vs target P2P behavior, Quick Start deployment steps, and TRT-LLM-specific configuration. A new HPA autoscaling example is documented and provided with working manifests. Existing deployment manifests are parameterized to support multiple namespaces and container registries.

Changes

TRT-LLM P2P Documentation and HPA Example

Layer / File(s)	Summary
P2P Transfer Concepts `examples/p2p_transfer_k8s/client/trtllm/README.md`, `examples/p2p_transfer_k8s/client/trtllm/hpa/README.md`	Documentation explains Source behavior (auto-detect, disk load, publish) and Target behavior (auto-detect, MX RDMA transfer, presharding). HPA README introduces autoscaling approach with validated timing benchmarks.
Prerequisites, Image Setup, Config `examples/p2p_transfer_k8s/client/trtllm/README.md`	Guides for building TRT-LLM container images based on upstream PR merges, GCP GB200 required config, and file reference tables for new P2P-focused examples.
Deployment Walkthrough `examples/p2p_transfer_k8s/client/trtllm/README.md`, `examples/p2p_transfer_k8s/client/trtllm/hpa/README.md`	Step-by-step Quick Start flow for MX infrastructure, source, targets, RDMA verification, and cleanup. HPA walkthrough includes manual/inference-driven scale-up, RDMA log validation, and production tuning notes.
HPA Example Manifests `examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-dgd.yaml`, `examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-hpa.yaml`	New ConfigMap with Kimi K2.5 aggregated inference settings; new DynamoGraphDeployment source service (TP=8, multinode, namespace-parameterized); new DGDSA and HPA manifests for autoscaling (minReplicas=1, maxReplicas=3, CPU-based metric).
Legacy Content Removal `examples/p2p_transfer_k8s/client/trtllm/README.md`	Removes older aggregated vs disaggregated inference sections, legacy testing commands, and replaced by consolidated TRT-LLM P2P guidance.

Deployment Manifest Parameterization and Infrastructure Updates

Layer / File(s)	Summary
Container Image Parameterization `examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml`, `examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml`	Updates frontend, prefill, and decode container images from hardcoded `dynamo-trtllm-mx` to parameterized `<REGISTRY>/<NAME>:<TAG>` for reusability across registries.
Service Endpoint Parameterization `examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml`, `examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml`	Updates ModelExpress, NATS, and ETCD endpoint environment variables from hardcoded `default` namespace to `<NAMESPACE>` placeholder for multi-namespace deployments.
Compute Domain and RDMA Config `examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml`, `examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml`	Updates compute-domain resource claim templates to namespace-based `<NAMESPACE>-compute-domain-channel`; removes MX_RDMA_NIC_PIN environment variables from workers.
ModelExpress Server Updates `examples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yaml`	Updates server image to `nvcr.io/nvidian/dynamo-dev/modelexpress-server:latest` and adds `MX_HEARTBEAT_TIMEOUT_SECS` and `MX_GC_TIMEOUT_SECS` (both 4500) to reduce metadata reaping aggressiveness for long model loading.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 Weights now hop through networks swift,
No disk delays, just RDMA gift—
P2P transfers at the speed of thought,
AutoScale the grace that NVIDIA brought!
Namespaced configs, templates that shine,
TB/s bandwidth, oh so fine! 🚀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: documentation additions for TRT-LLM deployment examples and HPA autoscaling demo for ModelExpress P2P, matching the core content of the pull request.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/p2p_transfer_k8s/client/trtllm/README.md (1)

10-31: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace ASCII architecture diagram with Mermaid

The block on Lines 10-31 should be converted to a Mermaid diagram to match repository markdown standards.

As per coding guidelines, "**/*.md: Use mermaid diagrams instead of ASCII art in markdown files".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/p2p_transfer_k8s/client/trtllm/README.md` around lines 10 - 31,
Replace the ASCII art block (the fenced code block containing "ModelExpress
Server + Redis", "Source (TP=8, 2 nodes)", "Targets (prefill + decode)", and the
"NIXL RDMA (RoCE)" link) with a fenced mermaid diagram (```mermaid```), modeling
the same entities and relationships: Source -> publish metadata -> ModelExpress
Server + Redis, Targets -> query metadata -> ModelExpress Server + Redis, and
the bidirectional NIXL RDMA link connecting Source and Targets with the
throughput/note rendered as a sublabel or note; remove the ASCII block and
ensure the mermaid labels preserve the checklist items (checkpoint_format, MX
detection, fallbacks, publish_as_source, post_load_weights, _weights_presharded,
Serve) so the diagram conveys the same steps and roles.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/p2p_transfer_k8s/client/trtllm/hpa/README.md`:
- Around line 36-39: The README contains inconsistent status for TRT-LLM PR
`#13531`: the paragraph mentioning `MXCheckpointLoader` and
`checkpoint_format="MX"` says "merged 2026-05-06" while another reference
(around line 210) marks it as "Ready"; pick one canonical status and update both
occurrences so they match (e.g., change the "Ready" label at line 210 to "Merged
(2026-05-06)" or change the earlier sentence to "Ready") ensuring every
reference to TRT-LLM PR `#13531`, `MXCheckpointLoader`, and
`checkpoint_format="MX"` uses the same status text.
- Around line 148-150: The JSONPath filter in the NEW_LDR command uses an
unsupported `contains`/regex syntax; replace the invalid JSONPath with a
pipeline that outputs JSON and performs pattern matching externally (e.g., run
the kubectl get pods ... -o json and pipe into jq to select .items[] by
.metadata.name, then grep -E "source-1.*ldr$" and take the first match) so
NEW_LDR is assigned a single pod name reliably; update the command that sets
NEW_LDR (the kubectl get pods ... -o jsonpath=... invocation) to use this
jq/grep pipeline instead.

---

Outside diff comments:
In `@examples/p2p_transfer_k8s/client/trtllm/README.md`:
- Around line 10-31: Replace the ASCII art block (the fenced code block
containing "ModelExpress Server + Redis", "Source (TP=8, 2 nodes)", "Targets
(prefill + decode)", and the "NIXL RDMA (RoCE)" link) with a fenced mermaid
diagram (```mermaid```), modeling the same entities and relationships: Source ->
publish metadata -> ModelExpress Server + Redis, Targets -> query metadata ->
ModelExpress Server + Redis, and the bidirectional NIXL RDMA link connecting
Source and Targets with the throughput/note rendered as a sublabel or note;
remove the ASCII block and ensure the mermaid labels preserve the checklist
items (checkpoint_format, MX detection, fallbacks, publish_as_source,
post_load_weights, _weights_presharded, Serve) so the diagram conveys the same
steps and roles.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5494a23c-0ee5-41a6-8036-52a6f4dd73f2

📥 Commits

Reviewing files that changed from the base of the PR and between 7059e68 and e591f52.

📒 Files selected for processing (7)

examples/p2p_transfer_k8s/client/trtllm/README.md
examples/p2p_transfer_k8s/client/trtllm/hpa/README.md
examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-dgd.yaml
examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-hpa.yaml
examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml
examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml
examples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yaml

Adds runnable Kubernetes manifests and a step-by-step run book for the ModelExpress + TRT-LLM P2P weight-loading integration on GCP GB200. This is the deployment-side companion to TRT-LLM PR #13531 (MXCheckpointLoader, merged 2026-05-06) and Dynamo PR #8037 (--model-express-url CLI, in review). What's added under examples/p2p_transfer_k8s/client/trtllm/: * mx-infra-decode.yaml — ModelExpress server + Redis deployment with placeholders for namespace, registry, CPU node pool. Tunable reaper / GC timeouts (MX_HEARTBEAT_TIMEOUT_SECS=4500s, MX_GC_TIMEOUT_SECS=4500s) sized to cover 685B-class disk loads without permanently disabling cleanup. * kimi-source-decode-dgd.yaml — Source DGD (Kimi K2.5 TP=8 across 2 nodes). Loads from disk, then publishes via publish_model_params. Auto-detected as source by probing MX server for existing entries. * kimi-disagg-mx-tp8-dgd.yaml — Target DGD (Frontend + Prefill TP=8 + Decode TP=8). Auto-detected as target. Receives weights via MxLiveWeightLoader / NIXL RDMA in ~2 seconds per rank instead of 15-20 minutes from disk. * hpa/kimi-agg-autoscale-dgd.yaml — Aggregated worker DGD where every replica uses the same spec; auto-detect handles source vs target. Designed for HPA-driven horizontal scaling. * hpa/kimi-agg-autoscale-hpa.yaml — DGDSA + HorizontalPodAutoscaler wrapping the aggregated DGD. Exposes the scale subresource so HPA can drive replica count. * hpa/README.md — End-to-end demo: first replica ~22 minutes from disk, subsequent HPA-driven replicas ~5 minutes via RDMA at 361-583 Gbps/rank (validated on Kimi K2.5, GCP GB200, 4x 400G RoCE). * README.md (overall guide) — Quick-start (5 steps), build instructions, image-stack guidance for both the post-PR-13531 recommended path (just pip install modelexpress on top of tensorrtllm-runtime once it bumps to TRT-LLM 1.3.0rc15+) and a bridge path that points at the patch-based Dockerfiles still hosted on the kavink/trtllm_clean branch (PR #218) for users on older runtime images. All yamls use placeholders (<NAMESPACE>, <REGISTRY>/<NAME>:<TAG>, <GPU_NODE_POOL>, <CPU_NODE_POOL>, <MX_INFRA_*>) so users replace cluster-specific values before applying. This PR carries the deployment-example slice of PR #218 (kavink/trtllm_clean). The patch-shim slice retires once the next tensorrtllm-runtime image cuts post-PR #13531; PR #218 will be closed without merging at that point. Validated end-to-end on GCP GB200 (dynamo-gcp-dev-02, kavin namespace): 16 target ranks x 90.75 GB transferred at 363-506 Gbps, end-to-end disaggregated serving (prefill + decode + frontend) verified. Companion PRs: * TRT-LLM #13531: MXCheckpointLoader (merged 2026-05-06) * ModelExpress #202: MxLiveWeightLoader, publish_model_params (merged) * ModelExpress #267: MX_POOL_REG allocation-based registration (merged) * Dynamo #8037: --model-express-url CLI integration (open) Signed-off-by: Kavin Krishnan <kavink@nvidia.com>

copy-pr-bot · 2026-05-21T18:25:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pull-request-size Bot added the size/XL label May 8, 2026

KavinKrishnan had a problem deploying to GITLAB May 8, 2026 01:59 — with GitHub Actions Error

github-actions Bot added the docs label May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread examples/p2p_transfer_k8s/client/trtllm/hpa/README.md

Comment thread examples/p2p_transfer_k8s/client/trtllm/hpa/README.md Outdated

KavinKrishnan force-pushed the kavink/trtllm-deploy-examples branch from e591f52 to be558f9 Compare May 8, 2026 02:13

KavinKrishnan temporarily deployed to GITLAB May 8, 2026 02:13 — with GitHub Actions Inactive

KavinKrishnan force-pushed the kavink/trtllm-deploy-examples branch from be558f9 to a7c77b7 Compare May 21, 2026 18:25

KavinKrishnan temporarily deployed to GITLAB May 21, 2026 18:26 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P#271

docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P#271
KavinKrishnan wants to merge 1 commit into
mainfrom
kavink/trtllm-deploy-examples

KavinKrishnan commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KavinKrishnan commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KavinKrishnan commented May 8, 2026 •

edited by coderabbitai Bot

Loading