docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P#271
docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P#271KavinKrishnan wants to merge 1 commit into
Conversation
WalkthroughThis PR updates Kubernetes manifests and documentation for ModelExpress P2P weight transfer with TRT-LLM autoscaling on GCP GB200. The main README is rewritten to focus on source vs target P2P behavior, Quick Start deployment steps, and TRT-LLM-specific configuration. A new HPA autoscaling example is documented and provided with working manifests. Existing deployment manifests are parameterized to support multiple namespaces and container registries. ChangesTRT-LLM P2P Documentation and HPA Example
Deployment Manifest Parameterization and Infrastructure Updates
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/p2p_transfer_k8s/client/trtllm/README.md (1)
10-31:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReplace ASCII architecture diagram with Mermaid
The block on Lines 10-31 should be converted to a Mermaid diagram to match repository markdown standards.
As per coding guidelines, "**/*.md: Use mermaid diagrams instead of ASCII art in markdown files".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/p2p_transfer_k8s/client/trtllm/README.md` around lines 10 - 31, Replace the ASCII art block (the fenced code block containing "ModelExpress Server + Redis", "Source (TP=8, 2 nodes)", "Targets (prefill + decode)", and the "NIXL RDMA (RoCE)" link) with a fenced mermaid diagram (```mermaid```), modeling the same entities and relationships: Source -> publish metadata -> ModelExpress Server + Redis, Targets -> query metadata -> ModelExpress Server + Redis, and the bidirectional NIXL RDMA link connecting Source and Targets with the throughput/note rendered as a sublabel or note; remove the ASCII block and ensure the mermaid labels preserve the checklist items (checkpoint_format, MX detection, fallbacks, publish_as_source, post_load_weights, _weights_presharded, Serve) so the diagram conveys the same steps and roles.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/p2p_transfer_k8s/client/trtllm/hpa/README.md`:
- Around line 36-39: The README contains inconsistent status for TRT-LLM PR
`#13531`: the paragraph mentioning `MXCheckpointLoader` and
`checkpoint_format="MX"` says "merged 2026-05-06" while another reference
(around line 210) marks it as "Ready"; pick one canonical status and update both
occurrences so they match (e.g., change the "Ready" label at line 210 to "Merged
(2026-05-06)" or change the earlier sentence to "Ready") ensuring every
reference to TRT-LLM PR `#13531`, `MXCheckpointLoader`, and
`checkpoint_format="MX"` uses the same status text.
- Around line 148-150: The JSONPath filter in the NEW_LDR command uses an
unsupported `contains`/regex syntax; replace the invalid JSONPath with a
pipeline that outputs JSON and performs pattern matching externally (e.g., run
the kubectl get pods ... -o json and pipe into jq to select .items[] by
.metadata.name, then grep -E "source-1.*ldr$" and take the first match) so
NEW_LDR is assigned a single pod name reliably; update the command that sets
NEW_LDR (the kubectl get pods ... -o jsonpath=... invocation) to use this
jq/grep pipeline instead.
---
Outside diff comments:
In `@examples/p2p_transfer_k8s/client/trtllm/README.md`:
- Around line 10-31: Replace the ASCII art block (the fenced code block
containing "ModelExpress Server + Redis", "Source (TP=8, 2 nodes)", "Targets
(prefill + decode)", and the "NIXL RDMA (RoCE)" link) with a fenced mermaid
diagram (```mermaid```), modeling the same entities and relationships: Source ->
publish metadata -> ModelExpress Server + Redis, Targets -> query metadata ->
ModelExpress Server + Redis, and the bidirectional NIXL RDMA link connecting
Source and Targets with the throughput/note rendered as a sublabel or note;
remove the ASCII block and ensure the mermaid labels preserve the checklist
items (checkpoint_format, MX detection, fallbacks, publish_as_source,
post_load_weights, _weights_presharded, Serve) so the diagram conveys the same
steps and roles.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5494a23c-0ee5-41a6-8036-52a6f4dd73f2
📒 Files selected for processing (7)
examples/p2p_transfer_k8s/client/trtllm/README.mdexamples/p2p_transfer_k8s/client/trtllm/hpa/README.mdexamples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-dgd.yamlexamples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-hpa.yamlexamples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yamlexamples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yamlexamples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yaml
e591f52 to
be558f9
Compare
Adds runnable Kubernetes manifests and a step-by-step run book for the ModelExpress + TRT-LLM P2P weight-loading integration on GCP GB200. This is the deployment-side companion to TRT-LLM PR #13531 (MXCheckpointLoader, merged 2026-05-06) and Dynamo PR #8037 (--model-express-url CLI, in review). What's added under examples/p2p_transfer_k8s/client/trtllm/: * mx-infra-decode.yaml — ModelExpress server + Redis deployment with placeholders for namespace, registry, CPU node pool. Tunable reaper / GC timeouts (MX_HEARTBEAT_TIMEOUT_SECS=4500s, MX_GC_TIMEOUT_SECS=4500s) sized to cover 685B-class disk loads without permanently disabling cleanup. * kimi-source-decode-dgd.yaml — Source DGD (Kimi K2.5 TP=8 across 2 nodes). Loads from disk, then publishes via publish_model_params. Auto-detected as source by probing MX server for existing entries. * kimi-disagg-mx-tp8-dgd.yaml — Target DGD (Frontend + Prefill TP=8 + Decode TP=8). Auto-detected as target. Receives weights via MxLiveWeightLoader / NIXL RDMA in ~2 seconds per rank instead of 15-20 minutes from disk. * hpa/kimi-agg-autoscale-dgd.yaml — Aggregated worker DGD where every replica uses the same spec; auto-detect handles source vs target. Designed for HPA-driven horizontal scaling. * hpa/kimi-agg-autoscale-hpa.yaml — DGDSA + HorizontalPodAutoscaler wrapping the aggregated DGD. Exposes the scale subresource so HPA can drive replica count. * hpa/README.md — End-to-end demo: first replica ~22 minutes from disk, subsequent HPA-driven replicas ~5 minutes via RDMA at 361-583 Gbps/rank (validated on Kimi K2.5, GCP GB200, 4x 400G RoCE). * README.md (overall guide) — Quick-start (5 steps), build instructions, image-stack guidance for both the post-PR-13531 recommended path (just pip install modelexpress on top of tensorrtllm-runtime once it bumps to TRT-LLM 1.3.0rc15+) and a bridge path that points at the patch-based Dockerfiles still hosted on the kavink/trtllm_clean branch (PR #218) for users on older runtime images. All yamls use placeholders (<NAMESPACE>, <REGISTRY>/<NAME>:<TAG>, <GPU_NODE_POOL>, <CPU_NODE_POOL>, <MX_INFRA_*>) so users replace cluster-specific values before applying. This PR carries the deployment-example slice of PR #218 (kavink/trtllm_clean). The patch-shim slice retires once the next tensorrtllm-runtime image cuts post-PR #13531; PR #218 will be closed without merging at that point. Validated end-to-end on GCP GB200 (dynamo-gcp-dev-02, kavin namespace): 16 target ranks x 90.75 GB transferred at 363-506 Gbps, end-to-end disaggregated serving (prefill + decode + frontend) verified. Companion PRs: * TRT-LLM #13531: MXCheckpointLoader (merged 2026-05-06) * ModelExpress #202: MxLiveWeightLoader, publish_model_params (merged) * ModelExpress #267: MX_POOL_REG allocation-based registration (merged) * Dynamo #8037: --model-express-url CLI integration (open) Signed-off-by: Kavin Krishnan <kavink@nvidia.com>
be558f9 to
a7c77b7
Compare
Adds runnable Kubernetes manifests and a step-by-step run book for the ModelExpress + TRT-LLM P2P weight-loading integration on GCP GB200. This is the deployment-side companion to TRT-LLM PR #13531 (MXCheckpointLoader, merged 2026-05-06) and Dynamo PR #8037 (--model-express-url CLI, in review).
What's added under examples/p2p_transfer_k8s/client/trtllm/:
mx-infra-decode.yaml — ModelExpress server + Redis deployment with placeholders for namespace, registry, CPU node pool. Tunable reaper / GC timeouts (MX_HEARTBEAT_TIMEOUT_SECS=4500s, MX_GC_TIMEOUT_SECS=4500s) sized to cover 685B-class disk loads without permanently disabling cleanup.
kimi-source-decode-dgd.yaml — Source DGD (Kimi K2.5 TP=8 across 2 nodes). Loads from disk, then publishes via publish_model_params. Auto-detected as source by probing MX server for existing entries.
kimi-disagg-mx-tp8-dgd.yaml — Target DGD (Frontend + Prefill TP=8 + Decode TP=8). Auto-detected as target. Receives weights via MxLiveWeightLoader / NIXL RDMA in ~2 seconds per rank instead of 15-20 minutes from disk.
hpa/kimi-agg-autoscale-dgd.yaml — Aggregated worker DGD where every replica uses the same spec; auto-detect handles source vs target. Designed for HPA-driven horizontal scaling.
hpa/kimi-agg-autoscale-hpa.yaml — DGDSA + HorizontalPodAutoscaler wrapping the aggregated DGD. Exposes the scale subresource so HPA can drive replica count.
hpa/README.md — End-to-end demo: first replica ~22 minutes from disk, subsequent HPA-driven replicas ~5 minutes via RDMA at 361-583 Gbps/rank (validated on Kimi K2.5, GCP GB200, 4x 400G RoCE).
README.md (overall guide) — Quick-start (5 steps), build instructions, image-stack guidance for both the post-PR-13531 recommended path (just pip install modelexpress on top of tensorrtllm-runtime once it bumps to TRT-LLM 1.3.0rc15+) and a bridge path that points at the patch-based Dockerfiles still hosted on the kavink/trtllm_clean branch (PR feat(trtllm): MXCheckpointLoader deployment support with auto-detect and RDMA validation #218) for users on older runtime images.
All yamls use placeholders (, /:, <GPU_NODE_POOL>, <CPU_NODE_POOL>, <MX_INFRA_*>) so users replace cluster-specific values before applying.
This PR carries the deployment-example slice of PR #218 (kavink/trtllm_clean). The patch-shim slice retires once the next tensorrtllm-runtime image cuts post-PR #13531; PR #218 will be closed without merging at that point.
Validated end-to-end on GCP GB200 (dynamo-gcp-dev-02, kavin namespace): 16 target ranks x 90.75 GB transferred at 363-506 Gbps, end-to-end disaggregated serving (prefill + decode + frontend) verified.
Companion PRs:
Summary by CodeRabbit
Documentation
New Features
Configuration Updates