Skip to content

docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P#271

Open
KavinKrishnan wants to merge 1 commit into
mainfrom
kavink/trtllm-deploy-examples
Open

docs(trtllm): deployment example yamls + HPA autoscale demo for MX P2P#271
KavinKrishnan wants to merge 1 commit into
mainfrom
kavink/trtllm-deploy-examples

Conversation

@KavinKrishnan

@KavinKrishnan KavinKrishnan commented May 8, 2026

Copy link
Copy Markdown
Contributor

Adds runnable Kubernetes manifests and a step-by-step run book for the ModelExpress + TRT-LLM P2P weight-loading integration on GCP GB200. This is the deployment-side companion to TRT-LLM PR #13531 (MXCheckpointLoader, merged 2026-05-06) and Dynamo PR #8037 (--model-express-url CLI, in review).

What's added under examples/p2p_transfer_k8s/client/trtllm/:

  • mx-infra-decode.yaml — ModelExpress server + Redis deployment with placeholders for namespace, registry, CPU node pool. Tunable reaper / GC timeouts (MX_HEARTBEAT_TIMEOUT_SECS=4500s, MX_GC_TIMEOUT_SECS=4500s) sized to cover 685B-class disk loads without permanently disabling cleanup.

  • kimi-source-decode-dgd.yaml — Source DGD (Kimi K2.5 TP=8 across 2 nodes). Loads from disk, then publishes via publish_model_params. Auto-detected as source by probing MX server for existing entries.

  • kimi-disagg-mx-tp8-dgd.yaml — Target DGD (Frontend + Prefill TP=8 + Decode TP=8). Auto-detected as target. Receives weights via MxLiveWeightLoader / NIXL RDMA in ~2 seconds per rank instead of 15-20 minutes from disk.

  • hpa/kimi-agg-autoscale-dgd.yaml — Aggregated worker DGD where every replica uses the same spec; auto-detect handles source vs target. Designed for HPA-driven horizontal scaling.

  • hpa/kimi-agg-autoscale-hpa.yaml — DGDSA + HorizontalPodAutoscaler wrapping the aggregated DGD. Exposes the scale subresource so HPA can drive replica count.

  • hpa/README.md — End-to-end demo: first replica ~22 minutes from disk, subsequent HPA-driven replicas ~5 minutes via RDMA at 361-583 Gbps/rank (validated on Kimi K2.5, GCP GB200, 4x 400G RoCE).

  • README.md (overall guide) — Quick-start (5 steps), build instructions, image-stack guidance for both the post-PR-13531 recommended path (just pip install modelexpress on top of tensorrtllm-runtime once it bumps to TRT-LLM 1.3.0rc15+) and a bridge path that points at the patch-based Dockerfiles still hosted on the kavink/trtllm_clean branch (PR feat(trtllm): MXCheckpointLoader deployment support with auto-detect and RDMA validation #218) for users on older runtime images.

All yamls use placeholders (, /:, <GPU_NODE_POOL>, <CPU_NODE_POOL>, <MX_INFRA_*>) so users replace cluster-specific values before applying.

This PR carries the deployment-example slice of PR #218 (kavink/trtllm_clean). The patch-shim slice retires once the next tensorrtllm-runtime image cuts post-PR #13531; PR #218 will be closed without merging at that point.

Validated end-to-end on GCP GB200 (dynamo-gcp-dev-02, kavin namespace): 16 target ranks x 90.75 GB transferred at 363-506 Gbps, end-to-end disaggregated serving (prefill + decode + frontend) verified.

Companion PRs:

Summary by CodeRabbit

  • Documentation

    • Added comprehensive autoscaling guide for TRT-LLM deployments with Kubernetes HPA.
    • Expanded ModelExpress P2P deployment documentation with updated workflows and validation steps.
  • New Features

    • Added Kubernetes manifests for HPA-driven autoscaling of TRT-LLM inference deployments.
  • Configuration Updates

    • Enhanced deployment manifests with namespace and registry parameterization for improved flexibility.
    • Updated server configuration for improved stability and metadata handling.

@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR updates Kubernetes manifests and documentation for ModelExpress P2P weight transfer with TRT-LLM autoscaling on GCP GB200. The main README is rewritten to focus on source vs target P2P behavior, Quick Start deployment steps, and TRT-LLM-specific configuration. A new HPA autoscaling example is documented and provided with working manifests. Existing deployment manifests are parameterized to support multiple namespaces and container registries.

Changes

TRT-LLM P2P Documentation and HPA Example

Layer / File(s) Summary
P2P Transfer Concepts
examples/p2p_transfer_k8s/client/trtllm/README.md, examples/p2p_transfer_k8s/client/trtllm/hpa/README.md
Documentation explains Source behavior (auto-detect, disk load, publish) and Target behavior (auto-detect, MX RDMA transfer, presharding). HPA README introduces autoscaling approach with validated timing benchmarks.
Prerequisites, Image Setup, Config
examples/p2p_transfer_k8s/client/trtllm/README.md
Guides for building TRT-LLM container images based on upstream PR merges, GCP GB200 required config, and file reference tables for new P2P-focused examples.
Deployment Walkthrough
examples/p2p_transfer_k8s/client/trtllm/README.md, examples/p2p_transfer_k8s/client/trtllm/hpa/README.md
Step-by-step Quick Start flow for MX infrastructure, source, targets, RDMA verification, and cleanup. HPA walkthrough includes manual/inference-driven scale-up, RDMA log validation, and production tuning notes.
HPA Example Manifests
examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-dgd.yaml, examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-hpa.yaml
New ConfigMap with Kimi K2.5 aggregated inference settings; new DynamoGraphDeployment source service (TP=8, multinode, namespace-parameterized); new DGDSA and HPA manifests for autoscaling (minReplicas=1, maxReplicas=3, CPU-based metric).
Legacy Content Removal
examples/p2p_transfer_k8s/client/trtllm/README.md
Removes older aggregated vs disaggregated inference sections, legacy testing commands, and replaced by consolidated TRT-LLM P2P guidance.

Deployment Manifest Parameterization and Infrastructure Updates

Layer / File(s) Summary
Container Image Parameterization
examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml, examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml
Updates frontend, prefill, and decode container images from hardcoded dynamo-trtllm-mx to parameterized <REGISTRY>/<NAME>:<TAG> for reusability across registries.
Service Endpoint Parameterization
examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml, examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml
Updates ModelExpress, NATS, and ETCD endpoint environment variables from hardcoded default namespace to <NAMESPACE> placeholder for multi-namespace deployments.
Compute Domain and RDMA Config
examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml, examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml
Updates compute-domain resource claim templates to namespace-based <NAMESPACE>-compute-domain-channel; removes MX_RDMA_NIC_PIN environment variables from workers.
ModelExpress Server Updates
examples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yaml
Updates server image to nvcr.io/nvidian/dynamo-dev/modelexpress-server:latest and adds MX_HEARTBEAT_TIMEOUT_SECS and MX_GC_TIMEOUT_SECS (both 4500) to reduce metadata reaping aggressiveness for long model loading.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 Weights now hop through networks swift,
No disk delays, just RDMA gift—
P2P transfers at the speed of thought,
AutoScale the grace that NVIDIA brought!
Namespaced configs, templates that shine,
TB/s bandwidth, oh so fine! 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: documentation additions for TRT-LLM deployment examples and HPA autoscaling demo for ModelExpress P2P, matching the core content of the pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/p2p_transfer_k8s/client/trtllm/README.md (1)

10-31: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace ASCII architecture diagram with Mermaid

The block on Lines 10-31 should be converted to a Mermaid diagram to match repository markdown standards.

As per coding guidelines, "**/*.md: Use mermaid diagrams instead of ASCII art in markdown files".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/p2p_transfer_k8s/client/trtllm/README.md` around lines 10 - 31,
Replace the ASCII art block (the fenced code block containing "ModelExpress
Server + Redis", "Source (TP=8, 2 nodes)", "Targets (prefill + decode)", and the
"NIXL RDMA (RoCE)" link) with a fenced mermaid diagram (```mermaid```), modeling
the same entities and relationships: Source -> publish metadata -> ModelExpress
Server + Redis, Targets -> query metadata -> ModelExpress Server + Redis, and
the bidirectional NIXL RDMA link connecting Source and Targets with the
throughput/note rendered as a sublabel or note; remove the ASCII block and
ensure the mermaid labels preserve the checklist items (checkpoint_format, MX
detection, fallbacks, publish_as_source, post_load_weights, _weights_presharded,
Serve) so the diagram conveys the same steps and roles.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/p2p_transfer_k8s/client/trtllm/hpa/README.md`:
- Around line 36-39: The README contains inconsistent status for TRT-LLM PR
`#13531`: the paragraph mentioning `MXCheckpointLoader` and
`checkpoint_format="MX"` says "merged 2026-05-06" while another reference
(around line 210) marks it as "Ready"; pick one canonical status and update both
occurrences so they match (e.g., change the "Ready" label at line 210 to "Merged
(2026-05-06)" or change the earlier sentence to "Ready") ensuring every
reference to TRT-LLM PR `#13531`, `MXCheckpointLoader`, and
`checkpoint_format="MX"` uses the same status text.
- Around line 148-150: The JSONPath filter in the NEW_LDR command uses an
unsupported `contains`/regex syntax; replace the invalid JSONPath with a
pipeline that outputs JSON and performs pattern matching externally (e.g., run
the kubectl get pods ... -o json and pipe into jq to select .items[] by
.metadata.name, then grep -E "source-1.*ldr$" and take the first match) so
NEW_LDR is assigned a single pod name reliably; update the command that sets
NEW_LDR (the kubectl get pods ... -o jsonpath=... invocation) to use this
jq/grep pipeline instead.

---

Outside diff comments:
In `@examples/p2p_transfer_k8s/client/trtllm/README.md`:
- Around line 10-31: Replace the ASCII art block (the fenced code block
containing "ModelExpress Server + Redis", "Source (TP=8, 2 nodes)", "Targets
(prefill + decode)", and the "NIXL RDMA (RoCE)" link) with a fenced mermaid
diagram (```mermaid```), modeling the same entities and relationships: Source ->
publish metadata -> ModelExpress Server + Redis, Targets -> query metadata ->
ModelExpress Server + Redis, and the bidirectional NIXL RDMA link connecting
Source and Targets with the throughput/note rendered as a sublabel or note;
remove the ASCII block and ensure the mermaid labels preserve the checklist
items (checkpoint_format, MX detection, fallbacks, publish_as_source,
post_load_weights, _weights_presharded, Serve) so the diagram conveys the same
steps and roles.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5494a23c-0ee5-41a6-8036-52a6f4dd73f2

📥 Commits

Reviewing files that changed from the base of the PR and between 7059e68 and e591f52.

📒 Files selected for processing (7)
  • examples/p2p_transfer_k8s/client/trtllm/README.md
  • examples/p2p_transfer_k8s/client/trtllm/hpa/README.md
  • examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-dgd.yaml
  • examples/p2p_transfer_k8s/client/trtllm/hpa/kimi-agg-autoscale-hpa.yaml
  • examples/p2p_transfer_k8s/client/trtllm/kimi-disagg-mx-tp8-dgd.yaml
  • examples/p2p_transfer_k8s/client/trtllm/kimi-source-decode-dgd.yaml
  • examples/p2p_transfer_k8s/client/trtllm/mx-infra-decode.yaml

Comment thread examples/p2p_transfer_k8s/client/trtllm/hpa/README.md
Comment thread examples/p2p_transfer_k8s/client/trtllm/hpa/README.md Outdated
@KavinKrishnan KavinKrishnan force-pushed the kavink/trtllm-deploy-examples branch from e591f52 to be558f9 Compare May 8, 2026 02:13
Adds runnable Kubernetes manifests and a step-by-step run book for the
ModelExpress + TRT-LLM P2P weight-loading integration on GCP GB200.
This is the deployment-side companion to TRT-LLM PR #13531
(MXCheckpointLoader, merged 2026-05-06) and Dynamo PR #8037
(--model-express-url CLI, in review).

What's added under examples/p2p_transfer_k8s/client/trtllm/:

* mx-infra-decode.yaml — ModelExpress server + Redis deployment
  with placeholders for namespace, registry, CPU node pool. Tunable
  reaper / GC timeouts (MX_HEARTBEAT_TIMEOUT_SECS=4500s,
  MX_GC_TIMEOUT_SECS=4500s) sized to cover 685B-class disk loads
  without permanently disabling cleanup.

* kimi-source-decode-dgd.yaml — Source DGD (Kimi K2.5 TP=8 across
  2 nodes). Loads from disk, then publishes via publish_model_params.
  Auto-detected as source by probing MX server for existing entries.

* kimi-disagg-mx-tp8-dgd.yaml — Target DGD (Frontend + Prefill TP=8 +
  Decode TP=8). Auto-detected as target. Receives weights via
  MxLiveWeightLoader / NIXL RDMA in ~2 seconds per rank instead of
  15-20 minutes from disk.

* hpa/kimi-agg-autoscale-dgd.yaml — Aggregated worker DGD where
  every replica uses the same spec; auto-detect handles source vs
  target. Designed for HPA-driven horizontal scaling.

* hpa/kimi-agg-autoscale-hpa.yaml — DGDSA + HorizontalPodAutoscaler
  wrapping the aggregated DGD. Exposes the scale subresource so HPA
  can drive replica count.

* hpa/README.md — End-to-end demo: first replica ~22 minutes from
  disk, subsequent HPA-driven replicas ~5 minutes via RDMA at
  361-583 Gbps/rank (validated on Kimi K2.5, GCP GB200, 4x 400G RoCE).

* README.md (overall guide) — Quick-start (5 steps), build
  instructions, image-stack guidance for both the post-PR-13531
  recommended path (just pip install modelexpress on top of
  tensorrtllm-runtime once it bumps to TRT-LLM 1.3.0rc15+) and a
  bridge path that points at the patch-based Dockerfiles still
  hosted on the kavink/trtllm_clean branch (PR #218) for users on
  older runtime images.

All yamls use placeholders (<NAMESPACE>, <REGISTRY>/<NAME>:<TAG>,
<GPU_NODE_POOL>, <CPU_NODE_POOL>, <MX_INFRA_*>) so users replace
cluster-specific values before applying.

This PR carries the deployment-example slice of PR #218
(kavink/trtllm_clean). The patch-shim slice retires once the next
tensorrtllm-runtime image cuts post-PR #13531; PR #218 will be
closed without merging at that point.

Validated end-to-end on GCP GB200 (dynamo-gcp-dev-02, kavin
namespace): 16 target ranks x 90.75 GB transferred at 363-506 Gbps,
end-to-end disaggregated serving (prefill + decode + frontend)
verified.

Companion PRs:
* TRT-LLM #13531: MXCheckpointLoader (merged 2026-05-06)
* ModelExpress #202: MxLiveWeightLoader, publish_model_params (merged)
* ModelExpress #267: MX_POOL_REG allocation-based registration (merged)
* Dynamo #8037: --model-express-url CLI integration (open)

Signed-off-by: Kavin Krishnan <kavink@nvidia.com>
@KavinKrishnan KavinKrishnan force-pushed the kavink/trtllm-deploy-examples branch from be558f9 to a7c77b7 Compare May 21, 2026 18:25
@copy-pr-bot

copy-pr-bot Bot commented May 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant