Add low-latency TensorRT-LLM example with http_server by zhouhelena1 · Pull Request #1589 · modal-labs/modal-examples

zhouhelena1 · 2026-06-15T19:44:38Z

Rewrote trtllm_latency.py with @modal.experimental.http_server for low-latency routing.

Type of Change

New example for the GitHub repo
- New example for the documentation site (Linked from a discoverable page, e.g. via the sidebar in /docs/examples)

Monitoring Checklist

Example is configured for testing in the synthetic monitoring system, or lambda-test: false is provided in the example frontmatter and I have gotten approval from a maintainer
- Example is tested by executing with modal run, or an alternative cmd is provided in the example frontmatter
- Example is tested by running the cmd with no arguments, or the args are provided in the example frontmatter
- Example does not require third-party dependencies besides fastapi to be installed locally

Documentation Site Checklist

Content

Example is documented with comments throughout, in a Literate Programming style
All media assets for the example that are rendered in the documentation site page are retrieved from modal-cdn.com

Build Stability

Example pins all dependencies in container images
- Example pins container images to a stable tag like v1, not a dynamic tag like latest
- Example specifies a python_version for the base image, if it is used
- Example pins all dependencies to at least SemVer minor version
  - Example dependencies with version < 1 are pinned to patch version

Link to Devin session: https://modal.devinenterprise.com/sessions/37b2b42588544b69bd089433e08fb9a0
Requested by: @zhouhelena1

New example combining TensorRT-LLM with Modal's experimental http_server for low-latency routing. Uses FP8 quantization and low-latency GEMM plugins via trtllm-serve, mirroring the patterns from sglang_low_latency.py and vllm_low_latency.py. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration · 2026-06-15T19:44:41Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment, CI, and merge conflict monitoring

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

charlesfrye · 2026-06-15T19:51:57Z

nice! @zhouhelena1 could we just overwrite the existing trtllm_latency instead of creating a new example?

trtllm-serve 0.18.0 does not support --extra_llm_api_options. Instead, build the optimized engine (FP8, low-latency GEMM plugins) using the Python API and cache it in the Volume, then start trtllm-serve pointing at the pre-built engine directory. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

The shared Volume had a cached engine built with lookahead speculative decoding by trtllm_latency.py. trtllm-serve cannot load that engine without matching runtime config. Use 'serve' subdirectory instead of 'fast' to isolate engine caches. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Ensures the cached engine is visible to other containers immediately, avoiding redundant rebuilds on concurrent starts. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration

Devin Review found 2 new potential issues.

devin-ai-integration · 2026-06-15T20:39:32Z

+        cmd = [
+            "trtllm-serve",
+            model_path,
+            "--host",
+            "0.0.0.0",
+            "--port",
+            str(PORT),
+        ]
+
+        self.process = subprocess.Popen(cmd)
+        wait_ready(self.process)
+        warmup()


🚩 trtllm-serve model name: no --served-model-name flag, relying on server default

The trtllm-serve command at lines 434-443 doesn't include a --served-model-name flag, unlike the vllm serve (vllm_low_latency.py:278-279) and sglang.launch_server (sglang_low_latency.py:399-400) examples which explicitly set the served model name. Meanwhile, the warmup function (line 325) and the test client (line 568) both send "model": MODEL_ID (NousResearch/Meta-Llama-3-8B-Instruct) in their payloads. If trtllm-serve defaults to a different model name (e.g. the engine path), these requests would get a 4xx error. Since warmup runs during @modal.enter(), this would surface immediately as a startup failure, so it's likely already validated. But worth confirming the trtllm-serve 0.20 default model name behavior.

Was this helpful? React with 👍 or 👎 to provide feedback.

The cached TRT-LLM 0.18.0 engine was incompatible with 0.20.0's TensorRT runtime. Including the version in the cache path (serve-0.20) forces a rebuild on upgrade. Also pins pynvml==12.0 per repo pinning rules. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Keeps pynvml>=12 (user's original) so Modal can reuse the cached container image from the previous CI run. This avoids a redundant image rebuild and gives the full 14-minute CI timeout for the engine build at the new path. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration Bot assigned zhouhelena1 Jun 15, 2026

devin-ai-integration Bot reviewed Jun 15, 2026

View reviewed changes

zhouhelena1 and others added 2 commits June 15, 2026 19:58

This comment was marked as resolved.

Sign in to view

Add volume.commit() after engine build

31d922a

Ensures the cached engine is visible to other containers immediately, avoiding redundant rebuilds on concurrent starts. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed Jun 15, 2026

View reviewed changes

replace trtllm_latency

be07fae

zhouhelena1 force-pushed the devin/1781552641-trt-low-latency branch from c451e34 to be07fae Compare June 15, 2026 21:14

This comment was marked as resolved.

Sign in to view

zhouhelena1 and others added 2 commits June 15, 2026 21:38

zhouhelena1 requested a review from charlesfrye June 15, 2026 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add low-latency TensorRT-LLM example with http_server#1589

Add low-latency TensorRT-LLM example with http_server#1589
zhouhelena1 wants to merge 7 commits into
mainfrom
devin/1781552641-trt-low-latency

zhouhelena1 commented Jun 15, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot commented Jun 15, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

charlesfrye commented Jun 15, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhouhelena1 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Change

Monitoring Checklist

Documentation Site Checklist

Content

Build Stability

Uh oh!

devin-ai-integration Bot commented Jun 15, 2026

🤖 Devin AI Engineer

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

charlesfrye commented Jun 15, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhouhelena1 commented Jun 15, 2026 •

edited

Loading

devin-ai-integration Bot Jun 15, 2026 •

edited

Loading