Skip to content

Add low-latency TensorRT-LLM example with http_server#1589

Open
zhouhelena1 wants to merge 7 commits into
mainfrom
devin/1781552641-trt-low-latency
Open

Add low-latency TensorRT-LLM example with http_server#1589
zhouhelena1 wants to merge 7 commits into
mainfrom
devin/1781552641-trt-low-latency

Conversation

@zhouhelena1

@zhouhelena1 zhouhelena1 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Rewrote trtllm_latency.py with @modal.experimental.http_server for low-latency routing.

Type of Change

  • New example for the GitHub repo
    • New example for the documentation site (Linked from a discoverable page, e.g. via the sidebar in /docs/examples)

Monitoring Checklist

  • Example is configured for testing in the synthetic monitoring system, or lambda-test: false is provided in the example frontmatter and I have gotten approval from a maintainer
    • Example is tested by executing with modal run, or an alternative cmd is provided in the example frontmatter
    • Example is tested by running the cmd with no arguments, or the args are provided in the example frontmatter
    • Example does not require third-party dependencies besides fastapi to be installed locally

Documentation Site Checklist

Content

  • Example is documented with comments throughout, in a Literate Programming style
  • All media assets for the example that are rendered in the documentation site page are retrieved from modal-cdn.com

Build Stability

  • Example pins all dependencies in container images
    • Example pins container images to a stable tag like v1, not a dynamic tag like latest
    • Example specifies a python_version for the base image, if it is used
    • Example pins all dependencies to at least SemVer minor version
      • Example dependencies with version < 1 are pinned to patch version

Link to Devin session: https://modal.devinenterprise.com/sessions/37b2b42588544b69bd089433e08fb9a0
Requested by: @zhouhelena1


Open in Devin Review

New example combining TensorRT-LLM with Modal's experimental
http_server for low-latency routing. Uses FP8 quantization and
low-latency GEMM plugins via trtllm-serve, mirroring the patterns
from sglang_low_latency.py and vllm_low_latency.py.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@charlesfrye

Copy link
Copy Markdown
Collaborator

nice! @zhouhelena1 could we just overwrite the existing trtllm_latency instead of creating a new example?

zhouhelena1 and others added 2 commits June 15, 2026 19:58
trtllm-serve 0.18.0 does not support --extra_llm_api_options.
Instead, build the optimized engine (FP8, low-latency GEMM plugins)
using the Python API and cache it in the Volume, then start
trtllm-serve pointing at the pre-built engine directory.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The shared Volume had a cached engine built with lookahead
speculative decoding by trtllm_latency.py. trtllm-serve cannot
load that engine without matching runtime config. Use 'serve'
subdirectory instead of 'fast' to isolate engine caches.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Ensures the cached engine is visible to other containers
immediately, avoiding redundant rebuilds on concurrent starts.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devin-ai-integration[bot]

This comment was marked as resolved.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review

Comment on lines +183 to +194
cmd = [
"trtllm-serve",
model_path,
"--host",
"0.0.0.0",
"--port",
str(PORT),
]

self.process = subprocess.Popen(cmd)
wait_ready(self.process)
warmup()

@devin-ai-integration devin-ai-integration Bot Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 trtllm-serve model name: no --served-model-name flag, relying on server default

The trtllm-serve command at lines 434-443 doesn't include a --served-model-name flag, unlike the vllm serve (vllm_low_latency.py:278-279) and sglang.launch_server (sglang_low_latency.py:399-400) examples which explicitly set the served model name. Meanwhile, the warmup function (line 325) and the test client (line 568) both send "model": MODEL_ID (NousResearch/Meta-Llama-3-8B-Instruct) in their payloads. If trtllm-serve defaults to a different model name (e.g. the engine path), these requests would get a 4xx error. Since warmup runs during @modal.enter(), this would surface immediately as a startup failure, so it's likely already validated. But worth confirming the trtllm-serve 0.20 default model name behavior.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread 06_gpu_and_ml/llm-serving/trt_low_latency.py Outdated
@zhouhelena1 zhouhelena1 force-pushed the devin/1781552641-trt-low-latency branch from c451e34 to be07fae Compare June 15, 2026 21:14
devin-ai-integration[bot]

This comment was marked as resolved.

zhouhelena1 and others added 2 commits June 15, 2026 21:38
The cached TRT-LLM 0.18.0 engine was incompatible with 0.20.0's
TensorRT runtime. Including the version in the cache path
(serve-0.20) forces a rebuild on upgrade.

Also pins pynvml==12.0 per repo pinning rules.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Keeps pynvml>=12 (user's original) so Modal can reuse the
cached container image from the previous CI run. This avoids
a redundant image rebuild and gives the full 14-minute CI
timeout for the engine build at the new path.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@zhouhelena1 zhouhelena1 requested a review from charlesfrye June 15, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants