Add low-latency TensorRT-LLM example with http_server#1589
Conversation
New example combining TensorRT-LLM with Modal's experimental http_server for low-latency routing. Uses FP8 quantization and low-latency GEMM plugins via trtllm-serve, mirroring the patterns from sglang_low_latency.py and vllm_low_latency.py. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
nice! @zhouhelena1 could we just overwrite the existing |
trtllm-serve 0.18.0 does not support --extra_llm_api_options. Instead, build the optimized engine (FP8, low-latency GEMM plugins) using the Python API and cache it in the Volume, then start trtllm-serve pointing at the pre-built engine directory. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The shared Volume had a cached engine built with lookahead speculative decoding by trtllm_latency.py. trtllm-serve cannot load that engine without matching runtime config. Use 'serve' subdirectory instead of 'fast' to isolate engine caches. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Ensures the cached engine is visible to other containers immediately, avoiding redundant rebuilds on concurrent starts. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
| cmd = [ | ||
| "trtllm-serve", | ||
| model_path, | ||
| "--host", | ||
| "0.0.0.0", | ||
| "--port", | ||
| str(PORT), | ||
| ] | ||
|
|
||
| self.process = subprocess.Popen(cmd) | ||
| wait_ready(self.process) | ||
| warmup() |
There was a problem hiding this comment.
🚩 trtllm-serve model name: no --served-model-name flag, relying on server default
The trtllm-serve command at lines 434-443 doesn't include a --served-model-name flag, unlike the vllm serve (vllm_low_latency.py:278-279) and sglang.launch_server (sglang_low_latency.py:399-400) examples which explicitly set the served model name. Meanwhile, the warmup function (line 325) and the test client (line 568) both send "model": MODEL_ID (NousResearch/Meta-Llama-3-8B-Instruct) in their payloads. If trtllm-serve defaults to a different model name (e.g. the engine path), these requests would get a 4xx error. Since warmup runs during @modal.enter(), this would surface immediately as a startup failure, so it's likely already validated. But worth confirming the trtllm-serve 0.20 default model name behavior.
Was this helpful? React with 👍 or 👎 to provide feedback.
c451e34 to
be07fae
Compare
The cached TRT-LLM 0.18.0 engine was incompatible with 0.20.0's TensorRT runtime. Including the version in the cache path (serve-0.20) forces a rebuild on upgrade. Also pins pynvml==12.0 per repo pinning rules. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Keeps pynvml>=12 (user's original) so Modal can reuse the cached container image from the previous CI run. This avoids a redundant image rebuild and gives the full 14-minute CI timeout for the engine build at the new path. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Rewrote
trtllm_latency.pywith@modal.experimental.http_serverfor low-latency routing.Type of Change
/docs/examples)Monitoring Checklist
lambda-test: falseis provided in the example frontmatter and I have gotten approval from a maintainermodal run, or an alternativecmdis provided in the example frontmattercmdwith no arguments, or theargsare provided in the example frontmatterfastapito be installed locallyDocumentation Site Checklist
Content
modal-cdn.comBuild Stability
v1, not a dynamic tag likelatestpython_versionfor the base image, if it is usedversion < 1are pinned to patch versionLink to Devin session: https://modal.devinenterprise.com/sessions/37b2b42588544b69bd089433e08fb9a0
Requested by: @zhouhelena1