Create app._experimental_server() version of LLM Inference Examples#1580
Create app._experimental_server() version of LLM Inference Examples#1580molocule wants to merge 38 commits into
app._experimental_server() version of LLM Inference Examples#1580Conversation
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
| """Start SGLang server process and wait for it to be ready""" | ||
| self.proc = _start_server() | ||
| wait_for_server_ready() |
There was a problem hiding this comment.
In the lift and shift world, I think we can use a readiness probe to specify this:
@app._experimental_server(
readiness_probe=modal.Probe.with_http("/healthz", SGLANG_PORT)
)We do not have with_http yet, but it's basically like kubernetes's readinessProbe + httpGet.
| @modal.exit() | ||
| def stop(self): | ||
| """Terminate the SGLang server process""" | ||
| self.proc.terminate() | ||
| self.proc.wait() |
There was a problem hiding this comment.
In the lift and shift world:
app._experimental_server(
name="Server",
cmd=["sglang", ...], # When you pass `cmd` you can no longer decorate a class
readiness_probe=modal.Probe.with_http("/healthz", SGLANG_PORT),
)| # allow generous time for all replicas to spin up based on rough heuristic; | ||
| # remove this sleep and increase CONTAINERS | ||
| # to observe session routing changes during autoscaling | ||
| await asyncio.sleep(5 + ((CONTAINERS - 10) // 2)) |
There was a problem hiding this comment.
🚩 server_sticky.py sleep heuristic gives only 1 second wait with default CONTAINERS=2
At line 134, asyncio.sleep(5 + ((CONTAINERS - 10) // 2)) evaluates to asyncio.sleep(1) when CONTAINERS=2. The comment says "allow generous time for all replicas to spin up" but 1 second may not be enough for containers to become ready. The formula only gives meaningful positive delays when CONTAINERS > 10. This could cause flaky test results if the second container isn't ready after 1 second, though the test would just see routing to a single container (possibly causing false assertion failures for the sticky test).
Was this helpful? React with 👍 or 👎 to provide feedback.
| - name: Install the modal client | ||
| shell: bash | ||
| run: uv pip install --system modal | ||
| run: uv pip install --system --prerelease allow modal |
There was a problem hiding this comment.
🚩 CI now installs pre-release modal versions
The setup action changes from uv pip install --system modal to uv pip install --system --prerelease allow modal, which means CI will pick up pre-release versions of the modal package. This is presumably intentional since the PR uses new API methods like app._experimental_server and Server.get_url() that may only exist in pre-release builds. Worth verifying this is temporary (for testing the new API) or intended as permanent.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
yeah, let's remove this once the release goes out and before merging
There was a problem hiding this comment.
🚩 Incomplete migration: sglang_snapshot.py and http_server.py still use old API
Files 06_gpu_and_ml/llm-serving/sglang_snapshot.py and 07_web/http_server.py still use the old import modal.experimental + @modal.experimental.http_server + @modal.concurrent pattern. These were not touched by this PR. If the old API is being deprecated, these will need updates in a follow-up. The http_server_sticky.py appears to be the predecessor of server_sticky.py (both exist simultaneously).
Was this helpful? React with 👍 or 👎 to provide feedback.
| import aiohttp | ||
| import modal | ||
| import modal.experimental | ||
| from modal.server import Server |
There was a problem hiding this comment.
📝 Info: AGENTS.md: from modal.server import Server is acceptable for submodule imports
Three files (lfm_snapshot.py, sglang_kitchen_sink.py, vllm_low_latency.py) use from modal.server import Server. AGENTS.md says to prefer modal.X over direct imports, but Server is not available on the top-level modal module — it's only accessible via modal.server.Server. Since the rule was written for items like Image, Volume, etc. that are available as modal.Image, modal.Volume, this import pattern is the practical way to access the Server class and doesn't violate the spirit of the rule.
Was this helpful? React with 👍 or 👎 to provide feedback.
Type of Change
/docs/examples)Monitoring Checklist
lambda-test: falseis provided in the example frontmatter and I have gotten approval from a maintainermodal run, or an alternativecmdis provided in the example frontmatter (e.g.cmd: ["modal", "serve"])cmdwith no arguments, or theargsare provided in the example frontmatter (e.g.args: ["--prompt", "Formula for room temperature superconductor:"]fastapito be installed locally (e.g. does not importrequestsortorchin the global scope or other code executed locally)Documentation Site Checklist
Content
modal-cdn.comBuild Stability
v1, not a dynamic tag likelatestpython_versionfor the base image, if it is used~=x.y.zor==x.y, or we expect this example to work across major versions of the dependency and are committed to maintenance across those versionsversion < 1are pinned to patch version,==0.y.zOutside Contributors
You're great! Thanks for your contribution.