Skip to content

Create app._experimental_server() version of LLM Inference Examples#1580

Open
molocule wants to merge 38 commits into
mainfrom
create-server-version-of-examples
Open

Create app._experimental_server() version of LLM Inference Examples#1580
molocule wants to merge 38 commits into
mainfrom
create-server-version-of-examples

Conversation

@molocule

@molocule molocule commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Type of Change

  • New example for the GitHub repo
    • New example for the documentation site (Linked from a discoverable page, e.g. via the sidebar in /docs/examples)
  • Example updates (Bug fixes, new features, etc.)
  • Other (Changes to the codebase, but not to examples)

Monitoring Checklist

  • Example is configured for testing in the synthetic monitoring system, or lambda-test: false is provided in the example frontmatter and I have gotten approval from a maintainer
    • Example is tested by executing with modal run, or an alternative cmd is provided in the example frontmatter (e.g. cmd: ["modal", "serve"])
    • Example is tested by running the cmd with no arguments, or the args are provided in the example frontmatter (e.g. args: ["--prompt", "Formula for room temperature superconductor:"]
    • Example does not require third-party dependencies besides fastapi to be installed locally (e.g. does not import requests or torch in the global scope or other code executed locally)

Documentation Site Checklist

Content

  • Example is documented with comments throughout, in a Literate Programming style
  • All media assets for the example that are rendered in the documentation site page are retrieved from modal-cdn.com

Build Stability

  • Example pins all dependencies in container images
    • Example pins container images to a stable tag like v1, not a dynamic tag like latest
    • Example specifies a python_version for the base image, if it is used
    • Example pins all dependencies to at least SemVer minor version, ~=x.y.z or ==x.y, or we expect this example to work across major versions of the dependency and are committed to maintenance across those versions
      • Example dependencies with version < 1 are pinned to patch version, ==0.y.z

Outside Contributors

You're great! Thanks for your contribution.


Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

@molocule molocule marked this pull request as draft June 2, 2026 00:48
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Comment on lines +212 to +214
"""Start SGLang server process and wait for it to be ready"""
self.proc = _start_server()
wait_for_server_ready()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the lift and shift world, I think we can use a readiness probe to specify this:

@app._experimental_server(
     readiness_probe=modal.Probe.with_http("/healthz", SGLANG_PORT)
)

We do not have with_http yet, but it's basically like kubernetes's readinessProbe + httpGet.

Comment on lines +216 to +220
@modal.exit()
def stop(self):
"""Terminate the SGLang server process"""
self.proc.terminate()
self.proc.wait()

@thomasjpfan thomasjpfan Jun 5, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the lift and shift world:

app._experimental_server(
      name="Server",
     cmd=["sglang", ...],  # When you pass `cmd` you can no longer decorate a class
     readiness_probe=modal.Probe.with_http("/healthz", SGLANG_PORT),
)

@molocule molocule marked this pull request as ready for review June 12, 2026 16:39
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review

Comment thread 06_gpu_and_ml/llm-serving/lfm_snapshot.py Outdated
Comment thread 07_web/server_sticky.py

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment thread 07_web/server_sticky.py
# allow generous time for all replicas to spin up based on rough heuristic;
# remove this sleep and increase CONTAINERS
# to observe session routing changes during autoscaling
await asyncio.sleep(5 + ((CONTAINERS - 10) // 2))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 server_sticky.py sleep heuristic gives only 1 second wait with default CONTAINERS=2

At line 134, asyncio.sleep(5 + ((CONTAINERS - 10) // 2)) evaluates to asyncio.sleep(1) when CONTAINERS=2. The comment says "allow generous time for all replicas to spin up" but 1 second may not be enough for containers to become ready. The formula only gives meaningful positive delays when CONTAINERS > 10. This could cause flaky test results if the second container isn't ready after 1 second, though the test would just see routing to a single container (possibly causing false assertion failures for the sticky test).

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

Open in Devin Review

Comment thread 06_gpu_and_ml/llm-serving/nemotron_inference.py
Comment thread 06_gpu_and_ml/llm-serving/sglang_low_latency.py
- name: Install the modal client
shell: bash
run: uv pip install --system modal
run: uv pip install --system --prerelease allow modal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 CI now installs pre-release modal versions

The setup action changes from uv pip install --system modal to uv pip install --system --prerelease allow modal, which means CI will pick up pre-release versions of the modal package. This is presumably intentional since the PR uses new API methods like app._experimental_server and Server.get_url() that may only exist in pre-release builds. Worth verifying this is temporary (for testing the new API) or intended as permanent.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's remove this once the release goes out and before merging

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 new potential issues.

Open in Devin Review

Comment thread 07_web/server_sticky.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Incomplete migration: sglang_snapshot.py and http_server.py still use old API

Files 06_gpu_and_ml/llm-serving/sglang_snapshot.py and 07_web/http_server.py still use the old import modal.experimental + @modal.experimental.http_server + @modal.concurrent pattern. These were not touched by this PR. If the old API is being deprecated, these will need updates in a follow-up. The http_server_sticky.py appears to be the predecessor of server_sticky.py (both exist simultaneously).

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread 06_gpu_and_ml/llm-serving/lfm_snapshot.py
Comment thread 06_gpu_and_ml/llm-serving/sglang_kitchen_sink.py
Comment thread 06_gpu_and_ml/llm-serving/vllm_low_latency.py

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

Open in Devin Review

import aiohttp
import modal
import modal.experimental
from modal.server import Server

@devin-ai-integration devin-ai-integration Bot Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: AGENTS.md: from modal.server import Server is acceptable for submodule imports

Three files (lfm_snapshot.py, sglang_kitchen_sink.py, vllm_low_latency.py) use from modal.server import Server. AGENTS.md says to prefer modal.X over direct imports, but Server is not available on the top-level modal module — it's only accessible via modal.server.Server. Since the rule was written for items like Image, Volume, etc. that are available as modal.Image, modal.Volume, this import pattern is the practical way to access the Server class and doesn't violate the spirit of the rule.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread 07_web/server.py

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 4 new potential issues.

Open in Devin Review

Comment thread 06_gpu_and_ml/llm-serving/lfm_snapshot.py Outdated
Comment thread 06_gpu_and_ml/llm-serving/lfm_snapshot.py
Comment thread 06_gpu_and_ml/llm-serving/vllm_low_latency.py
Comment thread 06_gpu_and_ml/llm-serving/sglang_kitchen_sink.py

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment thread 06_gpu_and_ml/llm-serving/sglang_low_latency.py
@charlesfrye

Copy link
Copy Markdown
Collaborator

We need .aio on the calls to get_url, eg here, in order to avoid warnings when the examples are run, eg here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants