You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SGLang supports CUDA, ROCm, Intel XPU, Ascend NPU, Habana HPU, and CPU. As more
hardware vendors seek integration with sglang (Moore Threads, …),
the current architecture has the following problem:
Harware logic scatter across many files. There is no stable
integration contract; Take npu as an example, the only way in is to read the source, find every if _is_npu: guard, and add a new branch alongside it.
Maintaining a backend without dedicated CI is unsustainable. Backends
that lack full CI coverage and continuous contribution slowly rot.
New features stall. Every generic optimization (chunked prefill, speculative
decoding, disaggregation) must be propagated to every backend separately or
silently broken.
Proposed Change
SGL-Diffusion already solved this problem with a Platform class and a current_platform singleton. SRT has not. This RFC proposes:
A unified Platform base class shared by both runtimes.
A plugin mechanism (Python entry points) so vendors can ship support as a standalone pip install package.
The directory layout after migration from sgl-diffusion to srt:
A single Platform base class covers both runtimes. One hardware = one subclass.
Methods irrelevant to a runtime return a safe default and are never called.
# Production — auto-activated on MUSA hardware after pip install
pip install sglang sglang-musa
# Development — no install required
python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct
Plan
Phase 1 — Create sglang platforms (Q1)
Create sglang/platforms/interface.py — Platform base class
All Diffusion behavior preserved, detection logic unified
Phase 3 — Wire SRT Core (Q2)
Replace 5 extension points with current_platform calls:
File
Change
model_executor/model_runner.py
initialize_device() + get_graph_runner_class()
model_executor/model_runner_kv_cache_mixin.py
create_kv_pool()
server_args.py
apply_server_args_defaults()
layers/attention/attention_registry.py
get_attention_backends()
distributed/parallel_state.py
get_device_communicator_cls()
Built-in CUDA/ROCm/NPU paths unchanged — Platform methods delegate to the
same code that runs today.
Phase 4 — Clean Up is_xxx() Guards (Q2)
~40 call-sites across SRT → current_platform.is_npu().
Existing is_npu() helpers in utils/common.py become thin wrappers that
delegate to current_platform for backward compatibility.
RFC: SGLang Hardware Plugin System
Motivation
SGLang supports CUDA, ROCm, Intel XPU, Ascend NPU, Habana HPU, and CPU. As more
hardware vendors seek integration with sglang (Moore Threads, …),
the current architecture has the following problem:
integration contract; Take npu as an example, the only way in is to read the source, find every
if _is_npu:guard, and add a new branch alongside it.that lack full CI coverage and continuous contribution slowly rot.
decoding, disaggregation) must be propagated to every backend separately or
silently broken.
Proposed Change
SGL-Diffusion already solved this problem with a
Platformclass and acurrent_platformsingleton. SRT has not. This RFC proposes:Platformbase class shared by both runtimes.pip installpackage.The directory layout after migration from sgl-diffusion to srt:
1. Unified
PlatformClassA single
Platformbase class covers both runtimes. One hardware = one subclass.Methods irrelevant to a runtime return a safe default and are never called.
2. Plugin Registration: Entry Points
Vendors ship a Python package and declare their
Platformsubclass as aPEP 517 entry point:
After
pip install sglang-musa, SGLang discovers the plugin automatically onthe next startup — no upstream changes required.
For development without packaging, set:
3. Class Hierarchy
classDiagram class Platform { <<abstract>> +device_name: str +_enum: PlatformEnum +is_cuda() bool +is_rocm() bool +is_npu() bool +is_out_of_tree() bool +initialize_device(rank, local_rank) +get_attention_backends() Dict +get_graph_runner_class() Type +create_kv_pool(runner) Pool +apply_server_args_defaults(args) +get_attn_backend_cls_str(...) str +get_device_communicator_cls() str } class PlatformEnum { <<enumeration>> CUDA ROCM NPU XPU HPU CPU OOT } class CudaPlatform { +get_attention_backends() flashinfer/fa3/triton/... +get_graph_runner_class() CudaGraphRunner +create_kv_pool() MHATokenToKVPool +get_attn_backend_cls_str() FlashAttn/SDPA } class RocmPlatform { +get_attention_backends() aiter/wave/flashinfer +get_graph_runner_class() CudaGraphRunner +get_attn_backend_cls_str() AITER/SDPA } class NpuPlatform { +initialize_device() init_npu_backend() +apply_server_args_defaults() +get_attention_backends() ascend +get_graph_runner_class() NPUGraphRunner +create_kv_pool() NPUMHATokenToKVPool } class XpuPlatform class HpuPlatform class CpuPlatform class OotPlatform { <<vendor package>> _enum = OOT implements only needed methods } Platform --> PlatformEnum Platform <|-- CudaPlatform Platform <|-- RocmPlatform Platform <|-- NpuPlatform Platform <|-- XpuPlatform Platform <|-- HpuPlatform Platform <|-- CpuPlatform Platform <|-- OotPlatformWriting a Platform Plugin
A vendor ships one Python package that works for both SRT and SGL-Diffusion.
Package layout
Minimal example (Moore Threads MUSA)
Installation and activation
Plan
Phase 1 — Create sglang platforms (Q1)
sglang/platforms/interface.py— Platform base classsglang/platforms/{cuda,rocm,npu,xpu,hpu,cpu}.pysglang/platforms/__init__.py—current_platform+ plugin detectionsglang.platform_pluginsentry point group inpyproject.tomlMockPlatformPhase 2 — Migrate SGL-Diffusion (Q1)
multimodal_gen/runtime/platforms/interface.py: inherit fromsglang.platforms.Platformmultimodal_gen/runtime/platforms/__init__.py: thin re-exportPhase 3 — Wire SRT Core (Q2)
Replace 5 extension points with
current_platformcalls:model_executor/model_runner.pyinitialize_device()+get_graph_runner_class()model_executor/model_runner_kv_cache_mixin.pycreate_kv_pool()server_args.pyapply_server_args_defaults()layers/attention/attention_registry.pyget_attention_backends()distributed/parallel_state.pyget_device_communicator_cls()Built-in CUDA/ROCm/NPU paths unchanged — Platform methods delegate to the
same code that runs today.
Phase 4 — Clean Up
is_xxx()Guards (Q2)~40 call-sites across SRT →
current_platform.is_npu().Existing
is_npu()helpers inutils/common.pybecome thin wrappers thatdelegate to
current_platformfor backward compatibility.Phase 5 — Documentation + Vendor Template (Q2)
docs/developer_guide/platform_plugin.md— step-by-step vendor guidesglang-platform-templateskeleton repository on GitHubdocs/supported_models/hardware_support.md