SGLang Deep Dive
SGLang Deep Dive: A Full-Stack Tour of a High-Performance LLM Serving System
This article is based on the current SGLang repository and documentation. It walks through SGLang as a full serving stack: frontend language, OpenAI-compatible APIs, runtime architecture, scheduling, KV-cache management, model execution, custom kernels, model gateway, distributed deployment, observability, benchmarks, and tests.
1. What SGLang Is
SGLang is a high-performance serving framework for large language models, multimodal models, embedding models, reward models, and diffusion-style image/video generation models. It is not just a thin wrapper around model.generate(). It is a full inference system: APIs, scheduling, tokenization, KV-cache management, model execution, distributed parallelism, custom kernels, production observability, and cluster routing.
In one sentence:
SGLang turns modern model inference from “the model can run” into “the model can serve real traffic efficiently, reliably, and at scale.”
At a high level, SGLang provides:
| Area | Capabilities |
|---|---|
| Serving interfaces | Native /generate, OpenAI-compatible APIs, Ollama-compatible APIs, gRPC, offline engine, Python API |
| Runtime performance | RadixAttention, continuous batching, chunked prefill, paged attention, CUDA graphs, overlap scheduling, speculative decoding |
| Cache system | GPU KV cache, RadixCache, HiCache, L3 distributed KV storage, cache-aware scheduling |
| Parallelism | Tensor parallelism, pipeline parallelism, data parallelism, expert parallelism, DP attention, PD disaggregation |
| Model support | LLMs, VLMs, embeddings, reward models, rerankers, classifiers, diffusion image/video models |
| Generation control | Sampling parameters, stop conditions, logprobs, JSON schema, regex, EBNF, tool parsing, reasoning parsing |
| Production deployment | Rust model gateway, load balancing, health checks, rate limiting, circuit breakers, Prometheus, OpenTelemetry |
| Low-level acceleration | sgl-kernel custom CUDA/HIP/CUTLASS/Triton kernels for attention, MoE, GEMM, quantization, sampling, KV-cache I/O |
SGLang as a Serving Stack
A layered system, from application-facing APIs down to specialized GPU kernels.
/generate, OpenAI-compatible /v1/chat/completions, embeddings, rerank, score, gRPC, Ollama-compatible endpoints, Python Engine2. Repository Map
The repository is organized as a full system rather than a single Python package.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sglang/
├── python/sglang/ # Main Python package
│ ├── lang/ # SGLang frontend language and interpreter
│ ├── srt/ # SGLang Runtime, the main LLM serving engine
│ ├── multimodal_gen/ # Diffusion/image/video generation runtime
│ ├── jit_kernel/ # JIT kernels and experimental kernels
│ ├── cli/ # CLI entrypoints
│ └── bench_*.py # Common benchmark scripts
├── sgl-kernel/ # Standalone optimized kernel package
├── sgl-model-gateway/ # Rust gateway for routing/control plane
├── docs/ # User docs, advanced features, platforms, developer guide
├── examples/ # Runtime, monitoring, and usage examples
├── benchmark/ # End-to-end and workload-specific benchmarks
├── test/ # Manual, registered, unit, and SRT tests
├── scripts/ # CI, release, playground, conversion, utility scripts
├── docker/ # Docker, Kubernetes, SageMaker deployment assets
└── 3rdparty/ # Third-party/platform-specific code
The most important directories are:
| Directory | Role |
|---|---|
python/sglang/lang |
The frontend language: programmatic prompting, IR, interpreter, backends |
python/sglang/srt |
SGLang Runtime: request handling, scheduling, KV cache, model execution |
sgl-kernel |
Optimized CUDA/HIP/CUTLASS/Torch-extension kernels |
sgl-model-gateway |
Rust-based routing layer for large model fleets |
python/sglang/multimodal_gen |
Diffusion/image/video generation serving runtime |
benchmark |
Performance and workload experiments |
test |
CI and correctness coverage across subsystems |
3. Public Interfaces
SGLang exposes several ways to use the system.
3.1 HTTP Server
The common deployment path is:
1
2
3
4
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
This starts a FastAPI/uvicorn server implemented mainly in:
1
2
3
python/sglang/srt/entrypoints/http_server.py
python/sglang/srt/entrypoints/openai/
python/sglang/srt/entrypoints/ollama/
The server exposes:
| Endpoint family | Examples |
|---|---|
| Native SGLang | /generate, /encode, /classify |
| OpenAI-compatible | /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/score, /v1/tokenize, /v1/detokenize |
| Ollama-compatible | /api/chat, /api/generate, /api/tags, /api/show |
| Ops/admin | /health, /metrics, profiling, cache flush, LoRA load/unload, weight update, pause/resume/abort |
| Platform integration | SageMaker /invocations, Vertex-style route, gRPC mode |
3.2 Python Engine
You can also instantiate an engine directly from Python:
1
2
3
4
5
6
7
8
import sglang as sgl
engine = sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")
out = engine.generate(
prompt="Explain KV cache in one paragraph.",
sampling_params={"temperature": 0, "max_new_tokens": 128},
)
print(out)
The Python Engine still launches the runtime components internally. The difference is only that the entrypoint is a Python object rather than an HTTP server.
3.3 SGLang Frontend Language
python/sglang/lang provides the original SGLang programming interface: a small language embedded in Python for writing prompt programs.
It exposes:
@sgl.functionfor defining SGL programssgl.genfor model generationsgl.selectfor constrained choice selection- role helpers for chat-style prompts
sgl.imageandsgl.videofor multimodal inputssgl.Runtimefor remote endpointssgl.Enginefor local runtime execution
Internally it contains:
1
2
3
4
5
python/sglang/lang/api.py
python/sglang/lang/ir.py
python/sglang/lang/interpreter.py
python/sglang/lang/tracer.py
python/sglang/lang/backend/
Frontend Language Execution
SGL programs are interpreted into backend calls, with optional tracing for common-prefix pre-cache.
4. The SGLang Runtime: Three Core Processes
The central serving engine lives in:
1
python/sglang/srt/
SGLang Runtime, often called SRT, is built around three major components:
| Component | Process | Main job |
|---|---|---|
TokenizerManager |
Main process | Request intake, templates, tokenization, multimodal preprocessing, request state, streaming |
Scheduler |
Subprocess | Waiting/running queues, batching, KV-cache management, GPU worker scheduling |
DetokenizerManager |
Subprocess | Incremental token-id to text decoding, stop trimming, response return path |
IPC between these processes is done through ZMQ.
Runtime Request Path
The main serving loop is split so CPU-heavy tokenization and detokenization do not block GPU scheduling.
4.1 TokenizerManager
Source:
1
python/sglang/srt/managers/tokenizer_manager.py
TokenizerManager is the runtime’s front door. It converts external request objects into tokenized internal messages that the scheduler can handle.
It initializes:
ModelConfig- tokenizer or multimodal processor
- IPC channels
- request state and streaming state
- request logging, request dump, crash dump
- weight update state
- LoRA registry
- PD disaggregation or encoder disaggregation services
- metrics collector and watchdog
- type-based request dispatcher
During request processing it handles:
- chat template and completion template application
- text and input-id validation
- image/audio/video preprocessing
- conversion into
TokenizedGenerateReqInputorTokenizedEmbeddingReqInput ReqStatetracking for streaming output- timing metrics such as first-token time and finish time
4.2 Scheduler
Source:
1
python/sglang/srt/managers/scheduler.py
The Scheduler is SGLang’s performance center. It is much more than a queue. It coordinates scheduling policy, KV-cache allocation, model workers, speculative decoding, disaggregation, LoRA, metrics, profiling, and distributed execution.
At startup, it initializes:
- model configuration
- IPC channels
- tokenizer or processor
- MoE and GEMM configuration
TpModelWorker- optional speculative draft worker
- cache and memory pools
- running/waiting queues
- sessions
- chunked prefill
- schedule policy
- watchdog, memory saver, input blocker, receive skipper
- profiler
- prefill/decode disaggregation
- overlap scheduling
- deterministic inference settings
- grammar manager
- LoRA overlap loader
The class is built through many mixins:
1
2
3
4
5
6
7
8
9
10
11
SchedulerOutputProcessorMixin
SchedulerUpdateWeightsMixin
SchedulerProfilerMixin
SchedulerMetricsMixin
SchedulerDisaggregationDecodeMixin
SchedulerDisaggregationPrefillMixin
SchedulerMultiplexMixin
SchedulerRuntimeCheckerMixin
SchedulerPPMixin
SchedulerDPAttnMixin
SchedulerDllmMixin
That mixin list tells the story: Scheduler is the runtime hub.
4.3 DetokenizerManager
Source:
1
python/sglang/srt/managers/detokenizer_manager.py
DetokenizerManager receives token IDs from Scheduler, maintains per-request incremental decode state, batch-decodes IDs into strings, trims stop strings/tokens, and returns BatchStrOutput to TokenizerManager.
Detokenization is separated into its own process because token decoding is CPU work. Keeping it away from Scheduler protects the GPU scheduling loop.
5. Core Batch Data Structures
SGLang explicitly divides batch state into three layers:
1
ScheduleBatch -> ModelWorkerBatch -> ForwardBatch
| Structure | Owner | Where it lives | Purpose |
|---|---|---|---|
ScheduleBatch |
Scheduler | Mostly CPU | High-level scheduling data: requests, prefix hits, cache allocations, sampling info |
ModelWorkerBatch |
TpModelWorker | CPU/GPU boundary | Subset of scheduling state needed by model execution |
ForwardBatch |
ModelRunner | Mostly GPU tensors | Low-level tensor state consumed by model forward and attention backend |
Batch State Compression
As a request gets closer to the GPU, its representation becomes more tensor-heavy and less policy-heavy.
ForwardMode describes what a forward pass is doing:
| Mode | Meaning |
|---|---|
EXTEND |
Prefill/extend a sequence |
DECODE |
Generate one token per request |
MIXED |
Chunked prefill mixed with decode |
IDLE |
Worker has no local sequence, common with DP attention |
TARGET_VERIFY |
Target-model verification in speculative decoding |
DRAFT_EXTEND / DRAFT_EXTEND_V2 |
Draft-model extension in speculative decoding |
PREBUILT |
Decode worker receives ready KV cache in PD mode |
SPLIT_PREFILL |
Split prefill for PD multiplexing |
DLLM_EXTEND |
Diffusion LLM extension path |
6. Scheduling: The Brain of Throughput
Scheduling policy lives in:
1
python/sglang/srt/managers/schedule_policy.py
SGLang supports both cache-aware and cache-agnostic strategies.
6.1 Cache-Aware Policies
| Policy | Meaning |
|---|---|
lpm |
Longest Prefix Match. Requests with more reusable prefix KV cache are prioritized. |
dfs-weight |
Sorts according to DFS-style radix-tree weights to cluster shared-prefix requests. |
6.2 Cache-Agnostic Policies
| Policy | Meaning |
|---|---|
fcfs |
First come, first served |
lof |
Longest output first |
random |
Random order |
routing-key |
Prioritize requests with routing keys frequent in the running batch |
Cache-Aware Scheduling Loop
The scheduler tries to spend GPU time on requests that can reuse existing computation.
This is why SGLang’s Scheduler is central to performance. It must consider:
- waiting queue and running batch
- prefix cache hit length
- KV-cache capacity
- maximum prefill tokens
- maximum running requests
- LoRA count per batch
- chunked prefill
- priority scheduling
- speculative decoding state
- PD disaggregation state
- distributed rank behavior
7. KV Cache, Memory Pools, and RadixAttention
LLM inference has two very different phases:
- Prefill: process the prompt and produce KV cache. This is compute-heavy.
- Decode: repeatedly generate one token while reading previous KV cache. This is memory-bandwidth-heavy.
SGLang optimizes both phases with memory pools and RadixAttention.
7.1 Two-Level Memory Pool
The memory pool design is described directly in memory_pool.py: ReqToTokenPool maps a request to token locations, while token-to-KV allocators manage physical KV-cache indices.
KV Memory Addressing
The runtime separates request identity from physical KV-cache storage.
Conceptually:
1
req_to_token[request_slot, token_position] = physical_kv_cache_index
7.2 RadixCache
Source:
1
python/sglang/srt/mem_cache/radix_cache.py
RadixCache is a radix tree for reusable prefix KV cache. Each node represents a span of tokens and stores metadata about whether the KV for that span exists on device, host, or storage.
Node metadata includes:
- token key
- device value
- host value
- page hashes
- parent and children
- lock/reference counters
- hit count
- last access time
- creation time
- eviction priority
RadixCache Prefix Reuse
Shared prompt prefixes become shared KV-cache paths.
For workloads such as RAG, multi-turn chat, long-context QA, and batch evaluation, this prefix reuse can save substantial prefill compute.
7.3 RadixAttention
Source:
1
python/sglang/srt/layers/radix_attention.py
RadixAttention is the model-layer attention module. It reshapes Q/K/V, optionally writes K/V into cache, and dispatches actual attention computation to the selected attention backend through forward_batch.attn_backend.
Important point: RadixAttention is not one fixed kernel. It is a cache-aware attention abstraction that can use FlashInfer, Triton, FlashAttention, FlashMLA, CUTLASS MLA, TensorRT-LLM, AITER, Wave, Ascend, NSA, or other backends depending on model and platform.
8. HiCache: Hierarchical KV Cache
RadixAttention uses idle GPU memory to cache reusable prefix KV. HiCache extends the same idea into a three-level hierarchy:
| Level | Storage | Role |
|---|---|---|
| L1 | GPU memory | Fastest, smallest, local to an inference instance |
| L2 | Host memory | Larger local capacity |
| L3 | Distributed storage | Cluster-wide sharing and much larger capacity |
HiCache: L1/L2/L3 KV Cache
HiCache expands prefix reuse beyond GPU memory and across instances.
HiCache operations:
- Local match: traverse local HiRadixTree metadata.
- Prefetch: query L3 and load useful KV pages into local memory.
- Write-back: move new or hot KV data from L1 to L2/L3.
- Multi-rank synchronization: use collectives so TP ranks agree on hit lengths.
- Transfer optimization: use page-first layouts and GPU-assisted I/O kernels.
HiCache is especially useful for:
- multi-turn chat
- repeated system prompts
- long-context QA
- multi-document QA
- RAG workloads
- cluster-level KV sharing across model instances
9. Model Execution: TpModelWorker and ModelRunner
Model execution is centered around:
1
2
3
python/sglang/srt/managers/tp_worker.py
python/sglang/srt/model_executor/model_runner.py
python/sglang/srt/model_executor/forward_batch_info.py
Scheduler does not directly call the model. It sends work to TpModelWorker, which converts ModelWorkerBatch into ForwardBatch and calls ModelRunner.
ModelRunner Initialization
ModelRunner owns the low-level device execution context.
ModelRunner is responsible for:
- distributed environment initialization
- tensor/pipeline/expert/data parallel group setup
- model loading
- quantization setup
- LoRA manager
- CPU/offload integration
- KV-cache dtype and memory pool setup
- attention backend initialization
- CUDA graph and CPU graph runners
- kernel warmup
- weight update pathways
- remote-instance transfer engine integration
- expert location and expert distribution tracking
- forward paths for decode, extend, idle, split prefill, and more
9.1 Attention Backend Registry
Source:
1
python/sglang/srt/layers/attention/attention_registry.py
SGLang registers attention backends by name:
| Backend | Purpose |
|---|---|
flashinfer |
Default high-performance backend, MHA and MLA paths |
triton |
Triton attention implementation and fallback |
torch_native |
Compatibility path |
flex_attention |
PyTorch FlexAttention |
fa3 / fa4 |
FlashAttention v3/v4 |
flashmla |
MLA-specific path |
cutlass_mla |
CUTLASS MLA backend |
trtllm_mla / trtllm_mha |
TensorRT-LLM attention backend |
aiter / wave |
AMD/ROCm-oriented optimized paths |
ascend |
Ascend NPU backend |
nsa |
Native Sparse Attention |
intel_amx |
Intel CPU AMX backend |
dual_chunk_flash_attn |
Long-context dual-chunk attention |
This registry is one of the reasons SGLang can support many models and hardware platforms without forcing every model through one attention implementation.
10. Model Support and Registry
Native model implementations live in:
1
python/sglang/srt/models/
The repository includes implementations for many model families:
- Llama, Llama4, MLLama
- Qwen, Qwen2, Qwen3, Qwen-VL, Qwen-Omni
- DeepSeek, DeepSeek-VL, DeepSeek-OCR, DeepSeek NextN
- Kimi, Kimi Linear, Kimi VL
- GLM, GLM-V, GLM-MoE
- Gemma and Gemma reward variants
- Mistral, Mixtral, Ministral
- GPT-OSS, GPT2, GPT-J, StarCoder
- Phi and Phi4MM
- InternVL, LLaVA, NVILA, Pixtral
- BERT, RoBERTa, embedding, reward, classification models
- Mamba, hybrid linear attention, MoE, MTP, EAGLE draft models
The registry is implemented in:
1
python/sglang/srt/models/registry.py
It scans sglang.srt.models, imports modules with an EntryClass, and registers architectures by class name. At load time:
- SGLang reads
architecturesfrom Hugging Face config. - It normalizes and looks up supported model classes.
- If a native implementation exists, it uses that implementation.
- Otherwise, it can fall back to
TransformersForCausalLM. - External model packages can be registered through an environment variable.
Model Resolution Path
11. Sampling, Structured Outputs, Tools, and Reasoning
Related directories:
1
2
3
4
python/sglang/srt/sampling/
python/sglang/srt/constrained/
python/sglang/srt/function_call/
python/sglang/srt/parser/
11.1 Sampling
SGLang supports the common generation controls:
- temperature
- top-p
- top-k
- min-p
- frequency penalty
- presence penalty
- stop strings
- stop token IDs
- stop regex
- max/min tokens
- logprobs and top logprobs
- ignore EOS
- custom logit processor
Sampling backends include FlashInfer, PyTorch, and platform-specific paths.
11.2 Structured Outputs
SGLang can constrain generation with:
- JSON schema
- regular expression
- EBNF grammar
Grammar backends:
| Backend | Support |
|---|---|
| XGrammar | Default. JSON schema, regex, EBNF |
| Outlines | JSON schema, regex |
| llguidance | JSON schema, regex, EBNF |
Structured Generation
The grammar backend restricts the sampler to valid next tokens.
11.3 Tool Calls and Reasoning Parsers
SGLang includes model-specific detectors and parsers for tool calls and reasoning content. This matters because different models encode tool calls and chain-of-thought style reasoning in different formats.
The function-call parser area includes formats for DeepSeek, Qwen, Kimi, GLM, GPT-OSS, Mistral, Llama, Step, and more. The reasoning parser can separate hidden or explicit reasoning content from user-facing assistant output.
12. Speculative Decoding
Speculative decoding uses a cheaper draft path to propose tokens and the target model to verify them. When drafts are accepted, the target model performs fewer sequential decode steps.
SGLang supports:
| Method | Draft source | When to use |
|---|---|---|
| EAGLE-2 | EAGLE draft model | Strong general default |
| EAGLE-3 | EAGLE3 draft model | Higher throughput when supported |
| MTP | Built-in multi-token prediction heads | Models with native MTP layers |
| STANDALONE | Separate smaller draft LLM | When a good smaller draft model is available |
| NGRAM | N-gram candidates from previous tokens | No extra model, CUDA-only path |
| SpecV2 | Experimental overlap scheduler path | Aggressive overlap scheduling |
Speculative Decoding Loop
In code, Scheduler checks SpeculativeAlgorithm and may launch a draft worker alongside the target worker. Users control the behavior with arguments such as:
--speculative-algorithm--speculative-draft-model-path--speculative-num-steps--speculative-eagle-topk--speculative-num-draft-tokens--speculative-token-map
13. Prefill/Decode Disaggregation
LLM inference has two phases with different bottlenecks:
- Prefill is compute-heavy.
- Decode is memory-bandwidth-heavy.
In a unified engine, prefill batches can interrupt decode batches and hurt latency. PD disaggregation separates them:
- Prefill workers process prompts and produce KV cache.
- Decode workers receive KV cache and perform autoregressive generation.
- Transfer backends move KV cache between worker groups.
- Gateway/router coordinates request flow.
Supported transfer backends include Mooncake, NIXL, Ascend, fake, and MoRI paths.
PD Disaggregation Topology
Compute-heavy prefill and memory-heavy decode are scaled independently.
This is particularly important for large models, long contexts, high concurrency, and rack-scale deployments.
14. Parallelism: TP, PP, DP, EP, and DP Attention
SGLang supports several parallel dimensions:
| Parallelism | Purpose |
|---|---|
| Tensor Parallelism | Split tensors, attention heads, MLP matrices across GPUs |
| Pipeline Parallelism | Split model layers into stages |
| Data Parallelism | Run multiple serving replicas |
| Expert Parallelism | Distribute MoE experts across ranks |
| DP Attention | Specialized data-parallel attention path for large-model decode |
| Context Parallel / NSA | Long-context and sparse-attention optimizations |
Parallel Execution Axes
ModelRunner initializes distributed environment and model-parallel groups. Scheduler uses rank information to decide which rank receives requests, sends outputs, synchronizes cache metadata, and handles pipeline proxy tensors.
15. LoRA, Weight Updates, and RL/Post-Training
15.1 LoRA
Related directory:
1
python/sglang/srt/lora/
SGLang supports:
- LoRA enabled at launch
- dynamic LoRA adapter loading
- loading LoRA adapters from tensors
- unloading adapters
- multi-LoRA batching
- LoRA overlap loading
- LoRA eviction policy
HTTP endpoints include:
/load_lora_adapter/load_lora_adapter_from_tensors/unload_lora_adapter
15.2 Weight Updates
SGLang exposes many weight update paths:
/update_weights_from_disk/init_weights_update_group/destroy_weights_update_group/update_weights_from_tensor/update_weights_from_distributed/update_weights_from_ipc/update_weight_version/get_weights_by_name
This is critical for RL rollout, post-training, checkpoint-engine integration, online weight refresh, and distributed model update workflows.
15.3 RL and Post-Training Backbone
The project positions SGLang as a rollout backend for post-training frameworks. That is supported by:
- weight sync
- checkpoint engine integration
- distributed and IPC weight updates
- request replay
- memory release/resume
- metrics and tracing
16. Quantization
Docs:
1
2
docs/advanced_features/quantization.md
python/sglang/srt/layers/quantization/
SGLang supports offline and online quantization:
| Mode | Meaning | Recommendation |
|---|---|---|
| Offline quantization | Load pre-quantized weights | Recommended for production |
| Online quantization | Quantize during startup | Convenient but slower startup and higher memory pressure |
Supported families include:
- AWQ
- GPTQ
- Marlin / GPTQ Marlin / AWQ Marlin
- FP8 / MXFP8
- FP4 / MXFP4 / NVFP4
- W8A8 INT8 / W8A8 FP8
- bitsandbytes
- GGUF
- ModelOpt FP8/FP4
- AutoRound
- compressed tensors
- MoE-specific formats such as WNA16 and W4AFP8
High-performance quantized execution can route into sgl-kernel, CUTLASS, Triton, FlashInfer, or platform backends.
17. sgl-kernel: The Low-Level Acceleration Library
sgl-kernel is a standalone package:
1
2
3
4
5
6
7
sgl-kernel/
├── csrc/ # CUDA/HIP/C++ extension sources
├── include/ # Kernel op headers
├── python/sgl_kernel/ # Python bindings
├── tests/ # Kernel tests
├── benchmark/ # Kernel benchmarks
└── CMakeLists.txt
It provides optimized primitives for LLM and VLM inference engines.
| Kernel family | Examples |
|---|---|
| Attention | Flash attention, FlashMLA, sparse flash attention, merge state |
| GEMM | FP8/FP4/INT8 GEMM, blockwise GEMM, BMM FP8, Marlin, CUTLASS |
| MoE | Top-k, fused gate, MoE align, FP8/FP4 blockwise MoE, Kimi K2 MoE |
| Quantization | AWQ/GPTQ/FP8/FP4 quant/dequant, per-token/per-tensor/per-group quant |
| Sampling | Top-k, sampling, speculative sampling, token bitmask |
| KV cache I/O | KV cache transfer, copy, store cache |
| Distributed | Custom allreduce, quick allreduce, MSCCL++ |
| Norm/activation/RoPE | RMSNorm, activation, rotary embedding, fused QK norm RoPE |
| Mamba/SSM | Causal convolution and Mamba-related kernels |
Kernel Integration Path
New kernels follow a clear path: implement source, expose headers, register torch extension, update CMake, add Python binding, add tests and benchmarks.
18. SGLang Model Gateway
The gateway is a Rust project:
1
sgl-model-gateway/
It turns a set of model workers into an operational model-serving fleet.
18.1 Control Plane
The control plane includes:
- worker manager
- worker registry
- worker service
- job queue
- health checker
- load monitor
- tokenizer registry
- Kubernetes service discovery
- WASM module registration
- MCP registration
18.2 Data Plane
The data plane supports:
- regular HTTP routing
- HTTP PD routing
- gRPC routing
- gRPC PD routing
- OpenAI-compatible backend proxy
- multi-model inference gateway mode
- tokenize/detokenize/parser endpoints
- conversation and response history connectors
Model Gateway Architecture
Rust gateway separates fleet control from request routing.
18.3 Load Balancing and Reliability
Gateway policies include:
- random
- round robin
- cache-aware
- power-of-two
- bucket
- prefix hash
- consistent hashing
- manual
- tree-like policies
Reliability features include:
- retry with jitter
- per-worker circuit breaker
- token-bucket rate limiting
- request queueing
- health checks
- Prometheus metrics
- OpenTelemetry tracing
- structured logs
- request ID propagation
19. Diffusion and Multimodal Generation
The diffusion runtime lives in:
1
python/sglang/multimodal_gen/
SGLang Diffusion targets accelerated image/video generation. It supports:
- Wan and FastWan
- Hunyuan
- Qwen-Image and Qwen-Image-Edit
- Flux
- Z-Image
- GLM-Image
- NVIDIA GPUs
- AMD ROCm
- Moore Threads MUSA
- OpenAI-compatible API
- CLI
- Python SDK
- LoRA
The runtime has its own structure:
1
2
3
4
5
6
7
8
9
10
runtime/
├── entrypoints/
├── managers/
├── models/
├── pipelines/
├── layers/
├── loader/
├── distributed/
├── cache/
└── platforms/
This shows that SGLang is expanding beyond LLM serving into a broader multimodal serving platform.
20. Observability
Docs:
1
2
3
4
docs/advanced_features/observability.md
docs/references/production_metrics.md
docs/references/production_request_trace.md
examples/monitoring/
SGLang supports:
- Prometheus metrics through
--enable-metrics - Grafana dashboard examples
- OpenTelemetry tracing
- request logging
- request dump
- request replay
- crash dump
- crash replay
- function timers
- CPU monitor
- tokenizer/scheduler/detokenizer metrics
Important metrics include:
| Metric | Meaning |
|---|---|
prompt_tokens_total |
Number of prefill tokens processed |
generation_tokens_total |
Number of generated tokens |
token_usage |
KV token usage |
cache_hit_rate |
Prefix/cache hit rate |
time_to_first_token_seconds |
TTFT |
time_per_output_token_seconds |
TPOT |
e2e_request_latency_seconds |
End-to-end latency |
num_running_reqs |
Number of running requests |
num_queue_reqs |
Waiting queue size |
gen_throughput |
Generation throughput in token/s |
Production Visibility
21. Benchmarks and Tests
SGLang has broad benchmark coverage, not just one throughput script.
Benchmark areas include:
- serving benchmark
- batch benchmark
- tokenizer benchmark
- HiCache benchmark
- JSON schema, regex, and jump-forward decoding
- LoRA benchmark
- MTBench
- MMLU
- GSM8K
- HellaSwag
- BoolQ
- CEval
- MMMU
- LLaVA bench
- multi-turn chat
- multi-document QA
- reasoning benchmark
- DeepSeek V3
- GPT-OSS
- prefill-only embedding and scoring
- kernel and attention sink benchmarks
Test layout:
| Directory | Purpose |
|---|---|
test/unit |
Unit tests |
test/srt |
SRT subsystem tests |
test/registered |
CI-registered functional coverage |
test/manual |
Manual and platform-specific tests |
sgl-kernel/tests |
Kernel-level tests |
test/registered covers many categories: OpenAI server, scheduler, radix cache, disaggregation, distributed execution, HiCache, LoRA, quantization, kernels, models, VLM, metrics, parsers, function calls, speculative decoding, performance, and stress tests.
22. End-to-End Request Flow
The following diagram summarizes a single generation request.
End-to-End Generation Path
From an OpenAI-style chat request to streamed text output.
This is why SGLang is a large codebase: high-performance serving is not a single model call. It is a request-lifecycle system.
23. Component Index
| Component / directory | Function |
|---|---|
python/sglang/lang |
Frontend language, IR, interpreter, backend abstraction |
python/sglang/cli |
CLI commands such as serve and generate |
python/sglang/launch_server.py |
Server launch entrypoint |
srt/entrypoints/http_server.py |
FastAPI server, OpenAI/Ollama/native/admin routes |
srt/entrypoints/engine.py |
Python Engine, launches tokenizer/scheduler/detokenizer |
srt/server_args.py |
Server arguments, backend choices, deployment/performance switches |
srt/managers/tokenizer_manager.py |
Tokenization, request state, multimodal preprocessing, streaming |
srt/managers/scheduler.py |
Queues, batches, cache, workers, scheduling, parallelism, PD, speculative decoding |
srt/managers/detokenizer_manager.py |
Token IDs to incremental text |
srt/managers/schedule_policy.py |
LPM, DFS-weight, FCFS, LOF, random, routing-key |
srt/managers/schedule_batch.py |
Req, ScheduleBatch, ModelWorkerBatch |
srt/managers/tp_worker.py |
Tensor-parallel model worker |
srt/model_executor/model_runner.py |
Model loading, distributed setup, attention backend, forward, CUDA graph |
srt/model_executor/forward_batch_info.py |
ForwardBatch and ForwardMode |
srt/mem_cache |
Memory pools, RadixCache, HiCache, storage backends, sparse cache |
srt/layers/attention |
Attention backends |
srt/layers/quantization |
Quantization configuration and kernel integration |
srt/layers/moe |
MoE layers, experts, routing, kernel integration |
srt/models |
Native model implementations and registry |
srt/model_loader |
Weight loading, format adapters, remote loaders |
srt/sampling |
Sampling parameters and sampling backends |
srt/constrained |
Grammar backends and structured outputs |
srt/function_call |
Tool-call detectors and parsers |
srt/parser |
Reasoning parsers |
srt/speculative |
EAGLE, MTP, standalone, NGRAM speculative decoding |
srt/disaggregation |
Prefill/decode disaggregation and KV transfer |
srt/distributed |
TP/PP/DP/EP communication and parallel state |
srt/lora |
Dynamic LoRA loading, batching, management |
srt/metrics / srt/tracing |
Prometheus, timers, CPU monitor, OpenTelemetry tracing |
sgl-kernel |
Optimized kernel package |
sgl-model-gateway |
Rust gateway, routing, control plane, load balancing, reliability |
multimodal_gen |
Diffusion/image/video generation runtime |
benchmark |
Real workload and performance benchmarks |
test |
Unit, registered, platform, manual, and kernel tests |
docs |
Installation, usage, advanced features, platform, developer docs |
24. Final Takeaways
SGLang has four defining design traits.
First, it is a production serving engine. The presence of OpenAI-compatible APIs, gRPC, metrics, tracing, gateway routing, rate limiting, request replay, Docker, Kubernetes, SageMaker, CI, and benchmarks makes it infrastructure, not just a model wrapper.
Second, its performance story is system-level. RadixCache, schedule policy, chunked prefill, CUDA graphs, overlap scheduling, PD disaggregation, HiCache, attention backends, MoE communication, speculative decoding, and custom kernels work together.
Third, it is built for broad model and hardware coverage. The repository supports many text, vision-language, embedding, reward, classification, rerank, and diffusion model families, while targeting NVIDIA, AMD, CPU, TPU, Ascend, XPU, MUSA, and other platforms.
Fourth, it is moving from single-server inference to cluster-level model infrastructure. The Rust gateway, PD disaggregation, HiCache L3 storage, service discovery, history connectors, MCP integration, and inference gateway mode all point in that direction.
If SGLang were a machine, sglang.lang would be the operator panel, HTTP/OpenAI/gRPC would be the public interface, Tokenizer/Scheduler/Detokenizer would be the control loop, RadixCache/HiCache would be the memory system, ModelRunner and attention backends would be the engine, sgl-kernel would be the precision-machined parts, and sgl-model-gateway would be the traffic tower coordinating a fleet.
That is SGLang’s core value: it turns model inference into a fast, observable, extensible, production-ready serving system.