SGLang Deep Dive

Posted Dec 29, 2025

By Zhenhuan Wei

26 min read

SGLang Deep Dive

SGLang Deep Dive: A Full-Stack Tour of a High-Performance LLM Serving System

This article is based on the current SGLang repository and documentation. It walks through SGLang as a full serving stack: frontend language, OpenAI-compatible APIs, runtime architecture, scheduling, KV-cache management, model execution, custom kernels, model gateway, distributed deployment, observability, benchmarks, and tests.

1. What SGLang Is

SGLang is a high-performance serving framework for large language models, multimodal models, embedding models, reward models, and diffusion-style image/video generation models. It is not just a thin wrapper around model.generate(). It is a full inference system: APIs, scheduling, tokenization, KV-cache management, model execution, distributed parallelism, custom kernels, production observability, and cluster routing.

In one sentence:

SGLang turns modern model inference from “the model can run” into “the model can serve real traffic efficiently, reliably, and at scale.”

At a high level, SGLang provides:

Area	Capabilities
Serving interfaces	Native `/generate`, OpenAI-compatible APIs, Ollama-compatible APIs, gRPC, offline engine, Python API
Runtime performance	RadixAttention, continuous batching, chunked prefill, paged attention, CUDA graphs, overlap scheduling, speculative decoding
Cache system	GPU KV cache, RadixCache, HiCache, L3 distributed KV storage, cache-aware scheduling
Parallelism	Tensor parallelism, pipeline parallelism, data parallelism, expert parallelism, DP attention, PD disaggregation
Model support	LLMs, VLMs, embeddings, reward models, rerankers, classifiers, diffusion image/video models
Generation control	Sampling parameters, stop conditions, logprobs, JSON schema, regex, EBNF, tool parsing, reasoning parsing
Production deployment	Rust model gateway, load balancing, health checks, rate limiting, circuit breakers, Prometheus, OpenTelemetry
Low-level acceleration	`sgl-kernel` custom CUDA/HIP/CUTLASS/Triton kernels for attention, MoE, GEMM, quantization, sampling, KV-cache I/O

SGLang as a Serving Stack

A layered system, from application-facing APIs down to specialized GPU kernels.

Applications

Agents, chat products, post-training rollout workers, benchmark clients, internal platforms

API Surface

/generate, OpenAI-compatible /v1/chat/completions, embeddings, rerank, score, gRPC, Ollama-compatible endpoints, Python Engine

Runtime

TokenizerManager, Scheduler, DetokenizerManager, request state, streaming, batching, metrics

Execution

TpModelWorker, ModelRunner, ForwardBatch, attention backends, sampler, grammar constraints, LoRA, quantization

Memory

ReqToTokenPool, TokenToKV pool, RadixCache, HiCache, host/offload/storage backends

Acceleration

sgl-kernel, FlashInfer, Triton, FlashAttention, FlashMLA, CUTLASS, AITER, Wave, platform backends

2. Repository Map

The repository is organized as a full system rather than a single Python package.

sglang/
├── python/sglang/              # Main Python package
│   ├── lang/                   # SGLang frontend language and interpreter
│   ├── srt/                    # SGLang Runtime, the main LLM serving engine
│   ├── multimodal_gen/         # Diffusion/image/video generation runtime
│   ├── jit_kernel/             # JIT kernels and experimental kernels
│   ├── cli/                    # CLI entrypoints
│   └── bench_*.py              # Common benchmark scripts
├── sgl-kernel/                 # Standalone optimized kernel package
├── sgl-model-gateway/          # Rust gateway for routing/control plane
├── docs/                       # User docs, advanced features, platforms, developer guide
├── examples/                   # Runtime, monitoring, and usage examples
├── benchmark/                  # End-to-end and workload-specific benchmarks
├── test/                       # Manual, registered, unit, and SRT tests
├── scripts/                    # CI, release, playground, conversion, utility scripts
├── docker/                     # Docker, Kubernetes, SageMaker deployment assets
└── 3rdparty/                   # Third-party/platform-specific code

The most important directories are:

Directory	Role
`python/sglang/lang`	The frontend language: programmatic prompting, IR, interpreter, backends
`python/sglang/srt`	SGLang Runtime: request handling, scheduling, KV cache, model execution
`sgl-kernel`	Optimized CUDA/HIP/CUTLASS/Torch-extension kernels
`sgl-model-gateway`	Rust-based routing layer for large model fleets
`python/sglang/multimodal_gen`	Diffusion/image/video generation serving runtime
`benchmark`	Performance and workload experiments
`test`	CI and correctness coverage across subsystems

3. Public Interfaces

SGLang exposes several ways to use the system.

3.1 HTTP Server

The common deployment path is:

        
      
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

This starts a FastAPI/uvicorn server implemented mainly in:

python/sglang/srt/entrypoints/http_server.py
python/sglang/srt/entrypoints/openai/
python/sglang/srt/entrypoints/ollama/

The server exposes:

Endpoint family	Examples
Native SGLang	`/generate`, `/encode`, `/classify`
OpenAI-compatible	`/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/rerank`, `/v1/score`, `/v1/tokenize`, `/v1/detokenize`
Ollama-compatible	`/api/chat`, `/api/generate`, `/api/tags`, `/api/show`
Ops/admin	`/health`, `/metrics`, profiling, cache flush, LoRA load/unload, weight update, pause/resume/abort
Platform integration	SageMaker `/invocations`, Vertex-style route, gRPC mode

3.2 Python Engine

You can also instantiate an engine directly from Python:

        
      
import sglang as sgl

engine = sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")
out = engine.generate(
    prompt="Explain KV cache in one paragraph.",
    sampling_params={"temperature": 0, "max_new_tokens": 128},
)
print(out)

The Python Engine still launches the runtime components internally. The difference is only that the entrypoint is a Python object rather than an HTTP server.

3.3 SGLang Frontend Language

python/sglang/lang provides the original SGLang programming interface: a small language embedded in Python for writing prompt programs.

It exposes:

@sgl.function for defining SGL programs
sgl.gen for model generation
sgl.select for constrained choice selection
role helpers for chat-style prompts
sgl.image and sgl.video for multimodal inputs
sgl.Runtime for remote endpoints
sgl.Engine for local runtime execution

Internally it contains:

python/sglang/lang/api.py
python/sglang/lang/ir.py
python/sglang/lang/interpreter.py
python/sglang/lang/tracer.py
python/sglang/lang/backend/

Frontend Language Execution

SGL programs are interpreted into backend calls, with optional tracing for common-prefix pre-cache.

@sgl.functionUser writes a Python prompt program.

→

SGL IRText, gen, select, role, image, video nodes.

→

InterpreterStreamExecutor runs the program.

→

BackendRuntimeEndpoint, Engine, OpenAI, Anthropic, LiteLLM.

→

ProgramStateFinal text, variables, streaming state.

4. The SGLang Runtime: Three Core Processes

The central serving engine lives in:

python/sglang/srt/

SGLang Runtime, often called SRT, is built around three major components:

Component	Process	Main job
`TokenizerManager`	Main process	Request intake, templates, tokenization, multimodal preprocessing, request state, streaming
`Scheduler`	Subprocess	Waiting/running queues, batching, KV-cache management, GPU worker scheduling
`DetokenizerManager`	Subprocess	Incremental token-id to text decoding, stop trimming, response return path

IPC between these processes is done through ZMQ.

Runtime Request Path

The main serving loop is split so CPU-heavy tokenization and detokenization do not block GPU scheduling.

1. TokenizerManagerReceives GenerateReqInput, applies templates, tokenizes text, preprocesses images/audio/video, and tracks streaming request state.

2. SchedulerChooses requests, matches prefix cache, allocates KV slots, builds batches, invokes model workers, and processes generated token IDs.

3. DetokenizerManagerDecodes token IDs into text, trims stop sequences, handles incremental output, and sends BatchStrOutput back to the tokenizer layer.

ClientHTTP, OpenAI, gRPC, Python Engine

→

TokenizerNormalize and tokenize request

→

SchedulerBatch, cache, dispatch

→

ModelRunnerForward, attention, sampling

→

DetokenizerText streaming or final JSON

4.1 TokenizerManager

Source:

python/sglang/srt/managers/tokenizer_manager.py

TokenizerManager is the runtime’s front door. It converts external request objects into tokenized internal messages that the scheduler can handle.

It initializes:

ModelConfig
tokenizer or multimodal processor
IPC channels
request state and streaming state
request logging, request dump, crash dump
weight update state
LoRA registry
PD disaggregation or encoder disaggregation services
metrics collector and watchdog
type-based request dispatcher

During request processing it handles:

chat template and completion template application
text and input-id validation
image/audio/video preprocessing
conversion into TokenizedGenerateReqInput or TokenizedEmbeddingReqInput
ReqState tracking for streaming output
timing metrics such as first-token time and finish time

4.2 Scheduler

Source:

python/sglang/srt/managers/scheduler.py

The Scheduler is SGLang’s performance center. It is much more than a queue. It coordinates scheduling policy, KV-cache allocation, model workers, speculative decoding, disaggregation, LoRA, metrics, profiling, and distributed execution.

At startup, it initializes:

model configuration
IPC channels
tokenizer or processor
MoE and GEMM configuration
TpModelWorker
optional speculative draft worker
cache and memory pools
running/waiting queues
sessions
chunked prefill
schedule policy
watchdog, memory saver, input blocker, receive skipper
profiler
prefill/decode disaggregation
overlap scheduling
deterministic inference settings
grammar manager
LoRA overlap loader

The class is built through many mixins:

SchedulerOutputProcessorMixin
SchedulerUpdateWeightsMixin
SchedulerProfilerMixin
SchedulerMetricsMixin
SchedulerDisaggregationDecodeMixin
SchedulerDisaggregationPrefillMixin
SchedulerMultiplexMixin
SchedulerRuntimeCheckerMixin
SchedulerPPMixin
SchedulerDPAttnMixin
SchedulerDllmMixin

That mixin list tells the story: Scheduler is the runtime hub.

4.3 DetokenizerManager

Source:

python/sglang/srt/managers/detokenizer_manager.py

DetokenizerManager receives token IDs from Scheduler, maintains per-request incremental decode state, batch-decodes IDs into strings, trims stop strings/tokens, and returns BatchStrOutput to TokenizerManager.

Detokenization is separated into its own process because token decoding is CPU work. Keeping it away from Scheduler protects the GPU scheduling loop.

5. Core Batch Data Structures

SGLang explicitly divides batch state into three layers:

ScheduleBatch -> ModelWorkerBatch -> ForwardBatch

Structure	Owner	Where it lives	Purpose
`ScheduleBatch`	Scheduler	Mostly CPU	High-level scheduling data: requests, prefix hits, cache allocations, sampling info
`ModelWorkerBatch`	TpModelWorker	CPU/GPU boundary	Subset of scheduling state needed by model execution
`ForwardBatch`	ModelRunner	Mostly GPU tensors	Low-level tensor state consumed by model forward and attention backend

Batch State Compression

As a request gets closer to the GPU, its representation becomes more tensor-heavy and less policy-heavy.

ReqOriginal request object and decoded request metadata.

→

ScheduleBatchQueue state, prefix match, memory budget, sampling settings.

→

ModelWorkerBatchWorker-facing subset for model execution.

→

ForwardBatchGPU tensors: input IDs, seq lens, cache locations, attention backend.

ForwardMode describes what a forward pass is doing:

Mode	Meaning
`EXTEND`	Prefill/extend a sequence
`DECODE`	Generate one token per request
`MIXED`	Chunked prefill mixed with decode
`IDLE`	Worker has no local sequence, common with DP attention
`TARGET_VERIFY`	Target-model verification in speculative decoding
`DRAFT_EXTEND` / `DRAFT_EXTEND_V2`	Draft-model extension in speculative decoding
`PREBUILT`	Decode worker receives ready KV cache in PD mode
`SPLIT_PREFILL`	Split prefill for PD multiplexing
`DLLM_EXTEND`	Diffusion LLM extension path

6. Scheduling: The Brain of Throughput

Scheduling policy lives in:

python/sglang/srt/managers/schedule_policy.py

SGLang supports both cache-aware and cache-agnostic strategies.

6.1 Cache-Aware Policies

Policy	Meaning
`lpm`	Longest Prefix Match. Requests with more reusable prefix KV cache are prioritized.
`dfs-weight`	Sorts according to DFS-style radix-tree weights to cluster shared-prefix requests.

6.2 Cache-Agnostic Policies

Policy	Meaning
`fcfs`	First come, first served
`lof`	Longest output first
`random`	Random order
`routing-key`	Prioritize requests with routing keys frequent in the running batch

Cache-Aware Scheduling Loop

The scheduler tries to spend GPU time on requests that can reuse existing computation.

Waiting QueueIncoming tokenized requests.

→

Radix MatchFind reusable prefix KV cache.

→

Priority SortLPM, DFS-weight, FCFS, LOF, routing-key.

→

Budget CheckTokens, requests, KV pages, LoRA limits.

→

Run BatchBuild ScheduleBatch and dispatch.

Prefix reuse KV budget Batching Priority LoRA/spec/PD constraints

This is why SGLang’s Scheduler is central to performance. It must consider:

waiting queue and running batch
prefix cache hit length
KV-cache capacity
maximum prefill tokens
maximum running requests
LoRA count per batch
chunked prefill
priority scheduling
speculative decoding state
PD disaggregation state
distributed rank behavior

7. KV Cache, Memory Pools, and RadixAttention

LLM inference has two very different phases:

Prefill: process the prompt and produce KV cache. This is compute-heavy.
Decode: repeatedly generate one token while reading previous KV cache. This is memory-bandwidth-heavy.

SGLang optimizes both phases with memory pools and RadixAttention.

7.1 Two-Level Memory Pool

The memory pool design is described directly in memory_pool.py: ReqToTokenPool maps a request to token locations, while token-to-KV allocators manage physical KV-cache indices.

KV Memory Addressing

The runtime separates request identity from physical KV-cache storage.

Request SlotRequest ID and logical token positions.

→

ReqToTokenPoolMaps request positions to KV indices.

→

TokenToKV AllocatorAllocates and frees physical KV pages.

→

KV Cache TensorsPhysical K/V tensors consumed by attention.

Conceptually:

req_to_token[request_slot, token_position] = physical_kv_cache_index

7.2 RadixCache

Source:

python/sglang/srt/mem_cache/radix_cache.py

RadixCache is a radix tree for reusable prefix KV cache. Each node represents a span of tokens and stores metadata about whether the KV for that span exists on device, host, or storage.

Node metadata includes:

token key
device value
host value
page hashes
parent and children
lock/reference counters
hit count
last access time
creation time
eviction priority

RadixCache Prefix Reuse

Shared prompt prefixes become shared KV-cache paths.

RootThe empty prefix. Every request starts here.

System PromptShared policy, role, or instruction prefix.

User PrefixRepeated task pattern such as summarize, translate, classify.

Document ARequest-specific suffix reuses all previous nodes.

Document BAnother suffix shares the same system and task prefix.

Eviction PolicyLRU, LFU, FIFO/FILO, MRU, priority-aware policies.

For workloads such as RAG, multi-turn chat, long-context QA, and batch evaluation, this prefix reuse can save substantial prefill compute.

7.3 RadixAttention

Source:

python/sglang/srt/layers/radix_attention.py

RadixAttention is the model-layer attention module. It reshapes Q/K/V, optionally writes K/V into cache, and dispatches actual attention computation to the selected attention backend through forward_batch.attn_backend.

Important point: RadixAttention is not one fixed kernel. It is a cache-aware attention abstraction that can use FlashInfer, Triton, FlashAttention, FlashMLA, CUTLASS MLA, TensorRT-LLM, AITER, Wave, Ascend, NSA, or other backends depending on model and platform.

8. HiCache: Hierarchical KV Cache

RadixAttention uses idle GPU memory to cache reusable prefix KV. HiCache extends the same idea into a three-level hierarchy:

Level	Storage	Role
L1	GPU memory	Fastest, smallest, local to an inference instance
L2	Host memory	Larger local capacity
L3	Distributed storage	Cluster-wide sharing and much larger capacity

HiCache: L1/L2/L3 KV Cache

HiCache expands prefix reuse beyond GPU memory and across instances.

Local Match

New request tokens are matched against HiRadixTree metadata for L1 GPU and L2 host cache hits.

L1 GPU

Hottest KV pages. Directly consumed by attention kernels.

L2 Host

Larger local KV pool. Can feed GPU via direct copy or GPU-assisted I/O kernels.

L3 Storage

Mooncake, HF3FS, NIXL, AIBrix KVCache, LMCache, file or dynamic backends.

Write-Back

New or hot KV pages are written to lower tiers using write-through, selective write-through, or write-back.

HiCache operations:

Local match: traverse local HiRadixTree metadata.
Prefetch: query L3 and load useful KV pages into local memory.
Write-back: move new or hot KV data from L1 to L2/L3.
Multi-rank synchronization: use collectives so TP ranks agree on hit lengths.
Transfer optimization: use page-first layouts and GPU-assisted I/O kernels.

HiCache is especially useful for:

multi-turn chat
repeated system prompts
long-context QA
multi-document QA
RAG workloads
cluster-level KV sharing across model instances

9. Model Execution: TpModelWorker and ModelRunner

Model execution is centered around:

python/sglang/srt/managers/tp_worker.py
python/sglang/srt/model_executor/model_runner.py
python/sglang/srt/model_executor/forward_batch_info.py

Scheduler does not directly call the model. It sends work to TpModelWorker, which converts ModelWorkerBatch into ForwardBatch and calls ModelRunner.

ModelRunner Initialization

ModelRunner owns the low-level device execution context.

Distributed InitTP, PP, DP, EP groups and backend setup.

→

Load ModelHF/local/remote formats, dtype, model-specific adjustment.

→

OptimizationQuantization, LoRA, offloaders, torchao, expert metadata.

→

MemoryKV dtype, ReqToTokenPool, TokenToKV pool, max tokens.

→

ExecutionAttention backend, kernel warmup, CUDA graph, piecewise graph.

ModelRunner is responsible for:

distributed environment initialization
tensor/pipeline/expert/data parallel group setup
model loading
quantization setup
LoRA manager
CPU/offload integration
KV-cache dtype and memory pool setup
attention backend initialization
CUDA graph and CPU graph runners
kernel warmup
weight update pathways
remote-instance transfer engine integration
expert location and expert distribution tracking
forward paths for decode, extend, idle, split prefill, and more

9.1 Attention Backend Registry

Source:

python/sglang/srt/layers/attention/attention_registry.py

SGLang registers attention backends by name:

Backend	Purpose
`flashinfer`	Default high-performance backend, MHA and MLA paths
`triton`	Triton attention implementation and fallback
`torch_native`	Compatibility path
`flex_attention`	PyTorch FlexAttention
`fa3` / `fa4`	FlashAttention v3/v4
`flashmla`	MLA-specific path
`cutlass_mla`	CUTLASS MLA backend
`trtllm_mla` / `trtllm_mha`	TensorRT-LLM attention backend
`aiter` / `wave`	AMD/ROCm-oriented optimized paths
`ascend`	Ascend NPU backend
`nsa`	Native Sparse Attention
`intel_amx`	Intel CPU AMX backend
`dual_chunk_flash_attn`	Long-context dual-chunk attention

This registry is one of the reasons SGLang can support many models and hardware platforms without forcing every model through one attention implementation.

10. Model Support and Registry

Native model implementations live in:

python/sglang/srt/models/

The repository includes implementations for many model families:

Llama, Llama4, MLLama
Qwen, Qwen2, Qwen3, Qwen-VL, Qwen-Omni
DeepSeek, DeepSeek-VL, DeepSeek-OCR, DeepSeek NextN
Kimi, Kimi Linear, Kimi VL
GLM, GLM-V, GLM-MoE
Gemma and Gemma reward variants
Mistral, Mixtral, Ministral
GPT-OSS, GPT2, GPT-J, StarCoder
Phi and Phi4MM
InternVL, LLaVA, NVILA, Pixtral
BERT, RoBERTa, embedding, reward, classification models
Mamba, hybrid linear attention, MoE, MTP, EAGLE draft models

The registry is implemented in:

python/sglang/srt/models/registry.py

It scans sglang.srt.models, imports modules with an EntryClass, and registers architectures by class name. At load time:

SGLang reads architectures from Hugging Face config.
It normalizes and looks up supported model classes.
If a native implementation exists, it uses that implementation.
Otherwise, it can fall back to TransformersForCausalLM.
External model packages can be registered through an environment variable.

Model Resolution Path

HF ConfigRead architectures from model config.

→

ModelRegistryLookup native EntryClass.

→

Native or FallbackSGLang model class or TransformersForCausalLM.

→

LoaderLoad weights from configured format.

→

ModelRunnerExecute forward passes.

11. Sampling, Structured Outputs, Tools, and Reasoning

Related directories:

python/sglang/srt/sampling/
python/sglang/srt/constrained/
python/sglang/srt/function_call/
python/sglang/srt/parser/

11.1 Sampling

SGLang supports the common generation controls:

temperature
top-p
top-k
min-p
frequency penalty
presence penalty
stop strings
stop token IDs
stop regex
max/min tokens
logprobs and top logprobs
ignore EOS
custom logit processor

Sampling backends include FlashInfer, PyTorch, and platform-specific paths.

11.2 Structured Outputs

SGLang can constrain generation with:

JSON schema
regular expression
EBNF grammar

Grammar backends:

Backend	Support
XGrammar	Default. JSON schema, regex, EBNF
Outlines	JSON schema, regex
llguidance	JSON schema, regex, EBNF

Structured Generation

The grammar backend restricts the sampler to valid next tokens.

PromptUser asks for JSON, regex, or grammar-constrained output.

→

Grammar BackendXGrammar, Outlines, or llguidance.

→

Token MaskAllowed next-token set.

→

SamplerSamples only valid tokens.

→

Valid OutputJSON, regex match, or EBNF-conforming text.

11.3 Tool Calls and Reasoning Parsers

SGLang includes model-specific detectors and parsers for tool calls and reasoning content. This matters because different models encode tool calls and chain-of-thought style reasoning in different formats.

The function-call parser area includes formats for DeepSeek, Qwen, Kimi, GLM, GPT-OSS, Mistral, Llama, Step, and more. The reasoning parser can separate hidden or explicit reasoning content from user-facing assistant output.

12. Speculative Decoding

Speculative decoding uses a cheaper draft path to propose tokens and the target model to verify them. When drafts are accepted, the target model performs fewer sequential decode steps.

SGLang supports:

Method	Draft source	When to use
EAGLE-2	EAGLE draft model	Strong general default
EAGLE-3	EAGLE3 draft model	Higher throughput when supported
MTP	Built-in multi-token prediction heads	Models with native MTP layers
STANDALONE	Separate smaller draft LLM	When a good smaller draft model is available
NGRAM	N-gram candidates from previous tokens	No extra model, CUDA-only path
SpecV2	Experimental overlap scheduler path	Aggressive overlap scheduling

Speculative Decoding Loop

Draft WorkerProposes a tree or sequence of candidate tokens.

→

Target WorkerVerifies candidates in a batched target pass.

→

Accept/RejectAccepted tokens are committed; rejection falls back to target sample.

→

KV UpdateRequest state and KV cache move forward.

In code, Scheduler checks SpeculativeAlgorithm and may launch a draft worker alongside the target worker. Users control the behavior with arguments such as:

--speculative-algorithm
--speculative-draft-model-path
--speculative-num-steps
--speculative-eagle-topk
--speculative-num-draft-tokens
--speculative-token-map

13. Prefill/Decode Disaggregation

LLM inference has two phases with different bottlenecks:

Prefill is compute-heavy.
Decode is memory-bandwidth-heavy.

In a unified engine, prefill batches can interrupt decode batches and hurt latency. PD disaggregation separates them:

Prefill workers process prompts and produce KV cache.
Decode workers receive KV cache and perform autoregressive generation.
Transfer backends move KV cache between worker groups.
Gateway/router coordinates request flow.

Supported transfer backends include Mooncake, NIXL, Ascend, fake, and MoRI paths.

PD Disaggregation Topology

Compute-heavy prefill and memory-heavy decode are scaled independently.

ClientSends OpenAI/native request to the gateway.

GatewaySelects prefill and decode workers using cache/load-aware policies.

Prefill WorkersRun prompt prefill and generate KV cache.

Decode WorkersReceive KV cache and stream generated tokens.

KV cache transfer is handled by backends such as Mooncake or NIXL. This lets prefill and decode fleets scale differently and reduces interference between the two phases.

This is particularly important for large models, long contexts, high concurrency, and rack-scale deployments.

14. Parallelism: TP, PP, DP, EP, and DP Attention

SGLang supports several parallel dimensions:

Parallelism	Purpose
Tensor Parallelism	Split tensors, attention heads, MLP matrices across GPUs
Pipeline Parallelism	Split model layers into stages
Data Parallelism	Run multiple serving replicas
Expert Parallelism	Distribute MoE experts across ranks
DP Attention	Specialized data-parallel attention path for large-model decode
Context Parallel / NSA	Long-context and sparse-attention optimizations

Parallel Execution Axes

TPShard compute within layers.

PPPipeline different layer ranges.

DPServe independent request shards or replicas.

EPPlace MoE experts across ranks.

DP AttentionSeparate attention parallelism from base TP in selected configurations.

CollectivesNCCL, RCCL, HCCL, XCCL, custom allreduce, MSCCL++ and related backends.

ModelRunner initializes distributed environment and model-parallel groups. Scheduler uses rank information to decide which rank receives requests, sends outputs, synchronizes cache metadata, and handles pipeline proxy tensors.

15. LoRA, Weight Updates, and RL/Post-Training

15.1 LoRA

Related directory:

python/sglang/srt/lora/

SGLang supports:

LoRA enabled at launch
dynamic LoRA adapter loading
loading LoRA adapters from tensors
unloading adapters
multi-LoRA batching
LoRA overlap loading
LoRA eviction policy

HTTP endpoints include:

/load_lora_adapter
/load_lora_adapter_from_tensors
/unload_lora_adapter

15.2 Weight Updates

SGLang exposes many weight update paths:

/update_weights_from_disk
/init_weights_update_group
/destroy_weights_update_group
/update_weights_from_tensor
/update_weights_from_distributed
/update_weights_from_ipc
/update_weight_version
/get_weights_by_name

This is critical for RL rollout, post-training, checkpoint-engine integration, online weight refresh, and distributed model update workflows.

15.3 RL and Post-Training Backbone

The project positions SGLang as a rollout backend for post-training frameworks. That is supported by:

weight sync
checkpoint engine integration
distributed and IPC weight updates
request replay
memory release/resume
metrics and tracing

16. Quantization

Docs:

docs/advanced_features/quantization.md
python/sglang/srt/layers/quantization/

SGLang supports offline and online quantization:

Mode	Meaning	Recommendation
Offline quantization	Load pre-quantized weights	Recommended for production
Online quantization	Quantize during startup	Convenient but slower startup and higher memory pressure

Supported families include:

AWQ
GPTQ
Marlin / GPTQ Marlin / AWQ Marlin
FP8 / MXFP8
FP4 / MXFP4 / NVFP4
W8A8 INT8 / W8A8 FP8
bitsandbytes
GGUF
ModelOpt FP8/FP4
AutoRound
compressed tensors
MoE-specific formats such as WNA16 and W4AFP8

High-performance quantized execution can route into sgl-kernel, CUTLASS, Triton, FlashInfer, or platform backends.

17. sgl-kernel: The Low-Level Acceleration Library

sgl-kernel is a standalone package:

sgl-kernel/
├── csrc/                  # CUDA/HIP/C++ extension sources
├── include/               # Kernel op headers
├── python/sgl_kernel/     # Python bindings
├── tests/                 # Kernel tests
├── benchmark/             # Kernel benchmarks
└── CMakeLists.txt

It provides optimized primitives for LLM and VLM inference engines.

Kernel family	Examples
Attention	Flash attention, FlashMLA, sparse flash attention, merge state
GEMM	FP8/FP4/INT8 GEMM, blockwise GEMM, BMM FP8, Marlin, CUTLASS
MoE	Top-k, fused gate, MoE align, FP8/FP4 blockwise MoE, Kimi K2 MoE
Quantization	AWQ/GPTQ/FP8/FP4 quant/dequant, per-token/per-tensor/per-group quant
Sampling	Top-k, sampling, speculative sampling, token bitmask
KV cache I/O	KV cache transfer, copy, store cache
Distributed	Custom allreduce, quick allreduce, MSCCL++
Norm/activation/RoPE	RMSNorm, activation, rotary embedding, fused QK norm RoPE
Mamba/SSM	Causal convolution and Mamba-related kernels

Kernel Integration Path

Python RuntimeSGLang calls Python wrapper or custom op.

→

Torch ExtensionSchema and dispatch registration.

→

C++ BindingAdapts PyTorch types and launches kernels.

→

CUDA/HIP/CUTLASSOptimized device code.

→

GPULow-latency execution path.

New kernels follow a clear path: implement source, expose headers, register torch extension, update CMake, add Python binding, add tests and benchmarks.

18. SGLang Model Gateway

The gateway is a Rust project:

sgl-model-gateway/

It turns a set of model workers into an operational model-serving fleet.

18.1 Control Plane

The control plane includes:

worker manager
worker registry
worker service
job queue
health checker
load monitor
tokenizer registry
Kubernetes service discovery
WASM module registration
MCP registration

18.2 Data Plane

The data plane supports:

regular HTTP routing
HTTP PD routing
gRPC routing
gRPC PD routing
OpenAI-compatible backend proxy
multi-model inference gateway mode
tokenize/detokenize/parser endpoints
conversation and response history connectors

Model Gateway Architecture

Rust gateway separates fleet control from request routing.

Control Plane

Worker ManagerRegisters, validates, and removes workers.

Health CheckerTracks readiness and circuit-breaker state.

Load MonitorFeeds cache-aware and load-aware policies.

Service DiscoveryKubernetes and dynamic registry updates.

Data Plane

HTTP RouterRegular SGLang and OpenAI-compatible traffic.

PD RouterCoordinates prefill and decode workers.

gRPC RouterRust tokenizer, reasoning parser, tool parser pipeline.

OpenAI ProxyRoutes to external OpenAI-compatible providers.

18.3 Load Balancing and Reliability

Gateway policies include:

random
round robin
cache-aware
power-of-two
bucket
prefix hash
consistent hashing
manual
tree-like policies

Reliability features include:

retry with jitter
per-worker circuit breaker
token-bucket rate limiting
request queueing
health checks
Prometheus metrics
OpenTelemetry tracing
structured logs
request ID propagation

19. Diffusion and Multimodal Generation

The diffusion runtime lives in:

python/sglang/multimodal_gen/

SGLang Diffusion targets accelerated image/video generation. It supports:

Wan and FastWan
Hunyuan
Qwen-Image and Qwen-Image-Edit
Flux
Z-Image
GLM-Image
NVIDIA GPUs
AMD ROCm
Moore Threads MUSA
OpenAI-compatible API
CLI
Python SDK
LoRA

The runtime has its own structure:

runtime/
├── entrypoints/
├── managers/
├── models/
├── pipelines/
├── layers/
├── loader/
├── distributed/
├── cache/
└── platforms/

This shows that SGLang is expanding beyond LLM serving into a broader multimodal serving platform.

20. Observability

Docs:

docs/advanced_features/observability.md
docs/references/production_metrics.md
docs/references/production_request_trace.md
examples/monitoring/

SGLang supports:

Prometheus metrics through --enable-metrics
Grafana dashboard examples
OpenTelemetry tracing
request logging
request dump
request replay
crash dump
crash replay
function timers
CPU monitor
tokenizer/scheduler/detokenizer metrics

Important metrics include:

Metric	Meaning
`prompt_tokens_total`	Number of prefill tokens processed
`generation_tokens_total`	Number of generated tokens
`token_usage`	KV token usage
`cache_hit_rate`	Prefix/cache hit rate
`time_to_first_token_seconds`	TTFT
`time_per_output_token_seconds`	TPOT
`e2e_request_latency_seconds`	End-to-end latency
`num_running_reqs`	Number of running requests
`num_queue_reqs`	Waiting queue size
`gen_throughput`	Generation throughput in token/s

Production Visibility

MetricsPrometheus endpoint and Grafana dashboard.

TracingOpenTelemetry request flow across runtime components.

ReplayRequest dump and crash dump replay for debugging.

LogsRequest logging, structured gateway logs, watchdog diagnostics.

21. Benchmarks and Tests

SGLang has broad benchmark coverage, not just one throughput script.

Benchmark areas include:

serving benchmark
batch benchmark
tokenizer benchmark
HiCache benchmark
JSON schema, regex, and jump-forward decoding
LoRA benchmark
MTBench
MMLU
GSM8K
HellaSwag
BoolQ
CEval
MMMU
LLaVA bench
multi-turn chat
multi-document QA
reasoning benchmark
DeepSeek V3
GPT-OSS
prefill-only embedding and scoring
kernel and attention sink benchmarks

Test layout:

Directory	Purpose
`test/unit`	Unit tests
`test/srt`	SRT subsystem tests
`test/registered`	CI-registered functional coverage
`test/manual`	Manual and platform-specific tests
`sgl-kernel/tests`	Kernel-level tests

test/registered covers many categories: OpenAI server, scheduler, radix cache, disaggregation, distributed execution, HiCache, LoRA, quantization, kernels, models, VLM, metrics, parsers, function calls, speculative decoding, performance, and stress tests.

22. End-to-End Request Flow

The following diagram summarizes a single generation request.

End-to-End Generation Path

From an OpenAI-style chat request to streamed text output.

1. APIFastAPI route or Python Engine receives request.

→

2. ProtocolValidate request and build GenerateReqInput.

→

3. TokenizerTemplate, tokenize, preprocess multimodal data.

→

4. SchedulerQueue, prefix match, allocate KV, build ScheduleBatch.

→

5. WorkerBuild ForwardBatch and run ModelRunner.forward.

6. AttentionRadixAttention dispatches to backend and updates KV cache.

→

7. SamplingLogits processor, grammar constraints, sampler.

→

8. DetokenizerDecode token IDs and trim stop sequences.

→

9. ResponseTokenizerManager returns JSON or SSE stream.

This is why SGLang is a large codebase: high-performance serving is not a single model call. It is a request-lifecycle system.

23. Component Index

Component / directory	Function
`python/sglang/lang`	Frontend language, IR, interpreter, backend abstraction
`python/sglang/cli`	CLI commands such as serve and generate
`python/sglang/launch_server.py`	Server launch entrypoint
`srt/entrypoints/http_server.py`	FastAPI server, OpenAI/Ollama/native/admin routes
`srt/entrypoints/engine.py`	Python Engine, launches tokenizer/scheduler/detokenizer
`srt/server_args.py`	Server arguments, backend choices, deployment/performance switches
`srt/managers/tokenizer_manager.py`	Tokenization, request state, multimodal preprocessing, streaming
`srt/managers/scheduler.py`	Queues, batches, cache, workers, scheduling, parallelism, PD, speculative decoding
`srt/managers/detokenizer_manager.py`	Token IDs to incremental text
`srt/managers/schedule_policy.py`	LPM, DFS-weight, FCFS, LOF, random, routing-key
`srt/managers/schedule_batch.py`	`Req`, `ScheduleBatch`, `ModelWorkerBatch`
`srt/managers/tp_worker.py`	Tensor-parallel model worker
`srt/model_executor/model_runner.py`	Model loading, distributed setup, attention backend, forward, CUDA graph
`srt/model_executor/forward_batch_info.py`	`ForwardBatch` and `ForwardMode`
`srt/mem_cache`	Memory pools, RadixCache, HiCache, storage backends, sparse cache
`srt/layers/attention`	Attention backends
`srt/layers/quantization`	Quantization configuration and kernel integration
`srt/layers/moe`	MoE layers, experts, routing, kernel integration
`srt/models`	Native model implementations and registry
`srt/model_loader`	Weight loading, format adapters, remote loaders
`srt/sampling`	Sampling parameters and sampling backends
`srt/constrained`	Grammar backends and structured outputs
`srt/function_call`	Tool-call detectors and parsers
`srt/parser`	Reasoning parsers
`srt/speculative`	EAGLE, MTP, standalone, NGRAM speculative decoding
`srt/disaggregation`	Prefill/decode disaggregation and KV transfer
`srt/distributed`	TP/PP/DP/EP communication and parallel state
`srt/lora`	Dynamic LoRA loading, batching, management
`srt/metrics` / `srt/tracing`	Prometheus, timers, CPU monitor, OpenTelemetry tracing
`sgl-kernel`	Optimized kernel package
`sgl-model-gateway`	Rust gateway, routing, control plane, load balancing, reliability
`multimodal_gen`	Diffusion/image/video generation runtime
`benchmark`	Real workload and performance benchmarks
`test`	Unit, registered, platform, manual, and kernel tests
`docs`	Installation, usage, advanced features, platform, developer docs

24. Final Takeaways

SGLang has four defining design traits.

First, it is a production serving engine. The presence of OpenAI-compatible APIs, gRPC, metrics, tracing, gateway routing, rate limiting, request replay, Docker, Kubernetes, SageMaker, CI, and benchmarks makes it infrastructure, not just a model wrapper.

Second, its performance story is system-level. RadixCache, schedule policy, chunked prefill, CUDA graphs, overlap scheduling, PD disaggregation, HiCache, attention backends, MoE communication, speculative decoding, and custom kernels work together.

Third, it is built for broad model and hardware coverage. The repository supports many text, vision-language, embedding, reward, classification, rerank, and diffusion model families, while targeting NVIDIA, AMD, CPU, TPU, Ascend, XPU, MUSA, and other platforms.

Fourth, it is moving from single-server inference to cluster-level model infrastructure. The Rust gateway, PD disaggregation, HiCache L3 storage, service discovery, history connectors, MCP integration, and inference gateway mode all point in that direction.

If SGLang were a machine, sglang.lang would be the operator panel, HTTP/OpenAI/gRPC would be the public interface, Tokenizer/Scheduler/Detokenizer would be the control loop, RadixCache/HiCache would be the memory system, ModelRunner and attention backends would be the engine, sgl-kernel would be the precision-machined parts, and sgl-model-gateway would be the traffic tower coordinating a fleet.

That is SGLang’s core value: it turns model inference into a fast, observable, extensible, production-ready serving system.

framework

sglang

This post is licensed under CC BY 4.0 by the author.