Post

SGLang Deep Dive

SGLang Deep Dive

SGLang Deep Dive: A Full-Stack Tour of a High-Performance LLM Serving System

This article is based on the current SGLang repository and documentation. It walks through SGLang as a full serving stack: frontend language, OpenAI-compatible APIs, runtime architecture, scheduling, KV-cache management, model execution, custom kernels, model gateway, distributed deployment, observability, benchmarks, and tests.

1. What SGLang Is

SGLang is a high-performance serving framework for large language models, multimodal models, embedding models, reward models, and diffusion-style image/video generation models. It is not just a thin wrapper around model.generate(). It is a full inference system: APIs, scheduling, tokenization, KV-cache management, model execution, distributed parallelism, custom kernels, production observability, and cluster routing.

In one sentence:

SGLang turns modern model inference from “the model can run” into “the model can serve real traffic efficiently, reliably, and at scale.”

At a high level, SGLang provides:

Area Capabilities
Serving interfaces Native /generate, OpenAI-compatible APIs, Ollama-compatible APIs, gRPC, offline engine, Python API
Runtime performance RadixAttention, continuous batching, chunked prefill, paged attention, CUDA graphs, overlap scheduling, speculative decoding
Cache system GPU KV cache, RadixCache, HiCache, L3 distributed KV storage, cache-aware scheduling
Parallelism Tensor parallelism, pipeline parallelism, data parallelism, expert parallelism, DP attention, PD disaggregation
Model support LLMs, VLMs, embeddings, reward models, rerankers, classifiers, diffusion image/video models
Generation control Sampling parameters, stop conditions, logprobs, JSON schema, regex, EBNF, tool parsing, reasoning parsing
Production deployment Rust model gateway, load balancing, health checks, rate limiting, circuit breakers, Prometheus, OpenTelemetry
Low-level acceleration sgl-kernel custom CUDA/HIP/CUTLASS/Triton kernels for attention, MoE, GEMM, quantization, sampling, KV-cache I/O

SGLang as a Serving Stack

A layered system, from application-facing APIs down to specialized GPU kernels.

Applications
Agents, chat products, post-training rollout workers, benchmark clients, internal platforms
API Surface
/generate, OpenAI-compatible /v1/chat/completions, embeddings, rerank, score, gRPC, Ollama-compatible endpoints, Python Engine
Runtime
TokenizerManager, Scheduler, DetokenizerManager, request state, streaming, batching, metrics
Execution
TpModelWorker, ModelRunner, ForwardBatch, attention backends, sampler, grammar constraints, LoRA, quantization
Memory
ReqToTokenPool, TokenToKV pool, RadixCache, HiCache, host/offload/storage backends
Acceleration
sgl-kernel, FlashInfer, Triton, FlashAttention, FlashMLA, CUTLASS, AITER, Wave, platform backends

2. Repository Map

The repository is organized as a full system rather than a single Python package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sglang/
├── python/sglang/              # Main Python package
│   ├── lang/                   # SGLang frontend language and interpreter
│   ├── srt/                    # SGLang Runtime, the main LLM serving engine
│   ├── multimodal_gen/         # Diffusion/image/video generation runtime
│   ├── jit_kernel/             # JIT kernels and experimental kernels
│   ├── cli/                    # CLI entrypoints
│   └── bench_*.py              # Common benchmark scripts
├── sgl-kernel/                 # Standalone optimized kernel package
├── sgl-model-gateway/          # Rust gateway for routing/control plane
├── docs/                       # User docs, advanced features, platforms, developer guide
├── examples/                   # Runtime, monitoring, and usage examples
├── benchmark/                  # End-to-end and workload-specific benchmarks
├── test/                       # Manual, registered, unit, and SRT tests
├── scripts/                    # CI, release, playground, conversion, utility scripts
├── docker/                     # Docker, Kubernetes, SageMaker deployment assets
└── 3rdparty/                   # Third-party/platform-specific code

The most important directories are:

Directory Role
python/sglang/lang The frontend language: programmatic prompting, IR, interpreter, backends
python/sglang/srt SGLang Runtime: request handling, scheduling, KV cache, model execution
sgl-kernel Optimized CUDA/HIP/CUTLASS/Torch-extension kernels
sgl-model-gateway Rust-based routing layer for large model fleets
python/sglang/multimodal_gen Diffusion/image/video generation serving runtime
benchmark Performance and workload experiments
test CI and correctness coverage across subsystems

3. Public Interfaces

SGLang exposes several ways to use the system.

3.1 HTTP Server

The common deployment path is:

1
2
3
4
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000

This starts a FastAPI/uvicorn server implemented mainly in:

1
2
3
python/sglang/srt/entrypoints/http_server.py
python/sglang/srt/entrypoints/openai/
python/sglang/srt/entrypoints/ollama/

The server exposes:

Endpoint family Examples
Native SGLang /generate, /encode, /classify
OpenAI-compatible /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/score, /v1/tokenize, /v1/detokenize
Ollama-compatible /api/chat, /api/generate, /api/tags, /api/show
Ops/admin /health, /metrics, profiling, cache flush, LoRA load/unload, weight update, pause/resume/abort
Platform integration SageMaker /invocations, Vertex-style route, gRPC mode

3.2 Python Engine

You can also instantiate an engine directly from Python:

1
2
3
4
5
6
7
8
import sglang as sgl

engine = sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")
out = engine.generate(
    prompt="Explain KV cache in one paragraph.",
    sampling_params={"temperature": 0, "max_new_tokens": 128},
)
print(out)

The Python Engine still launches the runtime components internally. The difference is only that the entrypoint is a Python object rather than an HTTP server.

3.3 SGLang Frontend Language

python/sglang/lang provides the original SGLang programming interface: a small language embedded in Python for writing prompt programs.

It exposes:

  • @sgl.function for defining SGL programs
  • sgl.gen for model generation
  • sgl.select for constrained choice selection
  • role helpers for chat-style prompts
  • sgl.image and sgl.video for multimodal inputs
  • sgl.Runtime for remote endpoints
  • sgl.Engine for local runtime execution

Internally it contains:

1
2
3
4
5
python/sglang/lang/api.py
python/sglang/lang/ir.py
python/sglang/lang/interpreter.py
python/sglang/lang/tracer.py
python/sglang/lang/backend/

Frontend Language Execution

SGL programs are interpreted into backend calls, with optional tracing for common-prefix pre-cache.

@sgl.functionUser writes a Python prompt program.
SGL IRText, gen, select, role, image, video nodes.
InterpreterStreamExecutor runs the program.
BackendRuntimeEndpoint, Engine, OpenAI, Anthropic, LiteLLM.
ProgramStateFinal text, variables, streaming state.

4. The SGLang Runtime: Three Core Processes

The central serving engine lives in:

1
python/sglang/srt/

SGLang Runtime, often called SRT, is built around three major components:

Component Process Main job
TokenizerManager Main process Request intake, templates, tokenization, multimodal preprocessing, request state, streaming
Scheduler Subprocess Waiting/running queues, batching, KV-cache management, GPU worker scheduling
DetokenizerManager Subprocess Incremental token-id to text decoding, stop trimming, response return path

IPC between these processes is done through ZMQ.

Runtime Request Path

The main serving loop is split so CPU-heavy tokenization and detokenization do not block GPU scheduling.

1. TokenizerManagerReceives GenerateReqInput, applies templates, tokenizes text, preprocesses images/audio/video, and tracks streaming request state.
2. SchedulerChooses requests, matches prefix cache, allocates KV slots, builds batches, invokes model workers, and processes generated token IDs.
3. DetokenizerManagerDecodes token IDs into text, trims stop sequences, handles incremental output, and sends BatchStrOutput back to the tokenizer layer.
ClientHTTP, OpenAI, gRPC, Python Engine
TokenizerNormalize and tokenize request
SchedulerBatch, cache, dispatch
ModelRunnerForward, attention, sampling
DetokenizerText streaming or final JSON

4.1 TokenizerManager

Source:

1
python/sglang/srt/managers/tokenizer_manager.py

TokenizerManager is the runtime’s front door. It converts external request objects into tokenized internal messages that the scheduler can handle.

It initializes:

  • ModelConfig
  • tokenizer or multimodal processor
  • IPC channels
  • request state and streaming state
  • request logging, request dump, crash dump
  • weight update state
  • LoRA registry
  • PD disaggregation or encoder disaggregation services
  • metrics collector and watchdog
  • type-based request dispatcher

During request processing it handles:

  • chat template and completion template application
  • text and input-id validation
  • image/audio/video preprocessing
  • conversion into TokenizedGenerateReqInput or TokenizedEmbeddingReqInput
  • ReqState tracking for streaming output
  • timing metrics such as first-token time and finish time

4.2 Scheduler

Source:

1
python/sglang/srt/managers/scheduler.py

The Scheduler is SGLang’s performance center. It is much more than a queue. It coordinates scheduling policy, KV-cache allocation, model workers, speculative decoding, disaggregation, LoRA, metrics, profiling, and distributed execution.

At startup, it initializes:

  • model configuration
  • IPC channels
  • tokenizer or processor
  • MoE and GEMM configuration
  • TpModelWorker
  • optional speculative draft worker
  • cache and memory pools
  • running/waiting queues
  • sessions
  • chunked prefill
  • schedule policy
  • watchdog, memory saver, input blocker, receive skipper
  • profiler
  • prefill/decode disaggregation
  • overlap scheduling
  • deterministic inference settings
  • grammar manager
  • LoRA overlap loader

The class is built through many mixins:

1
2
3
4
5
6
7
8
9
10
11
SchedulerOutputProcessorMixin
SchedulerUpdateWeightsMixin
SchedulerProfilerMixin
SchedulerMetricsMixin
SchedulerDisaggregationDecodeMixin
SchedulerDisaggregationPrefillMixin
SchedulerMultiplexMixin
SchedulerRuntimeCheckerMixin
SchedulerPPMixin
SchedulerDPAttnMixin
SchedulerDllmMixin

That mixin list tells the story: Scheduler is the runtime hub.

4.3 DetokenizerManager

Source:

1
python/sglang/srt/managers/detokenizer_manager.py

DetokenizerManager receives token IDs from Scheduler, maintains per-request incremental decode state, batch-decodes IDs into strings, trims stop strings/tokens, and returns BatchStrOutput to TokenizerManager.

Detokenization is separated into its own process because token decoding is CPU work. Keeping it away from Scheduler protects the GPU scheduling loop.

5. Core Batch Data Structures

SGLang explicitly divides batch state into three layers:

1
ScheduleBatch -> ModelWorkerBatch -> ForwardBatch
Structure Owner Where it lives Purpose
ScheduleBatch Scheduler Mostly CPU High-level scheduling data: requests, prefix hits, cache allocations, sampling info
ModelWorkerBatch TpModelWorker CPU/GPU boundary Subset of scheduling state needed by model execution
ForwardBatch ModelRunner Mostly GPU tensors Low-level tensor state consumed by model forward and attention backend

Batch State Compression

As a request gets closer to the GPU, its representation becomes more tensor-heavy and less policy-heavy.

ReqOriginal request object and decoded request metadata.
ScheduleBatchQueue state, prefix match, memory budget, sampling settings.
ModelWorkerBatchWorker-facing subset for model execution.
ForwardBatchGPU tensors: input IDs, seq lens, cache locations, attention backend.

ForwardMode describes what a forward pass is doing:

Mode Meaning
EXTEND Prefill/extend a sequence
DECODE Generate one token per request
MIXED Chunked prefill mixed with decode
IDLE Worker has no local sequence, common with DP attention
TARGET_VERIFY Target-model verification in speculative decoding
DRAFT_EXTEND / DRAFT_EXTEND_V2 Draft-model extension in speculative decoding
PREBUILT Decode worker receives ready KV cache in PD mode
SPLIT_PREFILL Split prefill for PD multiplexing
DLLM_EXTEND Diffusion LLM extension path

6. Scheduling: The Brain of Throughput

Scheduling policy lives in:

1
python/sglang/srt/managers/schedule_policy.py

SGLang supports both cache-aware and cache-agnostic strategies.

6.1 Cache-Aware Policies

Policy Meaning
lpm Longest Prefix Match. Requests with more reusable prefix KV cache are prioritized.
dfs-weight Sorts according to DFS-style radix-tree weights to cluster shared-prefix requests.

6.2 Cache-Agnostic Policies

Policy Meaning
fcfs First come, first served
lof Longest output first
random Random order
routing-key Prioritize requests with routing keys frequent in the running batch

Cache-Aware Scheduling Loop

The scheduler tries to spend GPU time on requests that can reuse existing computation.

Waiting QueueIncoming tokenized requests.
Radix MatchFind reusable prefix KV cache.
Priority SortLPM, DFS-weight, FCFS, LOF, routing-key.
Budget CheckTokens, requests, KV pages, LoRA limits.
Run BatchBuild ScheduleBatch and dispatch.
Prefix reuse KV budget Batching Priority LoRA/spec/PD constraints

This is why SGLang’s Scheduler is central to performance. It must consider:

  • waiting queue and running batch
  • prefix cache hit length
  • KV-cache capacity
  • maximum prefill tokens
  • maximum running requests
  • LoRA count per batch
  • chunked prefill
  • priority scheduling
  • speculative decoding state
  • PD disaggregation state
  • distributed rank behavior

7. KV Cache, Memory Pools, and RadixAttention

LLM inference has two very different phases:

  1. Prefill: process the prompt and produce KV cache. This is compute-heavy.
  2. Decode: repeatedly generate one token while reading previous KV cache. This is memory-bandwidth-heavy.

SGLang optimizes both phases with memory pools and RadixAttention.

7.1 Two-Level Memory Pool

The memory pool design is described directly in memory_pool.py: ReqToTokenPool maps a request to token locations, while token-to-KV allocators manage physical KV-cache indices.

KV Memory Addressing

The runtime separates request identity from physical KV-cache storage.

Request SlotRequest ID and logical token positions.
ReqToTokenPoolMaps request positions to KV indices.
TokenToKV AllocatorAllocates and frees physical KV pages.
KV Cache TensorsPhysical K/V tensors consumed by attention.

Conceptually:

1
req_to_token[request_slot, token_position] = physical_kv_cache_index

7.2 RadixCache

Source:

1
python/sglang/srt/mem_cache/radix_cache.py

RadixCache is a radix tree for reusable prefix KV cache. Each node represents a span of tokens and stores metadata about whether the KV for that span exists on device, host, or storage.

Node metadata includes:

  • token key
  • device value
  • host value
  • page hashes
  • parent and children
  • lock/reference counters
  • hit count
  • last access time
  • creation time
  • eviction priority

RadixCache Prefix Reuse

Shared prompt prefixes become shared KV-cache paths.

RootThe empty prefix. Every request starts here.
System PromptShared policy, role, or instruction prefix.
User PrefixRepeated task pattern such as summarize, translate, classify.
Document ARequest-specific suffix reuses all previous nodes.
Document BAnother suffix shares the same system and task prefix.
Eviction PolicyLRU, LFU, FIFO/FILO, MRU, priority-aware policies.

For workloads such as RAG, multi-turn chat, long-context QA, and batch evaluation, this prefix reuse can save substantial prefill compute.

7.3 RadixAttention

Source:

1
python/sglang/srt/layers/radix_attention.py

RadixAttention is the model-layer attention module. It reshapes Q/K/V, optionally writes K/V into cache, and dispatches actual attention computation to the selected attention backend through forward_batch.attn_backend.

Important point: RadixAttention is not one fixed kernel. It is a cache-aware attention abstraction that can use FlashInfer, Triton, FlashAttention, FlashMLA, CUTLASS MLA, TensorRT-LLM, AITER, Wave, Ascend, NSA, or other backends depending on model and platform.

8. HiCache: Hierarchical KV Cache

RadixAttention uses idle GPU memory to cache reusable prefix KV. HiCache extends the same idea into a three-level hierarchy:

Level Storage Role
L1 GPU memory Fastest, smallest, local to an inference instance
L2 Host memory Larger local capacity
L3 Distributed storage Cluster-wide sharing and much larger capacity

HiCache: L1/L2/L3 KV Cache

HiCache expands prefix reuse beyond GPU memory and across instances.

Local Match
New request tokens are matched against HiRadixTree metadata for L1 GPU and L2 host cache hits.
L1 GPU
Hottest KV pages. Directly consumed by attention kernels.
L2 Host
Larger local KV pool. Can feed GPU via direct copy or GPU-assisted I/O kernels.
L3 Storage
Mooncake, HF3FS, NIXL, AIBrix KVCache, LMCache, file or dynamic backends.
Write-Back
New or hot KV pages are written to lower tiers using write-through, selective write-through, or write-back.

HiCache operations:

  • Local match: traverse local HiRadixTree metadata.
  • Prefetch: query L3 and load useful KV pages into local memory.
  • Write-back: move new or hot KV data from L1 to L2/L3.
  • Multi-rank synchronization: use collectives so TP ranks agree on hit lengths.
  • Transfer optimization: use page-first layouts and GPU-assisted I/O kernels.

HiCache is especially useful for:

  • multi-turn chat
  • repeated system prompts
  • long-context QA
  • multi-document QA
  • RAG workloads
  • cluster-level KV sharing across model instances

9. Model Execution: TpModelWorker and ModelRunner

Model execution is centered around:

1
2
3
python/sglang/srt/managers/tp_worker.py
python/sglang/srt/model_executor/model_runner.py
python/sglang/srt/model_executor/forward_batch_info.py

Scheduler does not directly call the model. It sends work to TpModelWorker, which converts ModelWorkerBatch into ForwardBatch and calls ModelRunner.

ModelRunner Initialization

ModelRunner owns the low-level device execution context.

Distributed InitTP, PP, DP, EP groups and backend setup.
Load ModelHF/local/remote formats, dtype, model-specific adjustment.
OptimizationQuantization, LoRA, offloaders, torchao, expert metadata.
MemoryKV dtype, ReqToTokenPool, TokenToKV pool, max tokens.
ExecutionAttention backend, kernel warmup, CUDA graph, piecewise graph.

ModelRunner is responsible for:

  • distributed environment initialization
  • tensor/pipeline/expert/data parallel group setup
  • model loading
  • quantization setup
  • LoRA manager
  • CPU/offload integration
  • KV-cache dtype and memory pool setup
  • attention backend initialization
  • CUDA graph and CPU graph runners
  • kernel warmup
  • weight update pathways
  • remote-instance transfer engine integration
  • expert location and expert distribution tracking
  • forward paths for decode, extend, idle, split prefill, and more

9.1 Attention Backend Registry

Source:

1
python/sglang/srt/layers/attention/attention_registry.py

SGLang registers attention backends by name:

Backend Purpose
flashinfer Default high-performance backend, MHA and MLA paths
triton Triton attention implementation and fallback
torch_native Compatibility path
flex_attention PyTorch FlexAttention
fa3 / fa4 FlashAttention v3/v4
flashmla MLA-specific path
cutlass_mla CUTLASS MLA backend
trtllm_mla / trtllm_mha TensorRT-LLM attention backend
aiter / wave AMD/ROCm-oriented optimized paths
ascend Ascend NPU backend
nsa Native Sparse Attention
intel_amx Intel CPU AMX backend
dual_chunk_flash_attn Long-context dual-chunk attention

This registry is one of the reasons SGLang can support many models and hardware platforms without forcing every model through one attention implementation.

10. Model Support and Registry

Native model implementations live in:

1
python/sglang/srt/models/

The repository includes implementations for many model families:

  • Llama, Llama4, MLLama
  • Qwen, Qwen2, Qwen3, Qwen-VL, Qwen-Omni
  • DeepSeek, DeepSeek-VL, DeepSeek-OCR, DeepSeek NextN
  • Kimi, Kimi Linear, Kimi VL
  • GLM, GLM-V, GLM-MoE
  • Gemma and Gemma reward variants
  • Mistral, Mixtral, Ministral
  • GPT-OSS, GPT2, GPT-J, StarCoder
  • Phi and Phi4MM
  • InternVL, LLaVA, NVILA, Pixtral
  • BERT, RoBERTa, embedding, reward, classification models
  • Mamba, hybrid linear attention, MoE, MTP, EAGLE draft models

The registry is implemented in:

1
python/sglang/srt/models/registry.py

It scans sglang.srt.models, imports modules with an EntryClass, and registers architectures by class name. At load time:

  1. SGLang reads architectures from Hugging Face config.
  2. It normalizes and looks up supported model classes.
  3. If a native implementation exists, it uses that implementation.
  4. Otherwise, it can fall back to TransformersForCausalLM.
  5. External model packages can be registered through an environment variable.

Model Resolution Path

HF ConfigRead architectures from model config.
ModelRegistryLookup native EntryClass.
Native or FallbackSGLang model class or TransformersForCausalLM.
LoaderLoad weights from configured format.
ModelRunnerExecute forward passes.

11. Sampling, Structured Outputs, Tools, and Reasoning

Related directories:

1
2
3
4
python/sglang/srt/sampling/
python/sglang/srt/constrained/
python/sglang/srt/function_call/
python/sglang/srt/parser/

11.1 Sampling

SGLang supports the common generation controls:

  • temperature
  • top-p
  • top-k
  • min-p
  • frequency penalty
  • presence penalty
  • stop strings
  • stop token IDs
  • stop regex
  • max/min tokens
  • logprobs and top logprobs
  • ignore EOS
  • custom logit processor

Sampling backends include FlashInfer, PyTorch, and platform-specific paths.

11.2 Structured Outputs

SGLang can constrain generation with:

  • JSON schema
  • regular expression
  • EBNF grammar

Grammar backends:

Backend Support
XGrammar Default. JSON schema, regex, EBNF
Outlines JSON schema, regex
llguidance JSON schema, regex, EBNF

Structured Generation

The grammar backend restricts the sampler to valid next tokens.

PromptUser asks for JSON, regex, or grammar-constrained output.
Grammar BackendXGrammar, Outlines, or llguidance.
Token MaskAllowed next-token set.
SamplerSamples only valid tokens.
Valid OutputJSON, regex match, or EBNF-conforming text.

11.3 Tool Calls and Reasoning Parsers

SGLang includes model-specific detectors and parsers for tool calls and reasoning content. This matters because different models encode tool calls and chain-of-thought style reasoning in different formats.

The function-call parser area includes formats for DeepSeek, Qwen, Kimi, GLM, GPT-OSS, Mistral, Llama, Step, and more. The reasoning parser can separate hidden or explicit reasoning content from user-facing assistant output.

12. Speculative Decoding

Speculative decoding uses a cheaper draft path to propose tokens and the target model to verify them. When drafts are accepted, the target model performs fewer sequential decode steps.

SGLang supports:

Method Draft source When to use
EAGLE-2 EAGLE draft model Strong general default
EAGLE-3 EAGLE3 draft model Higher throughput when supported
MTP Built-in multi-token prediction heads Models with native MTP layers
STANDALONE Separate smaller draft LLM When a good smaller draft model is available
NGRAM N-gram candidates from previous tokens No extra model, CUDA-only path
SpecV2 Experimental overlap scheduler path Aggressive overlap scheduling

Speculative Decoding Loop

Draft WorkerProposes a tree or sequence of candidate tokens.
Target WorkerVerifies candidates in a batched target pass.
Accept/RejectAccepted tokens are committed; rejection falls back to target sample.
KV UpdateRequest state and KV cache move forward.

In code, Scheduler checks SpeculativeAlgorithm and may launch a draft worker alongside the target worker. Users control the behavior with arguments such as:

  • --speculative-algorithm
  • --speculative-draft-model-path
  • --speculative-num-steps
  • --speculative-eagle-topk
  • --speculative-num-draft-tokens
  • --speculative-token-map

13. Prefill/Decode Disaggregation

LLM inference has two phases with different bottlenecks:

  • Prefill is compute-heavy.
  • Decode is memory-bandwidth-heavy.

In a unified engine, prefill batches can interrupt decode batches and hurt latency. PD disaggregation separates them:

  • Prefill workers process prompts and produce KV cache.
  • Decode workers receive KV cache and perform autoregressive generation.
  • Transfer backends move KV cache between worker groups.
  • Gateway/router coordinates request flow.

Supported transfer backends include Mooncake, NIXL, Ascend, fake, and MoRI paths.

PD Disaggregation Topology

Compute-heavy prefill and memory-heavy decode are scaled independently.

ClientSends OpenAI/native request to the gateway.
GatewaySelects prefill and decode workers using cache/load-aware policies.
Prefill WorkersRun prompt prefill and generate KV cache.
Decode WorkersReceive KV cache and stream generated tokens.
KV cache transfer is handled by backends such as Mooncake or NIXL. This lets prefill and decode fleets scale differently and reduces interference between the two phases.

This is particularly important for large models, long contexts, high concurrency, and rack-scale deployments.

14. Parallelism: TP, PP, DP, EP, and DP Attention

SGLang supports several parallel dimensions:

Parallelism Purpose
Tensor Parallelism Split tensors, attention heads, MLP matrices across GPUs
Pipeline Parallelism Split model layers into stages
Data Parallelism Run multiple serving replicas
Expert Parallelism Distribute MoE experts across ranks
DP Attention Specialized data-parallel attention path for large-model decode
Context Parallel / NSA Long-context and sparse-attention optimizations

Parallel Execution Axes

TPShard compute within layers.
PPPipeline different layer ranges.
DPServe independent request shards or replicas.
EPPlace MoE experts across ranks.
DP AttentionSeparate attention parallelism from base TP in selected configurations.
CollectivesNCCL, RCCL, HCCL, XCCL, custom allreduce, MSCCL++ and related backends.

ModelRunner initializes distributed environment and model-parallel groups. Scheduler uses rank information to decide which rank receives requests, sends outputs, synchronizes cache metadata, and handles pipeline proxy tensors.

15. LoRA, Weight Updates, and RL/Post-Training

15.1 LoRA

Related directory:

1
python/sglang/srt/lora/

SGLang supports:

  • LoRA enabled at launch
  • dynamic LoRA adapter loading
  • loading LoRA adapters from tensors
  • unloading adapters
  • multi-LoRA batching
  • LoRA overlap loading
  • LoRA eviction policy

HTTP endpoints include:

  • /load_lora_adapter
  • /load_lora_adapter_from_tensors
  • /unload_lora_adapter

15.2 Weight Updates

SGLang exposes many weight update paths:

  • /update_weights_from_disk
  • /init_weights_update_group
  • /destroy_weights_update_group
  • /update_weights_from_tensor
  • /update_weights_from_distributed
  • /update_weights_from_ipc
  • /update_weight_version
  • /get_weights_by_name

This is critical for RL rollout, post-training, checkpoint-engine integration, online weight refresh, and distributed model update workflows.

15.3 RL and Post-Training Backbone

The project positions SGLang as a rollout backend for post-training frameworks. That is supported by:

  • weight sync
  • checkpoint engine integration
  • distributed and IPC weight updates
  • request replay
  • memory release/resume
  • metrics and tracing

16. Quantization

Docs:

1
2
docs/advanced_features/quantization.md
python/sglang/srt/layers/quantization/

SGLang supports offline and online quantization:

Mode Meaning Recommendation
Offline quantization Load pre-quantized weights Recommended for production
Online quantization Quantize during startup Convenient but slower startup and higher memory pressure

Supported families include:

  • AWQ
  • GPTQ
  • Marlin / GPTQ Marlin / AWQ Marlin
  • FP8 / MXFP8
  • FP4 / MXFP4 / NVFP4
  • W8A8 INT8 / W8A8 FP8
  • bitsandbytes
  • GGUF
  • ModelOpt FP8/FP4
  • AutoRound
  • compressed tensors
  • MoE-specific formats such as WNA16 and W4AFP8

High-performance quantized execution can route into sgl-kernel, CUTLASS, Triton, FlashInfer, or platform backends.

17. sgl-kernel: The Low-Level Acceleration Library

sgl-kernel is a standalone package:

1
2
3
4
5
6
7
sgl-kernel/
├── csrc/                  # CUDA/HIP/C++ extension sources
├── include/               # Kernel op headers
├── python/sgl_kernel/     # Python bindings
├── tests/                 # Kernel tests
├── benchmark/             # Kernel benchmarks
└── CMakeLists.txt

It provides optimized primitives for LLM and VLM inference engines.

Kernel family Examples
Attention Flash attention, FlashMLA, sparse flash attention, merge state
GEMM FP8/FP4/INT8 GEMM, blockwise GEMM, BMM FP8, Marlin, CUTLASS
MoE Top-k, fused gate, MoE align, FP8/FP4 blockwise MoE, Kimi K2 MoE
Quantization AWQ/GPTQ/FP8/FP4 quant/dequant, per-token/per-tensor/per-group quant
Sampling Top-k, sampling, speculative sampling, token bitmask
KV cache I/O KV cache transfer, copy, store cache
Distributed Custom allreduce, quick allreduce, MSCCL++
Norm/activation/RoPE RMSNorm, activation, rotary embedding, fused QK norm RoPE
Mamba/SSM Causal convolution and Mamba-related kernels

Kernel Integration Path

Python RuntimeSGLang calls Python wrapper or custom op.
Torch ExtensionSchema and dispatch registration.
C++ BindingAdapts PyTorch types and launches kernels.
CUDA/HIP/CUTLASSOptimized device code.
GPULow-latency execution path.

New kernels follow a clear path: implement source, expose headers, register torch extension, update CMake, add Python binding, add tests and benchmarks.

18. SGLang Model Gateway

The gateway is a Rust project:

1
sgl-model-gateway/

It turns a set of model workers into an operational model-serving fleet.

18.1 Control Plane

The control plane includes:

  • worker manager
  • worker registry
  • worker service
  • job queue
  • health checker
  • load monitor
  • tokenizer registry
  • Kubernetes service discovery
  • WASM module registration
  • MCP registration

18.2 Data Plane

The data plane supports:

  • regular HTTP routing
  • HTTP PD routing
  • gRPC routing
  • gRPC PD routing
  • OpenAI-compatible backend proxy
  • multi-model inference gateway mode
  • tokenize/detokenize/parser endpoints
  • conversation and response history connectors

Model Gateway Architecture

Rust gateway separates fleet control from request routing.

Control Plane
Worker ManagerRegisters, validates, and removes workers.
Health CheckerTracks readiness and circuit-breaker state.
Load MonitorFeeds cache-aware and load-aware policies.
Service DiscoveryKubernetes and dynamic registry updates.
Data Plane
HTTP RouterRegular SGLang and OpenAI-compatible traffic.
PD RouterCoordinates prefill and decode workers.
gRPC RouterRust tokenizer, reasoning parser, tool parser pipeline.
OpenAI ProxyRoutes to external OpenAI-compatible providers.

18.3 Load Balancing and Reliability

Gateway policies include:

  • random
  • round robin
  • cache-aware
  • power-of-two
  • bucket
  • prefix hash
  • consistent hashing
  • manual
  • tree-like policies

Reliability features include:

  • retry with jitter
  • per-worker circuit breaker
  • token-bucket rate limiting
  • request queueing
  • health checks
  • Prometheus metrics
  • OpenTelemetry tracing
  • structured logs
  • request ID propagation

19. Diffusion and Multimodal Generation

The diffusion runtime lives in:

1
python/sglang/multimodal_gen/

SGLang Diffusion targets accelerated image/video generation. It supports:

  • Wan and FastWan
  • Hunyuan
  • Qwen-Image and Qwen-Image-Edit
  • Flux
  • Z-Image
  • GLM-Image
  • NVIDIA GPUs
  • AMD ROCm
  • Moore Threads MUSA
  • OpenAI-compatible API
  • CLI
  • Python SDK
  • LoRA

The runtime has its own structure:

1
2
3
4
5
6
7
8
9
10
runtime/
├── entrypoints/
├── managers/
├── models/
├── pipelines/
├── layers/
├── loader/
├── distributed/
├── cache/
└── platforms/

This shows that SGLang is expanding beyond LLM serving into a broader multimodal serving platform.

20. Observability

Docs:

1
2
3
4
docs/advanced_features/observability.md
docs/references/production_metrics.md
docs/references/production_request_trace.md
examples/monitoring/

SGLang supports:

  • Prometheus metrics through --enable-metrics
  • Grafana dashboard examples
  • OpenTelemetry tracing
  • request logging
  • request dump
  • request replay
  • crash dump
  • crash replay
  • function timers
  • CPU monitor
  • tokenizer/scheduler/detokenizer metrics

Important metrics include:

Metric Meaning
prompt_tokens_total Number of prefill tokens processed
generation_tokens_total Number of generated tokens
token_usage KV token usage
cache_hit_rate Prefix/cache hit rate
time_to_first_token_seconds TTFT
time_per_output_token_seconds TPOT
e2e_request_latency_seconds End-to-end latency
num_running_reqs Number of running requests
num_queue_reqs Waiting queue size
gen_throughput Generation throughput in token/s

Production Visibility

MetricsPrometheus endpoint and Grafana dashboard.
TracingOpenTelemetry request flow across runtime components.
ReplayRequest dump and crash dump replay for debugging.
LogsRequest logging, structured gateway logs, watchdog diagnostics.

21. Benchmarks and Tests

SGLang has broad benchmark coverage, not just one throughput script.

Benchmark areas include:

  • serving benchmark
  • batch benchmark
  • tokenizer benchmark
  • HiCache benchmark
  • JSON schema, regex, and jump-forward decoding
  • LoRA benchmark
  • MTBench
  • MMLU
  • GSM8K
  • HellaSwag
  • BoolQ
  • CEval
  • MMMU
  • LLaVA bench
  • multi-turn chat
  • multi-document QA
  • reasoning benchmark
  • DeepSeek V3
  • GPT-OSS
  • prefill-only embedding and scoring
  • kernel and attention sink benchmarks

Test layout:

Directory Purpose
test/unit Unit tests
test/srt SRT subsystem tests
test/registered CI-registered functional coverage
test/manual Manual and platform-specific tests
sgl-kernel/tests Kernel-level tests

test/registered covers many categories: OpenAI server, scheduler, radix cache, disaggregation, distributed execution, HiCache, LoRA, quantization, kernels, models, VLM, metrics, parsers, function calls, speculative decoding, performance, and stress tests.

22. End-to-End Request Flow

The following diagram summarizes a single generation request.

End-to-End Generation Path

From an OpenAI-style chat request to streamed text output.

1. APIFastAPI route or Python Engine receives request.
2. ProtocolValidate request and build GenerateReqInput.
3. TokenizerTemplate, tokenize, preprocess multimodal data.
4. SchedulerQueue, prefix match, allocate KV, build ScheduleBatch.
5. WorkerBuild ForwardBatch and run ModelRunner.forward.
6. AttentionRadixAttention dispatches to backend and updates KV cache.
7. SamplingLogits processor, grammar constraints, sampler.
8. DetokenizerDecode token IDs and trim stop sequences.
9. ResponseTokenizerManager returns JSON or SSE stream.

This is why SGLang is a large codebase: high-performance serving is not a single model call. It is a request-lifecycle system.

23. Component Index

Component / directory Function
python/sglang/lang Frontend language, IR, interpreter, backend abstraction
python/sglang/cli CLI commands such as serve and generate
python/sglang/launch_server.py Server launch entrypoint
srt/entrypoints/http_server.py FastAPI server, OpenAI/Ollama/native/admin routes
srt/entrypoints/engine.py Python Engine, launches tokenizer/scheduler/detokenizer
srt/server_args.py Server arguments, backend choices, deployment/performance switches
srt/managers/tokenizer_manager.py Tokenization, request state, multimodal preprocessing, streaming
srt/managers/scheduler.py Queues, batches, cache, workers, scheduling, parallelism, PD, speculative decoding
srt/managers/detokenizer_manager.py Token IDs to incremental text
srt/managers/schedule_policy.py LPM, DFS-weight, FCFS, LOF, random, routing-key
srt/managers/schedule_batch.py Req, ScheduleBatch, ModelWorkerBatch
srt/managers/tp_worker.py Tensor-parallel model worker
srt/model_executor/model_runner.py Model loading, distributed setup, attention backend, forward, CUDA graph
srt/model_executor/forward_batch_info.py ForwardBatch and ForwardMode
srt/mem_cache Memory pools, RadixCache, HiCache, storage backends, sparse cache
srt/layers/attention Attention backends
srt/layers/quantization Quantization configuration and kernel integration
srt/layers/moe MoE layers, experts, routing, kernel integration
srt/models Native model implementations and registry
srt/model_loader Weight loading, format adapters, remote loaders
srt/sampling Sampling parameters and sampling backends
srt/constrained Grammar backends and structured outputs
srt/function_call Tool-call detectors and parsers
srt/parser Reasoning parsers
srt/speculative EAGLE, MTP, standalone, NGRAM speculative decoding
srt/disaggregation Prefill/decode disaggregation and KV transfer
srt/distributed TP/PP/DP/EP communication and parallel state
srt/lora Dynamic LoRA loading, batching, management
srt/metrics / srt/tracing Prometheus, timers, CPU monitor, OpenTelemetry tracing
sgl-kernel Optimized kernel package
sgl-model-gateway Rust gateway, routing, control plane, load balancing, reliability
multimodal_gen Diffusion/image/video generation runtime
benchmark Real workload and performance benchmarks
test Unit, registered, platform, manual, and kernel tests
docs Installation, usage, advanced features, platform, developer docs

24. Final Takeaways

SGLang has four defining design traits.

First, it is a production serving engine. The presence of OpenAI-compatible APIs, gRPC, metrics, tracing, gateway routing, rate limiting, request replay, Docker, Kubernetes, SageMaker, CI, and benchmarks makes it infrastructure, not just a model wrapper.

Second, its performance story is system-level. RadixCache, schedule policy, chunked prefill, CUDA graphs, overlap scheduling, PD disaggregation, HiCache, attention backends, MoE communication, speculative decoding, and custom kernels work together.

Third, it is built for broad model and hardware coverage. The repository supports many text, vision-language, embedding, reward, classification, rerank, and diffusion model families, while targeting NVIDIA, AMD, CPU, TPU, Ascend, XPU, MUSA, and other platforms.

Fourth, it is moving from single-server inference to cluster-level model infrastructure. The Rust gateway, PD disaggregation, HiCache L3 storage, service discovery, history connectors, MCP integration, and inference gateway mode all point in that direction.

If SGLang were a machine, sglang.lang would be the operator panel, HTTP/OpenAI/gRPC would be the public interface, Tokenizer/Scheduler/Detokenizer would be the control loop, RadixCache/HiCache would be the memory system, ModelRunner and attention backends would be the engine, sgl-kernel would be the precision-machined parts, and sgl-model-gateway would be the traffic tower coordinating a fleet.

That is SGLang’s core value: it turns model inference into a fast, observable, extensible, production-ready serving system.

This post is licensed under CC BY 4.0 by the author.

Trending Tags