SGLang Memory Management & KV Cache (Part 1)
SGLang Memory Management System
TL;DR
- Key–Value (KV) cache entries can be reused across token generation, improving inference efficiency when cached.
- SGLang maintains mapping tables to locate the KV-cache indices of tensors stored in the memory pool.
- The mapping mechanism supports multiple attention backends (e.g., MHA, MLA, NSA).
- SGLang provides several cache backends to meet different usage scenarios, performance goals, and implementation constraints.
| Cache Class | Module | When to Use / Condition |
|---|---|---|
| RadixCache | mem_cache/radix_cache.py | Default |
| ChunkCache | mem_cache/chunk_cache.py | disable_radix_cache=True |
| SWAChunkCache | mem_cache/chunk_cache.py | disable_radix_cache=True + sliding window |
| HiRadixCache | mem_cache/hiradix_cache.py | enable_hierarchical_cache=True |
| SWARadixCache | mem_cache/swa_radix_cache.py | Sliding-window attention models |
| MambaRadixCache | mem_cache/mamba_radix_cache.py | Mamba / SSM-hybrid models |
| LMCRadixCache | mem_cache/storage/lmcache/ | enable_lmcache=True |
| RadixCacheCpp | mem_cache/radix_cache_cpp.py | Experimental C++ radix tree |