Ren Zezhong

TurboQuant: Near-Optimal Online Vector Quantization

Fri, 27 Mar 2026 00:00:00 +0000

TurboQuant: Near-Optimal Online Vector Quantization

TL;DR

TurboQuant is a near-optimal online vector quantization method for compressing KV cache during inference.

It applies a random rotation to the original vector (QR-based in the paper; replaced by FWHT in the SGLang PR), which makes the rotated vector’s coordinates follow a Beta distribution on the sphere. Then it quantizes each coordinate independently (MSE objective + QJL term), stores the quantized indices in the KV cache, and reconstructs the vector by applying the inverse rotation during retrieval.

Key points

Random rotation reduces sensitivity to the original data distribution, making the rotated vectors more uniformly distributed on the sphere; each coordinate follows a Beta distribution.
As the dimension d increases, the coordinate distribution concentrates toward N(0, 1/d), and coordinates become approximately independent (in a probabilistic sense).
With near-independent coordinates, TurboQuant enables per-coordinate scalar quantization (e.g., Lloyd–Max) to compute near-optimal centroids.
QJL is introduced to mitigate inner-product errors beyond MSE (note: the current PR has not fully integrated this part yet).
TurboQuant performs better when the model’s K-norm is close to 1, because it assumes the input vectors lie on (or near) the unit hypersphere, i.e., ( |x|_2^2 = 1 ).
This helps explain why Mistral-7B (≈ 1.3×) performs better than Qwen3 (≈ 2.1×–2.4×) under TurboQuant.

Note: This is my personal interpretation and summary. I recommend reading the original paper and the PR side by side.
In my view, AI infrastructure is one of the most tightly coupled areas between academia and industry: mathematical elegance is like an artist’s ideas, while AI infra is the craft—brushes, canvas, and technique. Its impact will likely keep expanding, much like the ripple effects of the Industrial Revolution.

SGLang Memory Management & KV Cache (Part 1)

Wed, 11 Mar 2026 00:00:00 +0000

SGLang Memory Management System

TL;DR

Key–Value (KV) cache entries can be reused across token generation, improving inference efficiency when cached.
SGLang maintains mapping tables to locate the KV-cache indices of tensors stored in the memory pool.
The mapping mechanism supports multiple attention backends (e.g., MHA, MLA, NSA).
SGLang provides several cache backends to meet different usage scenarios, performance goals, and implementation constraints.

Cache Class	Module	When to Use / Condition
RadixCache	`mem_cache/radix_cache.py`	Default
ChunkCache	`mem_cache/chunk_cache.py`	`disable_radix_cache=True`
SWAChunkCache	`mem_cache/chunk_cache.py`	`disable_radix_cache=True` + sliding window
HiRadixCache	`mem_cache/hiradix_cache.py`	`enable_hierarchical_cache=True`
SWARadixCache	`mem_cache/swa_radix_cache.py`	Sliding-window attention models
MambaRadixCache	`mem_cache/mamba_radix_cache.py`	Mamba / SSM-hybrid models
LMCRadixCache	`mem_cache/storage/lmcache/`	`enable_lmcache=True`
RadixCacheCpp	`mem_cache/radix_cache_cpp.py`	Experimental C++ radix tree

2026-03-11 个人总结

Wed, 11 Mar 2026 00:00:00 +0000

总结一下自己从2022年读博到2026年博四的进程：

遇到了棒棒的女朋友。
申请CSC前往EPFL访问。
发表文章到NDSS26 能中全靠王琴应师姐，郑晗， Mahtias Payer 超哥等T1合作者, 实验室的大家都非常厉害也很nice。除了一个完全无法赢得我尊重的初代合作者，愿你以后喝到的酒都是白开水，因为你配不上那么好喝的德国啤酒（真好喝啊，汉堡的和雪山顶的，遗憾国内没有平替）。
协商毕业被拒绝 **，自己选的，还好其实，刚好搞搞AI Infra, all in 了属于是，至于结果如何，我保持微笑和努力。
带女友回家见爸爸妈妈！
All in AIInfra，因为是做安全测试出身，所以从系统鲁棒性入手，学习优化策略，实现方法，研究鲁棒性问题。（Being a intership）

2025年发生了太多太多，总的来说没白忙活。

技术

最大的收获是通过研究现有SOTA模糊测试框架Syzkaller(Google)存在的局限性，实现了专注动态低频区域，且不牺牲覆盖率的模糊测试框架：Sysyphuzz, 该工作对我最大的帮助是，面对复杂系统我可以从连滚带爬到从从容容。整个科研过程可以概括为：观察，测试，分析，确定问题，解决问题。工程性大于创新性，是我喜欢的工作类型。论文被接收是完全的意外之喜，目前保持着100%一作四大投稿接受率（笑：1中1）。

这个项目过程给了我转战AI Infra的信息，复杂系统其实都一样，知识就是那些，去理解去调试，去动手改进。

Currently Focus on:

Sglang (doing) – KV Cacheing – Attention Backend – PD deaggregation
CS336(Class Finished; Assignments1 Doing)

后期会在Tec里总结分享

生活

2022年-2023年，迷茫中读博，寻找出路，索性有女友的陪伴支持，自己也没放弃，摸索前行。
2024年，申请到CSC，张超老师和Mahtias Payer两位老师的指导，有了前进的方向，郑晗和王琴应师姐的帮助，让我第一次指导科研怎么搞，论文怎么写，需要做什么，需要考虑什么，202403-202503年在EPFL的所有经历，都是人生的第一次，非常有意义。
2025年03月，回国后，继续咬牙坚持，改稿，rebuttle, major, 接收，完成后已经是10月份了，至今还记得无数个凌晨一点走在马路上，就觉得无论结果如何，对得起自己了，感谢所有支持我的人，感谢自己。
2025年12月，中期申请被拒，意味着众多以26年6月毕业为前提的计划全部落空，我告诉自己不能被困死，我要求生。

2026 展望

AI Infra 继续努力 – 搞懂sglang的设计，优化策略。 – 分析其面临的挑战。 – 做出优化，争取给出一个有价值的PR。 – 带来一篇工作。
12 月申请中期
投出Post-fuzzing 工。
和女友一起去大理。
在海南过年。

Ren Zezhong

TurboQuant: Near-Optimal Online Vector Quantization

TurboQuant: Near-Optimal Online Vector Quantization

TL;DR

Key points

SGLang Memory Management & KV Cache (Part 1)

SGLang Memory Management System

TL;DR

2026-03-11 个人总结

技术

生活

2026 展望

我知道你很努力，也很累，所以辛苦了，继续陪着自己前进，前进三！