TurboQuant: Near-Optimal Online Vector Quantization

TurboQuant: Near-Optimal Online Vector Quantization

TL;DR

TurboQuant is a near-optimal online vector quantization method for compressing KV cache during inference.

It applies a random rotation to the original vector (QR-based in the paper; replaced by FWHT in the SGLang PR), which makes the rotated vector’s coordinates follow a Beta distribution on the sphere. Then it quantizes each coordinate independently (MSE objective + QJL term), stores the quantized indices in the KV cache, and reconstructs the vector by applying the inverse rotation during retrieval.

Key points

Note: This is my personal interpretation and summary. I recommend reading the original paper and the PR side by side.
In my view, AI infrastructure is one of the most tightly coupled areas between academia and industry: mathematical elegance is like an artist’s ideas, while AI infra is the craft—brushes, canvas, and technique. Its impact will likely keep expanding, much like the ripple effects of the Industrial Revolution.

Table of Contents