SQuat: Subspace-orthogonal KV Cache Quantization

COLM 2025

Hao Wang^*,1 Ligong Han^*,1 Kai Xu¹ Akash Srivastava¹

^*Equal contribution ¹Red Hat AI Innovation

Abstract

We introduce a new KV cache quantization algorithm: SQuat (Subspace-orthogonal KV cache quantization). SQuat is training-free, requires no calibration data, runs on-the-fly, and is grounded in a theoretical framework we developed. Empirically, it reduces GPU peak memory by 2.17× to 2.82×, improves throughput by 2.45× to 3.60×, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

Method

SQuat first constructs a subspace that captures the most critical task-related information. During key tensor quantization, it ensures that the difference between the (de)quantized and original keys remains orthogonal to this subspace, thereby minimizing the impact of quantization errors on the attention mechanism's outputs.

Results

Our experimental evaluation demonstrates significant improvements across multiple dimensions:

Citation

@inproceedings{wang2025squat,
    title={SQuat: Subspace-orthogonal KV Cache Quantization},
    author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
    booktitle={Conference on Language Modeling},
    year={2025}
}