SQuat: Subspace-orthogonal KV Cache Quantization

COLM 2025
*Equal contribution 1Red Hat AI Innovation

Abstract

We introduce a new KV cache quantization algorithm: SQuat (Subspace-orthogonal KV cache quantization). SQuat is training-free, requires no calibration data, runs on-the-fly, and is grounded in a theoretical framework we developed. Empirically, it reduces GPU peak memory by 2.17× to 2.82×, improves throughput by 2.45× to 3.60×, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

Method

KV Cache Quantization Overview

SQuat first constructs a subspace that captures the most critical task-related information. During key tensor quantization, it ensures that the difference between the (de)quantized and original keys remains orthogonal to this subspace, thereby minimizing the impact of quantization errors on the attention mechanism's outputs.

Results

SQuat Method Overview

Our experimental evaluation demonstrates significant improvements across multiple dimensions:

Performance Results
Benchmark Comparison

Citation

@inproceedings{wang2025squat,
    title={SQuat: Subspace-orthogonal KV Cache Quantization},
    author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
    booktitle={Conference on Language Modeling},
    year={2025}
}