↓Skip to main content

🏢 Dept. of Computer Science, Rice University

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

26 September 2024·3037 words·15 mins· loading · loading

Natural Language Processing Large Language Models 🏢 Dept. of Computer Science, Rice University

Boost LLM inference speed 1.4-3.5x by using Coupled Quantization (CQ) to compress KV cache down to 1 bit per channel, while preserving model accuracy.