🏢 Dept. of Computer Science, Rice University
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
·3037 words·15 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 Dept. of Computer Science, Rice University
Boost LLM inference speed 1.4-3.5x by using Coupled Quantization (CQ) to compress KV cache down to 1 bit per channel, while preserving model accuracy.