↓Skip to main content

🏢 California Institute of Technology

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

18 February 2025·4689 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 California Institute of Technology

HEADINFER achieves memory-efficient LLM inference by cleverly offloading key-value cache to the CPU, enabling 4 million token inference on a single consumer GPU.