Skip to main content

🏢 California Institute of Technology

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
·4689 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 California Institute of Technology
HEADINFER achieves memory-efficient LLM inference by cleverly offloading key-value cache to the CPU, enabling 4 million token inference on a single consumer GPU.