🏢 California Institute of Technology
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
·4689 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 California Institute of Technology
HEADINFER achieves memory-efficient LLM inference by cleverly offloading key-value cache to the CPU, enabling 4 million token inference on a single consumer GPU.