↓Skip to main content

🏢 Integrated Vision and Language Lab, KAIST, South Korea

Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

29 November 2024·3199 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Integrated Vision and Language Lab, KAIST, South Korea

Video-Ma²mba efficiently handles long videos by using State Space Models, achieving linear scaling in memory and time, and employing a novel Multi-Axis Gradient Checkpointing (MA-GC) for significant m…