🏢 Integrated Vision and Language Lab, KAIST, South Korea
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
·3199 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Integrated Vision and Language Lab, KAIST, South Korea
Video-Ma²mba efficiently handles long videos by using State Space Models, achieving linear scaling in memory and time, and employing a novel Multi-Axis Gradient Checkpointing (MA-GC) for significant m…