🏢 Carnegie Mellon University
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
·2584 words·13 mins
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
VideoGLaMM: a new large multimodal model achieves precise pixel-level visual grounding in videos by seamlessly integrating a dual vision encoder, a spatio-temporal decoder, and a large language model.
Inference Optimal VLMs Need Only One Visual Token but Larger Models
·3063 words·15 mins
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
·5414 words·26 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Carnegie Mellon University
Specialized Sparse Autoencoders (SSAEs) decode foundation models’ ‘dark matter’ features, efficiently extracting rare subdomain concepts for improved interpretability and safety.