Skip to main content

🏢 Carnegie Mellon University

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
·2584 words·13 mins
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
VideoGLaMM: a new large multimodal model achieves precise pixel-level visual grounding in videos by seamlessly integrating a dual vision encoder, a spatio-temporal decoder, and a large language model.
Inference Optimal VLMs Need Only One Visual Token but Larger Models
·3063 words·15 mins
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
·5414 words·26 mins
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University
Specialized Sparse Autoencoders (SSAEs) decode foundation models’ ‘dark matter’ features, efficiently extracting rare subdomain concepts for improved interpretability and safety.