Skip to main content

🏢 Carnegie Mellon University

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
·2677 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University
AI agents are tested in a simulated company, revealing their capability to automate tasks and shortcomings with complex workflows and interfaces.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
·4233 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.
Evaluating Language Models as Synthetic Data Generators
·4403 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University
AGORABENCH: A new benchmark reveals surprising strengths & weaknesses of LMs as synthetic data generators, showing that problem-solving ability isn’t the sole indicator of data quality.
Video Depth without Video Models
·3150 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Carnegie Mellon University
RollingDepth: Achieving state-of-the-art video depth estimation without using complex video models, by cleverly extending a single-image depth estimator.
Soft Robotic Dynamic In-Hand Pen Spinning
·2419 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Carnegie Mellon University
SWIFT, a new system, enables a soft robotic hand to learn dynamic pen spinning via real-world trial-and-error, achieving 100% success across diverse pen properties without explicit object modeling.
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
·2584 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
VideoGLaMM: a new large multimodal model achieves precise pixel-level visual grounding in videos by seamlessly integrating a dual vision encoder, a spatio-temporal decoder, and a large language model.
Inference Optimal VLMs Need Only One Visual Token but Larger Models
·3063 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
·5414 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University
Specialized Sparse Autoencoders (SSAEs) decode foundation models’ ‘dark matter’ features, efficiently extracting rare subdomain concepts for improved interpretability and safety.