🏢 Carnegie Mellon University

Unified Multimodal Discrete Diffusion

26 March 2025·3324 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Carnegie Mellon University

UniDisc: a unified multimodal discrete diffusion model for joint text and image generation, surpassing autoregressive models in quality & efficiency!

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

10 March 2025·4375 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Machine Learning Reinforcement Learning 🏢 Carnegie Mellon University

LLMs can now reason more efficiently!

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

8 February 2025·6090 words·29 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University

APE: a novel method significantly speeds up context-augmented generation (CAG). By using adaptive parallel encoding, APE achieves a 4.5x speedup and maintains high accuracy even with 128K length cont…

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

29 January 2025·2552 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University

Critique Fine-Tuning (CFT) outperforms traditional supervised fine-tuning (SFT) in training language models, achieving comparable results with significantly less data and opening new avenues in AI.

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

18 December 2024·2677 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University

AI agents are tested in a simulated company, revealing their capability to automate tasks and shortcomings with complex workflows and interfaces.

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

6 December 2024·4233 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.

Evaluating Language Models as Synthetic Data Generators

4 December 2024·4403 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University

AGORABENCH: A new benchmark reveals surprising strengths & weaknesses of LMs as synthetic data generators, showing that problem-solving ability isn’t the sole indicator of data quality.

Video Depth without Video Models

28 November 2024·3150 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Carnegie Mellon University

RollingDepth: Achieving state-of-the-art video depth estimation without using complex video models, by cleverly extending a single-image depth estimator.

Soft Robotic Dynamic In-Hand Pen Spinning

19 November 2024·2419 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Carnegie Mellon University

SWIFT, a new system, enables a soft robotic hand to learn dynamic pen spinning via real-world trial-and-error, achieving 100% success across diverse pen properties without explicit object modeling.

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

7 November 2024·2584 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

VideoGLaMM: a new large multimodal model achieves precise pixel-level visual grounding in videos by seamlessly integrating a dual vision encoder, a spatio-temporal decoder, and a large language model.

Inference Optimal VLMs Need Only One Visual Token but Larger Models

5 November 2024·3063 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

1 November 2024·5414 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University

Specialized Sparse Autoencoders (SSAEs) decode foundation models’ ‘dark matter’ features, efficiently extracting rare subdomain concepts for improved interpretability and safety.