2025-01-22s

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

21 January 2025·4089 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance

Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

21 January 2025·4964 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 ByteDance Seed, Tsinghua University

UI-TARS, a novel native GUI agent, achieves state-of-the-art performance by solely using screenshots as input, eliminating the need for complex agent frameworks and expert-designed workflows.

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

21 January 2025·4649 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Google DeepMind

TokenVerse: Extract & combine visual concepts from multiple images for creative image generation!

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

21 January 2025·6574 words·31 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Yale NLP

MMVU: a new benchmark pushes multimodal video understanding to expert level, revealing limitations of current models and paving the way for more advanced AI.

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

21 January 2025·2690 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

21 January 2025·3101 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab

Hunyuan3D 2.0: A groundbreaking open-source system generating high-resolution, textured 3D assets using scalable diffusion models, exceeding state-of-the-art performance.

GPS as a Control Signal for Image Generation

21 January 2025·3156 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Michigan

GPS-guided image generation is here! This paper leverages GPS data to create highly realistic images reflecting specific locations, even reconstructing 3D models from 2D photos.

Reasoning Language Models: A Blueprint

20 January 2025·3562 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 ETH Zurich

Democratizing advanced reasoning in AI, this blueprint introduces a modular framework for building Reasoning Language Models (RLMs), simplifying development and enhancing accessibility.

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

20 January 2025·2333 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 University of Illinois Urbana-Champaign

Mobile-Agent-E: A self-evolving mobile assistant conquering complex tasks with hierarchical agents and a novel self-evolution module, significantly outperforming prior approaches.

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

20 January 2025·4105 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Fudan University

Agent-R: A novel self-training framework enables language model agents to learn from errors by dynamically constructing training data that corrects erroneous actions, resulting in significantly improv…

EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

18 January 2025·2205 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Group

EMO2 achieves realistic audio-driven avatar video generation by employing a two-stage framework: first generating hand poses directly from audio and then using a diffusion model to synthesize full-bod…

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

17 January 2025·3786 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind

New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.