Skip to main content

2025-01-22s

2025

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
·4089 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance
Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
·4964 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 ByteDance Seed, Tsinghua University
UI-TARS, a novel native GUI agent, achieves state-of-the-art performance by solely using screenshots as input, eliminating the need for complex agent frameworks and expert-designed workflows.
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
·4649 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Google DeepMind
TokenVerse: Extract & combine visual concepts from multiple images for creative image generation!
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
·6574 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Yale NLP
MMVU: a new benchmark pushes multimodal video understanding to expert level, revealing limitations of current models and paving the way for more advanced AI.
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
·2690 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
·3101 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab
Hunyuan3D 2.0: A groundbreaking open-source system generating high-resolution, textured 3D assets using scalable diffusion models, exceeding state-of-the-art performance.
GPS as a Control Signal for Image Generation
·3156 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Michigan
GPS-guided image generation is here! This paper leverages GPS data to create highly realistic images reflecting specific locations, even reconstructing 3D models from 2D photos.
Reasoning Language Models: A Blueprint
·3562 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 ETH Zurich
Democratizing advanced reasoning in AI, this blueprint introduces a modular framework for building Reasoning Language Models (RLMs), simplifying development and enhancing accessibility.
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
·2333 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 University of Illinois Urbana-Champaign
Mobile-Agent-E: A self-evolving mobile assistant conquering complex tasks with hierarchical agents and a novel self-evolution module, significantly outperforming prior approaches.
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
·4105 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Fudan University
Agent-R: A novel self-training framework enables language model agents to learn from errors by dynamically constructing training data that corrects erroneous actions, resulting in significantly improv…
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
·2205 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Group
EMO2 achieves realistic audio-driven avatar video generation by employing a two-stage framework: first generating hand poses directly from audio and then using a diffusion model to synthesize full-bod…
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
·3786 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind
New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.