Skip to main content

Multimodal Learning

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
·2632 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhejiang University, China
SegAgent: Improves MLLMs’ pixel understanding by mimicking human annotation, enabling mask refinement without altering output space.
Referring to Any Person
·3096 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 International Digital Economy Academy (IDEA)
Introducing HumanRef, a new dataset & RexSeek, a multimodal LLM, to improve human-centric referring tasks by addressing limitations of existing methods.
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
·2951 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Huazhong University of Science & Technology
OmniMamba: Efficient multimodal understanding and generation via SSMs, trained on 2M image-text pairs.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
·2477 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
GTR: Prevents thought collapse in RL-based VLM agents by process guidance, enhancing performance in complex visual reasoning tasks.
$^R$FLAV: Rolling Flow matching for infinite Audio Video generation
·2128 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 University of Parma
RFLAV: A novel rolling flow matching model for infinite audio-video generation with high quality, synchronization, and temporal coherence.
Video Action Differencing
·3793 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Stanford
VidDiff: Identify subtle action differences in videos for coaching and skill learning.
Should VLMs be Pre-trained with Image Data?
·3469 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Toyota Research Institute
Image data during pre-training can boost Vision-Language Model (VLM) performance, especially when introduced later in the process.
Motion Anything: Any to Motion Generation
·7987 words·38 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 ANU
Motion Anything: control human motion generation with multimodal conditions like text and music.
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
·2597 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhongguancun Laboratory
VisualSimpleQA: A new benchmark for fine-grained evaluation of visual and linguistic modules in fact-seeking LVLMs.
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
·2549 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 WHU
ProJudge: MLLM judges’ benchmark for sci-reasoning & instruction-tuning data to boost performance!
DiffCLIP: Differential Attention Meets CLIP
·2247 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 KAUST
DiffCLIP: Enhancing CLIP models by integrating differential attention, achieving superior performance with minimal overhead.
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
·2871 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Nankai University
ARMOR: Empowers MLLMs with interleaved multimodal generation via asymmetric synergy, using limited resources.
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
·2695 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 Imperial College London
Llama-MTSK: AVSR via Matryoshka LLMs, adapting to computational limits without sacrificing accuracy!
Unified Reward Model for Multimodal Understanding and Generation
·368 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Fudan University
UNIFIEDREWARD: A unified reward model that enhances multimodal understanding and generation!
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning
·1187 words·6 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Alibaba Group
R1-Omni: RLVR enhances multimodal emotion recognition, boosting reasoning and generalization.
SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
·2729 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Shanghai Artificial Intelligence Laboratory
SURVEYFORGE automates survey generation, improving quality and evaluation.
EgoLife: Towards Egocentric Life Assistant
·3562 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 NTU S-Lab
EgoLife: Ultra-long egocentric dataset & benchmark enabling AI assistants to understand and enhance daily life. Datasets and models released!
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
·5020 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 National University of Singapore
VLMs often disproportionately trust text over visual data, leading to performance drops and safety concerns.
Visual-RFT: Visual Reinforcement Fine-Tuning
·3386 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiaotong University
Visual-RFT: Enhance LVLMs’ visual reasoning via reinforcement learning with verifiable rewards, achieving strong performance with limited data.
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
·1959 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Technology Sydney
VideoUFO: A new user-focused, million-scale dataset that improves text-to-video generation by aligning training data with real user interests and preferences!