Multimodal Learning
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
·4224 words·20 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
MoLE: Mixture of Low-rank Experts enhances human-centric text-to-image diffusion models by using low-rank modules trained on high-quality face and hand datasets to improve the realism of faces and han…
MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling
·2749 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Alibaba Group
MoGenTS revolutionizes human motion generation by quantizing individual joints into 2D tokens, enabling efficient spatial-temporal modeling and significantly outperforming existing methods.
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
·3087 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Beijing Jiaotong University
Mobile-Agent-v2 uses a three-agent collaborative framework (planning, decision, reflection) to improve mobile device operation accuracy by over 30%, overcoming the limitations of single-agent architec…
MO-DDN: A Coarse-to-Fine Attribute-based Exploration Agent for Multi-Object Demand-driven Navigation
·4206 words·20 mins·
loading
·
loading
AI Generated
Multimodal Learning
Embodied AI
🏢 Peking University
MO-DDN: A new benchmark and coarse-to-fine exploration agent boosts embodied AI’s ability to handle multi-object, preference-based task planning.
MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins
·2970 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 School of Computer Science, National Engineering Research Center for Multimedia Software and Institute of Artificial Intelligence, Wuhan University
MMSite: a novel multi-modal framework accurately identifies protein active sites using protein sequences and textual descriptions, achieving state-of-the-art performance.
Mixtures of Experts for Audio-Visual Learning
·2112 words·10 mins·
loading
·
loading
Multimodal Learning
Audio-Visual Learning
🏢 Fudan University
AVMoE: a novel parameter-efficient transfer learning approach for audio-visual learning, dynamically allocates expert models (unimodal and cross-modal adapters) based on task demands, achieving superi…
Mitigating Object Hallucination via Concentric Causal Attention
·2174 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Nanyang Technological University
Concentric Causal Attention (CCA) significantly reduces object hallucination in LVLMs by cleverly reorganizing visual tokens to mitigate the impact of long-term decay in Rotary Position Encoding.
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
·3263 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 KAIST
Meteor: Mamba-based Traversal of Rationale achieves significant vision-language improvements by efficiently embedding multifaceted rationales in a large language model, without scaling the model or us…
MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts
·2803 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 School of Artificial Intelligence, University of Chinese Academy of Sciences
MemVLT: Adaptive Vision-Language Tracking leverages memory to generate dynamic prompts, surpassing existing methods by adapting to changing target appearances.
Membership Inference Attacks against Large Vision-Language Models
·3357 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 LIONS, EPFL
First benchmark for detecting training data in large vision-language models (VLLMs) improves data security.
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
·1886 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
MaVEn: A novel multi-granularity hybrid visual encoding framework significantly boosts MLLM’s multi-image reasoning capabilities by combining discrete and continuous visual representations.
Matryoshka Query Transformer for Large Vision-Language Models
·1913 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC Los Angeles
Matryoshka Query Transformer (MQT) empowers large vision-language models with flexible visual token encoding, drastically reducing inference costs while maintaining high accuracy across multiple bench…
MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models
·2671 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Human-AI Interaction
🏢 Tsinghua University
MambaTalk: Efficient holistic gesture synthesis using selective state space models to overcome computational complexity and improve gesture quality.
Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials
·4579 words·22 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Stanford University
Make-it-Real uses a large multimodal language model to automatically paint realistic materials onto 3D objects, drastically improving realism and saving developers time.
Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function
·3895 words·19 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Nanjing University of Aeronautics and Astronautics
Magnet: Enhancing Text-to-Image Synthesis by Disentangling Attributes in CLIP.
M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
·4046 words·19 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Tencent AI Lab
M³GPT, a novel multimodal framework, achieves superior motion comprehension and generation by integrating text, music, and motion data into a unified LLM representation.
Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT
·2931 words·14 mins·
loading
·
loading
Multimodal Learning
Multimodal Generation
🏢 Beijing University of Posts and Telecommunications
Lumina-Next supercharges image generation: faster, more efficient, and better resolution with new architecture and sampling techniques.
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
·2500 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Fudan University
Lumen: A novel LMM architecture decouples perception learning into task-agnostic and task-specific stages, enabling versatile vision-centric capabilities and surpassing existing LMM-based approaches.
LOVA3: Learning to Visual Question Answering, Asking and Assessment
·3398 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Show Lab, National University of Singapore
LOVA³ enhances MLLMs by teaching them to ask and assess image-based questions, improving their multimodal understanding and performance on various benchmarks.
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
·3183 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Science and Technology of China
LoTLIP boosts language-image pre-training for superior long text understanding by cleverly integrating corner tokens and utilizing a massive dataset of 100M long-caption images.