Multimodal Learning

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

26 September 2024·4224 words·20 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

MoLE: Mixture of Low-rank Experts enhances human-centric text-to-image diffusion models by using low-rank modules trained on high-quality face and hand datasets to improve the realism of faces and han…

MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

26 September 2024·2749 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Alibaba Group

MoGenTS revolutionizes human motion generation by quantizing individual joints into 2D tokens, enabling efficient spatial-temporal modeling and significantly outperforming existing methods.

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

26 September 2024·3087 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Beijing Jiaotong University

Mobile-Agent-v2 uses a three-agent collaborative framework (planning, decision, reflection) to improve mobile device operation accuracy by over 30%, overcoming the limitations of single-agent architec…

MO-DDN: A Coarse-to-Fine Attribute-based Exploration Agent for Multi-Object Demand-driven Navigation

26 September 2024·4206 words·20 mins· loading · loading

AI Generated Multimodal Learning Embodied AI 🏢 Peking University

MO-DDN: A new benchmark and coarse-to-fine exploration agent boosts embodied AI’s ability to handle multi-object, preference-based task planning.

MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins

26 September 2024·2970 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 School of Computer Science, National Engineering Research Center for Multimedia Software and Institute of Artificial Intelligence, Wuhan University

MMSite: a novel multi-modal framework accurately identifies protein active sites using protein sequences and textual descriptions, achieving state-of-the-art performance.

Mixtures of Experts for Audio-Visual Learning

26 September 2024·2112 words·10 mins· loading · loading

Multimodal Learning Audio-Visual Learning 🏢 Fudan University

AVMoE: a novel parameter-efficient transfer learning approach for audio-visual learning, dynamically allocates expert models (unimodal and cross-modal adapters) based on task demands, achieving superi…

Mitigating Object Hallucination via Concentric Causal Attention

26 September 2024·2174 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Nanyang Technological University

Concentric Causal Attention (CCA) significantly reduces object hallucination in LVLMs by cleverly reorganizing visual tokens to mitigate the impact of long-term decay in Rotary Position Encoding.

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

26 September 2024·3263 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 KAIST

Meteor: Mamba-based Traversal of Rationale achieves significant vision-language improvements by efficiently embedding multifaceted rationales in a large language model, without scaling the model or us…

MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

26 September 2024·2803 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 School of Artificial Intelligence, University of Chinese Academy of Sciences

MemVLT: Adaptive Vision-Language Tracking leverages memory to generate dynamic prompts, surpassing existing methods by adapting to changing target appearances.

Membership Inference Attacks against Large Vision-Language Models

26 September 2024·3357 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 LIONS, EPFL

First benchmark for detecting training data in large vision-language models (VLLMs) improves data security.

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

26 September 2024·1886 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

MaVEn: A novel multi-granularity hybrid visual encoding framework significantly boosts MLLM’s multi-image reasoning capabilities by combining discrete and continuous visual representations.

Matryoshka Query Transformer for Large Vision-Language Models

26 September 2024·1913 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UC Los Angeles

Matryoshka Query Transformer (MQT) empowers large vision-language models with flexible visual token encoding, drastically reducing inference costs while maintaining high accuracy across multiple bench…

MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

26 September 2024·2671 words·13 mins· loading · loading

AI Generated Multimodal Learning Human-AI Interaction 🏢 Tsinghua University

MambaTalk: Efficient holistic gesture synthesis using selective state space models to overcome computational complexity and improve gesture quality.

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

26 September 2024·4579 words·22 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Stanford University

Make-it-Real uses a large multimodal language model to automatically paint realistic materials onto 3D objects, drastically improving realism and saving developers time.

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

26 September 2024·3895 words·19 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Nanjing University of Aeronautics and Astronautics

Magnet: Enhancing Text-to-Image Synthesis by Disentangling Attributes in CLIP.

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

26 September 2024·4046 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab

M³GPT, a novel multimodal framework, achieves superior motion comprehension and generation by integrating text, music, and motion data into a unified LLM representation.

Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT

26 September 2024·2931 words·14 mins· loading · loading

Multimodal Learning Multimodal Generation 🏢 Beijing University of Posts and Telecommunications

Lumina-Next supercharges image generation: faster, more efficient, and better resolution with new architecture and sampling techniques.

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

26 September 2024·2500 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Fudan University

Lumen: A novel LMM architecture decouples perception learning into task-agnostic and task-specific stages, enabling versatile vision-centric capabilities and surpassing existing LMM-based approaches.

LOVA3: Learning to Visual Question Answering, Asking and Assessment

26 September 2024·3398 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore

LOVA³ enhances MLLMs by teaching them to ask and assess image-based questions, improving their multimodal understanding and performance on various benchmarks.

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

26 September 2024·3183 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

LoTLIP boosts language-image pre-training for superior long text understanding by cleverly integrating corner tokens and utilizing a massive dataset of 100M long-caption images.