Vision-Language Models

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

26 September 2024·1886 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

MaVEn: A novel multi-granularity hybrid visual encoding framework significantly boosts MLLM’s multi-image reasoning capabilities by combining discrete and continuous visual representations.

Matryoshka Query Transformer for Large Vision-Language Models

26 September 2024·1913 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UC Los Angeles

Matryoshka Query Transformer (MQT) empowers large vision-language models with flexible visual token encoding, drastically reducing inference costs while maintaining high accuracy across multiple bench…

Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

26 September 2024·4579 words·22 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Stanford University

Make-it-Real uses a large multimodal language model to automatically paint realistic materials onto 3D objects, drastically improving realism and saving developers time.

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

26 September 2024·3895 words·19 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Nanjing University of Aeronautics and Astronautics

Magnet: Enhancing Text-to-Image Synthesis by Disentangling Attributes in CLIP.

M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

26 September 2024·4046 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab

M³GPT, a novel multimodal framework, achieves superior motion comprehension and generation by integrating text, music, and motion data into a unified LLM representation.

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

26 September 2024·2500 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Fudan University

Lumen: A novel LMM architecture decouples perception learning into task-agnostic and task-specific stages, enabling versatile vision-centric capabilities and surpassing existing LMM-based approaches.

LOVA3: Learning to Visual Question Answering, Asking and Assessment

26 September 2024·3398 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore

LOVA³ enhances MLLMs by teaching them to ask and assess image-based questions, improving their multimodal understanding and performance on various benchmarks.

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding

26 September 2024·3183 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

LoTLIP boosts language-image pre-training for superior long text understanding by cleverly integrating corner tokens and utilizing a massive dataset of 100M long-caption images.

LocCa: Visual Pretraining with Location-aware Captioners

26 September 2024·2114 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Google DeepMind

LocCa, a novel visual pretraining paradigm, uses location-aware captioning tasks to boost downstream localization performance while maintaining holistic task capabilities.

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering

26 September 2024·2138 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Fudan University

LiveScene: Language-embedded interactive radiance fields efficiently reconstruct and control complex scenes with multiple interactive objects, achieving state-of-the-art results.

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

26 September 2024·2315 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

LipFD: a novel method leverages audio-visual inconsistencies to accurately spot lip-syncing deepfakes, outperforming existing methods and introducing a high-quality dataset for future research.

LG-VQ: Language-Guided Codebook Learning

26 September 2024·3656 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Harbin Institute of Technology

LG-VQ: A novel language-guided codebook learning framework boosts multi-modal performance.

LG-CAV: Train Any Concept Activation Vector with Language Guidance

26 September 2024·3860 words·19 mins· loading · loading

AI Generated Computer Vision Vision-Language Models 🏢 Zhejiang University

LG-CAV: Train any Concept Activation Vector with Language Guidance, leverages vision-language models to train CAVs without labeled data, achieving superior accuracy and enabling state-of-the-art model…

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

26 September 2024·2448 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore

Visual tokens boost long-text multi-modal models!

Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models

26 September 2024·2923 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Southeast University

Lever-LM configures effective in-context demonstrations for large vision-language models using a small language model, significantly improving their performance on visual question answering and image …

LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

26 September 2024·2019 words·10 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 College of Computer Science and Software Engineering, Shenzhen University

LESS achieves state-of-the-art Referring 3D Segmentation using only binary masks, significantly reducing labeling effort and improving efficiency with a novel single-stage pipeline.

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

26 September 2024·2514 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

Boosting complex visual reasoning, a new Iterative and Parallel Reasoning Mechanism (IPRM) outperforms existing methods by combining step-by-step and simultaneous computations, improving accuracy and …

Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios

26 September 2024·2506 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Department of Bioengineering, Imperial College London

Unveiling cortico-muscular dependence using orthonormal decomposition of density ratios, FMCA-T, enhances movement classification and reveals channel-temporal dependencies.

Learning 1D Causal Visual Representation with De-focus Attention Networks

26 September 2024·2168 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

De-focus Attention Networks achieve comparable performance to 2D non-causal models using 1D causal visual representation, solving the ‘over-focus’ issue in existing 1D causal vision models.

LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction

26 September 2024·2343 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

LaSe-E2V: Language-guided semantic-aware event-to-video reconstruction uses text descriptions to improve video quality and consistency.