Vision-Language Models
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
·1886 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
MaVEn: A novel multi-granularity hybrid visual encoding framework significantly boosts MLLM’s multi-image reasoning capabilities by combining discrete and continuous visual representations.
Matryoshka Query Transformer for Large Vision-Language Models
·1913 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC Los Angeles
Matryoshka Query Transformer (MQT) empowers large vision-language models with flexible visual token encoding, drastically reducing inference costs while maintaining high accuracy across multiple bench…
Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials
·4579 words·22 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Stanford University
Make-it-Real uses a large multimodal language model to automatically paint realistic materials onto 3D objects, drastically improving realism and saving developers time.
Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function
·3895 words·19 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Nanjing University of Aeronautics and Astronautics
Magnet: Enhancing Text-to-Image Synthesis by Disentangling Attributes in CLIP.
M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
·4046 words·19 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Tencent AI Lab
M³GPT, a novel multimodal framework, achieves superior motion comprehension and generation by integrating text, music, and motion data into a unified LLM representation.
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
·2500 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Fudan University
Lumen: A novel LMM architecture decouples perception learning into task-agnostic and task-specific stages, enabling versatile vision-centric capabilities and surpassing existing LMM-based approaches.
LOVA3: Learning to Visual Question Answering, Asking and Assessment
·3398 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Show Lab, National University of Singapore
LOVA³ enhances MLLMs by teaching them to ask and assess image-based questions, improving their multimodal understanding and performance on various benchmarks.
LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
·3183 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Science and Technology of China
LoTLIP boosts language-image pre-training for superior long text understanding by cleverly integrating corner tokens and utilizing a massive dataset of 100M long-caption images.
LocCa: Visual Pretraining with Location-aware Captioners
·2114 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
LocCa, a novel visual pretraining paradigm, uses location-aware captioning tasks to boost downstream localization performance while maintaining holistic task capabilities.
LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering
·2138 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Fudan University
LiveScene: Language-embedded interactive radiance fields efficiently reconstruct and control complex scenes with multiple interactive objects, achieving state-of-the-art results.
Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
·2315 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
LipFD: a novel method leverages audio-visual inconsistencies to accurately spot lip-syncing deepfakes, outperforming existing methods and introducing a high-quality dataset for future research.
LG-VQ: Language-Guided Codebook Learning
·3656 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harbin Institute of Technology
LG-VQ: A novel language-guided codebook learning framework boosts multi-modal performance.
LG-CAV: Train Any Concept Activation Vector with Language Guidance
·3860 words·19 mins·
loading
·
loading
AI Generated
Computer Vision
Vision-Language Models
🏢 Zhejiang University
LG-CAV: Train any Concept Activation Vector with Language Guidance, leverages vision-language models to train CAVs without labeled data, achieving superior accuracy and enabling state-of-the-art model…
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
·2448 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Show Lab, National University of Singapore
Visual tokens boost long-text multi-modal models!
Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models
·2923 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Southeast University
Lever-LM configures effective in-context demonstrations for large vision-language models using a small language model, significantly improving their performance on visual question answering and image …
LESS: Label-Efficient and Single-Stage Referring 3D Segmentation
·2019 words·10 mins·
loading
·
loading
Natural Language Processing
Vision-Language Models
🏢 College of Computer Science and Software Engineering, Shenzhen University
LESS achieves state-of-the-art Referring 3D Segmentation using only binary masks, significantly reducing labeling effort and improving efficiency with a novel single-stage pipeline.
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
·2514 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
Boosting complex visual reasoning, a new Iterative and Parallel Reasoning Mechanism (IPRM) outperforms existing methods by combining step-by-step and simultaneous computations, improving accuracy and …
Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios
·2506 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Department of Bioengineering, Imperial College London
Unveiling cortico-muscular dependence using orthonormal decomposition of density ratios, FMCA-T, enhances movement classification and reveals channel-temporal dependencies.
Learning 1D Causal Visual Representation with De-focus Attention Networks
·2168 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
De-focus Attention Networks achieve comparable performance to 2D non-causal models using 1D causal visual representation, solving the ‘over-focus’ issue in existing 1D causal vision models.
LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction
·2343 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
LaSe-E2V: Language-guided semantic-aware event-to-video reconstruction uses text descriptions to improve video quality and consistency.