Vision-Language Models
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
·3165 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Chinese University of Hong Kong, Shenzhen
MM-Detect: a novel framework detects contamination in multimodal LLMs, enhancing benchmark reliability by identifying training set leakage and improving performance evaluations.
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
·2197 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Technology Sydney
TIP-I2V: A million-scale dataset provides 1.7 million real user text & image prompts for image-to-video generation, boosting model development and safety.
Inference Optimal VLMs Need Only One Visual Token but Larger Models
·3063 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
·3628 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai AI Laboratory
OS-Atlas: A new open-source toolkit and model dramatically improves GUI agent performance by providing a massive dataset and innovative training methods, enabling superior generalization to unseen int…
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays
·3405 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Institute of High Performance Computing (IHPC)
BenchX: A unified benchmark framework reveals surprising MedVLP performance, challenging existing conclusions and advancing research.