↓Skip to main content

Vision-Language Models

Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

6 November 2024·3165 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong, Shenzhen

MM-Detect: a novel framework detects contamination in multimodal LLMs, enhancing benchmark reliability by identifying training set leakage and improving performance evaluations.

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

5 November 2024·2197 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Technology Sydney

TIP-I2V: A million-scale dataset provides 1.7 million real user text & image prompts for image-to-video generation, boosting model development and safety.

Inference Optimal VLMs Need Only One Visual Token but Larger Models

5 November 2024·3063 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

30 October 2024·3628 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai AI Laboratory

OS-Atlas: A new open-source toolkit and model dramatically improves GUI agent performance by providing a massive dataset and innovative training methods, enabling superior generalization to unseen int…

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

29 October 2024·3405 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Institute of High Performance Computing (IHPC)

BenchX: A unified benchmark framework reveals surprising MedVLP performance, challenging existing conclusions and advancing research.