VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

2503.10582

Yiming Jia et el.

🤗 2025-03-14

TL;DR
#

Vision-Language Models have shown promise in perception, but reasoning is limited by data scarcity. Existing datasets are restricted by focusing on specific data, relying on synthetic data, or covering elementary knowledge. To solve this issue, VisualWebInstruct introduces a method for creating diverse, high-quality datasets spanning various disciplines like math and chemistry. It employs Google Image Search, gathers HTML from 700K+ sources, and synthesizes 900K question-answer pairs.

The study presents VisualWebInstruct, which uses search engines to create diverse datasets. Starting with 30,000 images, similar images are found via Google. HTML is collected and processed to build a dataset of 900K question-answer pairs, 40% visual. Fine-tuning models show gains, 10-20% from Llava-OV-mid and 5% from MAmmoTH-VL. MAmmoTH-VL2 is state-of-the-art, highlighting data’s impact on complex reasoning for VLMs.

Key Takeaways
#

Why does it matter?
#

This paper is important because it addresses the scarcity of high-quality, diverse data for training VLMs to enhance their reasoning abilities. By introducing a novel method for generating multimodal instruction data, it paves the way for creating more capable VLMs that can tackle complex real-world tasks and opens new avenues for research in multimodal learning and reasoning.

Visual Insights
#

🔼 This figure illustrates the VisualWebInstruct dataset creation process and its impact. The pipeline starts with 30,000 seed images, uses Google Lens to find similar images on the web, extracts QA pairs from the resulting HTML, and filters for quality. The final dataset contains approximately 900k QA pairs. The bar charts show the performance gains achieved by fine-tuning language models (LLaVA-CoT, Qwen2-VL, and MAmmoTH-VL2) on this new dataset compared to their performance on several benchmarks (MMMU, MMMU-PRO, MathVista, and DynaMath). The improvements demonstrate the effectiveness of VisualWebInstruct in enhancing visual reasoning capabilities.
read the caption
Figure 1: Overview of our automated data curation approach and major experimental results.

Dataset	Size	Source & Domains	Coverage
ScienceQA [45]	21K	Elementary and high school science	Science Q&A, diagrams, K-12 Exam
IconQA [37]	107K	Abstract diagrams and visual reasoning	Visual reasoning, diagrams
Geo170K [16]	170K	Synthesized from LLMs	Geometry
CLEVR [24]	700K	Synthesized from rules	Shapes
FigureQA [25]	1.3M	Synthesized from rules	Bar, Line, Pie
ChartQA [40]	23K	Charts from Staista, Pew, etc	Charts
Math360V [47]	260K	FigureQA [25], CLEVR [24], IconQA [37], etc	Math reasoning, diagrams
Mulberry [60]	260K	Geo3K [36], IconQA [37], ChartQA [40], ScienceQA [45], etc	Geo, Figure, Medical, K-12 Exam
Llava-CoT [57]	100K	ChartQA [40], AI2D [26], GeoQA [7], CLEVR [24], etc	Geo, General VQA, K-12 Exam
VisualWebInstruct	906K	Internet (Homework Website, Forums, etc)	All Above + College Exams

🔼 This table compares the VisualWebInstruct dataset with other existing multimodal datasets, highlighting key differences in size, source, domains covered, and types of visual content. It shows that VisualWebInstruct is significantly larger and more diverse than existing datasets, encompassing a broader range of scientific disciplines and image types, making it particularly suitable for training vision-language models on complex reasoning tasks.
read the caption
Table 1: Comparison between our dataset and the existing datasets. VisualWebInstruct is the most diverse dataset with very broad coverage of disciplines and image types.

In-depth insights
#

Web-scale VLM
#

Web-scale Vision-Language Models (VLMs) represent a significant leap in AI, moving beyond limited datasets to harness the vastness of internet data. This scaling is crucial for several reasons. First, it provides VLMs with exposure to a much more diverse range of visual concepts, linguistic expressions, and real-world scenarios. This broader training base directly translates to improved generalization capabilities, allowing these models to perform effectively on a wider array of tasks and unseen data. Second, web-scale data enables VLMs to learn more nuanced and complex relationships between vision and language. They can capture subtle contextual cues and associations that would be impossible to discern from smaller, curated datasets. Third, the sheer volume of data helps to mitigate biases that may be present in smaller datasets. By training on a more representative sample of the world, web-scale VLMs can achieve fairer and more robust performance. However, the move to web-scale also presents significant challenges. Data quality becomes a major concern, as the internet contains a large amount of noisy, irrelevant, or even harmful information. Sophisticated data cleaning and filtering techniques are essential to ensure that VLMs are trained on high-quality data. Computational resources are also a major bottleneck, as training VLMs on web-scale datasets requires massive amounts of compute power and memory. Finally, ethical considerations become paramount, as web-scale VLMs may inadvertently learn and perpetuate biases present in the data, or be used for malicious purposes. Addressing these challenges requires careful attention to data curation, model design, and ethical oversight.

Data Curation
#

Data curation in the context of training Vision-Language Models (VLMs) is crucial for enhancing their reasoning capabilities. The traditional reliance on human annotation, while providing high-precision data, suffers from scalability issues. Automated approaches, like leveraging web search, offer a solution to create diverse and high-quality datasets. The process typically involves starting with seed images, using them to identify relevant webpages via search engines like Google, and then extracting HTML content. This raw data undergoes a series of refinement steps including content extraction, filtering, and synthesis to generate question-answer pairs. The quality of extracted data is paramount, so techniques like filtering based on question validity and image relevance, and ensuring consistency among synthesized answers are necessary. This process has the potential to dramatically scale up the amount of reasoning-focused multimodal data available for training VLMs.

Instruct Tuning
#

While the provided document does not explicitly contain a section titled “Instruct Tuning,” we can infer its relevance within the context of vision-language models (VLMs). Instruct tuning, in general, involves fine-tuning a pre-trained model using a dataset of instructions and corresponding desired outputs. This process significantly enhances the model’s ability to follow user commands, perform specific tasks, and generalize to unseen scenarios. In the context of VLMs, instruct tuning could be applied to improve the model’s ability to perform complex reasoning tasks based on visual and textual inputs. For instance, the VisualWebInstruct dataset, described in the paper, could be used to fine-tune VLMs using instruct tuning, where the instructions are questions about images and the outputs are the corresponding answers. This would enable the model to better understand the relationships between visual elements and textual descriptions, leading to improved performance on tasks such as visual question answering, image captioning, and visual reasoning. Effectiveness is shown in data augmentation which results in more capable vision language models.

Reasoning Gains
#

Reasoning gains in multimodal models are significantly impacted by training data. High-quality, diverse datasets like VisualWebInstruct enhance reasoning by exposing models to various disciplines and problem types. Fine-tuning on such datasets demonstrably improves performance on benchmarks like MMMU and MathVista. The combination of web-sourced data with techniques like chain-of-thought further amplifies reasoning capabilities, allowing models to tackle complex, multi-step problems that require deliberate thought and visual context understanding. The effectiveness hinges on balancing data quality and diversity to facilitate broader generalization and prevent overfitting to specific domains.

Dataset Scaling
#

Dataset scaling is a crucial aspect of training effective machine learning models, particularly for vision-language models (VLMs). The paper addresses the limitation of reasoning-focused datasets and proposes a novel approach, VisualWebInstruct, to scale up instruction data. Scaling involves using web search to gather a diverse dataset, including images from various disciplines. They start with 30K images and use Google Image Search to find similar images, extracting data from 700K unique URLs. The scaling aims to increase data quantity and diversity and improve the model’s ability to handle complex multimodal tasks. The use of web search is a clever way to overcome the data scarcity issue, enabling the creation of a large dataset without extensive human annotation. The dataset size and the comprehensive nature of web-derived data helps in enhancing the model’s reasoning capabilities

More visual insights
#

More on tables

Category	Percentage	Num of QA Pairs
Math	62.50%	566K
Physics	14.50%	132K
Finance	7.25%	66K
Chemistry	4.80%	43K
Engineering	4.35%	39K
Others	6.60%	60K

🔼 Table 2 shows the distribution of different categories of questions within the VisualWebInstruct dataset. The dataset covers multiple disciplines, with the majority of questions falling under mathematics (62.5%). Other significant categories include physics, finance, chemistry, and engineering, while a smaller ‘Others’ category encompasses subjects such as general knowledge, computer science, biology, and humanities.
read the caption
Table 2: Distribution of Categories in VisualWebInstruct

Stage 1: Mining Data from the Internet
Processing Stage	Total QA Pairs	Image-Associated QA	Unique Questions	Total Images	Unique Images
Seed Data Collection	-	-	-	30,000	30,000
QA Pairs Extraction	421,320	248,643	421,320	552,269	362,728
Post-Processing	361,015	159,059	361,015	331,818	212,530
Stage 2: Dataset Refinement
Answer Refinement	1,041,598	407,218	257,201	577,455	167,493
Answer Alignment	906,160	347,313	257,201	475,099	163,743

🔼 This table details the statistics of the VisualWebInstruct dataset creation pipeline. It shows the number of question-answer pairs, and images at different stages: initial data mining from the internet, post-processing for quality control, and answer refinement steps. The table illustrates the data’s evolution and the impact of various filtering and refinement steps on the dataset’s final size and composition.
read the caption
Table 3: Statistics of different milestones in the data processing pipeline of VisualWebInstruct.

Model	Size	MMMU	MMMU-Pro	MMMU-Pro	MathVista	MMVet	MathVerse	Dyna-Math	Avg
		val	standard	vision	testmini	test	testmini	test
Closed-sourced Models
GPT-4o	-	69.1	54.0	49.7	63.8	76.2	50.2	63.7	61.0
Gemini-1.5-Pro	-	59.1	49.4	65.8	63.9	64.0	41.2	64.8	58.3
Claude-3.5-Sonnet	-	68.3	55.0	48.0	67.7	75.4	44.2	60.5	59.9
Open-source General Vision-Language Models
Molmo	8B	45.3	28.3	18.9	51.6	58.0	18.9	41.6	37.5
Llava-OV	7B	48.8	29.5	18.7	63.2	58.6	26.2	40.3	40.8
Llama-3.2-Inst	11B	50.7	33.0	23.7	51.5	59.3	31.6	40.5	41.5
Qwen2-VL	7B	52.1	37.0	26.9	58.2	62.0	28.2	42.1	43.8
MAmmoTH-VL	7B	50.8	33.2	25.3	66.0	62.3	34.2	44.7	45.2
InternVL2.5	7B	55.8	38.2	30.4	64.4	62.8	39.5	49.8	48.7
Phi-4-mini	5.6B	55.1	39.7	31.2	62.4	60.5	37.6	51.4	48.6
DeepSeek-VL2	27B	51.1	31.4	24.3	62.8	-	-	-	-
Llava-CoT-L	11B	50.1	31.6	20.4	54.8	60.3	30.2	44.8	41.7
Llava-CoT-M	7B	51.4	33.0	23.7	63.8	58.6	39.4	48.3	45.5
LlamaV-o1	11B	49.1	31.5	22.4	54.4	63.6	-	-	-
Mulberry	7B	55.0	36.8	23.6	63.1	60.9	31.0	45.1	45.0
Insight-V	8B	50.2	30.7	20.5	59.9	60.8	28.7	47.8	42.6
MM-Eureka	8B	49.2	-	-	67.1	60.7	40.4	-	-
MAmmoTH-VL2	7B	54.7	40.7	26.3	68.1	64.5	42.6	55.7	50.4
$\Delta$ over SoTA		-1.1	+1.0	-4.9	+2.1	+0.9	+3.1	+4.3	+1.7

🔼 This table presents a comprehensive comparison of the performance of various vision-language models, including the authors’ model (MAmmoTH-VL2), across seven different visual reasoning benchmarks. The benchmarks cover a range of tasks and complexity levels, assessing the models’ capabilities in handling various aspects of visual reasoning. The table highlights the superior performance of the authors’ model compared to other open-source models, particularly in mathematical reasoning tasks. Results from closed-source models are also included for context. The best and second-best performances among open-source models are clearly indicated.
read the caption
Table 4: Evaluation Results of our model and other baseline models. Most of the baseline results are taken from other papers. The best and second-best results across all open-source models are highlighted in bold and underlined, respectively.

Training Data	MMMU	MMMU-Pro	MMMU-Pro	MathVista	MMVet	MathVerse	Dyna-Math	Avg
	val	standard	vision	testmini	test	testmini	test
Training from LLava-OV-mid
-	40.1	21.2	12.2	36.0	32.1	18.1	24.4	26.3
Llava-CoT	40.8	25.8	14.6	45.7	47.5	27.2	33.9	33.6
Ours	45.3	31.5	20.9	43.9	57.6	27.4	40.3	38.1
Ours+Llava-CoT	47.6	31.6	20.9	48.8	51.7	34.9	42.3	39.7
Training from MAmmoTH-VL
-	50.8	34.8	25.3	66.0	62.3	34.2	44.7	45.4
Llava-CoT	51.4	35.2	24.6	63.8	58.7	39.4	48.3	45.9
Ours	52.6	38.6	29.0	65.9	61.8	39.4	55.7	49.0
Ours+Llava-CoT	54.7	40.7	26.3	68.1	64.5	42.6	55.7	50.4

🔼 This table presents the ablation study results, showing the impact of different training data and their combinations on model performance. It compares two base models, Llava-OV-mid and MAmmoTH-VL, trained using various datasets: Llava-CoT, VisualWebInstruct, and a combination of both. The results are presented as average scores across seven benchmarks (MMMU, MMMU-Pro, MathVista, MMVet, MathVerse, DynaMath) for each training configuration. The best performance for each base model is highlighted in bold, illustrating the contribution of the VisualWebInstruct dataset and data mixing strategies to improved performance on various visual reasoning tasks.
read the caption
Table 5: Ablation Results of our experiments. We show experimental results from different backbones to show the impact of consistency filtering and data mixing with Llava-CoT. For each base model, the best performance is highlighted in bold.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Web-scale VLM#

Data Curation#

Instruct Tuning#

Reasoning Gains#

Dataset Scaling#

More visual insights#

Full paper#