MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

2412.05237

Jarvis Guo et el.

🤗 2024-12-09

↗ arXiv ↗ Hugging Face ↗ Papers with Code

TL;DR
#

Current open-source multimodal large language models (MLLMs) struggle with complex reasoning tasks due to limitations in existing instruction-tuning datasets. These datasets often lack detailed rationales and focus on simpler tasks, hindering the development of robust MLLMs. This limits the models’ ability to tackle complex real-world problems requiring deeper reasoning.

To address these challenges, the researchers present MAmmoTH-VL, a novel approach to creating a large-scale multimodal instruction-tuning dataset. They leverage open-source models to rewrite existing datasets, adding detailed rationales and increasing task complexity. This resulted in a dataset containing 12 million instruction-response pairs. Training an MLLM on this dataset leads to state-of-the-art performance on various benchmarks, particularly those involving complex reasoning, showcasing the method’s effectiveness. This work significantly contributes to the open-source MLLM community by offering a scalable and efficient way to build high-quality multimodal datasets.

Key Takeaways
#

Why does it matter?
#

This paper is crucial for researchers in multimodal learning and large language models. It introduces a scalable and cost-effective methodology for creating high-quality instruction-tuning datasets, addressing a major bottleneck in the field. The resulting dataset and model significantly advance the state-of-the-art, opening new avenues for research and providing valuable resources for the broader community.

Visual Insights
#

🔼 Figure 1 presents a performance comparison of the MAmmoTH-VL-8B model against several baseline models across eight multimodal datasets. The key finding is that using a simple rewriting technique with open-source language models significantly improves the quality of visual instruction data. This rewriting method encourages chain-of-thought (CoT) reasoning. Training MAmmoTH-VL-8B on this enhanced data leads to substantial gains in performance that scale with the model size. LLaVA-OneVision and LLaVA-CoT models serve as baselines for comparison.
read the caption
Figure 1: Scaling effects of MAmmoTH-VL-8B on eight multimodal evaluation datasets. A simple rewriting approach using open models improves the quality of visual instruction data by eliciting chain-of-thought (CoT) reasoning. Training on this rewritten data demonstrates significant performance gains through increased model scale. Llava-OneVision-7B&72B (Li et al., 2024b) and Llava-CoT (Xu et al., 2024a) are included as references.

The following markdown table is a reformat of the provided HTML table. Note that the image paths have been updated to use the provided arxiv ID (2412.05237) and replace the svg images with a placeholder:

Category	Dataset	Dataset	Dataset	Dataset	Dataset
General (15.4%)	ALLaVA	SVITCore	ALLaVA-zh	ShareGPT4V
	CLlava Instruct	idefics375k	LVIS-InstructV4	WildVision Chat	GQA
	AlfWorld	IDK	GPT4V77	Laion GPT4V	Sherlock
	Irv-Normal	LLaVA-zh	SVITCore	Cambrian (Filter)	Visual7W
Chart (15.4%)	[mPLUG-DocOwlchart](https://arxiv.org/html/2412.05237/mplug-doc owlchart.png)	Ureader Chart	Ureader QA	DVQA
	ArXiv-Chart-GPT4o	PlotQA	ArxivQA	InfographicVQA	Robut-WTQ
	Robut-SQA	Hitab	TAT-QA	FinQA	Vistext
	ChartQA	Robut-WikiSQL	Ureader KG	Chart2Text	Irv-Chart
OCR (13.7%)	MultiUI	OCRVQA	ScreenQA	TextVQA
	TextOCR	LLaVAR GPT4	ReCTs	Chrome-Writing	IAM
	UreaderOCR	ST-VQA	DocVQA	RenderedText	VisualMRC
Caption (10.9%)	ShareGPT4v	ShareGPT4o	Sharegpt4v (SAM)	Infographic
	Sharegpt4v (COCO)	Sharegpt4v (LLAVA)
Language (16%)	Orca	NuminaMath	MathInstruct	Orca Math
	Magpie Pro(L3 MT)	Magpie Pro(L3 ST)	Others
Code/Math (8.3%)	MAVIS Geo	MAVIS Metagen	Geometry3K	GeomVerse
	Super-CLEVR	TabMWP	VizWiz	Geo170K	MathVision
	GEOS	GeoQA+	IconQA(Math)	PMC-VQA	UniGeo
	CLEVR-Math	MapQA	RAVEN(M)	Design2Code
Domain-specific (8.9%)	WIT	M3IT+FLAN	ScienceQA(Nona)	Vision Flan
	PathVQA	TQA	A-OKVQA	WebSight	ViQuAE
	ShareGPT4V(Knowledge)	AI2D(4V)
Detection (3.2%)	CLEVR	VisualGenome	TallyQA	VSR
Multi-Image (5.8%)	NLVR2	Mimic CGD	Coinstruct	HQ-Edit
	Raven	IconQA	VIST	Contrast-Caption	FlintstonesSV
	PororoSV	Others
Video (2.5%)	L-Video	M4 Instruct Video	L-Video-ActivityNetQA	L-Hound
	L-Video-NeXT-QA	VideoChatGPT	Video-MME	L-Video-PerceptionTest	EgoSchema

🔼 This table details the hyperparameters and settings used during the three-stage training process of the MAmmoTH-VL-8B multimodal large language model. It specifies the resolution, number of tokens, dataset used, number of samples, vision tower architecture, LLM backbone, trainable model parameters, batch size, maximum model length, and learning rates for the vision and language model components for each training stage. The stages represent distinct phases in the model’s training: Language-Image Alignment, Visual Instruction Tuning (Single Image), and Visual Instruction Tuning (One Vision).
read the caption
Table 1: Detailed configuration for each training stage of the MAmmoTH-VL-8B model.

In-depth insights
#

Multimodal Reasoning
#

Multimodal reasoning, the capacity to integrate and interpret information from diverse sources like text, images, and audio, is a crucial frontier in AI. The paper’s focus on eliciting this ability in large language models (LLMs) is significant because current models often struggle with reasoning-heavy multimodal tasks. This highlights a critical gap between the potential of multimodal LLMs and their actual performance. The proposed solution—a scalable, cost-effective method for creating instruction-tuning datasets with rich rationales—directly addresses this weakness. By focusing on reasoning-intensive tasks and detailed rationales, the methodology aims to move beyond simplistic tasks that dominate existing datasets. The results, showing significant improvement in reasoning benchmarks, demonstrate the success of this approach. This improvement is particularly notable in tasks requiring intricate reasoning and alignment between different modalities, suggesting the methodology’s effectiveness in fostering higher-order cognitive abilities in LLMs. However, limitations exist, particularly regarding dataset scale for multimodal tasks involving video and multiple images. Future research could explore methods to efficiently scale data collection and tackle the complexities inherent in processing these more demanding data types. The overall contribution emphasizes the need for high-quality, reasoning-focused datasets and the potential of open-source methods to bridge the gap between cutting-edge research and practical application.

Instruction Tuning
#

Instruction tuning is a crucial technique for aligning large language models (LLMs) with human intentions. It involves fine-tuning pre-trained LLMs on a dataset of instruction-response pairs, enabling the model to better understand and follow diverse instructions. The key to successful instruction tuning lies in the quality and diversity of the instruction dataset. A high-quality dataset comprises various instructions, encompassing diverse levels of complexity and nuanced expression, often including detailed rationales or chain-of-thought reasoning. The scale of the dataset also significantly impacts performance, with larger, more diverse datasets leading to superior results. Furthermore, the choice of model architecture and training methodology is critical for optimizing performance and ensuring that the LLM generalizes well to unseen instructions. Careful consideration of these factors ensures a fine-tuned LLM capable of reliably following complex and nuanced instructions, ultimately enhancing its overall utility and usability.

Data Augmentation
#

Data augmentation is a crucial technique in machine learning, particularly when dealing with limited datasets. In the context of multimodal learning, it is even more critical as obtaining large, diverse, high-quality multimodal datasets is expensive and time-consuming. The paper explores a novel, cost-effective approach for data augmentation that involves using open-source large language models (LLMs) to rewrite and enhance existing visual instruction datasets. This process focuses on eliciting chain-of-thought (CoT) reasoning by adding detailed rationales and intermediate steps to simplistic instruction-response pairs, thus greatly expanding the amount of training data. The effectiveness of this approach is validated through experiments demonstrating significant performance gains compared to models trained on non-augmented data. The strategy prioritizes a scalable and open-source solution and avoids reliance on computationally expensive or proprietary methods for generating augmented data. The pipeline’s steps – data collection, augmentation using open LLMs, and rigorous filtering – are designed for broad applicability. Furthermore, the research highlights the importance of self-filtering techniques for data quality control, and addresses potential issues such as hallucinations during the generation of augmented data.

Open-Source Methods
#

The embrace of open-source methodologies in research significantly impacts reproducibility and accessibility. Open-source code allows other researchers to verify results, adapt methods, and build upon existing work, fostering collaboration and accelerating progress. Open-access datasets democratize research by removing financial barriers and enabling broader participation. This inclusivity encourages diverse perspectives and contributes to more robust and generalizable findings. However, relying solely on open-source tools can present challenges. The quality of open-source tools and datasets can vary significantly, requiring careful evaluation and validation. Furthermore, the open-source landscape may lack the comprehensive features or specialized functionalities available in commercial software, potentially limiting the scope of some research endeavors. Successfully leveraging open-source methods requires a strategic approach, balancing cost-effectiveness with the need for quality and appropriate functionality. While open-source offers significant advantages, researchers must consider its limitations to ensure the reliability and impact of their research.

Ablation Studies
#

Ablation studies systematically remove components of a model or system to assess their individual contributions. In the context of a research paper, this involves isolating variables to determine their effect on overall performance. For a multimodal model, ablation could focus on removing specific components like the visual encoder, language model, or the fusion mechanism to understand their importance. A well-designed ablation study should highlight the relative impact of various model components, offering insights into which parts are most crucial and others that are less impactful or even detrimental. It also helps to validate design choices, determine if features are overfitting or underfitting, and refine future model iterations. By carefully controlling which components are removed and measuring the consequent changes in performance metrics, researchers can draw definitive conclusions about the model’s architecture and its strengths and weaknesses. This process is essential in building robust and explainable AI models. The insights gained from a comprehensive ablation study are invaluable to guide future research and development efforts, allowing for more efficient and effective model design.

More visual insights
#

More on tables

Stage-1	Stage-2	Stage-3
Resolution	384	384 × {1×1, …}
#Tokens	729	Max 729×5
Dataset	LCS	Single Image
#Samples	558K	10M
Vision Tower	siglip-so400m-patch14-384	siglip-so400m-patch14-384
LLM Backbone	Qwen2.5-7B-Instruct	Qwen2.5-7B-Instruct
Trainable Model Parameters	Projector: 20.0M	Full Model: 8.0B
Batch Size	512	256
Model Max Length	8192	8192
Learning Rate: ψ_vision	1×10^-3	2×10^-6
Learning Rate: {θ_proj,Φ_LLM}	1×10^-3	1×10^-5
Epoch	1	1

🔼 Table 2 presents the performance comparison of various large language models (LLMs) across a suite of benchmark tests evaluating multi-disciplinary knowledge and mathematical reasoning capabilities. The benchmarks cover diverse tasks requiring complex reasoning and problem-solving skills. Models are categorized into three groups based on their accessibility and transparency: closed-source (proprietary), open-weight (model weights are publicly available but training details are not), and fully open-source (both weights and training details are open). Performance metrics were sourced either from the official publications of the respective LLMs or calculated using the lmms-eval package. This table is crucial for illustrating the significant improvement achieved by the proposed MAmmoTH-VL-8B model, particularly when compared to fully open-source models of a similar size.
read the caption
Table 2: Performance on multi-discipline knowledge and mathematical reasoning benchmarks. We highlight different groups of models with different colors: closed-source models, open weights but closed training details, and fully open-source models. Results are from official sources or running with lmms-eval package if unavailable.

Model	MMStar	MMMU	MMMU-Pro	SeedBench	MMBench	MMVet	MathVerse	MathVista
Multi-Discipline Knowledge and Mathematical Reasoning
Model	MMStar	MMMU	MMMU-Pro	SeedBench	MMBench	MMVet	MathVerse	MathVista
	test	val	vision	test	en-test	test	mini-vision	testmini
GPT-4o (OpenAI, 2024)	64.7	69.1	49.7	76.2	82.1	76.2	50.2	63.8
Gemini-1.5-Pro (Gemini Team, 2023)	59.1	65.8	44.4	76.0	73.9	64.0	-	63.9
Claude-3.5-Sonnet (Anthropic, 2024)	62.2	68.3	48.0	72.2	79.7	75.4	-	67.7
InternVL2-76B (Chen et al., 2023b)	67.1	58.2	38.0	77.6	86.5	64.4	-	65.5
Qwen2-VL-72B (Wang et al., 2024c)	68.6	64.5	37.1	77.9	86.9	73.9	37.3	70.5
LLaVA-OV-72B (SI) (Li et al., 2024b)	65.2	57.4	26.0	77.6	86.6	60.0	37.7	66.5
LLaVA-OV-72B (Li et al., 2024b)	66.1	56.8	24.0	78.0	85.9	63.7	39.1	67.5
MiniCPM-V-2.6-8B (Yao et al., 2024)	57.5	49.8	21.7	74.0	81.5	60.0	-	60.6
INXComp-2.5-7B (Zhang et al., 2024b)	59.9	42.9	-	75.4	74.4	51.7	20.0	59.6
Llama-3.2-11B-Vision-Ins. (Meta, 2024b)	49.8	50.7	23.7	72.7	73.2	57.6	23.6	51.5
InternVL-2-8B (Chen et al., 2023b)	59.4	49.3	25.4	76.0	81.7	60.0	27.5	58.3
Qwen2-VL-7B-Ins. (Wang et al., 2024c)	60.7	52.1	26.9	74.3	83.0	62.0	28.2	58.2
Cambrian-1-8B (Tong et al., 2024)	-	42.7	14.7	73.3	74.6	48.0	-	49.0
Llava-CoT-11B (Xu et al., 2024b)	57.6	48.9	18.5	75.2	75.0	60.3	24.2	54.8
Molmo-8B-D (Deitke et al., 2024)	50.5	45.3	18.9	74.1	73.6	58.0	21.5	51.6
LLaVA-OV-7B (SI) (Li et al., 2024b)	60.9	47.3	16.8	74.8	80.5	58.8	26.9	56.1
LLaVA-OV-7B (Li et al., 2024b)	61.7	48.8	18.7	75.4	80.8	58.6	26.2	63.2
MAmmoTH-VL-8B (SI)	55.4	49.4	26.0	73.3	83.0	60.6	35.0	67.6
MAmmoTH-VL-8B	63.0	50.8	25.3	76.0	83.4	62.3	34.2	67.6
Δ Over Best Open-Source (~10B Scale)	+1.3	+1.9	+7.1	+0.6	+2.6	+2.0	+8.1	+4.4

🔼 Table 3 presents the performance of various models on a range of benchmarks focused on Chart & Doc Understanding, and Multimodal Interactions & Preferences. These benchmarks evaluate the models’ abilities to comprehend and reason with charts, diagrams, documents, and real-world multimodal scenarios, measuring their accuracy and overall performance in nuanced interaction tasks. Results are compared using consistent evaluation settings as established in Table 2.
read the caption
Table 3: Main results on Chart, Diagram, and Document Understanding, and Real-world Multimodal Interactions and Human Preferences benchmarks. Follow the same settings as in Table 2.

Model	AI2D	ChartQA	InfoVQA	DocVQA	RealWorldQA	WildVision	L-Wilder
Chart & Doc Understanding
	AI2D	ChartQA	InfoVQA	DocVQA	RealWorldQA	WildVision	L-Wilder
	test	test	test	test	test	0617	small
GPT-4o (OpenAI, 2024)	94.2	85.7	79.2	92.8	76.5	89.4	85.9
Gemini-1.5-Pro (Gemini Team, 2023)	94.4	87.2	81.0	93.1	70.4	-	-
Claude-3.5-Sonnet (Anthropic, 2024)	94.7	90.8	49.7	95.2	60.1	50.0	83.1
InternVL2-76B (Chen et al., 2023b)	88.4	88.4	82.0	94.1	72.7	-	-
Qwen2-VL-72B (Wang et al., 2024c)	88.1	88.3	84.5	96.5	77.8	52.3	53.6
LLaVA-OV-72B (SI) (Li et al., 2024b)	85.1	84.9	74.6	91.8	73.8	49.5	72.9
LLaVA-OV-72B (Li et al., 2024b)	85.6	83.7	74.9	91.3	71.9	52.3	72.0
MiniCPM-V-2.6-7B (Yao et al., 2024)	82.1	82.4	-	90.8	65.0	11.7	-
INXComp-2.5-7B (Zhang et al., 2024b)	81.5	82.2	70.0	90.9	67.8	-	61.4
Llama-3.2-11B-Vision-Ins (Meta, 2024b)	77.3	83.4	65.0	88.4	63.3	49.7	62.0
InternVL-2-8B (Chen et al., 2023b)	83.8	83.3	74.8	91.6	64.4	51.5	62.5
Qwen2-VL-7B-Ins (Wang et al., 2024c)	83.0	83.0	76.5	94.5	70.1	44.0	66.3
Cambrian-1-8B (Tong et al., 2024)	73.3	73.3	41.6	77.8	64.2	-	34.1
Llava-CoT-11B (Xu et al., 2024b)	-	67.0	44.8	-	-	-	65.3
Molmo-7B-D (Deitke et al., 2024)	81.0	84.1	72.6	92.2	70.7	40.0	-
LLaVA-OV-7B (SI) (Li et al., 2024b)	81.6	78.8	65.3	86.9	65.5	39.2	69.1
LLaVA-OV-7B (Li et al., 2024b)	81.4	80.0	68.8	87.5	66.3	53.8	67.8
MAmmoTH-VL-8B (SI)	83.4	85.9	74.8	93.8	71.3	51.9	71.3
MAmmoTH-VL-8B	84.0	86.2	73.1	93.7	69.9	51.1	70.8
Δ Over Best Open-Source (~10B Scale)	+2.4	+2.1	+2.2	+1.6	+0.6	-1.9	+2.2

🔼 This table presents the performance of various models on benchmarks involving multiple images and videos. It compares different models’ performance across several datasets, showing their scores and highlighting the relative strengths and weaknesses of each model in handling more complex, multi-modal data. This allows for a comparison to assess which models are best suited for tasks that demand processing of rich visual information from multiple sources.
read the caption
Table 4: Main results on Multi-Image and Video benchmarks. Follow the same settings as in Table 2.

Model	MuirBench	MEGABench	EgoSchema	PerceptionTest	SeedBench	MLVU	MVBench	VideoMME
Multi-Image and Video
Model	MuirBench	MEGABench	EgoSchema	PerceptionTest	SeedBench	MLVU	MVBench	VideoMME
	test	test	test	test	video	dev	test	w/o subs
GPT-4o (OpenAI, 2024)	68.0	54.2	-	-	-	64.6	-	71.9
GPT-4V (OpenAI, 2023)	62.3	-	-	-	60.5	49.2	43.5	59.9
LLaVA-OV-72B (SI) (Li et al., 2024b)	33.2	-	58.6	62.3	60.9	60.9	57.1	64.8
LLaVA-OV-72B (Li et al., 2024b)	54.8	33.8	62.0	66.9	62.1	66.4	59.4	66.2
InternVL-2-8B (Chen et al., 2023b)	59.4	27.7	54.2	57.4	54.9	30.2	66.4	54.0
Qwen2-VL-7B-Ins. (Wang et al., 2024c)	41.6	36.0	66.7	62.3	55.3	58.6	67.0	63.3
LLaVA-OV-7B (SI) (Li et al., 2024b)	32.7	22.1	52.9	54.9	51.1	60.2	51.2	55.0
LLaVA-OV-7B (Li et al., 2024b)	41.8	23.9	60.1	57.1	56.9	64.7	56.7	58.2
MAmmoTH-VL-8B	55.1	28.2	58.5	59.3	57.1	64.7	59.1	58.8
Δ Over Best Open-Source ~10B Scale	+13.3	+4.3	-1.6	+2.2	+0.2	+0	+2.4	+0.6

🔼 This table presents a quantitative comparison of the performance achieved by models trained on filtered versus unfiltered datasets across a range of benchmarks. It highlights the impact of the data filtering process on model accuracy, providing a detailed breakdown of performance differences across various evaluation metrics. The benchmarks likely include diverse tasks and datasets, showcasing a comprehensive assessment of the models’ capabilities after training on data subjected to different preprocessing techniques. The table allows for an understanding of the effectiveness of data filtering in improving the overall model performance and the influence of data quality on the outcome.
read the caption
Table A1: Performance Comparison of Models Trained on Filtered versus Unfiltered Data Across Multiple Benchmarks.

Bench Name	Before Filter	After Filter
MMMU	39.6	40.9
MMStar	14.0	44.6
SeedBench	66.4	67.9
MMMU-Pro Vision	15.5	13.7
MathVista	39.5	42.0
MMBench EN	58.6	65.1
MMVet	40.5	43.9
MathVerse	19.3	22.6
AI2D	56.9	61.8
ChartQA	26.8	63.0
InfoVQA	41.5	48.0
DocVQA	71.7	76.5
L-Wilder Small	58.8	59.8
WildVision	40.2	42.2
RealWorldQA	50.3	56.0
Avg	42.6	49.9

🔼 This table presents the percentage of data retained after filtering for different data types within the MAmmoTH-VL dataset. The filtering process aimed to remove low-quality or hallucinated data, focusing on data sets related to Charts, Diagrams, and Documents. The filter rate represents the proportion of data deemed acceptable and retained for model training after this quality-control process. The data types are categorized and analyzed individually to show where the filtering process had a greater effect, indicating potential areas for future data improvements or improvements in the model’s ability to filter such data.
read the caption
Table A2: Filter Rates Of Different Data Types After Data Filtering.

Data Type	Before Filter	After Filter	Filter Rate
OCR	1104960	498337	54.9
Chart	7326189	3782029	48.4
GeneralQA	1726180	1584308	8.2
Caption	244874	199853	18.3
Math	590894	518393	12.3
Other	1315039	1178275	10.4

🔼 This table presents a comparison of the performance of models trained on datasets with varying ratios of original and rewritten data. It shows how the model’s performance changes across multiple benchmarks as the proportion of rewritten data in the training set increases. The benchmarks used are likely for a multimodal large language model (MLLM). The different mix ratios show how the model performs using only original data, only rewritten data, and combinations of both.
read the caption
Table A3: Benchmark Performance Of Models Trained On Data With Different Mix Ratios.

Bench Name	Rewrite	Original	Mix 3:7	Mix 7:3	Mix 5:5
MMMU	40.9	41.9	41.5	41.3	41.7
MMStar	44.6	43.3	43.4	42.3	43.7
SeedBench	67.9	69.9	68.7	69.3	68.9
MMMU-Pro Vision	13.7	13.0	13.8	13.5	13.5
MathVista	42.0	40.4	41.8	40.6	39.5
MMBench EN	65.1	67.8	66.1	67.9	66.4
MMVet	43.9	37.3	45.5	40.7	38.9
MathVerse	22.6	19.8	21.4	21.0	20.4
AI2D	61.8	63.1	62.9	62.5	62.8
ChartQA	63.1	56.5	61.1	56.8	56.6
InfoVQA	48.0	47.3	49.0	45.7	45.6
DocVQA	76.5	76.6	77.4	76.0	75.7
L-Wilder Small	59.8	56.4	60.9	56.8	57.4
WildVision	42.2	34.9	38.7	34.5	36.7
RealworldQA	56.0	56.1	57.1	55.7	54.8
Avg	49.9	48.3	50.0	48.3	48.2

🔼 This table presents a comparison of the performance achieved by various models trained on datasets created using different rewriting techniques. The models were evaluated across multiple benchmarks, allowing for a comprehensive assessment of how rewriting methods impact performance. Each model’s performance is presented as a score for each benchmark, allowing for a direct comparison across different rewriting approaches.
read the caption
Table A4: Performance On Different Benchmarks Of Models Trained On Data Rewritten By Different Models

Bench Name	Original	Rewrite (Qwen2-VL-7B)	Rewrite (InternVL2-8B)	Rewrite (InternVL2-76B)
MMMU	40.4	40.6	40.9	40.78
MMStar	40.9	41.7	41.7	37.9
SeedBench	50.6	52.1	65.0	67.0
MMMU-Pro Vision	12.3	12.9	12.9	15.3
MathVista	36.4	38.8	37.4	39.0
MMBench EN	65.8	59.1	60.1	58.3
MMVet	38.6	38.1	38.6	41.1
MathVerse	17.6	21.6	19.8	20.6
AI2D	61.8	62.3	61.7	59.6
ChartQA	49.4	48.1	50.6	58.7
InfoVQA	43.8	43.1	43.7	44.3
DocVQA	73.4	70.8	71.3	72.2
L-Wilder Small	44.5	55.7	55.7	60.5
WildVision	32.7	32.0	30.8	41.7
RealWorldQA	56.5	55.1	56.8	53.5
Avg	46.8	47.3	48.4	50.0

🔼 This table presents the inter-rater reliability scores (Cohen’s Kappa) comparing the model’s filtering decisions against three human evaluators. The values show how consistently the model’s automated filtering process agrees with human judgment in identifying high-quality data entries.
read the caption
Table A5: Kappa Value Between Any Two.

\	Model	Evaluator1	Evaluator2	Evaluator3
Model	-	0.73	0.70	0.63
Evaluator1	0.73	-	0.70	0.42
Evaluator2	0.70	0.70	-	0.53
Evaluator3	0.63	0.42	0.53	-

🔼 This table presents a quantitative comparison of the quality of original and rewritten multimodal instruction data. Specifically, it shows the average content and relevance scores for each dataset before and after the rewriting process. Higher scores indicate better quality data, implying richer information content and stronger alignment between the visual and textual components. The scores were obtained using an MLLM (large language model) as a judge.
read the caption
Table A6: Comparison of Original and Rewrite Average Content and Relevance Scores

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Multimodal Reasoning#

Instruction Tuning#

Data Augmentation#

Open-Source Methods#

Ablation Studies#

More visual insights#

Full paper#