VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

2412.00927

Weiming Ren et el.

🤗 2024-12-03

TL;DR
#

Current large multimodal models (LMMs) struggle with long and high-resolution videos due to a lack of suitable datasets. This paper highlights the significant challenge of limited high-quality video instruction data, hindering advancements in video understanding. Existing datasets either have low resolution or short durations, insufficient to train robust models.

To overcome this, the researchers introduce VISTA, a video augmentation framework. VISTA synthesizes long-duration and high-resolution video data by combining existing videos and captions. It generates a new video instruction-following dataset, VISTA-400K, and a high-resolution video benchmark, HRVideoBench. Experiments demonstrate that fine-tuning models on VISTA-400K significantly improves their performance on various benchmarks, achieving an average of 3.3% improvement on long-video tasks and 6.5% on high-resolution tasks.

Key Takeaways
#

Why does it matter?
#

This paper is important because it addresses the critical need for high-quality datasets in long-duration and high-resolution video understanding, a currently under-explored area. VISTA-400K, a novel dataset generated by the proposed method, significantly improves the performance of existing models. This opens new avenues for research in video understanding and establishes a new benchmark, HRVideoBench, for high-resolution video analysis, pushing the field forward.

Visual Insights
#

🔼 The figure illustrates the VISTA framework, which leverages existing video-caption datasets to produce high-quality video instruction-following data. VISTA combines videos both spatially and temporally to create synthetic videos with longer durations and higher resolutions. These videos are then paired with newly generated question-answer pairs, effectively augmenting the original dataset. The figure also shows a bar graph comparing the average accuracy of baseline models versus models fine-tuned with the VISTA-400K dataset, highlighting the performance improvements achieved on various video understanding benchmarks (Long Video Bench, LVBench, HRVideoBench, Short Video Bench). The improvement demonstrates VISTA’s effectiveness in enhancing video understanding models’ ability to handle long-duration and high-resolution video content.
read the caption
Figure 1: VISTA is a simple but effective framework that generates high-quality video instruction data from existing video-caption pairs. Our VISTA-400K dataset enhances model performances on various long and high-resolution video benchmarks.

Subset	Instruction Type	Video Source	#Videos	Avg. Duration	Avg. Resolution
Long Video Captioning	Video Captioning	Panda-70M [5]	58,617	33.2s	1277x720
Event Relationship QA	Freeform QA/MCQ	Panda-70M [5]	56,854	33.4s	1278x720
Temporal NIAH	Freeform QA/MCQ	Panda-70M [5] (N), MiraData [14] (H)	59,751	67.6s	640x358
Two Needle NIAH	Freeform QA	Panda-70M [5] (N), FineVideo [8] (H)	52,349	112.4s	591x382
Spatial NIAH	Freeform QA/MCQ	InternVid [50] (N), OpenVid-1M [33] (H)	59,978	9.9s	1726x971
Spatiotemporal NIAH	Freeform QA/MCQ	OpenVid-1M [33] (N), FineVideo [8] (H)	56,494	89.9s	591x383
HR Video Grid QA	Freeform QA/MCQ	InternVid [50]	59,901	3s	1920x1080
VISTA-400K	-	-	403,944	48.6s	1160x666

🔼 This table presents a statistical summary of the VISTA-400K dataset, a synthetic video instruction-following dataset created using the VISTA framework. It details the number of videos, average duration, and average resolution for each of the seven subsets of the dataset. Each subset employs a different video augmentation technique to create synthetic videos of varying lengths and resolutions. The ‘Needle-in-a-Haystack’ (NIAH) subsets combine short, low-resolution videos (‘N’) with longer, high-resolution videos (‘H’) to create more challenging training examples for video understanding models. The table provides crucial information for understanding the characteristics and composition of the VISTA-400K dataset.
read the caption
Table 1: Statistics of our synthetic video instruction-following dataset. “(N)” and “(H)” corresponds to the “needle” (short or low-res videos) and the “haystack” (long or high-res videos) in NIAH subsets.

In-depth insights
#

Long-Video Augmentation
#

The concept of ‘Long-Video Augmentation’ presents a crucial advancement in video understanding, particularly concerning the limitations of current models with short-duration video data. The core idea revolves around artificially extending the length of existing video clips to create a larger, more diverse training dataset. This addresses the scarcity of long-duration, high-quality video data, a significant bottleneck for training robust and effective video understanding models. The augmentation process likely involves techniques like concatenation of multiple short clips, possibly with careful selection to maintain narrative coherence and contextual relevance. Synthesizing long videos offers a cost-effective alternative to the expensive process of acquiring and annotating extensive, real-world long video datasets. However, careful consideration is needed to avoid introducing artificial artifacts or inconsistencies that could negatively impact model performance or lead to overfitting. The effectiveness of this method hinges on several factors, including the quality of the original short videos, the sophistication of the concatenation algorithms, and the potential need for additional data augmentation strategies to further improve the diversity of the augmented data. Ultimately, the success of long-video augmentation rests on its ability to create a synthetic dataset that sufficiently resembles real-world long videos, enabling models to generalize well to unseen long-duration video inputs.

HR-Video Benchmark
#

A high-resolution video benchmark is crucial for evaluating the capabilities of video language models (VLMs) to understand fine details and subtle actions within high-resolution videos. Existing benchmarks often focus on low-resolution videos, limiting our understanding of VLM performance on the increasingly common high-resolution video data. A comprehensive HR-video benchmark would need to include diverse video types, with varied object details, subtle actions, and complex scenes. This would require careful consideration of video resolution, frame rate, and overall quality, as these factors significantly impact the performance of VLMs. The benchmark should also incorporate diverse question types, testing not just object recognition but also higher-order reasoning and temporal understanding, reflecting the nuanced complexity of high-resolution videos. A robust HR-video benchmark would greatly advance the field by facilitating the development of more sophisticated VLMs, capable of handling the richness of high-resolution video information and contributing to numerous real-world applications. Furthermore, it could highlight the limitations of current VLMs and guide future research on model architecture and training data towards improving their comprehension and reasoning abilities with high-resolution videos.

VISTA Dataset
#

The VISTA dataset represents a novel approach to augmenting video data for improved long-duration and high-resolution video understanding. Instead of relying solely on collecting new videos, VISTA cleverly synthesizes new video-instruction pairs from existing datasets. This is achieved by spatially and temporally combining existing videos and generating corresponding question-answer pairs, thereby expanding the scope and resolution of the training data. The resulting VISTA-400K dataset is substantial, comprising a diverse array of synthesized videos, significantly increasing the quantity of high-quality long and high-resolution video-instruction data. This data augmentation strategy addresses a critical bottleneck in video LMM training, proving its effectiveness through improved performance on various benchmarks, highlighting the power of data-centric solutions to enhance video comprehension capabilities. A particularly valuable contribution is the introduction of HRVideoBench, a benchmark specifically designed for evaluating high-resolution video understanding, further underscoring the impact of VISTA’s contribution.

Model Finetuning
#

Model finetuning in the context of large multimodal models (LMMs) for video understanding involves adapting pre-trained models to excel at specific video-related tasks. This process is crucial because LMMs, while powerful, often require further specialization to handle the nuances of long-duration and high-resolution videos. The effectiveness of finetuning hinges on the quality and diversity of the training dataset. A well-curated dataset, such as the VISTA-400K dataset described in the paper, allows the model to learn essential spatiotemporal relationships and high-resolution details. Augmentation techniques further enhance the dataset, creating synthetic data to address the scarcity of naturally occurring high-quality, long videos. The results demonstrate that finetuning on this augmented data leads to substantial improvements across various video understanding benchmarks, showcasing the significance of a data-centric approach to improving LMMs for video. The improvements highlight the importance of high-quality data in finetuning. Careful consideration of the benchmark selection is also essential; as demonstrated in the paper, the creation of HRVideoBench enables a proper assessment of high-resolution video understanding, an area previously overlooked.** Finally, the choice of base model significantly influences the results; different models will have varying levels of adaptability and benefit differently from finetuning. Therefore, a comprehensive model finetuning strategy must consider the dataset, augmentation techniques, benchmark choice, and the suitability of the base model.

Future Work
#

The paper’s ‘Future Work’ section could explore several promising avenues. Improving the video augmentation techniques is crucial. Currently, the methods are primarily based on simple spatial and temporal combinations; more sophisticated techniques like generative models or advanced video editing algorithms could create more realistic and diverse synthetic data. Expanding the dataset is vital. While VISTA-400K is significant, a larger and more varied dataset with a broader range of video types and qualities would further improve model performance. In addition to quantity, improving the quality of captions and QA pairs through more advanced language models or human annotation will result in more accurate and informative training data. Finally, investigating the transferability of models trained on VISTA-400K to other video understanding tasks is key to validating the framework’s generality. This would involve comprehensive testing on various benchmarks for diverse downstream tasks. Addressing these aspects will enhance the robustness and applicability of the proposed approach.

More visual insights
#

More on tables

Long Video Understanding	Short Video Understanding
Video-MME w/o subtitles	MLVU	LVBench
	LongVideoBench	MVBench
	NExT-QA
—	—	—
Models	Size	avg
—	—	—
Proprietary Models
GPT-4V [1]	-	59.9
GPT-4o [35]	-	71.9
Gemini-1.5-Pro [44]	-	75.0
Open-source Models
VideoChat2 [23]	7B	39.5
LLaMA-VID [25]	7B	-
ST-LLM [29]	7B	37.9
ShareGPT4Video [4]	7B	39.9
LongVILA [55]	7B	50.5
LongLLaVA [49]	7B	52.9
Video-XL [41]	7B	55.5
VideoLLaVA [26]	7B	39.9
VISTA-VideoLLaVA	7B	43.7
Δ - VideoLLaVA		+3.8
Mantis-Idefics2 [13]	8B	45.4
VISTA-Mantis	8B	48.2
Δ - Mantis-Idefics2		+2.8
LongVA [63]	7B	52.4
VISTA-LongVA	7B	55.5
Δ - LongVA		+3.1

🔼 This table compares the performance of several baseline video language models (LLMs) against versions of those same models fine-tuned on the VISTA-400K dataset. The comparison is made across multiple benchmarks designed to test both long and short video understanding capabilities. The table shows the average performance across multiple categories, including ‘short,’ ‘medium,’ and ’long’ video lengths, as well as overall performance. The best results achieved by open-source models are highlighted in bold. The final column indicates the performance improvement (Δ) after fine-tuning with VISTA-400K.
read the caption
Table 2: Comparisons between baseline models and VISTA-finetuned models on long/short video understanding benchmarks. The best results among open-source models are bolded. ΔΔ\Deltaroman_Δ denotes the performance differences before and after finetuning on VISTA-400K.

	HRVideoBench	MSVD-QA	MSRVTT-QA	TGIF-QA	ActivityNet-QA
High-Res Video Understanding						Open-Ended Video QA
Models	avg	object	action	acc.	score	acc.	score	acc.	score	acc.	score
VideoLLaVA [26]	32.5	36.0	27.9	60.3	3.7	42.1	3.0	63.5	3.8	48.6	3.3
VISTA-VideoLLaVA	47.5	50.0	44.2	71.5	4.0	58.5	3.5	78.0	4.3	49.1	3.4
Δ - VideoLLaVA	+15.0	+14.0	+16.3	+11.2	+0.3	+16.4	+0.5	+14.5	+0.5	+0.5	+0.1
Mantis-Idefics2 [13]	48.5	50.9	45.4	57.4	3.5	34.9	2.7	65.7	3.8	46.5	3.1
VISTA-Mantis	51.0	53.5	47.7	65.2	3.8	46.4	3.1	71.4	4.0	48.8	3.3
Δ - Mantis	+2.5	+2.6	+2.3	+7.8	+0.3	+11.5	+0.4	+5.7	+0.2	+2.3	+0.2
LongVA [63]	48.0	52.6	41.9	56.3	3.5	37.7	2.8	55.4	3.4	48.0	3.2
VISTA-LongVA	50.0	56.1	41.9	61.0	3.7	42.5	3.0	67.5	3.9	51.8	3.4
Δ - LongVA	+2.0	+3.5	+0.0	+4.7	+0.2	+4.8	+0.2	+12.1	+0.5	+3.8	+0.2

🔼 This table presents a quantitative comparison of different video language models’ performance on high-resolution video understanding and open-ended video question answering tasks. The HRVideoBench benchmark assesses the models’ ability to understand high-resolution video details, while the open-ended benchmarks (MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA) evaluate their performance on general video question-answering tasks. The table shows the average accuracy (acc.) and scores achieved by each model on each benchmark. Specifically for HRVideoBench, both object and action understanding accuracies are shown.
read the caption
Table 3: Quantitative results on HRVideoBench and open-ended video QA benchmarks. “acc.” represents accuracy.

Models	Video-MME	HRVideoBench
Video-MME	w/o sub. avg	avg
VISTA-Mantis	48.2	51.0
w/o Long Video Captioning	47.9	48.0
w/o Event Relationship QA	47.7	49.5
w/o Temporal NIAH	47.5	48.0
w/o Two Needle NIAH	48.1	50.5
w/o Spatial NIAH	47.2	47.5
w/o Spatiotemporal NIAH	47.7	50.0
w/o HR Video Grid QA	47.8	48.0
w/o Video Augmentation	45.7	44.5

🔼 This ablation study investigates the impact of each video augmentation subset within VISTA-400K on the performance of the Mantis-Idefics2 model. For each row, a modified version of VISTA-400K is created by replacing one of the seven subsets with an equal number of training examples from the VideoChat2-IT dataset. The table shows the average performance scores on the Video-MME and HRVideoBench benchmarks for the modified models, highlighting the contribution of each subset to the overall performance gains.
read the caption
Table 4: Ablation study results for VISTA-Mantis. Each “w/o [Subset]” denotes a Mantis-Idefics2 model finetuned on a modified VISTA-400K by replacing the corresponding subset with the same amount of training examples from VideoChat2-IT [23].

	avg	short	medium	long	m-avg	test	val
Long Video Understanding
Video-MME w/o subtitles
Models
VideoLLaVA	39.9	45.3	38.0	36.2	45.0	29.3	39.1
VideoLLaVA (SFT on VISTA-400K)	43.6	47.3	43.8	39.8	48.7	32.6	41.0
Δ - VideoLLaVA	+3.7	+2.0	+5.8	+3.6	+3.7	+3.3	+1.9
VideoLLaVA (SFT on VISTA-400K + 300K VideoChat2-IT)	43.7	48.2	43.9	38.9	49.5	33.8	42.3
Δ - VideoLLaVA (SFT on VISTA-400K)	+0.1	+0.9	+0.1	-0.9	+0.8	+1.2	+1.3

🔼 This table presents a comparison of the performance of three different models on long video understanding benchmarks. The first model is the baseline VideoLLaVA model. The second model is VideoLLaVA finetuned on the VISTA-400K dataset. The third model is VideoLLaVA finetuned on both the VISTA-400K dataset and an additional 300K videos from the VideoChat2-IT dataset. The benchmarks used are Video-MME, MLVU, LVBench, and LongVideoBench. The table shows the average performance across short, medium, and long video clips for each benchmark and model, as well as the improvement achieved by fine-tuning. ‘SFT’ denotes supervised finetuning.
read the caption
Table 5: Comparison between the baseline VideoLLaVA model, VideoLLaVA finetuned on VISTA-400K and VideoLLaVA finetuned on VISTA-400K + 300K VideoChat2-IT data (VISTA-VideoLLaVA in the main paper) on long video understanding benchmarks. “SFT” indicates supervised finetuning.

Models	avg	object	action
High-Resolution Video Understanding
HRVideoBench
VideoLLaVA	32.5	36.0	27.9
VideoLLaVA (SFT on VISTA-400K)	44.0	42.1	46.5
Δ - VideoLLaVA	+11.5	+6.1	+18.6
VideoLLaVA (SFT on VISTA-400K + 300K VideoChat2-IT)	47.5	50	44.2
Δ - VideoLLaVA (SFT on VISTA-400K)	+3.5	+7.9	-2.3

🔼 This table compares the performance of three different models on the HRVideoBench benchmark: the baseline VideoLLaVA model, VideoLLaVA fine-tuned on the VISTA-400K dataset, and VideoLLaVA fine-tuned on both VISTA-400K and an additional 300K samples from the VideoChat2-IT dataset. The results show the average performance, object recognition accuracy, and action recognition accuracy for each model on the benchmark. The ‘A’ values represent the performance differences between the fine-tuned models and the baseline model. ‘SFT’ denotes that supervised fine-tuning was used for the models.
read the caption
Table 6: Comparison between the baseline VideoLLaVA model, VideoLLaVA finetuned on VISTA-400K and VideoLLaVA finetuned on VISTA-400K + 300K VideoChat2-IT data (VISTA-VideoLLaVA in the main paper) on HRVideoBench. “SFT” indicates supervised finetuning.

Models	avg	short	medium	long
VideoLLaVA	41.6	46.1	40.7	38.1
VISTA-VideoLLaVA	45.1	50.2	45.7	39.3
Δ - VideoLLaVA	+3.5	+4.1	+5.0	+1.2
Mantis-Idefics2	49.0	60.4	46.1	40.3
VISTA-Mantis	50.9	61.8	48.6	42.3
Δ - Mantis-Idefics2	+1.9	+1.4	+2.5	+2.0
LongVA	54.3	61.6	53.6	47.6
VISTA-LongVA	59.3	70.0	57.6	50.3
Δ - LongVA	+5.0	+8.4	+4.0	+2.7

🔼 This table presents a comparison of the performance of baseline video language models and their VISTA-finetuned counterparts on the Video-MME benchmark, specifically using the ‘with subtitles’ setting. It shows the average accuracy scores, as well as scores for short, medium, and long video questions, demonstrating the improvement achieved by fine-tuning on the VISTA dataset.
read the caption
Table 7: Comparison between VISTA-finetuned models and baseline models on Video-MME w/ subtitle benchmark.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Long-Video Augmentation#

HR-Video Benchmark#

VISTA Dataset#

Model Finetuning#

Future Work#

More visual insights#

Full paper#