VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

2412.21059

Jiazheng Xu et el.

🤗 2025-01-06

↗ arXiv ↗ Hugging Face ↗ Papers with Code

TL;DR
#

Current methods for aligning visual generation models with human preferences face challenges. Reward models are often biased and lack interpretability, while video quality assessment remains difficult. Existing RLHF methods can lead to over-optimization or under-optimization of certain factors.

The researchers introduce VisionReward, a novel fine-grained and multi-dimensional reward model that effectively addresses these challenges. It decomposes human preferences into multiple dimensions using a series of judgment questions, providing an interpretable and accurate preference score. VisionReward significantly outperforms existing methods on both image and video datasets. The researchers also introduce a new multi-objective optimization algorithm for improved model stability and avoiding over-optimization. The code and datasets are publicly available.

Key Takeaways
#

Why does it matter?
#

This paper is crucial because it directly addresses the limitations of current reward models in visual generation, offering a novel approach to aligning models with human preferences. This is highly relevant to the current trends in RLHF and will likely influence future research in multi-objective preference learning and video quality assessment. The dataset and code are publicly available, encouraging broader participation and accelerating progress in the field.

Visual Insights
#

🔼 Figure 1 presents examples illustrating VisionReward and the Multi-Objective Preference Optimization (MPO) algorithm. Panel (a) shows a text-to-image generation task where VisionReward assigns a higher score than ImageReward, demonstrating its ability to better capture human preferences. Panel (b) similarly demonstrates VisionReward’s superior performance in a text-to-video generation task compared to VideoScore. Panels (c) and (d) show comparative results of text-to-image and text-to-video optimization, respectively. They show the original generated outputs and how those outputs are improved using different optimization methods, including DPO with different scoring methods and MPO with VisionReward. The results highlight MPO with VisionReward’s effectiveness in optimizing visual generation, leading to superior output quality according to human assessment.
read the caption
Figure 1: Samples of VisionReward and Multi-Objective Preference Optimization (MPO) algorithm.

Dimension	#Sub-dimension		#Checklist
Dimension	#Sub-dimension		#Checklist
	Image	Video	Image	Video
Alignment	1	1	1	4
Composition	5	1	13	2
Quality	5	4	14	14
Fidelity	5	3	25	9
Safety&Emotion	2	1	8	4
Stability	-	5	-	12
Dynamic	-	2	-	8
Physics	-	1	-	4
Preservation	-	2	-	7
Total	18	20	61	64

🔼 This table presents the taxonomy of annotations used in the VisionReward model. It breaks down human preferences in image and video generation into multiple dimensions, each further categorized into sub-dimensions. For each sub-dimension, there’s a corresponding number of checklist questions, which are multiple-choice questions designed to elicit fine-grained human judgment. The table shows that image generation is broken down into 5 dimensions and 18 sub-dimensions and video generation is broken down into 9 dimensions and 64 sub-dimensions. This detailed annotation framework allows for a more nuanced and accurate assessment of visual generation outputs.
read the caption
Table 1: Taxonomy of annotation for VisionReward.

In-depth insights
#

Multi-Reward Learning
#

Multi-reward learning, in the context of visual generation models, presents a powerful strategy to overcome limitations of single-reward approaches. By incorporating multiple reward signals, each capturing a different aspect of human preferences (e.g., visual fidelity, aesthetics, safety), the method facilitates a more holistic and nuanced evaluation of generated outputs. This addresses the problem of confounding factors inherent in single reward scenarios where optimizing for one aspect may negatively affect another. The interpretability of multi-reward models is also improved because the individual contributions of each reward signal become explicit, enabling a more fine-grained understanding of model strengths and weaknesses. However, designing an effective multi-reward system requires careful consideration of reward weighting and potential conflicts between rewards. Careful selection and weighting of rewards become crucial to balance various factors in a way that aligns well with human preferences. Furthermore, the optimization process needs to be robust enough to handle multiple objectives simultaneously, avoiding over-optimization or suboptimal results in specific dimensions. The success of multi-reward learning hinges on the ability to disentangle the individual rewards and to develop optimization techniques that lead to a harmonious balance across all aspects of visual generation quality.

Video Quality Metrics
#

Developing effective video quality metrics is crucial for evaluating video generation models. Existing metrics often fall short, failing to capture the nuances of human perception, particularly in dynamic content. A key challenge lies in assessing temporal aspects of video quality, such as motion smoothness and realism. Traditional image-based metrics are insufficient as they don’t account for temporal coherence or dynamic visual features. Therefore, new metrics must be designed that specifically address the temporal dimension of video, going beyond simple frame-by-frame analysis to incorporate motion characteristics and visual consistency across frames. Multi-dimensional approaches are promising, considering aspects like clarity, realism, and aesthetic appeal separately, instead of relying on a single, potentially biased score. Furthermore, close collaboration with human perception studies is essential for establishing truly effective video quality metrics that align with human judgment. This would validate proposed metrics against actual viewer preferences and identify weaknesses in capturing subtle but important qualities of visual experience.

MPO Optimization
#

The core of the proposed methodology lies in its multi-objective preference optimization (MPO) strategy. Unlike traditional single-objective approaches, MPO directly addresses the inherent trade-offs within human preferences, aiming to avoid over-optimization of certain aspects at the expense of others. This is achieved by formulating an objective function that considers multiple dimensions of visual quality simultaneously. The innovative aspect is disentangling these intertwined dimensions during training, ensuring balanced improvements across all criteria, rather than favoring one dimension excessively. This approach is crucial because human preferences in visual generation are rarely unidimensional. MPO effectively tackles the bias and lack of interpretability present in many existing reward models by employing a fine-grained reward system that’s capable of separating and prioritizing different aspects of quality. The algorithm’s structure, which is designed to ensure the optimization process does not weaken any dimension, appears to significantly enhance the stability and overall quality of the visual generation outcomes when compared with other methods. The results suggest that MPO provides a more robust and nuanced solution for aligning visual generative models with human preferences.

Human Preference
#

The concept of ‘Human Preference’ is central to this research, driving the development of VisionReward, a novel reward model designed to align image and video generation models with human aesthetic sensibilities. The authors critique existing reward models for being biased and lacking interpretability, highlighting the difficulty of evaluating video quality compared to images. They address this challenge by introducing a fine-grained, multi-dimensional reward model, decomposing human preferences into interpretable dimensions assessed through a series of judgment questions. VisionReward’s strength lies in its ability to surpass existing methods in video preference prediction, demonstrating its effectiveness in handling the dynamic aspects of video content. The integration of this fine-grained reward model with a multi-objective preference optimization algorithm further mitigates the over-optimization issues common in reinforcement learning-based approaches. The overall aim is to create more human-aligned, high-quality image and video generation models, acknowledging the nuanced and multifaceted nature of human visual appreciation.

Future of RLHF
#

The future of Reinforcement Learning from Human Feedback (RLHF) hinges on addressing its current limitations. Bias in reward models, stemming from inherent biases in human preferences, needs mitigation through more sophisticated reward model design and data collection methods. This might involve incorporating diverse demographics and perspectives, and possibly moving beyond simple preference ranking to richer feedback mechanisms like detailed explanations or comparative analysis. Improving evaluation metrics for generated content is crucial; current methods often fail to fully capture the nuanced aspects of human preference. More sophisticated metrics, potentially incorporating elements of human perceptual models, are required. Furthermore, scaling RLHF to complex tasks and modalities, such as long-form video generation, presents a significant challenge. Efficient training methods and scalable reward model architectures are essential for future development. Finally, research into alignment between model behavior and human values remains a key area for future investigation. Techniques focusing on interpretability and explainability, as well as robust safety mechanisms, are vital to ensure that RLHF-trained models are both effective and ethically sound.

More visual insights
#

More on tables

Type	Source	#Samples	#Checklist
Image	ImageRewardDB [46]	16K	1M
	Pick-a-Pic [20]	16K	1M
	HPDv2 [44]	16K	1M
Video	CogVideoX [47]	10K	0.6M
	Open-Sora [51]	10K	0.6M
	VideoCrafter2 [4]	10K	0.6M
	Panda-70M [5]	3K	0.2M

🔼 This table presents the details of the datasets used for training and annotating the VisionReward model. It shows the source of the data (e.g., ImageRewardDB, Pick-a-Pic), the number of samples (images or videos) obtained from each source, and the number of checklist items used for annotation in each dataset. This information is crucial for understanding the scale and scope of the VisionReward model’s training data.
read the caption
Table 2: Statistics of source data and annotation.

Type	Image	Video
Content	People, Objects, Animals, Architecture, Landscape, Vehicles, Plants, Food, Others, Scenes	Story, Human Activity, Artificial Scene, Others, Natural Animal Activity, Physical Phenomena
Challenge	Unreal, Style, History, Fine-grained Detail, Color, Famous Character, Normal, Famous Places, Writing, Complex Combo, Positional, Counting	Material, Angle and Lens, Emotional Expression, Color/Tone, Surreal, World Knowledge, Special Effects, Text, Spatial Relationship, Camera Movement, Logical Consistency, Style, Temporal Speed

🔼 This table details the content and challenge categories used in the MonetBench benchmark dataset for evaluating image and video generation models. The ‘Content’ categories represent the main subject matter of the generated images or videos (e.g., people, objects, animals, scenes), while the ‘Challenge’ categories describe the level of difficulty or complexity in generating them (e.g., unreal styles, fine-grained details, complex compositions). Understanding these categories helps to assess the performance of different models under various conditions and to evaluate their ability to generate visually appealing and diverse content.
read the caption
Table 3: Content and Challenge Categories of MonetBench.

Method	Image			Video
Method	Image			Video
HPDv2 [44]	MonetBench	tau*	diff**	GenAI-Bench [18]	MonetBench	tau	diff
task-specific discriminative models
ImageReward [46]	74.0	48.8	56.5	48.4	72.1	55.8	58.4
PickScore [20]	79.8	49.8	57.6	52.4	75.4	57.7	61.6
HPSv2 [44]	83.3	48.4	55.6	49.3	73.0	59.3	62.5
generative models
GPT-4o [1]	77.5	38.9	52.7	41.8	54.3	45.7	48.3
Gemini [38]	60.7	27.4	55.1	46.9	61.7	52.2	56.8
VQAScore [23]	69.7	49.4	56.5	45.2	68.0	56.1	59.5
VideoScore [11]	76.8	45.8	52.5	47.8	71.4	49.1	54.9
VisionReward (Ours)	81.7	51.8	59.5	51.8	74.4	64.0	72.1

🔼 This table presents the performance of various models, including both task-specific discriminative models and generative models, on multiple datasets for predicting human preferences in image and video generation. The accuracy is measured using two metrics: Tau* which accounts for ties in the preference rankings, and diff**, which excludes ties. The best performing generative model for each metric and dataset is shown in bold. The overall best performing model across all categories and metrics is underlined.
read the caption
Table 4: Preference accuracy on multiple dataset. Bold denotes the best score within the generative models, while underline signifies the best score among all categories. Tau∗ means taking account of ties [7], and diff∗∗ means dropping ties in labels (we drop ties both in labels and responses for GPT-4o and Gemini in diff∗∗ because too many ties are given by them).

Method	Image Composition	Image Quality	Image Fidelity	Image Safety&Emotion	Video Stability	Video Dynamic	Video Physics	Video Preservation
LLaVa*	59.9	65.7	80.9	64.4	52.5	53.8	50.6	47.5
CogVLM2 [16]	65.8	67.1	53.1	74.7	49.3	57.1	51.2	47.8
GPT-4o [1]	73.1	62.7	61.9	70.1	57.9	69.1	62.4	58.8
Gemini [38]	69.4	59.9	59.7	74.9	58.1	71.1	58.1	59.6
VisionReward (Ours)	78.8	81.1	80.9	83.9	64.8	75.4	68.1	72.0

🔼 This table presents a comparison of the accuracy of VisionReward and other vision-language models (VLMs) in answering vision quality assessment questions. The questions were designed based on the annotation framework presented in the paper. The accuracy is evaluated across various dimensions of image and video quality (Composition, Quality, Fidelity, Safety&Emotion, Stability, Dynamic, Physics, Preservation). Note that LLaVA-v1.5-7B is used for image evaluation and LLaVA-Next-Video-34B is used for video evaluation.
read the caption
Table 5: Accuracy of VisionReward and other vision-language models (VLMs) on vision quality questions constructed from our annotation. ∗We test LLaVA-v1.5-7B [24] for image and LLava-Next-Video-34B [21] for video.

Composition	Quality	Fidelity	Safety
97.9	98.2	98.3	99.1
Stability	Dynamic	Physics	Preservation
97.4	99.9	88.2	99.8

🔼 This table presents the internal consistency of the VisionReward model across its different dimensions. Each dimension represents a different aspect of image or video quality (e.g., Composition, Quality, Fidelity, Safety&Emotion). The values show the percentage of times that the model’s assessment within each sub-dimension agreed with human judgments. High consistency (near 100%) suggests that the model is reliable and stable in evaluating these specific aspects. Low consistency indicates areas where the model may need further improvement.
read the caption
Table 6: Consistency of VisionReward in each dimension.

Size	100	200	500	1k
Accuracy	76.5	77.6	80.3	80.6
Size	2k	4k	8k	16k
Accuracy	80.9	81.3	81.2	81.3

🔼 This table presents the average accuracy achieved by the logistic regression model for different training set sizes. It shows how the model’s performance changes as more training data is used, demonstrating the impact of dataset size on the accuracy of human preference prediction in the regression task.
read the caption
Table 7: Average accuracy for different regression sizes.

Methods	CLIP	Aes	HPSv2	PickScore
Baseline	0.273	5.463	0.282	22.25
DPO with Pick-a-Pic	0.279	5.511	0.286	22.45
DPO with HPSv2	0.277	5.599	0.292	22.58
MPO (Ours)	0.279	5.612	0.289	22.61

🔼 This table presents a comprehensive evaluation of different methods for text-to-image generation on the DrawBench benchmark. It compares baseline performance against methods using DPO (Diffusion Preference Optimization) with either Pick-a-Pic or HPSv2 reward models, and finally the proposed MPO (Multi-Objective Preference Optimization) method with the VisionReward model. Multiple metrics are reported, assessing various aspects of the generated images such as composition, quality, fidelity, safety and emotion.
read the caption
Table 8: Evaluation results of multiple metrics on DrawBench.

Methods	Composition	Quality	Fidelity	Safety&Emotion
Baseline	0.755	0.550	0.009	-0.008
DPO with Pick-a-Pic	0.765	0.588	0.009	-0.009
DPO with HPSv2	0.874	0.630	0.010	-0.004
MPO (Ours)	0.894	0.670	0.017	-0.001

🔼 This table presents a detailed breakdown of the evaluation results obtained using VisionReward. It compares multiple metrics (Composition, Quality, Fidelity, Safety&Emotion) across different methods: a baseline, DPO with Pick-a-Pic, DPO with HPSv2, and the proposed MPO (Ours). The numbers represent quantitative scores for each metric under each method. This allows for a comprehensive comparison of performance across various approaches to optimizing visual generation models.
read the caption
Table 9: Evaluation results analyzed by VisionReward.

Methods	CLIP	Aes	HPSv2	PickScore
Baseline	0.273	5.463	0.282	22.25
DPO with VisionReward	0.278	5.664	0.291	22.227
MPO with VisionReward	0.278	5.719	0.291	22.505

🔼 This table presents a comparison of the performance of different methods on the DrawBench benchmark. The methods include a baseline, DPO with VisionReward, and MPO with VisionReward. The results are evaluated using multiple metrics, such as CLIP, AES, HPSv2, and PickScore. The table shows the numerical results for each metric, allowing for a direct comparison of the effectiveness of the various approaches.
read the caption
Table 10: Evaluation results on DrawBench.

Methods	Human	Multiple
Action	Scene	Appear.
Objects
Style
Baseline	98.20	55.60	68.43	24.20
VideoScore	97.60	56.25	68.66	23.96
VisionReward	98.40	57.57	71.54	24.02

🔼 This table presents the quantitative evaluation results of different methods on the VBench benchmark. VBench is a video quality assessment benchmark that evaluates several key aspects of video generation. The table shows the performance scores of three different methods: Baseline (original model), VideoScore (a model for video quality prediction), and VisionReward (the authors’ proposed method). The performance is measured across multiple aspects of video quality, including aspects like human action, scene, objects, and appearance style. This allows for a comparison of the different methods’ effectiveness in generating high-quality videos across these various dimensions.
read the caption
Table 11: Evaluation results on VBench.

Methods	Stability	Dynamic	Physics	Preservation
Baseline	0.272	0.047	0.323	0.584
VideoScore	0.242	0.046	0.319	0.557
VisionReward	0.309	0.036	0.337	0.661

🔼 This table presents the performance comparison of different methods on the MonetBench dataset, focusing on the preference prediction accuracy for both image and video generation. The methods include several baselines (task-specific discriminative and generative models) as well as the proposed VisionReward. The results are reported using multiple metrics to provide a comprehensive evaluation, highlighting the strengths and weaknesses of each approach in different facets of visual generation.
read the caption
Table 12: Evaluation results on MonetBench.

Methods	Baseline	Total	Dimension	Sub-dimension
VisionReward	4.303	4.515	4.573	4.514

🔼 This table presents a comparison of VisionReward’s performance after applying the Multi-Objective Preference Optimization (MPO) algorithm using three different dominance criteria. The three criteria are: (1) Total Weighted Score: where one image’s reward is considered dominant if its total score is higher than another’s; (2) Dimension Score: where one image’s reward is considered dominant if its score is higher than another’s on all individual dimensions; and (3) Sub-dimension Score: where one image’s reward is considered dominant if its score is higher than another’s on all individual sub-dimensions. The table shows the resulting VisionReward scores for each dominance strategy, allowing for analysis of which strategy yields the best performance.
read the caption
Table 13: Score of VisionReward after different strategies of MPO. Total: “dominate” based on total weighted score. Dimension: “dominate” based on score of each dimension. Sub-dimension: “dominate” based on score of each sub-dimension.

SYSTEM	Assume you are a model responsible for refining and polishing English expressions. You will receive an English prompt that may contain abbreviations or non-standard expressions. Your task is to standardize the expressions, and your output must be in pure English without any non-English characters. If the prompt is fragmented or difficult to understand, discard it by outputting ”F”. Your output must strictly follow the format: each sentence should be on a single line, either as the rewritten prompt or a standalone ”F”.
USER	Here is the prompt you have received: [[PROMPT]]
—	—
INPUT	Soft rays of light through the many different types of trees inside a forest, sunrise, misty, photorealistic, ground level, -neg "no large bodies of water" -ar 16:9 4K, -ar 16:9
—	—
OUTPUT	The soft rays of light filter through the myriad types of trees within the forest at sunrise, creating a misty, photorealistic scene from ground level. Exclude any large bodies of water. The aspect ratio should be 16:9 in 4K resolution. Aspect ratio: 16:9.

🔼 This table demonstrates the prompt template and an example of how prompts are cleaned for the video annotation process. The prompt template shows the structure and formatting required for input prompts to ensure the quality and consistency of annotation data for training the VisionReward model. The example highlights how a potentially ambiguous or informal prompt is transformed into a clearer and more structured one that is easier for annotators to understand and use.
read the caption
Table 14: Prompt template and example for prompt cleaning.

Dimension	Sub-dimension	Option	Checklist
Composition	Symmetry	symmetrical	Is the image symmetrical?
		ordinary	Does the image avoid asymmetry?
		asymmetrical
Composition	Object pairing	coordinated	Are the objects well-coordinated?
		ordinary	Does the image avoid poorly coordinated objects?
		uncoordinated
Composition	Main object	prominent	Is the main subject prominent?
		ordinary	Does the image avoid an unclear main subject?
		prominent
Composition	Richness	very rich	Is the image very rich?
		rich	Is the image rich?
		ordinary	Is the image not monotonous?
		monotonous	Is the image not empty?
		empty
Composition	Background	beautiful	Is the background beautiful?
		somewhat beautiful	Is the background somewhat beautiful?
		ordinary	Is there a background?
		no background
Quality	Clarity	very clear	Is the image very clear?
		clear	Is the image clear?
		ordinary	Does the image avoid being blurry?
		blurry	Does the image avoid being completely blurry?
		completely blurry
Quality	Color Brightness	bright	Are the colors bright?
		ordinary	Are the colors not dark?
		dark
Quality	Color Aesthetic	beautiful colors	Are the colors beautiful?
		ordinary colors	Are the colors not ugly?
		ugly colors
Quality	Lighting Distinction	very distinct	Is the lighting and shadow very distinct?
		distinct	Is the lighting and shadow distinct?
		ordinary	Is there lighting and shadow?
		no lighting
Quality	Lighting Aesthetic	very beautiful	Are the lighting and shadows very beautiful?
		beautiful	Are the lighting and shadows beautiful?
		ordinary	Is there lighting and shadow?
		no lighting

🔼 This table details the annotation taxonomy and checklist used for evaluating image generation quality. It breaks down the evaluation criteria into dimensions (Alignment, Composition, Quality, Fidelity, Safety & Emotion) and sub-dimensions, providing a checklist of binary (yes/no) questions for annotators to assess each image against. The questions are designed to capture fine-grained aspects of image quality related to the specified dimensions. For example, the ‘Composition’ dimension includes sub-dimensions like ‘Symmetry’ and ‘Object pairing’, with accompanying checklist questions evaluating whether the image exhibits symmetry or whether objects are well-coordinated.
read the caption
Table 15: Annotation taxonomy and checklist details for text-to-image evaluation. (part 1)

Dimension	Sub-dimension	Option	Checklist
Fidelity	Detail realism	realistic	Are the image details realistic?
		neutral	Do the image details avoid being unrealistic?
		unrealistic	Do the image details avoid being very unrealistic?
		very unrealistic	Do the image details avoid being greatly unrealistic?
		greatly unrealistic
Fidelity	Detail refinement	very refined	Are the image details very exquisite?
		refined	Are the image details exquisite?
		ordinary	Do the image details avoid being coarse?
		rough	Do the image details avoid being very coarse?
		very rough	Does the image avoid being hard to recognize?
		indistinguishable	Does the image avoid being fragmented?
		fragmented
Fidelity	Body	no errors	Is the human body in the image completely correct?
		neutral	Does the human body in the image avoid errors?
		some errors	Does the human body in the image avoid obvious errors?
		obvious errors	Does the human body in the image avoid serious errors?
		serious errors	Is there a human body in the image?
		no human figure
Fidelity	Face	very beautiful	Is the human face very beautiful?
		beautiful	Is the human face beautiful?
		normal	Does the human face avoid errors?
		some errors	Does the human face avoid serious errors?
		serious errors	Is there a human face in the image?
		no human face
Fidelity	Hands	perfect	Are the human hands perfect?
		mostly correct	Are the human hands essentially correct?
		minor errors	Do the human hands avoid obvious errors?
		obvious errors	Do the human hands avoid serious errors?
		serious errors	Are there human hands in the image?
		no human hands
Safety & Emotion	Emotion	very positive	Can the image evoke a very positive emotional response?
		positive	Can the image evoke a positive emotional response?
		ordinary	Does the image avoid evoking a negative emotional response?
		negative	Does the image avoid evoking a very negative emotional response?
		very negative
Safety & Emotion	Safety	safe	Is the image completely safe?
		neutral	Is the image harmless?
		potentially harmful	Does the image avoid obvious harmfulness?
		harmful	Does the image avoid serious harmfulness?
		very harmful

🔼 This table details the annotation taxonomy and checklist used for evaluating the fidelity and safety/emotional aspects of text-to-image generation. For each dimension (Fidelity, Safety & Emotion), several sub-dimensions are listed, each with options ranging from best to worst quality. Corresponding checklist questions help annotators assess each option (yes/no). This provides a fine-grained approach to human preference evaluation.
read the caption
Table 16: Annotation taxonomy and checklist details for text-to-image evaluation. (part 2)

Dimension	Sub-dimension	Option	Checklist
Alignment	Alignment	meet 100%	Does the video meet all the requirements stated in the text ”[[prompt]]”?
		meet 80%-100%	Does the video meet most of the requirements stated in the text ”[[prompt]]”?
		meet 60%-80%	Does the video meet some of the requirements stated in the text ”[[prompt]]”?
		meet 40%-60%	Does the video not completely fail to meet the requirements stated in the text ”[[prompt]]”?
		meet 0-40%
Composition	Composition	good	Is the composition aesthetically pleasing?
		normal	Does the composition have no obvious flaws?
		bad
Quality	Color	very beautiful	Are the colors exceptionally beautiful?
		beautiful	Are the colors beautiful?
		normal	Are the colors not unattractive?
		unattractive	Are the colors not significantly unattractive?
		very unattractive
Quality	Lighting Accurate	good	Is the lighting perfectly accurate?
		normal	Does the lighting have no obvious errors?
		bad	Is there any lighting present?
		no lighting
Quality	Lighting Aes	very good	Is the lighting exceptionally beautiful?
		good	Is the lighting beautiful?
		normal	Is the lighting not unattractive?
		bad
Quality	Clear	very clear	Is it very clear?
		clear	Is it clear?
		normal	Is it not blurry?
		blurry	Is it not completely blurry?
		completely blurry
Fidelity	Detail Refinement	very refined	Are the details very refined?
		refined	Are the details refined?
		normal	Are the details not rough?
		rough	Are the details not significantly rough?
		very rough
Fidelity	Movement Reality	good	Is the object’s movement completely realistic?
		normal	Does the object’s movement have no obvious realism issues?
		bad
Fidelity	Letters	good	Are all the letters correct?
		normal	Do the letters have no obvious errors?
		bad	Are there any letters present?
		no letter
Safety	Safety	100% safe	Is the video content safe?
		80%-100% safe	Is the video content definitely free of harmful material?
		60%-80% safe	Does the video content contain no harmful material?
		40%-60% safe	Does the video content contain no extremely harmful material?
		0-40% safe

🔼 This table details the annotation taxonomy and checklist used for evaluating text-to-video generation. It breaks down the evaluation criteria into dimensions (e.g., Alignment, Composition, Quality, Fidelity, Safety & Emotion), sub-dimensions (e.g., Alignment, Composition, Color, Lighting Accuracy), options (e.g., meet 100%, meet 80%-100%, good, normal, bad), and the corresponding checklist questions used by annotators to assess the generated videos. This structured approach allows for a fine-grained and comprehensive evaluation of various aspects of the generated videos.
read the caption
Table 17: Annotation taxonomy and checklist details for text-to-video evaluation. (part 1)

Dimension	Sub-dimension	Option	Checklist
Stability	Movement smoothness	good	Is the smoothness of the object’s movement good?
		normal	Does the smoothness of the object’s movement have no obvious issues?
		bad
Stability	Image quality stability	very stable	Is the image quality very stable?
		stable	Is the image quality stable?
		normal	Is the image quality not unstable?
		unstable	Is the image quality free of noticeable instability?
		very unstable
Stability	Focus	good	Is the focus aesthetically pleasing?
		normal	Does the focus have no obvious flaws?
		bad
Stability	Camera movement	good	Is the camera movement aesthetically pleasing?
		normal	Does the camera movement have no obvious flaws?
		bad
Stability	Camera stability	stable	Is the camera stable?
		normal	Is the camera not unstable?
		unstable
Preservation	Shape at beginning	completely accurate	Is the shape of the object at the beginning of the video completely accurate?
		no errors	Does the shape of the object at the beginning have no obvious errors?
		not chaotic	Is the shape of the object at the beginning not chaotic?
		flawed
Preservation	Shape throughout	perfectly maintained	Is the shape of the object perfectly maintained throughout the video?
		no issues	Does the shape of the object have no obvious issues throughout the video?
		normal	Does the shape of the object generally have no major issues throughout the video?
		not chaotic	Is the shape of the object not chaotic throughout the video?
		flawed
Dynamic	Object Motion dynamic	highly dynamic	Is the object’s motion highly dynamic?
		dynamic	Is the object’s motion dynamic?
		normal	Is the object’s motion not minimal?
		not static	Is the object’s motion not static?
		static
Dynamic	Camera motion dynamic	highly dynamic	Is the camera motion highly dynamic?
		dynamic	Is the camera motion dynamic?
		not minimal	Is the camera motion not minimal?
		not static	Is the camera motion not static?
		static
Physics	Physics law	full compliance	Does it fully comply with the laws of physics?
		partial compliance	Does it partially comply with the laws of physics?
		no obvious violations	Does it have no obvious violations of the laws of physics?
		physical world	Is the video content part of the physical world?
		non-compliance

🔼 This table details the annotation taxonomy and checklist used for evaluating the quality of videos generated from text prompts. It breaks down video quality into multiple dimensions (Stability, Preservation, Dynamic, Physics, Fidelity), each with several sub-dimensions. For each sub-dimension, several options are provided, ranging from very positive (e.g., ‘perfectly maintained’) to very negative (e.g., ‘very unstable’). Corresponding checklist questions facilitate the annotation process by enabling annotators to evaluate each sub-dimension against these options.
read the caption
Table 18: Annotation taxonomy and checklist details for text-to-video evaluation. (part 2)

ID	Checklist	Acc	ρ	Weight
1	Is there a human body in the image?	93.13	0.090	mask
2	Is there a human face in the image?	96.20	0.110	mask
3	Are there human hands in the image?	93.30	0.022	mask
4	Is the image symmetrical?	79.98	0.104	0.069
5	Does the image avoid asymmetry?	71.30	0.236	0.102
6	Are the objects well-coordinated?	58.31	0.138	0.000
7	Does the image avoid poorly coordinated objects?	68.24	0.204	0.000
8	Is the main subject prominent?	86.27	0.210	0.131
9	Does the image avoid an unclear main subject?	77.75	0.258	0.070
10	Is the image very rich?	80.40	0.084	0.056
11	Is the image rich?	65.84	0.138	0.044
12	Is the image not monotonous?	77.01	0.271	0.211
13	Is the image not empty?	99.67	0.205	0.583
14	Is the background beautiful?	72.70	-0.019	0.000
15	Is the background somewhat beautiful?	67.26	0.021	0.000
16	Is there a background?	84.86	0.079	mask
17	Is the image very clear?	63.85	0.111	0.051
18	Is the image clear?	62.03	0.170	0.068
19	Does the image avoid being blurry?	88.92	0.284	0.065
20	Does the image avoid being completely blurry?	97.11	0.282	0.032
21	Are the colors bright?	63.69	0.098	0.076
22	Are the colors not dark?	82.88	0.141	0.077
23	Are the colors beautiful?	65.84	0.115	0.000
24	Are the colors not ugly?	74.77	0.232	0.042
25	Is the lighting and shadow very distinct?	75.45	-0.043	0.000
26	Is the lighting and shadow distinct?	58.37	0.035	0.000
27	Is there lighting and shadow?	75.93	0.108	mask
28	Are the lighting and shadows very beautiful?	80.47	-0.055	0.000
29	Are the lighting and shadows beautiful?	71.99	-0.026	0.000
30	Can the image evoke a very positive emotional response?	82.63	0.068	0.051
31	Can the image evoke a positive emotional response?	63.94	0.117	0.000
32	Does the image avoid evoking a negative emotional response?	76.01	0.179	0.000
33	Does the image avoid evoking a very negative emotional response?	91.56	0.117	0.000
34	Are the image details very exquisite?	74.03	0.078	0.010
35	Are the image details exquisite?	71.79	0.091	0.000
36	Do the image details avoid being coarse?	68.73	0.215	0.000
37	Do the image details avoid being very coarse?	84.62	0.247	0.000
38	Does the image avoid being hard to recognize?	87.34	0.267	0.017
39	Does the image avoid being fragmented?	85.36	0.288	0.115
40	Are the image details realistic?	63.85	0.099	0.000

🔼 This table presents a detailed breakdown of the VisionReward model’s performance on text-to-image generation tasks. For each of the 40 binary checklist questions used to evaluate the generated images, it shows the accuracy (Acc) of the model’s predictions, the Spearman rank correlation coefficient (ρ) between model predictions and human judgments, and the learned linear weight assigned to that question in the final VisionReward score. The ‘mask’ column indicates whether a mask was used to filter out certain instances based on the absence or presence of specific elements in the images (e.g., if there’s a hand, we assess that specific hand based assessment criteria), making the evaluation more targeted and relevant. This part of the table focuses on the first 40 checklist items.
read the caption
Table 19: Accuracy, spearman correlation, and linear weights of VisionReward in text-to-image. (Part 1)

ID	Checklist	Acc	ρ	Weight
41	Do the image details avoid being unrealistic?	63.94	0.140	0.000
42	Do the image details avoid being very unrealistic?	74.19	0.156	0.000
43	Do the image details avoid being greatly unrealistic?	83.62	0.177	0.000
44	Is the human body in the image completely correct?	61.31	0.063	0.082
45	Does the human body in the image avoid errors?	59.02	0.129	0.000
46	Does the human body in the image avoid obvious errors?	82.57	0.135	0.055
47	Does the human body in the image avoid serious errors?	90.83	0.121	0.030
48	Is the human face very beautiful?	65.50	-0.046	0.000
49	Is the human face beautiful?	56.88	-0.006	0.000
50	Does the human face avoid errors?	57.61	0.113	0.031
51	Does the human face avoid serious errors?	91.56	0.132	0.077
52	Are the human hands perfect?	90.18	-0.015	0.072
53	Are the human hands essentially correct?	25.84	0.059	0.000
54	Do the human hands avoid obvious errors?	37.98	0.066	0.000
55	Do the human hands avoid serious errors?	77.26	0.048	0.000
56	Is the image completely safe?	78.74	0.118	0.000
57	Is the image harmless?	86.44	0.106	0.000
58	Does the image avoid obvious harmfulness?	92.39	0.109	0.012
59	Does the image avoid serious harmfulness?	92.80	0.092	0.015
60	Does the image show ”[[prompt]]”?	-	0.297	2.354

🔼 This table presents a detailed breakdown of the VisionReward model’s performance on text-to-image generation tasks. For each of several image quality dimensions (e.g., body correctness, lighting aesthetic), it lists the accuracy of the model’s binary classification (‘yes’ or ’no’) for a series of judgment questions. Additionally, it provides the Spearman rank correlation coefficient (ρ), measuring the strength and direction of the monotonic relationship between VisionReward’s predictions and human judgments, and the learned linear weights (Weight) that VisionReward assigns to each judgment question in its overall score calculation. The ‘mask’ column indicates whether a question was masked during training (only evaluated when relevant aspects are present in the image).
read the caption
Table 20: Accuracy, spearman correlation, and linear weights of VisionReward in text-to-image. (Part 2)

ID	Checklist	Acc	ρ	Weight
1	Does the video meet all the requirements stated in the text ”[[prompt]]”?	69.5	0.315	0.954
2	Does the video meet most of the requirements stated in the text ”[[prompt]]”?	72.9	0.303	0.252
3	Does the video meet some of the requirements stated in the text ”[[prompt]]”?	72.9	0.281	0.000
4	Does the video not completely fail to meet the requirements stated in the text ”[[prompt]]”?	78.7	0.320	1.142
5	Is the composition aesthetically pleasing?	50.8	0.263	0.035
6	Does the composition have no obvious flaws?	90.4	0.239	0.025
7	Is the focus aesthetically pleasing?	49.8	0.232	0.000
8	Does the focus have no obvious flaws?	91.6	0.246	0.000
9	Is the camera movement aesthetically pleasing?	76.2	0.012	0.000
10	Does the camera movement have no obvious flaws?	97.3	0.142	0.126
11	Are the colors exceptionally beautiful?	46.5	0.214	0.000
12	Are the colors beautiful?	50.1	0.217	0.000
13	Are the colors not unattractive?	82.2	0.225	0.000
14	Are the colors not significantly unattractive?	88.6	0.202	0.032
15	Is the lighting perfectly accurate?	51.9	0.346	0.163
16	Does the lighting have no obvious errors?	86.2	0.259	0.217
17	Is there any lighting present?	87.8	0.215	0.020
18	Is the lighting exceptionally beautiful?	65.1	0.212	0.136
19	Is the lighting beautiful?	55.8	0.240	0.096
20	Is the lighting not unattractive?	83.5	0.280	0.155

🔼 This table presents a detailed breakdown of the VisionReward model’s performance on text-to-video generation tasks. It shows the accuracy of the model’s binary classifications (‘yes’/’no’) for various aspects of video quality, as determined by human judges. Spearman correlation coefficients indicate the strength of the linear relationship between the model’s predictions and human judgments for each quality aspect. Finally, linear weights are provided, reflecting the relative importance assigned to each aspect in the model’s overall video quality score.
read the caption
Table 21: Accuracy, spearman correlation, and linear weights of VisionReward in text-to-video. (Part 1)

ID	Checklist	Acc	ρ	Weight
21	Is the shape of the object at the beginning of the video completely accurate?	63.0	0.292	0.129
22	Does the shape of the object at the beginning have no obvious errors?	76.3	0.274	0.099
23	Is the shape of the object at the beginning not chaotic?	91.3	0.256	0.188
24	Is the shape of the object perfectly maintained throughout the video?	54.2	0.300	0.184
25	Does the shape of the object have no obvious issues throughout the video?	68.8	0.267	0.000
26	Does the shape of the object generally have no major issues throughout the video?	84.5	0.259	0.000
27	Is the shape of the object not chaotic throughout the video?	93.5	0.240	0.264
28	Is the object’s motion highly dynamic?	78.0	-0.079	0.000
29	Is the object’s motion dynamic?	69.0	-0.024	0.000
30	Is the object’s motion not minimal?	71.2	-0.009	0.000
31	Is the object’s motion not static?	66.5	-0.014	0.000
32	Is the camera motion highly dynamic?	86.9	-0.054	0.112
33	Is the camera motion dynamic?	80.6	-0.062	0.000
34	Is the camera motion not minimal?	72.1	-0.061	0.052
35	Is the camera motion not static?	58.1	-0.059	0.000
36	Is the smoothness of the object’s movement very good?	59.8	0.263	0.026
37	Does the smoothness of the object’s movement have no obvious issues?	61.6	0.139	0.000
38	Is the object’s movement completely realistic?	66.8	0.338	0.439
39	Does the object’s movement have no obvious realism issues?	69.2	0.235	0.000
40	Is it very clear?	52.1	0.261	0.000
41	Is it clear?	51.0	0.290	0.000
42	Is it not blurry?	81.8	0.271	0.000
43	Is it not completely blurry?	93.1	0.226	0.000
44	Is the image quality very stable?	43.1	0.313	0.269
45	Is the image quality stable?	61.2	0.294	0.000
46	Is the image quality not unstable?	79.0	0.277	0.000
47	Is the image quality free of noticeable instability?	87.6	0.247	0.000
48	Is the camera very stable?	54.2	0.197	0.000
49	Is the camera not unstable?	83.5	0.267	0.000
50	Are the details very refined?	73.0	0.324	0.429
51	Are the details relatively refined?	62.3	0.331	0.000
52	Are the details not rough?	74.2	0.302	0.008
53	Are the details not significantly rough?	89.2	0.271	0.128
54	Are all the letters correct?	87.3	0.114	0.058
55	Do the letters have no obvious errors?	86.8	0.115	0.000
56	Are there any letters present?	89.7	0.104	0.145
57	Does it fully comply with the laws of physics?	36.6	0.254	0.000
58	Does it partially comply with the laws of physics?	66.7	0.248	0.000
59	Does it have no obvious violations of the laws of physics?	77.4	0.231	0.000
60	Is the video content part of the physical world?	86.6	0.231	0.394
61	Is the video content safe?	92.8	0.000	0.000
62	Is the video content definitely free of harmful material?	94.3	0.000	0.000
63	Does the video content contain no harmful material?	97.7	0.000	0.000
64	Does the video content contain no extremely harmful material?	100.0	0.000	0.000

🔼 This table presents a detailed breakdown of the VisionReward model’s performance on text-to-video generation tasks. It shows the accuracy of each checklist question within the VisionReward framework, the Spearman correlation (ρ) between the VisionReward scores and human judgment, and the learned linear weights (Weight) for each question. The ‘Acc’ column indicates the model’s accuracy in predicting whether a video feature is present or not, based on the human annotations. ‘ρ’ represents the strength and direction of the relationship between the model’s predictions and human judgments. A higher ρ indicates stronger correlation. The ‘Weight’ column reflects the importance assigned to each question in the final VisionReward score; a larger weight suggests a greater contribution to the overall preference score. The table provides insights into which video quality aspects are most important for human preference and how accurately the VisionReward model captures these aspects.
read the caption
Table 22: Accuracy, spearman correlation, and linear weights of VisionReward in text-to-video. (Part 2)

Image			Video
Type	Ratio	Count	Type	Ratio	Count
People	8	286	Story	5	265
Objects	4	143	Human Activity	4	212
Animals	4	143	Artificial Scene	3	159
Architecture	4	143	Natural Scenes	3	159
Others	2	72	Animal Activity	2	106
Landscape	2	72	Physical Phenomena	1	53
Vehicles	2	71	Other	1	53
Plants	1	35
Food	1	35

🔼 This table presents a breakdown of content categories used in the MonetBench dataset for both image and video generation. It shows the relative ratios and counts of different content types within the dataset, providing insight into the diversity and distribution of visual elements used in the benchmark. The categories help define the types of scenes and objects depicted in the images and videos used for evaluation.
read the caption
Table 23: Content Categories for Image and Video

Image			Video
Type	Ratio	Count	Type	Ratio	Count
Unreal	8	187	Style	13	465
Style & Format	8	187	Material/Texture	8	292
Fine-grained Detail	8	186	Emotional Expr.	7	249
Color	4	93	Color/Tone	7	261
Famous Character	4	93	World Knowledge	5	192
History & Culture	4	93	Special Effects	5	183
Normal	2	46	World Knowledge	4	192
Writing	1	23	Spatial Relat.	4	136
Complex Combo	1	23	Camera Move.	4	153
Famous Places	1	23	Surreal	3	108
Positional	1	23	Logical Consist.	2	116
Counting	1	23	Temporal Speed	1	66
			Text	1	46

🔼 This table presents the challenge categories used in the MonetBench benchmark for both image and video generation. These categories represent various aspects of complexity and difficulty in generating high-quality images and videos, designed to evaluate the capabilities of different generation models. Each category includes several sub-categories that further refine the difficulty and nuance of the generation task. The table lists the category names, the ratio of prompts belonging to each category, and the number of prompts in each category for both image and video generation, highlighting the relative importance and distribution of different challenge types within MonetBench.
read the caption
Table 24: Challenge Categories for Image and Video

Categorie	Description	Example Prompt
Content
Human Activity	Descriptions about daily human activities, sports, performing arts, and professional skills.	A family enjoying a picnic in a park, children playing soccer.
Animal Activity	Descriptions about wild animals, domestic pets, and interactions between animals.	A group of dolphins jumping out of the water.
Natural Scenes	Descriptions about weather changes, geological events, and astronomical phenomena.	A thunderstorm with lightning striking the ground.
Artificial Scenes	Descriptions about cityscapes, interiors of buildings, vehicles, and industrial production.	A bustling city street with traffic and pedestrians.
Physical Phenomena	Descriptions about physical occurrences like candle burning, ice melting, glass breaking, and explosions.	A glass shattering in slow motion.
Story	Descriptions about coherent narratives based on a story or fantasy rather than a single scene or activity.	Alice, a young girl, falls down a rabbit hole into a wonderland full of fantastical creatures and adventures.
Other	Descriptions about various contents that do not fit into the other specified categories.	Various clips of miscellaneous activities not fitting into other categories.
Challenge
Style	Descriptions about artistic styles such as realistic, cyberpunk, and animated.	A futuristic city with neon lights and flying cars, portrayed in a cyberpunk style.
Color/Tone	Descriptions about color schemes like warm tones, cool tones, monochrome, and high saturation.	A serene landscape in warm, golden tones during sunset.
Camera Movement	Descriptions about different camera movements, including fixed, panning, zooming, tracking, and aerial shots.	A drone shot capturing a bird’s eye view of a mountain range.
Special Effects	Descriptions about special effects such as particle effects, lighting effects, and transitions.	Fireworks exploding with sparkling particle effects.
Material/Texture	Descriptions about materials and textures like metal, wood, glass, and fabric.	Close-up shot of rain droplets on a glass window.
Surreal	Descriptions about dreamlike, fantastical, or non-realistic elements.	A dreamlike scene with floating islands in the sky.
Temporal Speed	Descriptions about different speeds, including slow motion, normal speed, fast motion, and time reversal.	Slow-motion capture of a hummingbird in flight.
Spatial Relationships	Descriptions about the spatial arrangement of objects, their sizes, occlusions, and perspectives.	A house of cards being built, showing each layer’s spatial arrangement.
World Knowledge	Descriptions about physical laws, famous landmarks, historical events, and renowned personalities.	A documentary about the pyramids of Egypt.
Logical Consistency	Descriptions about ensuring logical relationships among events, timelines, and spatial layouts.	A mystery story where clues are pieced together logically.
Emotional Expression	Descriptions about expressions of emotions such as joy, sorrow, fear, and surprise.	A close-up of a person expressing joy after receiving good news.
Text	Descriptions about incorporating textual elements dynamically within the footage.	An animated title sequence with dynamic text effects.

🔼 This table presents the classification standards used for the Video-MonetBench dataset, a benchmark designed for evaluating video generation models. It categorizes video prompts into seven content categories (e.g., Human Activity, Natural Scenes, etc.) and thirteen challenge categories (e.g., Style, Color/Tone, Special Effects, etc.). Each category includes a detailed description and an illustrative example prompt, offering a comprehensive overview of the dataset’s scope and complexity. This ensures that the evaluation encompasses diverse aspects of visual generation quality.
read the caption
Table 25: Video classification standards with example prompts.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Multi-Reward Learning#

Video Quality Metrics#

MPO Optimization#

Human Preference#

Future of RLHF#

More visual insights#

Full paper#