Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

2503.02846

Yuzhe Gu et el.

🤗 2025-03-05

TL;DR
#

Large Language Models (LLMs) often generate hallucinations, mixing truthful and incorrect information. Existing methods, using response-level feedback, inadvertently introduce noise during training. To solve the issue, the paper introduces Mask-DPO, a fine-grained approach that uses sentence-level factuality as a mask signal, only training on factually correct sentences and preventing penalties on truthful content when models make mistakes.

Experimental results show that Mask-DPO significantly improves factuality on both in-domain and out-of-domain tasks, surpassing existing models. The method also improves generalization by scaling training data effectively, emphasizing the importance of topic diversity. Furthermore, the paper hypothesizes that factuality alignment adjusts LLMs’ internal knowledge structures, improving the model’s responses.

Key Takeaways
#

Why does it matter?
#

This paper is important to researchers since it addresses a critical issue in LLMs: hallucinations. By introducing a novel and effective method (Mask-DPO) for factuality alignment, the study paves the way for more reliable and trustworthy LLMs. Furthermore, the investigation of data scaling strategies and their impact on generalization is valuable for future research in this area, opening avenues for improving LLMs.

Visual Insights
#

🔼 This figure illustrates the core difference between the vanilla Direct Preference Optimization (DPO) method and the proposed Mask-DPO method for factuality alignment in LLMs. Vanilla DPO uses response-level factuality, leading to the model inadvertently encouraging incorrect information in preferred samples and penalizing correct information in non-preferred samples. Mask-DPO addresses this issue by incorporating sentence-level factuality as a mask signal during training. This allows Mask-DPO to learn only from factually correct sentences in the preferred samples, thus resolving ambiguity and enhancing the effectiveness of factuality alignment.
read the caption
Figure 1: Comparison between DPO and Mask-DPO. Vanilla DPO (a) inadvertently encourages and penalizes all the content in the preferred and non-preferred samples, respectively, regardless of their correctness. Instead, Mask-DPO (b) incorporates sentence-level facticity into the mask signal, preventing incorrect reward signal, which resolves ambiguity in preference learning.

Model	ANAH (in-domain)						Biography (out-of-domain)
	ANAH-v2 Evaluator			Factscore Evaluator			Factscore Evaluator
	# Cor.	# Inc.	% Score $\uparrow$	# Cor.	# Inc.	% Score $\uparrow$	# Cor.	# Inc.	% Score $\uparrow$
Qwen2-7B	450	872	34.03	5.19	30.60	15.57	14.46	39.27	27.28
Qwen2-72B	538	690	43.81	6.45	27.97	20.76	20.13	40.09	34.06
Gemma2-9B	635	837	43.13	4.51	25.92	18.29	15.53	29.00	29.52
Gemma2-27B	889	1110	44.47	7.81	35.65	20.14	19.27	36.59	29.99
Yi1.5-6B	440	1143	27.79	6.01	44.48	11.92	10.84	59.65	15.81
Yi1.5-9B	483	1136	29.83	5.06	39.79	10.42	10.63	62.67	15.33
Yi1.5-34B	535	911	36.99	5.25	36.96	13.27	17.01	52.76	25.49
Llama3.1-8B	461	476	49.19	5.95	21.29	19.43	16.83	31.57	30.29
Llama3.1-70B	520	453	53.44	6.81	22.17	21.92	23.58	31.59	40.47
FactTune	657	499	56.83	7.45	23.08	22.67	15.93	23.97	37.97
Ours	547.6	161.4	77.53	5.90	18.59	25.56	12.16	15.24	39.39

🔼 This table presents a comparison of factuality scores achieved by various large language models (LLMs) on two datasets: ANAH (in-domain) and Biography (out-of-domain). The models compared include several open-source LLMs, FactTune (a factuality enhancement method), and the proposed Mask-DPO method. Two different evaluation metrics are used: ANAH-v2 and FactScore. For each model and metric, the number of correct and incorrect factual statements are reported, along with the overall factuality score (percentage of correct statements). The best and second-best performing models for each metric are highlighted.
read the caption
Table 1: Evaluation results for the open-source models, FactTune, and our Mask-DPO. Here, ANAH and Biography represent the in-domain and out-of-domain test sets, respectively. ANAH-v2 and FactScore represent the corresponding evaluation strategies. For each evaluation strategy, we report the number of correct facts (# Correct.), the number of inaccurate facts (# Inc.), and the factuality score (% Score), i.e. the proportion of correct facts out of the total number of facts. Bold and underlined represent the best and second best performance, respectively.

In-depth insights
#

Factuality F-tuning
#

Factuality Fine-tuning is crucial for reliable LLMs. Aligning models with factual knowledge requires addressing inherent hallucinations. Existing methods using response-level preferences introduce noise, penalizing correct content alongside errors. A fine-grained approach is needed, masking sentence-level factuality. Mask-DPO leverages sentence-level factuality to guide training, avoiding penalization of truthful content. This resolves preference learning ambiguity, improving in-domain and out-of-domain generalization. Scaling by increasing the number of topics during alignment is more effective for knowledge adaptation. The underlying knowledge graph adjusts, influencing generalization. This addresses the need for targeted adjustments, rather than broad distributional changes. This refines LLM knowledge, boosting real-world applications.

Mask-DPO Insight
#

Mask-DPO refines LLM factuality by masking sentence-level inaccuracies during DPO, addressing the issue where truthful content is penalized alongside hallucinations. This fine-grained approach enables more precise knowledge alignment, leading to improved performance on both in-domain and out-of-domain datasets. The method’s success hinges on sentence-level factuality, as this allows for accurate reward signals during preference learning. It prevents inadvertently encouraging inaccuracies or penalizing truths, a problem inherent in response-level methods like vanilla DPO. By integrating sentence-level factuality, Mask-DPO effectively improves the factuality of LLMs, showcasing its potential in enhancing the reliability of these models across various domains. This is particularly important since mitigating hallucinations improves LLM trustworthiness.

Topic Scaling
#

Topic scaling appears to be a critical factor in improving the factuality and generalization capabilities of LLMs. The research suggests that diversifying the topics within the training data leads to better performance than simply increasing the number of questions within the same topics. This highlights the importance of exposing the model to a broader range of knowledge domains to improve its ability to answer questions accurately and generalize to new, unseen topics. The success of topic scaling likely stems from the LLM’s ability to develop a more robust and well-rounded understanding of the world when trained on a diverse set of subjects. Furthermore, the research implies that focusing on topics helps the model grasp broader concepts, while question variety under a single topic might lead to overfitting to specific question structures, hindering generalization.

Knowledge Graph
#

Knowledge graphs (KGs) offer a structured way to represent information, connecting entities through relationships. They can enhance LLMs by providing factual knowledge. LLMs often struggle with hallucination, KGs can serve as external knowledge sources, grounding responses in verifiable facts. By integrating KGs, LLMs can access and reason over structured data, reducing reliance on parametric memory and improving accuracy. Factuality alignment methods are essential to minimize errors and provide trustable information. The effectiveness hinges on KG quality and integration strategy; a poorly constructed KG or ineffective integration can yield limited benefits.

RLHF Extension
#

Reinforcement Learning from Human Feedback (RLHF) has become a pivotal technique for aligning LLMs with human preferences, extending beyond mere task completion to encompass aspects like robustness, factuality, and safety. This extension involves a complex pipeline: initially training a reward model to mimic human judgment and iteratively optimizing the language model using the reward signal. RLHF’s adaptability allows it to integrate diverse feedback sources, offering granular control over model behavior. Recent advancements explore more efficient methods like Direct Preference Optimization (DPO) to sidestep explicit reward modeling, directly optimizing the policy model using preference data. These methods also seek to refine data creation utilizing AI feedback.

More visual insights
#

More on tables

Training Setting	ANAH-v2			FactScore
Training Setting	# Correct	# Incorrect	% Score $\uparrow$	# Correct	# Incorrect	% Score $\uparrow$
w/o mask	446.60	206.00	68.44	5.71	18.98	23.43
w/ mask	547.60	161.40	77.53	5.90	18.59	25.56

🔼 This table presents an ablation study comparing the performance of the proposed Mask-DPO method against a baseline DPO method without the masking scheme. The study evaluates the impact of the masking scheme on the factuality of the model’s responses. The ‘w/ mask’ column shows results using the Mask-DPO method (DPO with masking), while ‘w/o mask’ shows results using standard DPO without masking. The table reports the number of correct and incorrect facts identified by two different metrics, ANAH-v2 and FactScore, and the corresponding factuality score for each condition. This allows for a quantitative assessment of how the proposed masking scheme contributes to improved factuality.
read the caption
Table 2: Ablation study for the masking scheme applied to the DPO. Here, “w/ mask” means that the factual alignment algorithm is DPO with mask, which is the default setting of Mask-DPO. “w/o mask” means that the algorithm is the vanilla DPO.

Sampling Setting	ANAH-v2			FactScore
Sampling Setting	# Correct	# Incorrect	% Score $\uparrow$	# Correct	# Incorrect	% Score $\uparrow$
InternLM	419.20	296.80	59.43	5.19	16.59	21.33
Llama+Doc	877.00	1347.00	39.43	7.98	33.07	16.16
Llama	547.60	161.40	77.53	5.00	18.59	25.56

🔼 This table presents an ablation study comparing different sampling strategies for creating preference data used in Mask-DPO. The default strategy samples data from the Llama3.1-8B-Instruct policy model. The comparison includes using data from InternLM2-7B-Chat-SFT, and a strategy where the policy model is used, but preferred samples include the reference document in the prompt, while dispreferred samples do not.
read the caption
Table 3: Ablation study for the sampling strategy in preference dataset construction. Here, “Llama” means that the data used to construct preference pairs is sampled from the policy model (Llama3.1-8B-Instruct), which is the default setting of Mask-DPO. “InternLM” means that the data is sampled from InternLM2-7B-Chat-SFT. And “Llama+Doc” means that the data is sampled from the policy model, but the sampling strategies for preferred and dispreferred samples are different. When generating preferred samples, the reference document corresponding to the question is added to the prompt, but not to the dispreferred ones.

Dimension	# Topic	# Question	Scale	# Correct	# Incorrect	% Score $\uparrow$
Topic	894	3	1/3	501.80	270.20	65.04
	1788	3	2/3	518.00	206.20	71.56 (+6.52)
	2682	3	1	547.60	161.40	77.53 (+5.97)
Question	2682	1	1/3	486.00	218.00	69.03
	2682	2	2/3	444.00	164.00	73.03 (+4.00)
	2682	3	1	547.60	161.40	77.53 (+4.50)

🔼 This table presents a study on the impact of scaling training data on the efficiency and generalization of Mask-DPO, a fine-grained factuality alignment method. Two scaling strategies were used: scaling the number of different topics and scaling the number of questions within each topic. The table shows the number of topics and questions used for training at different scales (1/3, 2/3, and 1x the original data), along with the corresponding number of correct and incorrect facts, factuality score, and the change in score compared to the baseline. This allows for analysis of how different scaling approaches affect the performance of the model, specifically its ability to generalize to unseen data.
read the caption
Table 4: Evaluation for the efficiency and the generalization effects of different training sample scaling strategies. Here, “Topic” means scaling from the number of different topics, and “Question” means scaling from the number of different questions under the same topic. “# Topic” denotes the number of topics used for training, and “# Question” denotes the number of questions under the same topic used for training. We also report changes in scores (in parentheses) when scaling the data.

Base Model	Embedding Model	Distance	# Correct	# Incorrect	% Score $\uparrow$	Diff $\uparrow$
InternLM2-7B	InternLM2-7B	far	493.10	267.60	64.99	2.43
	InternLM2-7B	near	456.40	222.30	67.42	2.43
	Llama3.1-8B	far	454.80	252.20	64.32	0.65
	Llama3.1-8B	near	487.00	263.00	64.97	0.65
	OpenAI	far	487.64	256.09	65.59	1.25
	OpenAI	near	489.50	242.70	66.84	1.25
	Random	-	453.60	247.40	64.78	0.77
	Random	-	521.70	275.90	65.55	0.77
Llama3.1-8B	InternLM2-7B	far	455.50	206.50	68.97	0.22
	InternLM2-7B	near	551.75	280.75	69.19	0.22
	Llama3.1-8B	far	568.00	265.75	68.94	4.49
	Llama3.1-8B	near	500.00	182.33	73.43	4.49
	OpenAI	far	761.25	317.00	71.32	0.85
	OpenAI	near	616.60	253.60	72.17	0.85
	Random	-	566.00	219.00	70.60	0.56
	Random	-	454.00	182.33	71.11	0.56

🔼 This table presents a quantitative analysis of the impact of different topic clustering strategies on the effectiveness of Mask-DPO’s factuality alignment. Four different methods for generating topic embeddings are compared: InternLM2-7B-Chat-SFT, Llama3.1-8B-Instruct, Text-Embedding-3-Large, and a random baseline. For each embedding method, the training data is divided into ’near’ and ‘far’ groups based on the distance of each topic’s embedding to the cluster center. The table shows the number of correct and incorrect facts, the overall factuality score, and the difference in score between the ’near’ and ‘far’ groups for each embedding method. The results demonstrate the relationship between topic proximity and factuality improvement after alignment.
read the caption
Table 5: Evaluation for the effects of different clustering strategies for preference data construction. Here, “InternLM2-7B”, “Llama3.1-8B” and “OpenAI” denotes InternLM2-7B-Chat-SFT, Llama3.1-8B-Instruct and Text-Embedding-3-Large, respectively. And “Random” means randomly selecting samples. “Distance” represents the distance between the training set and the cluster center. “Diff” represents the difference between the score at near distance settings and the one at far.

Model	Best of 1	Best of 16	Best of 32	Best of 64	Best of 128	Best of 256
Baseline	47.79	73.38	76.30	76.93	79.42	80.98
Mask-DPO	70.67	89.80	90.53	92.81	93.47	94.09

🔼 This table presents the performance comparison between the baseline model (Llama-3.1-8B-Instruct) and the Mask-DPO model under different best-of-N settings. Best-of-N refers to selecting the top-performing response from N generated responses. The table shows the number of correct and incorrect facts and the resulting factuality score for each model and each N value (1, 16, 32, 64, 128, 256). The discrepancy between the best-of-1 results here and those reported in Table 1 is because the inference setting used here is top-k=40, while the setting for Table 1 was top-k=1. This highlights the impact of different inference parameters on model performance.
read the caption
Table 6: Evaluation for the performance under the setting of best-of-N. Here, “Baseline” represents the base model Llama3.1-8N-Instruct and “Mask-DPO” represents the aligned model. 222The results for best-of-1 do not match those in Table 1 because the inference setting here is top-k=40, whereas in Table 1 it is top-k=1.

	# Correct	# Incorrect	% Score
Llama-3.1-8B + SFT + Mask-DPO	631	192	76.67
Llama-3.1-8B-Instruct + Mask-DPO	547.6	161.4	77.53

🔼 This table compares the performance of two approaches to factuality alignment: 1) directly applying Mask-DPO to an instruction-tuned LLM and 2) first applying supervised fine-tuning (SFT) to a base LLM and then applying Mask-DPO. It shows the number of correct and incorrect facts, as well as the factuality score achieved by each method. This comparison helps demonstrate that the effectiveness of Mask-DPO is not dependent on a prior SFT step.
read the caption
Table 7: Comparision with SFT+PO.

	# Correct	# Incorrect	% Score
DPO	446.4	206	68.44
DPOP	593	291	67.08
TDPO	703.6	284	70.95
SePO	254	611.4	58.45
Mask-DPO	547.6	161.4	77.53

🔼 This table compares the performance of Mask-DPO against other Preference Optimization (PO) strategies, namely TDPO, DPOP, and SePO, in enhancing the factuality of LLMs. It uses Llama3.1-8B-Instruct as the base model and ANAH-v2 for evaluation. The table shows the number of correct and incorrect facts, as well as the factuality score for each method, illustrating Mask-DPO’s superior performance compared to others.
read the caption
Table 8: Comparision with other PO Strategies.

	# Correct	# Incorrect	% Score
FactTune	657	499	56.83
Flame	447	263	62.96
Mask-DPO	547.6	161.4	77.53

🔼 This table presents a comparison of the factuality scores achieved by three different factuality alignment methods: Mask-DPO (the proposed method), FactTune, and Flame. The scores are based on the ANAH-v2 evaluation metric, indicating the proportion of correctly stated facts in the model’s responses. The table highlights Mask-DPO’s superior performance in terms of factuality compared to the other state-of-the-art methods.
read the caption
Table 9: Comparision with factuality alignment methods.

	GSM8K	MATH	GPQA	Human Eval	MBPP
Baseline	84.91	52.72	26.77	70.73	71.21
Mask-DPO	86.20	47.68	31.82	68.29	70.04

🔼 This table presents the performance of the Llama3.1-8B-Instruct model on several standard benchmarks before and after applying the Mask-DPO factuality alignment technique. The benchmarks cover diverse tasks, including mathematical reasoning (GSM8K, MATH, GPQA) and code generation (HumanEval, MBPP). The table shows the scores achieved by the model in each benchmark, allowing comparison of performance before and after the alignment process. This comparison helps determine if the improvement in factuality provided by Mask-DPO impacts the model’s overall capabilities across different tasks.
read the caption
Table 10: Evaluation on Standard Benchmarks.

Temperature	# Correct	# Incorrect	% Score
0.25	485	259.2	65.30
0.5	539	244.8	67.96
0.75	733	219	76.02
1.0	732	240.6	74.24
0.8 (default)	547.6	161.4	77.53

🔼 This table presents the results of an ablation study on the impact of different inference temperatures on the performance of the Mask-DPO model. The study evaluates the model’s performance using four different temperatures: 0.25, 0.5, 0.75, and 1.0, with the default temperature of 0.8 included for comparison. The results show the number of correct and incorrect facts, along with the factuality score for each temperature setting. This analysis helps to determine the optimal temperature setting for the Mask-DPO model, balancing the trade-off between diversity and accuracy in the model’s outputs.
read the caption
Table 11: Evaluation on different inference setup.

Distance	# Correct	# Incorrect	% Score	Diff
far	541.4	214	71.69	0.23
near	518	195	71.92	0.23

🔼 This table presents an ablation study evaluating different strategies for constructing preference data in the Mask-DPO model. Specifically, it compares the performance of clustering topics based on questions versus topics to determine which is more effective in improving the model’s factuality. The table shows the number of correct and incorrect facts, along with the resulting factuality score, for each sampling strategy. The purpose is to assess if the model’s internal knowledge representation is organized primarily around topics or questions, helping to understand the impact of data scaling strategies for factuality alignment.
read the caption
Table 12: Evaluation on question-based clustering strategies for preference data construction.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Factuality F-tuning#

Mask-DPO Insight#

Topic Scaling#

Knowledge Graph#

RLHF Extension#

More visual insights#

Full paper#