Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

2501.17433

Tiansheng Huang et el.

🤗 2025-01-30

TL;DR
#

Large language models (LLMs) are increasingly used in various applications, but their vulnerability to harmful fine-tuning attacks poses significant safety concerns. These attacks involve modifying an LLM by training it on a small set of malicious data, causing it to generate unsafe or biased outputs. To mitigate this risk, many services employ ‘guardrail moderation’ systems that filter out harmful data before fine-tuning. However, current research shows that these guardrails are often ineffective.

This paper introduces a new attack method called “Virus” that successfully bypasses these guardrail systems. Virus cleverly modifies harmful data to make it undetectable by the filter while still effectively compromising the safety of the LLM. The researchers demonstrate that Virus achieves a 100% leakage rate (meaning all modified harmful data passes the filter), and that it significantly degrades the LLM’s safety performance. The key finding is that relying solely on guardrail moderation for LLM safety is insufficient and reckless. The paper’s contributions include a novel attack methodology, empirical evidence of the guardrail’s vulnerabilities, and a deeper understanding of the limitations of current safety techniques.

Key Takeaways
#

Why does it matter?
#

This paper is crucial because it reveals a critical vulnerability in the widely adopted fine-tuning-as-a-service paradigm for large language models (LLMs). By demonstrating how easily the guardrail moderation system can be bypassed, it highlights the inherent safety risks associated with current LLM development and deployment practices. This necessitates a reevaluation of current security measures and prompts research into more robust safety mechanisms.

Visual Insights
#

🔼 The figure illustrates a three-stage pipeline for a harmful fine-tuning attack against a large language model (LLM). Stage 1 shows the initial safety alignment of the LLM using alignment data. Stage 2 depicts the guardrail moderation process, where the service provider filters out harmful samples from the uploaded fine-tuning data. Stage 3 shows the fine-tuning of the aligned LLM with the filtered data. The ‘Virus’ attack focuses on creating a user dataset that bypasses the guardrail and compromises the LLM’s safety alignment.
read the caption
Figure 1: A three stage pipeline for harmful fine-tuning attack under guardrail moderation. i) At the first stage, the model is safety aligned with alignment data. ii) At the second stage, the service provider applies guardrail moderation to filter out the harmful samples over the uploaded fine-tuning data. iii) At the third stage, the filtered data is used for fine-tuning the aligned LLM. Our attack Virus is concerning how to construct the user dataset that can bypass the guardrail and break the victim LLM’s safety alignment.

Metrics\Methods	BFA	HFA	HFA w/o moderation
Leakage ratio	100%	38%	100%
Harmful score	4.10	17.60	55.80

🔼 This table presents the results of benign and harmful fine-tuning attacks on a language model, both with and without guardrail moderation. It shows the leakage ratio (percentage of harmful samples that bypass the guardrail), and the resulting harmful score (a measure of how harmful the model’s outputs become after fine-tuning). The metrics help assess the effectiveness of guardrail moderation in mitigating harmful fine-tuning attacks.
read the caption
Table 1: Attack performance of benign/harmful fine-tuning attack. Attack methods are under guardrail moderation unless specified.

In-depth insights
#

Guardrail Jailbreak
#

The concept of “Guardrail Jailbreak” in the context of large language model (LLM) safety is a crucial aspect of the research. It highlights a critical vulnerability: the limitations of relying solely on moderation guardrails to prevent harmful fine-tuning. The guardrail, designed to filter harmful data before it’s used to train the LLM, is not foolproof. A successful jailbreak demonstrates that carefully crafted, subtly harmful data can bypass these filters, leading to a model that’s less safe than intended. This has significant implications, because it suggests that purely relying on data filtration as a safety mechanism is insufficient. The research likely explores methods to create these adversarial examples that evade detection, providing insight into the sophistication of attacks. Furthermore, it underscores the importance of developing more robust safety mechanisms that go beyond simple data filtering, perhaps by focusing on model architecture, training methodology, or post-training mitigation strategies. The effectiveness of a guardrail jailbreak also speaks to the need for ongoing adversarial testing and improvement of these safety systems, mimicking real-world attack attempts. Understanding how these jailbreaks work is paramount in enhancing LLM security and building more trustworthy and responsible AI systems.

Dual-Goal Optimization
#

The concept of “Dual-Goal Optimization” in the context of adversarial attacks against Large Language Models (LLMs) is a significant advancement in the field. It addresses the limitations of previous methods that either focused solely on bypassing moderation filters or solely on maximizing the harmful impact of fine-tuned models. The core innovation lies in simultaneously optimizing for two objectives: 1) evading detection by the guardrail moderation system and 2) maintaining or improving the model’s ability to generate harmful outputs. This dual-pronged approach is critical because a successful attack needs to pass both hurdles—an attack that easily gets detected will not cause significant harm. The dual-goal optimization cleverly uses a weighted combination of two loss functions, allowing the system to strike a balance between invisibility and potency. This approach significantly improves the efficacy of the attacks, demonstrating a higher degree of success in compromising the safety of the LLM. However, the introduction of the hyperparameter (lambda) raises questions about the robustness of the method and its susceptibility to fine-tuning; further research is required to explore optimal parameter settings and generalizability across various LLMs and datasets.

Virus Attack Method
#

The “Virus” attack method, as described in the research paper, represents a novel and effective approach to circumventing guardrail moderation in large language model (LLM) fine-tuning. Its core innovation lies in a dual-objective data optimization strategy. This isn’t simply about injecting harmful data; it’s about carefully crafting that data to meet two crucial goals simultaneously. First, the harmful data must be subtly modified to avoid detection by the guardrail’s moderation system. Second, even after passing this filter, the data must still be effective at undermining the LLM’s safety alignment. The use of a dual-objective optimization algorithm is pivotal here, balancing the need to bypass the guardrail with maintaining a harmful gradient that can effectively corrupt the model’s learned safety parameters. The research highlights the limitations of relying solely on guardrail moderation as a defense against harmful fine-tuning attacks, emphasizing the inherent vulnerability of pre-trained LLMs and the potential for sophisticated adversarial techniques to exploit them. The Virus method’s success rate, achieving up to 100% leakage ratio in experiments, underscores this vulnerability and underscores the need for more robust LLM safety mechanisms beyond simple data filtering.

Gradient Mismatch
#

The concept of “Gradient Mismatch” in the context of the research paper highlights a critical challenge in adversarial attacks against machine learning models, specifically those employing guardrail moderation. The authors posit that simply optimizing harmful data to bypass the guardrail (achieving a high “leakage ratio”) isn’t sufficient to effectively compromise the model’s safety alignment. A successful attack requires not only evading detection but also ensuring that the gradients of the optimized harmful data remain similar to the gradients of truly harmful data. This similarity is crucial because the model’s fine-tuning process relies on gradient updates. If the gradients differ significantly, the model may not learn harmful behavior, even if the harmful data itself passes moderation. Therefore, the “Gradient Mismatch” refers to the discrepancy between the gradients of the optimized adversarial data and those of genuinely harmful data. This mismatch renders the attack less effective, despite successfully bypassing the guardrail. The paper introduces “Virus”, a dual-objective optimization method that addresses this problem by simultaneously optimizing for both guardrail evasion and gradient similarity, achieving superior attack performance compared to single-objective methods.

Future Research
#

Future research should prioritize developing more robust guardrail systems that are less susceptible to adversarial attacks like Virus. This involves exploring advanced techniques in natural language understanding to better distinguish subtle manipulations in harmful data. Investigating alternative data sanitization methods beyond simple filtering is crucial. This could involve techniques like data augmentation or adversarial training for the guardrails themselves. A deeper understanding of the interaction between guardrail models and the underlying LLM architecture is needed. This understanding could lead to design improvements that make the models more resistant to manipulation at the data level. Finally, research should focus on developing more effective post-fine-tuning safety mechanisms to mitigate the impact of successful attacks, potentially including techniques to detect and reverse the effects of harmful fine-tuning on the model’s outputs.

More visual insights
#

More on figures

🔼 This figure shows the impact of different harmful data ratios on the harmful score and fine-tuning accuracy of a language model. A harmful ratio of 0% (BFA, Benign Fine-tuning Attack) represents a baseline where only benign data is used for fine-tuning. Increasing the harmful ratio (HFA, Harmful Fine-tuning Attack) introduces more harmful data, leading to a higher harmful score (indicating the model’s tendency to produce harmful outputs) and potentially lower fine-tuning accuracy (although in this instance, accuracy remains largely unaffected). The figure also demonstrates the effectiveness of the guardrail moderation system by showing that a significant portion of harmful data is filtered out (average leakage ratio of 0.348), mitigating the negative impact on the model’s safety alignment. However, the key takeaway is that even with the guardrail, a non-negligible fraction of harmful data still leaks through and compromises the model’s safety.
read the caption
Figure 2: Harmful score and Fine-tune accuracy under different harmful ratio. HFA refers to harmful fine-tuning attack with a harmful ratio of harmful data. BFA refers to benign fine-tuning attack with pure GSM8K data. BF is a special case when harmful ratio=0 for HF. The average leakage ratio (ratio of leak-through harmful data) of HF w/ moderation is 0.348. All the data in BFA an leak through the moderation.

🔼 Figure 3 illustrates four different fine-tuning attack methods. (a) shows a benign fine-tuning attack where only benign question-answer pairs are used. This serves as a baseline for comparison. (b) shows a harmful fine-tuning attack where only harmful question-answer pairs are used, demonstrating the vulnerability of LLMs to such attacks. (c) demonstrates the ‘Mixing Attack’, which attempts to bypass guardrail moderation by concatenating benign and harmful question-answer pairs. However, this method proves ineffective. (d) Finally, the figure illustrates the ‘Virus’ attack, which successfully bypasses the guardrail. This is achieved by concatenating a benign and a harmful question-answer pair, but this time, the harmful part is optimized using a novel method that considers two objectives: 1) bypassing the guardrail’s moderation and 2) maintaining effective attack performance. This dual optimization ensures that the malicious data evades detection while retaining its capability to degrade the safety alignment of the victim LLM. The figure visually represents the differences in data construction and moderation outcomes for each of the four approaches.
read the caption
Figure 3: Example illustration of different fine-tuning attack techniques. a) For benign fine-tuning attack, benign QA pair is uploaded for fine-tuning. b) For harmful fine-tuning attack, only harmful samples are uploaded. c) For Mixing attack, a benign QA is concatenated with a harmful QA in order to circumvent guardrail, which unfortunately does not succeed. d) For Virus, the benign QA is concated with a harmful QA and the harmful QA is optimized with the dual goals: i) To bypass moderation. ii) To guarantee attack performance.

🔼 This figure visualizes the impact of the hyperparameter λ (lambda) in the Virus algorithm on two key metrics: harmful loss and gradient similarity. The x-axis represents the number of fine-tuning steps, while the y-axis shows the harmful loss and gradient similarity. Multiple lines are plotted, each corresponding to a different value of λ. The lines demonstrate how the harmful loss and the gradient similarity change as the optimization process proceeds with different values of λ. When λ is 0, Virus prioritizes gradient similarity, resulting in a decrease in harmful loss. As λ increases, the focus shifts toward guardrail jailbreak, leading to higher harmful loss. The case where λ = 1 represents a failed attempt (guardrail jailbreak) where only the jailbreak goal was pursued.
read the caption
Figure 4: Stepping over the data optimized by Virus with different λ𝜆\lambdaitalic_λ, harmful loss and gradient similarity across fine-tuning rounds are displayed. When λ=1𝜆1\lambda=1italic_λ = 1, the method reduces to one of our failure attempt named guardrail jailbreak.

🔼 The figure illustrates how a sequence of tokens is represented as a flattened one-hot vector. The vocabulary size is 6, meaning there are 6 unique tokens. The number of optimizable tokens (n) is 3. Each token’s position in the sequence is represented by a segment of the flattened vector. Each segment is a one-hot encoding where only one bit is set to 1 (representing the selected token) and the rest are 0s. This representation is used because it allows easy manipulation and optimization of the tokens during the data optimization process of the proposed attack.
read the caption
Figure 5: Illustration of flattened one-hot vector.

More on tables

Metrics\Methods	HFA	Mixing	Mixing w/o moderation
Leakage ratio	38%	44%	100%
Harmful score	17.60	14.40	35.30

🔼 This table presents the results of harmful fine-tuning attacks (HFA) and a novel mixing attack strategy, both evaluated under the presence and absence of guardrail moderation. The comparison focuses on two key metrics: the leakage ratio (percentage of harmful samples that bypass the moderation filter) and the harmful score (a measure of how effectively the fine-tuned model generates harmful outputs). The data illustrates the effectiveness of guardrail moderation in mitigating harmful fine-tuning and the relative performance of the mixing attack in comparison to standard HFA.
read the caption
Table 2: Evaluation of HFA/Mixing. Attack methods are under guardrail moderation unless specified.

Metrics \Methods	HFA	Mixing	Guardrail Jailbreak
Gradient similarity	-	1	0.826
Leakage ratio	38%	44%	100%
Harmful score	17.60	14.40	14.10

🔼 This table presents the results of an experiment evaluating the effectiveness of a guardrail jailbreak attack method. The experiment compares three approaches: a standard harmful fine-tuning attack (HFA), a mixing attack that combines benign and harmful data, and a proposed guardrail jailbreak method. The table reports on three metrics: gradient similarity (measuring how closely the optimized harmful data’s gradient resembles that of the original mixing data), leakage ratio (percentage of harmful data that bypasses the guardrail), and harmful score (measure of how effectively the attack compromises the model’s safety alignment). All attacks are performed under guardrail moderation, except where explicitly noted. Cosine similarity is used to quantify the resemblance between gradients.
read the caption
Table 3: Evaluation of guardrail jailbreak design. Attack methods are under guardrail moderation unless specified. We use cosine similarity to measure the similarity between the gradient of the optimizable data and the original mixing data.

Metrics\Methods	Mixing	Guardrail Jailbreak	Only $F_{2}$	Virus
		(Only $F_{1}$ )
Grad similarity	1	0.826	1	0.981
Leakage ratio	44%	100%	44%	100%
Harmful score	14.40	14.10	14.00	30.40

🔼 Table 4 presents a comparison of the performance of different attack methods, specifically focusing on the Virus method, under the constraint of guardrail moderation. The table highlights key metrics: leakage ratio (the percentage of harmful data that bypasses the guardrail), and harmful score (a measure of the attack’s success in compromising the safety of the language model). Gradient similarity, calculated using cosine similarity, measures the resemblance between the gradients of the optimized harmful data produced by the Virus method and the original mixing data. This similarity metric helps assess the effectiveness of the optimization process.
read the caption
Table 4: Evaluation of Virus design. Attack methods are under guardrail moderation unless specified. We use cosine similarity to measure the similarity between the gradient of the optimizable data and the original mixing data.

Methods	Harmful Score						Finetune Accuracy
	p=0.01	p=0.05	p=0.1	p=0.15	p=0.2	Average	p=0.01	p=0.05	p=0.1	p=0.15	p=0.2	Average
BFA	4.10	4.10	4.10	4.10	4.10	4.10	32.00	32.00	32.00	32.00	32.00	32.00
HFA	3.20	9.30	17.60	20.20	26.20	15.30	27.50	27.40	29.00	29.70	29.70	28.66
Mixing	3.30	8.20	14.40	16.30	21.90	12.82	31.50	30.10	30.00	28.90	30.10	30.12
Virus	5.40	15.70	31.30	33.80	48.00	26.84	29.70	29.20	30.40	29.80	30.10	29.84

🔼 This table presents a comparison of different attack methods’ effectiveness against a guardrail moderation system at varying harmful data ratios. The methods compared include Benign Fine-tuning Attack (BFA), Harmful Fine-tuning Attack (HFA), Mixing attack, and the proposed Virus attack. For each method, the harmful score and fine-tuning accuracy are measured across different percentages of harmful training data (p). BFA results remain constant because the harmful data ratio does not affect its implementation. The table helps quantify the impact of harmful data on model safety and the effectiveness of each attack in bypassing the guardrail.
read the caption
Table 5: Evaluation of different attack methods under guardrail moderation and different harmful ratio p𝑝pitalic_p. All the statistic of BFA for this table are the same because harmful ratio p𝑝pitalic_p does not affect its implementation.

Methods	Harmful Score						Finetune Accuracy
	n=100	n=200	n=500	n=800	n=1000	Average	n=100	n=200	n=500	n=800	n=1000	Average
BFA	2.20	2.90	4.10	4.70	5.60	3.90	22.20	26.90	32.00	31.80	36.20	29.82
HFA	3.10	5.60	17.60	24.70	31.80	16.56	22.20	27.20	29.00	32.70	33.20	28.86
Mixing	3.80	5.30	14.40	21.70	25.50	14.14	24.50	24.30	30.00	31.20	33.80	28.76
Virus	6.60	12.90	31.30	39.70	42.60	26.62	21.90	27.70	30.40	32.30	34.90	29.44

🔼 This table presents a comparison of different attack methods (BFA, HFA, Mixing, Virus) on a large language model (LLM) under guardrail moderation. The experiment is conducted with varying numbers of fine-tuning samples (n) to assess the effectiveness of each attack in compromising the safety alignment of the model. The metrics used for evaluation include the harmful score (measuring the LLM’s tendency to generate harmful responses) and finetune accuracy (measuring the accuracy on the downstream task). This allows for an understanding of how the number of fine-tuning samples and different attack strategies impact the LLM’s safety alignment and its performance on the intended task.
read the caption
Table 6: Evaluation of different attack methods under guardrail moderation and different number of fine-tune sample n𝑛nitalic_n.

Methods	SST2		AgNews		GSM8K		Average
	HS	FA	HS	FA	HS	FA	HS	FA
BFA	2.20	93.69	1.30	79.30	4.10	32.00	2.53	68.33
HFA	13.40	92.78	13.90	54.00	17.60	29.00	14.97	58.59
Mixing	9.30	93.69	6.60	79.40	14.40	30.00	10.10	67.70
Virus	23.00	93.35	21.20	75.40	31.30	30.40	25.17	66.38

🔼 This table presents a comparison of different attack methods (BFA, HFA, Mixing, Virus) on the safety alignment of a large language model (LLM) under guardrail moderation. The comparison is conducted across three different downstream fine-tuning tasks (SST2, AGNews, GSM8K). The metrics used are harmful score and fine-tuning accuracy, reflecting the effectiveness of each attack method in compromising the model’s safety and its impact on its performance on benign tasks. It demonstrates the relative success of each method in bypassing the safety mechanisms and degrading the model’s ability to produce safe responses.
read the caption
Table 7: Evaluation of different attack methods under guardrail moderation and different fine-tuning tasks.

Methods	Clocktime	GPU Memory
Only $F_{1}$ ( $\lambda$ =1)	0.0828h	34.37GB
Only $F_{2}$ ( $\lambda$ =0)	0.2036h	36.62GB
Virus	0.2649h	39.68GB

🔼 This table presents the computational resource requirements for running the Virus attack algorithm on an NVIDIA H100 GPU. It details the GPU clock time (in hours) and GPU memory usage (in GB) needed to generate the adversarial examples. The table compares resource usage for three scenarios: Virus (the full dual-objective optimization method), only F1 (guardrail jailbreak, optimizing only for bypassing the guardrail), and only F2 (gradient matching, optimizing only for gradient similarity). This allows for an assessment of the computational cost-effectiveness of the different components of the Virus attack.
read the caption
Table 8: System evaluation for Virus on an H100.

$\lambda$	0	0.01	0.05	0.1	1
Gradient similarity	1.000	0.9972	0.9840	0.9805	0.8264
Leakage ratio	44%	66%	100%	100%	100%
Harmful score	14.00	21.90	30.40	31.30	14.10
Finetune accuracy	30.10	30.70	29.60	30.40	27.10

🔼 This table presents the results of an experiment evaluating the effect of different values for the hyperparameter lambda (λ) on the performance of the Virus attack. The hyperparameter λ controls the tradeoff between two objectives: minimizing the loss for jailbreaking the guardrail and maximizing the gradient similarity between the optimized data and the original harmful data. The table shows how different values of λ affect gradient similarity, leakage ratio (percentage of harmful data that bypasses the guardrail), harmful score (a measure of the attack’s success in compromising the model’s safety), and downstream task accuracy.
read the caption
Table 9: The impact of tradeoff hyper-parameter λ𝜆\lambdaitalic_λ.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Guardrail Jailbreak#

Dual-Goal Optimization#

Virus Attack Method#

Gradient Mismatch#

Future Research#

More visual insights#

Full paper#