TL;DR#
The paper tackles the challenge of aligning Large Language Models (LLMs) with human values and policies. Existing methods, like Reinforcement Learning from Human Feedback (RLHF), are often computationally expensive and require large, high-quality datasets, which are difficult and costly to acquire. This paper focuses on ‘unlearning’ undesired model behaviors by selectively removing the influence of specific training samples. It addresses the limitations of existing methods by focusing on efficiently removing the impact of harmful training data.
The paper proposes a gradient ascent-based unlearning algorithm that leverages only negative (undesired) examples, making data collection simpler and cheaper. The authors demonstrate the effectiveness of this method across three scenarios: removing harmful responses, erasing copyrighted content, and reducing hallucinations. They show that their unlearning approach achieves better alignment performance than RLHF in some instances, while requiring significantly less computational time. This offers a resource-efficient and practical solution for aligning LLMs, especially when positive examples are scarce.
Key Takeaways#
Why does it matter?#
This paper is crucial because it introduces LLM unlearning, a novel approach for aligning LLMs with human values and policies. This method is particularly relevant given the increasing concerns around the safety and ethical implications of LLMs. It offers a computationally efficient and resource-friendly alternative to existing methods and provides a novel perspective on model alignment, opening new avenues for research and development in the field.
Visual Insights#
🔼 This figure shows the computation time taken by different methods for aligning LLMs. The full RLHF approach requires significantly more time than other methods such as SFT, finetuning on remaining data, GA, and GA+Mismatch. The GA and GA+Mismatch methods, which use only negative samples and are the focus of this paper, demonstrate the lowest computation times, achieving comparable results to the significantly more time consuming RLHF approach with only 2% of its computational time. This highlights the computational efficiency of the unlearning approach proposed in the paper.
read the caption
Figure 1: Run time on a single NVIDIA A100-80GB GPU.
🔼 This table presents the experimental results of unlearning harmfulness using three different language models: OPT-1.3B, OPT-2.7B, and Llama 2 (7B). The results are categorized by the method used (Finetuning, GA, GA+Mismatch), showing the harmful rate, diversity, fluency, utility reward, and similarity to the original model’s outputs. The harmful rate represents the percentage of harmful responses generated by each model. Diversity and fluency measure the quality and variety of the generated text, and Utility Reward indicates the model’s performance on normal, non-harmful prompts. Similarity to Original measures how close the generated responses are to those of the original, un-unlearned model. NM stands for ‘Not Meaningful’, indicating that the generated text was of such low quality that fluency could not be measured.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
In-depth insights#
LLM Unlearning#
LLM unlearning explores techniques for removing unwanted information or behaviors from large language models (LLMs). This is crucial for mitigating risks associated with harmful or biased outputs, safeguarding user privacy by removing personal data, and enhancing the LLM’s overall safety and reliability. The primary advantage of unlearning is its computational efficiency and the ability to utilize negative samples, which are easier to obtain than positive examples needed for techniques like Reinforcement Learning from Human Feedback (RLHF). However, LLM unlearning poses unique challenges due to the sheer size and complexity of LLMs, the lack of complete information about the training data, and the difficulty in defining and evaluating what constitutes “forgetting.” Successful LLM unlearning strategies must address these challenges by carefully designing the unlearning algorithm, choosing effective evaluation metrics, and generalizing to unseen prompts. The effectiveness of different approaches (such as gradient ascent, and incorporating random or templated mismatched responses) needs thorough empirical analysis, and the implications for various LLM alignment tasks (removing harmful responses, deleting copyrighted material, and reducing hallucinations) require further investigation. The long-term potential of unlearning lies in its ability to create more robust, safer, and ethically aligned LLMs with significantly reduced computational costs, thereby making AI alignment a more feasible and practical task.
Negative Sample Power#
The concept of ‘Negative Sample Power’ in the context of large language model (LLM) training and alignment is a fascinating and crucial area of investigation. It highlights the potential of using negative examples—cases where the model produced undesirable or harmful outputs—to improve its behavior. Traditional approaches, like Reinforcement Learning from Human Feedback (RLHF), heavily rely on positive examples, which are expensive and time-consuming to obtain. Negative sample power suggests a more efficient alternative, leveraging readily available negative data from user reports or red teaming exercises. The effectiveness hinges on several factors, such as the quality and diversity of negative samples, the selection algorithm, and the method of incorporating this data into the model’s training or fine-tuning process. A deeper understanding of negative sample power could lead to more effective and cost-efficient LLM alignment techniques, focusing on preventing harmful outputs rather than solely promoting desirable ones. Further research could explore optimal strategies for data collection, sampling, and integration, potentially leading to significant advancements in LLM safety and reliability.
Gradient Ascent Method#
The Gradient Ascent Method, within the context of large language model (LLM) unlearning, presents a compelling approach to mitigate undesirable model behaviors by iteratively adjusting model parameters. Its core strength lies in its direct targeting of unwanted outputs using only negative examples, unlike Reinforcement Learning from Human Feedback (RLHF), which requires costly positive examples. This efficiency is particularly attractive given the scale of LLMs and the relative ease of collecting negative data through user reports or red teaming. However, the method’s effectiveness hinges on the quality and representativeness of the negative samples used. A crucial aspect is the method’s ability to generalize to unseen, yet similar, undesirable inputs. The paper’s exploration of using random mismatch responses or templated outputs to augment gradient ascent tackles the challenge of potential over-specialization. Careful consideration of the balance between forgetting undesirable behavior and preserving desirable aspects of the LLM is paramount. The ablation studies included in the research will help to determine the optimal settings and further evaluate its performance against standard approaches.
RLHF Comparison#
A hypothetical ‘RLHF Comparison’ section in a research paper would analyze the performance of Reinforcement Learning from Human Feedback (RLHF) against alternative methods, likely focusing on the efficiency and effectiveness of each approach. The analysis would compare the computational cost of RLHF, known for its resource intensity, with alternatives like the proposed unlearning method. Key metrics would include the alignment performance achieved, such as the reduction in harmful outputs or improved factual accuracy. A thoughtful comparison would also explore the data requirements, highlighting the substantial cost and difficulty associated with acquiring high-quality positive data for RLHF versus the relative ease of obtaining negative examples for unlearning. Furthermore, the comparison would likely address the generalizability of each method, considering its performance on unseen data and its ability to adapt to evolving preferences. Finally, a nuanced discussion would acknowledge that while RLHF may offer advantages in certain situations, unlearning represents a valuable alternative, particularly when resources are limited, or the primary goal is preventing harmful outputs rather than generating optimal responses.
Unlearning Challenges#
The challenges of unlearning in large language models (LLMs) are multifaceted and significant. Unlike traditional unlearning in simpler classification models, LLMs present a vastly larger output space, making it difficult to define and measure successful forgetting. The sheer scale and complexity of LLMs also pose computational challenges, as methods effective for smaller models often become prohibitively expensive when applied to LLMs. Furthermore, the opacity of the training data in LLMs is a major hurdle, preventing the precise identification and removal of specific harmful training samples, which is crucial for effective unlearning. Efficiently balancing the goals of forgetting undesirable outputs while preserving the utility of the model on desirable tasks is a core challenge. Finally, the very definition of “undesirable” is context-dependent and often subjective, making the design and evaluation of unlearning algorithms complex and nuanced. Therefore, robust, scalable, and interpretable methods for unlearning in LLMs are still under development, requiring innovative approaches that leverage the unique characteristics and challenges of these complex models.
More visual insights#
More on tables
🔼 This table presents the quantitative results of unlearning copyrighted content experiments. It shows the performance of different methods (original model, finetuning, GA, and GA+Mismatch) on three different aspects: leak rate (the percentage of extraction attempts that successfully leaked copyrighted information), diversity (the diversity of generated text), and fluency (the perplexity of generated text). The results are broken down for unlearned extraction attempts (those from the training data being targeted for unlearning), unseen extraction attempts (similar copyright-protected content not included in the unlearning set), and normal completion (text generation on non-copyrighted prompts). The ‘Utility’ and ‘Similarity to Original’ columns compare the performance on normal completion tasks.
read the caption
Table 2: Experimental results on unlearning copyrighted content. NM = “Not Meaningful”. Both GA and GA+Mismatch can achieve near-zero leak rates, and distinguish between copyright-related prompts from other prompts.
🔼 This table presents the results of experiments on reducing hallucinations in LLMs using the proposed unlearning methods (GA and GA+Mismatch). It shows the performance on three types of questions: unlearned misleading questions, unseen misleading questions (in-distributional), and benign questions (out-of-distributional). The metrics reported include hallucination rate, diversity, fluency, utility reward, and similarity to the original LLM. The results demonstrate the effectiveness of the methods in reducing hallucinations, particularly for in-distributional questions, while maintaining reasonable performance on benign questions.
read the caption
Table 3: Experimental results on reducing hallucinations. NM = “Not Meaningful”. Both GA and GA+Mismatch can significantly reduce the hallucination rate and distinguish between in-distributional and out-of-distributional questions.
🔼 This table presents the quantitative results of applying three different unlearning methods (finetuning, GA, and GA+Mismatch) to reduce harmful responses from the OPT-1.3B, OPT-2.7B, and Llama 2 (7B) language models. The evaluation metrics include the harmful rate (lower is better), diversity (higher is better), fluency (lower is better), utility reward (higher is better), and similarity to the original model (higher is better). The results show that both GA and GA+Mismatch significantly reduced the harmful rate, achieving near-zero rates in most cases and generalizing well to unseen harmful prompts. The table also notes that fluency scores were often ‘Not Meaningful’ (NM) due to the generation of primarily nonsensical text, a consequence of the constraint that only negative samples were used for training.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of experiments on unlearning harmful responses from a large language model (LLM). Three different methods were used: finetuning, gradient ascent (GA), and gradient ascent with random mismatch (GA+Mismatch). The table shows the harmful rate (percentage of harmful responses), diversity (a measure of the uniqueness of the generated text), and fluency (perplexity score) for both unlearned and unseen harmful prompts and normal prompts. The results demonstrate that both GA and GA+Mismatch are effective at reducing the harmful rate, and that these methods generalize well to unseen prompts. The
NM
indicates that the Fluency is not meaningful (perplexity measure doesn’t apply when output is mostly a single repeated character).read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table shows the outputs of the model for harmful prompts before and after unlearning. The table demonstrates that after ~ 200 steps of unlearning, the model still generates harmful responses; however, after ~ 1000 steps of unlearning, it outputs only whitespaces.
read the caption
Table 6: Harmful content warning. Responses to harmful prompts after unlearning ~ 200 (when the loss on harmful samples is already as high as ~ 60) and ~ 1000 batches.
🔼 This table shows the results of applying the unlearning harmfulness method on Llama 2 (7B) with normal prompts. Because Llama 2 is a text completion model, it appends ‘### Question:’ before each answer, and this is present in both the original model and after unlearning, demonstrating that this behavior is not caused by the specific data being unlearned. The prompt template was used to encourage the model to answer questions instead of creating new ones, which is a limitation of the text completion model.
read the caption
Table 7: Unlearning Harmfulness: Llama 2 (7B), test normal prompts. Note that Llama 2 would output unnecessarily '### Question:' after answering the question because Llama 2 is a text completion model rather than a chat model (we do not use Llama 2 Chat because it is already aligned). So we use the following prompt template to make it answer questions: '### Question: [question] ### Answer: [answer]'. Even the original Llama 2 would unnecessarily output new questions. In practice, we can just ignore the unnecessarily generated questions.
🔼 This table presents the hyperparameter settings used in the experiments for unlearning harmfulness. It shows the number of unlearning batches, batch size, and the weights assigned to three loss functions (Lfgt, Lrdn, Lnor) in the gradient ascent update equation. It also lists the learning rate and whether Low-Rank Adaptation (LoRA) was used for each model (OPT-1.3B, OPT-2.7B, Llama 2 (7B)) and method (Finetuning, GA, GA+Mismatch).
read the caption
Table 8: Unlearning Harmfulness: Hyperparameter setting.
🔼 This table presents the hyperparameters used in the unlearning harmfulness experiments. It shows the number of unlearning batches, batch size, the weights given to different losses (ε1, ε2, ε3), the learning rate, and whether LORA was used for each model (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) and training method (Finetuning, GA, and GA+Mismatch). The hyperparameters were tuned for each model and method to optimize the unlearning performance.
read the caption
Table 8: Unlearning Harmfulness: Hyperparameter setting.
🔼 This table presents the hyperparameters used in the unlearning harmfulness experiments. It shows the settings for three different models (OPT-1.3B, OPT-2.7B, Llama 2 (7B)) and three different unlearning approaches (Finetuning, GA, GA+Mismatch). For each model and approach, the table lists the number of unlearning batches, batch size, and the weights assigned to different loss functions (€1, €2, €3). The learning rate and whether Low-Rank Adaptation (LoRA) was used are also specified.
read the caption
Table 8: Unlearning Harmfulness: Hyperparameter setting.
🔼 This table presents the results of unlearning harmful prompts using OPT-1.3B. It shows the original responses, responses after fine-tuning, responses after applying the gradient ascent (GA) method, and responses after using the GA with random mismatch. The prompts are examples of harmful questions. The table demonstrates that the unlearning methods significantly reduce the generation of harmful responses.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the experimental results of unlearning harmfulness from three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)). It compares the performance of three methods: Finetuning, GA (Gradient Ascent), and GA+Mismatch. The metrics used are Harmful Rate, Diversity, Fluency, Reward, and Utility Similarity to Original. The Harmful Rate indicates the percentage of harmful outputs; lower is better. Diversity measures the uniqueness of the generated text, with higher scores indicating greater variety. Fluency, measured by perplexity, assesses the coherence and grammaticality of the output. Reward and Utility Similarity measure the quality and similarity of outputs to the original LLM, respectively. ‘NM’ stands for ‘Not Meaningful’, indicating that the fluency score is not meaningful (e.g., due to repetitive outputs). The table shows that both GA and GA+Mismatch achieve near-zero harmful rates and generalize well to unseen harmful prompts compared to finetuning.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of experiments on unlearning harmful responses from a large language model (LLM). Three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were used in the experiments, and each was subjected to three different unlearning methods: Finetuning, Gradient Ascent (GA), and Gradient Ascent with Random Mismatch (GA+Mismatch). The table shows the harmful rate, diversity, fluency, and utility scores for each LLM and unlearning method, both on harmful prompts used for unlearning and unseen harmful prompts. The results demonstrate that both GA and GA+Mismatch effectively reduced the harmful rate to near zero and generalized well to unseen prompts. The NM indicates that the generated text was nonsensical (not meaningful).
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of the unlearning harmfulness experiments. It compares the performance of three different methods: Finetuning, GA, and GA+Mismatch. The metrics used are Harmful Rate, Diversity, Fluency, Utility, and Similarity to Original. The Harmful Rate shows the percentage of harmful outputs, while Diversity measures the uniqueness of generated text. Fluency is evaluated using perplexity, while Utility represents the usefulness of the response as measured by a reward model, and Similarity to Original represents how similar the generated responses are to the original model’s. The results indicate that both GA and GA+Mismatch are very effective at reducing harmful responses, and their performance generalizes well to unseen data. The ‘NM’ designation indicates that the Fluency metric was not meaningful because the generated text was nonsensical.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of unlearning harmful prompts on the OPT-1.3B model. It compares the original model’s responses to harmful prompts against the responses after fine-tuning and after applying the gradient ascent (GA) and GA+Mismatch unlearning methods. The table shows that both unlearning methods effectively prevent the model from generating harmful responses, replacing them with either a refusal to assist or nonsensical text. This demonstrates the efficacy of the proposed unlearning techniques in mitigating harmful outputs. The ‘Prompt’ column lists the harmful prompts used in the experiment. The ‘Original’, ‘Finetuning’, ‘GA’, and ‘GA+Mismatch’ columns show how the model responded to each prompt under the four different conditions.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the quantitative results of applying the Gradient Ascent (GA) and GA with random mismatch methods to unlearn harmful responses from a large language model (LLM). The metrics evaluated include the harmful rate (the percentage of harmful responses), diversity (the uniqueness of tokens generated in the responses), fluency (the perplexity of the generated text as measured by a reference LLM), and utility. The table showcases the results for three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) across three different prompting scenarios: using unlearned harmful prompts, unseen harmful prompts (similar to the ones in the training data), and normal prompts (non-harmful prompts). ‘NM’ indicates that the fluency metric is not meaningful due to extremely low diversity in generated text. The results demonstrate the effectiveness of the proposed methods in reducing harmful responses while maintaining relatively high utility on non-harmful prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of applying the unlearning algorithm to harmful prompts using the OPT-1.3B language model. It shows the original responses generated by the model, the responses after fine-tuning, the responses after applying the gradient ascent (GA) algorithm, and the responses after applying the GA algorithm with random mismatch. The table demonstrates that the unlearning methods effectively reduce the generation of harmful responses.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the results of experiments on unlearning harmful responses from a large language model (LLM). Three LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 7B) were used. The table shows the performance of three methods: finetuning, gradient ascent (GA), and GA with random mismatch (GA+Mismatch) on three sets of prompts: harmful prompts used in training, unseen harmful prompts, and normal prompts. The metrics used include harmful rate, diversity of outputs, fluency, and utility. The results show that both GA and GA+Mismatch significantly reduce the harmful rate and generalize well to unseen prompts, achieving near-zero harmful rates while maintaining relatively good performance on normal prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of experiments on unlearning harmful behaviors from a large language model (LLM). Three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were tested with two different unlearning methods (gradient ascent (GA) and gradient ascent with random mismatch (GA+Mismatch)). The table shows the harmful rate, diversity, and fluency of the LLM’s outputs on three sets of prompts: unlearned harmful prompts, unseen harmful prompts (similar to unlearned prompts, but not seen during unlearning), and normal prompts. NM indicates that the fluency measure was not meaningful (the generated text consisted mostly of repeated characters). The results demonstrate that both GA and GA+Mismatch are effective in reducing harmful responses, generalizing well to unseen harmful prompts, and maintaining performance on normal prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table shows the original outputs, the outputs after finetuning, and the outputs after applying the gradient ascent (GA) and GA with random mismatch methods on unlearned harmful prompts using the OPT-1.3B language model. The results demonstrate the effectiveness of these methods in reducing or eliminating harmful responses to specific harmful prompts. The ‘I can’t assist it.’ response indicates successful unlearning in many instances.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the results of applying the unlearning algorithm on harmful prompts using the OPT-1.3B model. It shows the original model’s responses to several harmful prompts, the responses after fine-tuning, and the responses after applying the gradient ascent (GA) and GA+Mismatch methods. The responses showcase the effectiveness of the unlearning methods in preventing the model from generating harmful content, particularly the GA and GA+Mismatch which predominantly output ‘I can’t assist it’.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the results of experiments on unlearning harmful behavior from a large language model (LLM). Three LLMs (OPT-1.3B, OPT-2.7B, Llama 2 (7B)) were evaluated using two unlearning methods: Gradient Ascent (GA) and Gradient Ascent with Random Mismatch (GA+Mismatch). The table shows the harmful rate, diversity, fluency, and utility scores for each LLM and method, across unlearned harmful prompts, unseen harmful prompts, and normal prompts. The results indicate that both unlearning methods effectively reduce the harmful rate and generalize well to unseen prompts, demonstrating the promise of unlearning for aligning LLMs.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of experiments on unlearning harmful behavior from a large language model (LLM). Three LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were used, each tested with three methods: finetuning, gradient ascent (GA), and gradient ascent with random mismatch (GA+Mismatch). The table shows the harmful rate, diversity, fluency, and reward metrics for each LLM and method, broken down by harmful prompts (both unlearned and unseen) and normal prompts. The ‘NM’ designation indicates that the fluency metric was ‘Not Meaningful’ because the generated text was mostly a repetition of single characters. The results demonstrate that both GA and GA+Mismatch are effective in reducing the harmful rate and generalize well to unseen prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of applying the unlearning algorithm to harmful prompts using the OPT-1.3B model. It shows the original model’s responses, the responses after fine-tuning, and the responses after applying the gradient ascent (GA) and GA with random mismatch methods. The results highlight the effectiveness of the proposed unlearning techniques in mitigating harmful outputs.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the quantitative results of applying two unlearning methods (Gradient Ascent and Gradient Ascent with Random Mismatch) to the task of mitigating harmful outputs from a large language model (LLM). The results are compared against a finetuned model and the original, untrained LLM. The metrics evaluated include harmful rate, diversity of outputs, fluency, and utility preservation (reward and similarity to original model). The table shows that both unlearning methods significantly reduce the harmful response rate, demonstrating their effectiveness in addressing this issue. The ‘NM’ (Not Meaningful) indicates instances where the output was nonsensical, primarily whitespace, suggesting a successful removal of undesirable behavior. Generalization to unseen harmful prompts is also demonstrated.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of unlearning harmful behaviors from a large language model (LLM). Three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were used and evaluated using three different methods: finetuning, gradient ascent (GA), and gradient ascent with random mismatch (GA+Mismatch). The table displays metrics for harmful rate, diversity, and fluency on both unlearned and unseen harmful prompts, as well as utility preservation metrics (reward and similarity to the original LLM) on normal prompts. The ‘NM’ indicates that the fluency metric is not meaningful because the generated text consists mostly of a single repeated character.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the experimental results of unlearning harmful behaviors from a large language model (LLM). It compares three different methods: Finetuning (a baseline method), Gradient Ascent (GA), and Gradient Ascent with Random Mismatch (GA+Mismatch). The results are evaluated on three types of prompts: Unlearned Harmful Prompts (those used during the unlearning process), Unseen Harmful Prompts (similar to unlearned prompts, but not seen during unlearning), and Normal Prompts (non-harmful prompts). The metrics used to evaluate the models are Harmful Rate (the percentage of harmful responses), Diversity (token uniqueness in generated text), Fluency (measured by perplexity), Reward (utility score from a reward model), and Similarity to Original (BLEURT score comparing generated text to the original model’s text). The results show that GA and GA+Mismatch significantly reduce the harmful rate across all prompt types, demonstrating the effectiveness of unlearning, and maintain relatively good performance on normal prompts compared to finetuning.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of applying the unlearning algorithm to harmful prompts using the OPT-1.3B model. It shows the model’s responses to these prompts before unlearning (Original), after fine-tuning (Finetuning), after applying gradient ascent (GA), and after applying gradient ascent with random mismatch (GA+Mismatch). The responses highlight the model’s ability to generate harmful content before unlearning, and its improved performance after applying the unlearning algorithm, which results in mostly safe responses.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.
🔼 This table presents the quantitative results of the experiment on unlearning harmful behaviors from LLMs. It compares the performance of three different methods: finetuning, gradient ascent (GA), and gradient ascent with random mismatch (GA+Mismatch). The metrics evaluated include the harmful rate (lower is better), diversity (higher is better), and fluency (lower is better), across three different prompt categories: unlearned harmful prompts, unseen harmful prompts, and normal prompts. The ‘NM’ (Not Meaningful) designation indicates cases where the fluency metric was not meaningful due to the model’s output containing primarily repeated single characters. The table demonstrates the effectiveness of both GA and GA+Mismatch in reducing harmful outputs while maintaining decent performance on normal prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of the unlearning harmfulness experiments. It shows the performance of different methods (Original, Finetuning, GA, GA+Mismatch) across several metrics: Harmful Rate (lower is better), Diversity (higher is better), Fluency (lower is better), Utility Reward (higher is better), and Similarity to Original (higher is better). The metrics are evaluated on three types of prompts: Unlearned Harmful Prompts, Unseen Harmful Prompts, and Normal Prompts, offering a comprehensive view of the methods’ effectiveness in forgetting harmful associations while preserving their performance on benign prompts. The ‘NM’ (Not Meaningful) indicates that the generated text was mostly repetitive single characters.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of unlearning harmful behaviors from a large language model (LLM). Three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were evaluated using two unlearning methods: Gradient Ascent (GA) and GA with Random Mismatch (GA+Mismatch). The results are categorized by harmful rate, diversity, fluency, utility, and similarity to the original model. The harmful rate shows a significant reduction after unlearning using both methods. The diversity and fluency scores vary, with many outputs marked as ‘NM’ (Not Meaningful), indicating generation of nonsensical outputs due to the absence of positive examples in the training data. The utility and similarity metrics provide insights into how the unlearning approach impacts performance on normal, non-harmful prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of experiments conducted to evaluate the effectiveness of two unlearning methods, GA and GA+Mismatch, in mitigating harmful outputs from a large language model (LLM). The metrics used include harmful rate, diversity of generated text, fluency, and utility (reward and similarity scores compared to the original model). The results show that both methods were highly successful in reducing harmful responses, with near-zero harmful rates. The methods also demonstrated good generalization to unseen harmful prompts and maintained satisfactory utility scores on normal prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of the unlearning harmfulness experiments. It compares the performance of three different methods: Finetuning, GA (Gradient Ascent), and GA+Mismatch (Gradient Ascent with random mismatch). The metrics evaluated include Harmful Rate, Diversity, Fluency, and Utility, across three different LLM models (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)). The Harmful Rate shows the percentage of harmful responses generated. Diversity measures the variety of unique words used. Fluency is indicated by the perplexity score, with NM signifying that the generated text is nonsensical and thus the fluency metric is not meaningful. Utility considers the helpfulness of responses, and Similarity to Original measures how close the model’s outputs are to those of the original, un-unlearned model. The table demonstrates that both GA and GA+Mismatch effectively reduce the harmful rate to near zero and generalize well to unseen data, showing promise for LLM unlearning.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of applying the gradient ascent (GA) and gradient ascent with random mismatch (GA+Mismatch) methods to unlearn harmful responses from a large language model (LLM). The table compares the performance of these methods against a baseline (Finetuning) and the original, unaligned model on three different metrics: harmful rate, diversity, and fluency. The harmful rate indicates the percentage of harmful responses generated by the models. The diversity metric measures the variety of words used in the model’s responses, and fluency measures the readability of the generated text. The results demonstrate that both GA and GA+Mismatch significantly reduce the harmful rate and show reasonable generalization ability to unseen harmful prompts while maintaining a comparable level of utility on non-harmful prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of experiments on unlearning harmful responses from a large language model (LLM). Three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were tested using two different unlearning methods: Gradient Ascent (GA) and Gradient Ascent with Random Mismatch (GA+Mismatch). The table shows the harmful rate, diversity, and fluency of the LLM’s responses to harmful prompts before and after unlearning, as well as the utility and similarity to the original LLM’s responses to normal prompts. ‘NM’ indicates that the fluency score was not meaningful due to low diversity in the generated text. The results demonstrate that both GA and GA+Mismatch effectively reduce harmful responses and generalize well to unseen harmful prompts, achieving near-zero harmful rates, while maintaining reasonable utility and similarity to the original model on normal prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of unlearning harmful responses from a large language model (LLM). Three different LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were used in the experiments. Two unlearning methods, Gradient Ascent (GA) and Gradient Ascent with Random Mismatch (GA+Mismatch), were compared to a baseline fine-tuning method. The metrics evaluated include the harmful rate (the percentage of harmful responses), diversity (the number of unique tokens in the generated text), fluency (perplexity of generated text), reward (quality of generated text as measured by a reward model), and similarity to the original model’s outputs. The results show that both GA and GA+Mismatch effectively reduce harmful responses and maintain a similar level of performance on normal prompts.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the quantitative results of unlearning harmful responses from a large language model (LLM). Three LLMs (OPT-1.3B, OPT-2.7B, and Llama 2 (7B)) were tested using two unlearning methods: Gradient Ascent (GA) and Gradient Ascent with Random Mismatch (GA+Mismatch). The table shows the harmful rate, diversity, fluency, reward, and similarity to the original LLM’s responses for both harmful (unlearned and unseen) and normal prompts. The NM abbreviation indicates that the fluency score is not meaningful due to the low diversity of generated text.
read the caption
Table 1: Experimental results on unlearning harmfulness. NM = “Not Meaningful”. GA and GA+Mismatch can achieve near zero harmful rates and generalize to unseen harmful prompts.
🔼 This table presents the results of unlearning harmful prompts on the OPT-1.3B model. It shows examples of harmful prompts and how the original, fine-tuned, GA, and GA+Mismatch models responded. The table demonstrates the effectiveness of the unlearning methods in mitigating harmful responses.
read the caption
Table 11: Unlearning Harmfulness: OPT-1.3B, unlearned harmful prompts.