Group-robust Machine Unlearning

2503.09330

Thomas De Min et el.

🤗 2025-03-17

TL;DR
#

Machine unlearning aims to remove the influence of specific training data while preserving knowledge. Previous approaches assume uniform forget data distribution; however, performance degrades for dominant groups when this doesn’t hold, leading to fairness issues. This paper addresses this by presenting group-robust machine unlearning to tackle the overlooked problem of non-uniformly distributed forget sets.

The paper presents a simple strategy that mitigates performance loss in dominant groups via sample distribution reweighting. It also introduces MIU, the first approach for group robustness in approximate machine unlearning, minimizing the mutual information between model features and group information. It also exploits sample distribution reweighting and mutual information calibration to preserve group robustness.

Key Takeaways
#

Why does it matter?
#

This paper introduces group-robust unlearning, a novel approach to mitigate performance degradation in dominant groups after unlearning. It offers practical solutions and opens avenues for research in fair and robust ML systems.

Visual Insights
#

🔼 This figure compares different machine unlearning approaches. Traditional methods assume that the data being unlearned (the ‘forget set’) is evenly distributed across all groups within the dataset. However, in reality, unlearning requests may disproportionately come from certain demographics (e.g., older men). This figure illustrates how this non-uniform distribution can cause a significant drop in model accuracy for the over-represented group (shown as the blue group experiencing a drop in accuracy). In contrast, ‘Group-robust Unlearning,’ the focus of this paper, aims to mitigate this performance degradation by considering the uneven distribution of the forget set.
read the caption
Figure 1: Comparing unlearning approaches. Previous works assume the forget set to be uniformly distributed. However, real-life unlearning requests do not comply with the uniform distribution assumption [3]. If the forget set distribution is predominant in some groups (e.g., old males), it can lead to performance degradation in such dominant forget groups (i.e., the blue group in the figure). Group-robust Unlearning prevents this from happening.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	84.6	66.6	82.8	0.1	20.9	67.3	-
Retrain	$\times$	84.7	54.6	82.3	0.1	31.6	56.9	-
Retrain	$\times$	81.4	82.8	79.9	11.3	3.9	83.5	-
Retrain		84.5	66.6	82.4	0.3	20.8	67.8	-
L1-sparse [23]	$\times$	83.8 (0.7)	44.3 (22.3)	81.3 (1.1)	0.2 (0.1)	37.5 (16.7)	47.5 (20.2)	89.8
SalUn [13]	$\times$	84.9 (0.4)	47.6 (19.0)	82.4 (0.2)	0.1 (0.2)	31.6 (10.8)	48.8 (19.0)	91.7
SCRUB [28]	$\times$	82.1 (2.5)	40.6 (26.0)	79.8 (2.6)	0.4 (0.2)	40.1 (19.3)	42.6 (25.1)	87.4
MIU	$\times$	84.8 (0.3)	55.3 (11.3)	82.6 (0.3)	0.3 (0.2)	27.4 (6.6)	55.9 (11.9)	94.9
L1-sparse [23]		83.4 (1.1)	60.6 (6.0)	81.3 (1.1)	0.2 (0.2)	27.9 (7.2)	62.0 (5.7)	96.5
SalUn [13]		84.4 (0.3)	64.9 (3.7)	82.5 (0.2)	0.1 (0.2)	21.4 (1.5)	66.2 (3.0)	98.5
SCRUB [28]		84.4 (0.3)	61.6 (5.0)	82.6 (0.3)	0.5 (0.3)	24.0 (3.2)	62.9 (4.8)	97.7
MIU		84.2 (0.4)	68.8 (2.6)	82.5 (0.1)	0.1 (0.2)	20.2 (0.6)	69.0 (1.3)	99.2

🔼 This table presents the results of a group-robust machine unlearning experiment on the CelebA dataset. The experiment focuses on the scenario where the data to be unlearned (the ‘forget set’) is not uniformly distributed across different groups within the dataset, but rather heavily concentrated in a single group. The goal is to evaluate how well different machine unlearning methods, specifically MIU (the proposed method in this paper), L1-sparse, SalUn, and SCRUB, can remove the influence of the forget set while preserving the model’s accuracy on the remaining data (the ‘retain set’), particularly within the dominant group of the forget set. The unlearning ratio is set to 0.5, meaning 50% of the data from the dominant group in the forget set is removed. Performance is measured using several metrics, and compared against a baseline of retraining the model without the forget set (Retrain) and a modified version with the proposed sample reweighting strategy (Retrain + Reweight), which is used as the gold standard against which other methods are compared. Methods that don’t employ the reweighting strategy are shown in light gray to clearly distinguish them from the reweighted methods.
read the caption
Table 1: Group-robust machine unlearning in CelebA [31]. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap and deltas are computed against Retrain + reweight. To avoid confusion, other reference models are in light gray.

In-depth insights
#

Group-robustness
#

The paper tackles the critical issue of performance degradation in specific groups when unlearning data, particularly when the data to be unlearned is not uniformly distributed across all groups. This is a significant problem because existing machine unlearning methods often assume uniform distribution, which can lead to unfair outcomes and reduced accuracy for dominant groups within the forget set. The paper addresses this gap by introducing a novel approach called group-robust machine unlearning. By mitigating the performance deterioration in these dominating groups, the algorithm helps in preserving the model’s generalization capabilities and ensuring fairness across different demographic or social groups.

Unlearning MIU
#

Unlearning through Mutual Information Unlearning (MIU) is a novel method in machine unlearning, focusing on balancing privacy and utility. MIU leverages mutual information minimization between model features and group labels, decorrelating unlearning from spurious attributes to mitigate performance loss in dominant forget groups. To prevent impacting other groups, MIU calibrates the unlearned model’s mutual information to match the original, preserving robustness. By minimizing the mutual information between forget-set features and ground-truth labels, MIU decorrelates unlearning from spurious attributes.

This mitigation addresses the scenario where data to be unlearned is not uniformly distributed but dominant in one group, leading to performance degradation.

The key idea is that by making the features independent of the group labels, the effect of the unlearning process can be isolated to the intended data without causing fairness issues.

Coupled with REWEIGHT, MIU outperforms existing unlearning approaches (L1-SPARSE, SALUN, SCRUB) in unlearning efficacy and preserved group robustness.

Reweighting Data
#

Reweighting data is a crucial aspect of machine unlearning, particularly when dealing with non-uniformly distributed forget sets. Simply removing data can lead to performance degradation in dominant groups, thus requiring a more nuanced approach. Reweighting techniques adjust the importance of different data points during retraining. By increasing the sampling likelihood of underrepresented groups, the model can compensate for information loss and maintain group robustness. This strategy helps preserve original group accuracies and overall model performance after unlearning. The effectiveness of reweighting depends on factors such as dataset size and the degree of imbalance in the forget set. Properly implemented reweighting can mitigate fairness issues and ensure that the unlearning process does not disproportionately affect certain demographic groups.

Fairness Metrics
#

Evaluating fairness is crucial in machine unlearning, especially with group robustness. The paper considers Demographic Parity (DP), ensuring prediction independence from sensitive attributes, Equal Opportunity (EO), focusing on equal true positive rates across groups, and Worst Group Accuracy (WG), maximizing performance for the least accurate group. These metrics complement typical unlearning evaluations by highlighting potential biases. Moreover, they measure model performance discrepancy with protected attributes. The use of these metrics suggests the importance of quantifying unlearning’s impact on different demographic groups to ensure fair and equitable outcomes, preventing the exacerbation of existing biases or the introduction of new ones. Furthermore, a comprehensive evaluation requires a set of metrics that can capture the nuances of fairness, considering various aspects of model behavior beyond overall accuracy. In essence, these metrics are used to reveal if unlearning affects specific group’s overall model capabilities or increases biases within it, so as to guide in developing group robust models.

Ablation Study
#

The ablation study meticulously dissects MIU’s architecture across diverse datasets (CelebA, Waterbirds, and FairFace). It systematically evaluates the contribution of each component: retaining term, unlearning term, calibration term, and REWEIGHT. Results underscore the unlearning term’s efficacy in reducing mutual information, evidenced by consistently lower UA scores. The calibration term enhances group fairness (increased GA), while REWEIGHT boosts robustness. The study also explores the impact of the λ parameter, finding optimal values vary across datasets. Overall, the ablation study validates the effectiveness of MIU’s design choices in achieving group-robust unlearning.

More visual insights
#

More on tables

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	98.9	84.5	87.7	33.3	26.2	56.6	-
Retrain	$\times$	98.7	52.4	86.5	54.8	30.4	49.4	-
Retrain	$\times$	94.7	89.3	91.6	21.4	7.3	83.1	-
Retrain		99.0	59.5	87.2	53.6	28.3	51.6	-
L1-sparse [23]	$\times$	99.0 (0.2)	59.5 (7.1)	85.6 (1.6)	44.0 (9.5)	32.2 (4.1)	48.8 (11.1)	94.4
SalUn [13]	$\times$	100.0 (1.0)	50.0 (9.5)	81.8 (5.4)	90.5 (36.9)	38.7 (10.4)	39.3 (12.3)	87.4
SCRUB [28]	$\times$	98.8 (0.3)	60.7 (10.7)	86.9 (0.7)	45.2 (10.7)	31.9 (4.3)	41.7 (9.8)	93.9
MIU	$\times$	100.0 (1.0)	53.6 (8.3)	86.1 (1.2)	58.3 (7.1)	28.3 (3.0)	53.8 (7.5)	95.3
L1-sparse [23]		98.7 (0.2)	64.3 (11.9)	85.0 (2.2)	46.4 (7.1)	30.6 (2.3)	53.7 (8.3)	94.7
SalUn [13]		100.0 (1.0)	47.6 (16.7)	81.1 (6.1)	91.7 (38.1)	39.0 (10.7)	39.0 (12.5)	85.8
SCRUB [28]		98.9 (0.2)	66.7 (11.9)	87.0 (0.6)	44.0 (9.5)	30.9 (3.4)	44.3 (7.3)	94.5
MIU		99.9 (0.9)	54.8 (4.8)	85.8 (1.4)	59.5 (6.0)	28.3 (1.8)	53.7 (4.0)	96.9

🔼 Table 2 presents a comparison of different machine unlearning methods on the Waterbirds dataset [42], focusing on their ability to handle non-uniformly distributed forget sets (where the data to be forgotten is not evenly distributed across different groups). The experiment involves removing 50% of the data from a single dominant group within the forget set. The table compares the performance of the proposed MIU method against three existing baselines: L1-sparse [23], SalUn [13], and SCRUB [28]. Evaluation metrics include retention accuracy (RA), unlearning accuracy (UA), test accuracy (TA), Membership Inference Attack efficacy (MIA), Equalized Odds (EO), and the accuracy of the dominant group in the forget set (GA). The ‘Avg. Gap’ metric summarizes the average performance difference compared to an ideal scenario (Retrain + Reweight). The results demonstrate MIU’s effectiveness in maintaining model performance, specifically for the dominant group, in the face of non-uniform data removal.
read the caption
Table 2: Group-robust machine unlearning in Waterbirds [42]. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap and deltas are computed against Retrain + reweight. To avoid confusion, other reference models are in light gray.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	65.6	79.0	57.2	0.2	5.4	71.2	-
Retrain	$\times$	66.8	57.8	56.5	0.9	9.2	58.7	-
Retrain	$\times$	61.7	56.3	51.4	10.2	2.3	57.4	-
Retrain		66.7	69.3	56.7	0.7	5.6	69.6	-
L1-sparse [23]	$\times$	64.0 (2.7)	74.1 (4.8)	56.9 (0.5)	0.2 (0.7)	6.1 (0.9)	69.4 (0.4)	98.3
SalUn [13]	$\times$	66.3 (0.3)	66.6 (3.0)	55.9 (0.8)	0.3 (0.6)	9.0 (3.4)	60.3 (9.3)	97.1
SCRUB [28]	$\times$	66.9 (0.3)	65.4 (3.9)	56.7 (0.6)	1.0 (0.5)	9.9 (4.3)	61.3 (8.3)	97.0
MIU	$\times$	66.7 (0.1)	74.7 (5.4)	57.2 (0.8)	0.3 (0.5)	6.0 (1.1)	66.1 (3.5)	98.1
L1-sparse [23]		64.4 (2.3)	72.9 (3.6)	56.0 (0.9)	0.3 (0.4)	6.1 (2.0)	67.0 (7.1)	97.3
SalUn [13]		65.1 (1.5)	69.8 (5.6)	54.8 (1.8)	0.3 (0.4)	6.6 (1.7)	63.7 (5.9)	97.2
SCRUB [28]		66.7 (0.2)	73.4 (4.1)	57.2 (0.6)	0.7 (0.3)	6.2 (0.7)	70.2 (1.8)	98.7
MIU		64.7 (1.9)	71.6 (2.3)	57.1 (0.5)	0.3 (0.3)	5.8 (1.5)	70.3 (1.2)	98.7

🔼 This table presents the results of a group-robust machine unlearning experiment on the FairFace dataset. The experiment aims to remove the influence of a specific group’s data from a pre-trained model while minimizing the impact on the accuracy for other groups. The forget set (data to be removed) was sampled from a single group, with a 50% unlearning ratio (half the data from that group was removed). The table compares a new method, MIU, with three existing approaches (L1-sparse, SalUn, and SCRUB) in terms of several metrics that measure unlearning effectiveness and fairness across different groups. The ‘Avg. Gap’ metric shows the average difference between the performance of each method and an optimal retraining model (Retrain + Reweight) that serves as a baseline for comparison. The use of light gray for some rows aids in distinguishing between MIU and other baselines.
read the caption
Table 3: Group-robust machine unlearning in FairFace [24]. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap and deltas are computed against Retrain + reweight. To avoid confusion, other reference models are in light gray.

Dataset	Eq. 5	Eq. 4	RW	UA	GA	Gap $\uparrow$
CelebA		$\times$	$\times$	53.6	54.0	94.1
			$\times$	55.3	55.9	94.9
	$\times$		$\times$	41.7	42.3	78.9
				68.8	69.0	99.2
Waterbirds		$\times$	$\times$	47.6	51.1	92.5
			$\times$	53.6	53.8	95.3
	$\times$		$\times$	16.7	16.8	81.9
				54.8	53.7	96.9
FairFace		$\times$	$\times$	63.1	59.2	96.1
			$\times$	74.7	66.1	98.1
	$\times$		$\times$	87.1	81.1	93.0
				71.6	70.3	98.7

🔼 This table presents an ablation study of the MIU model. It shows the impact of removing different components of the MIU model (retaining term, unlearning term, calibration term, and reweighting) on the model’s performance. The performance is measured using three metrics: Unlearning Accuracy (UA), Dominant Group Accuracy (GA), and Average Gap (Avg. Gap). The table helps to understand the contribution of each component to the overall effectiveness of the MIU model in achieving group robustness during machine unlearning. The row highlighted corresponds to the full MIU model with all components included.
read the caption
Table 4: MIU ablations. We compute MIU ablations on each of the three investigated datasets. From left to right, we report the investigated dataset, the retaining term, the unlearning term, the calibration term, and reweight. We measure performance using UA, GA, and Avg. Gap. The configuration that corresponds to AyMIU + reweight is highlighted.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	83.8 $\pm$ 0.0	67.9 $\pm$ 0.5	82.7 $\pm$ 0.1	0.1 $\pm$ 0.1	20.8 $\pm$ 1.0	69.0 $\pm$ 0.9	-
Retrain	$\times$	83.8 $\pm$ 0.1	62.0 $\pm$ 1.6	82.5 $\pm$ 0.1	0.2 $\pm$ 0.1	21.9 $\pm$ 1.3	64.0 $\pm$ 1.9	-
Retrain		83.7 $\pm$ 0.2	63.2 $\pm$ 1.6	82.7 $\pm$ 0.2	0.2 $\pm$ 0.0	21.2 $\pm$ 0.5	65.8 $\pm$ 1.7	-
L1-sparse [23]	$\times$	82.2 $\pm$ 0.1	52.8 $\pm$ 0.5	81.5 $\pm$ 0.0	0.1 $\pm$ 0.0	29.8 $\pm$ 0.1	56.0 $\pm$ 1.4	94.8 $\pm$ 0.5
SalUn [13]	$\times$	83.1 $\pm$ 0.8	69.2 $\pm$ 8.5	81.8 $\pm$ 0.9	0.1 $\pm$ 0.1	23.8 $\pm$ 1.8	69.1 $\pm$ 8.0	96.9 $\pm$ 2.0
SCRUB [28]	$\times$	84.4 $\pm$ 0.0	64.9 $\pm$ 0.6	82.9 $\pm$ 0.2	0.1 $\pm$ 0.0	22.0 $\pm$ 0.3	65.0 $\pm$ 0.1	99.1 $\pm$ 0.2
MIU	$\times$	83.9 $\pm$ 0.1	63.1 $\pm$ 3.8	82.7 $\pm$ 0.0	0.1 $\pm$ 0.0	22.3 $\pm$ 0.4	64.0 $\pm$ 3.8	98.6 $\pm$ 0.8
L1-sparse [23]		82.2 $\pm$ 0.2	60.7 $\pm$ 2.0	81.5 $\pm$ 0.2	0.1 $\pm$ 0.0	27.4 $\pm$ 0.6	63.4 $\pm$ 2.2	97.3 $\pm$ 0.8
SalUn [13]		83.5 $\pm$ 0.1	61.9 $\pm$ 2.0	82.6 $\pm$ 0.1	0.1 $\pm$ 0.1	22.7 $\pm$ 1.0	63.1 $\pm$ 2.0	98.3 $\pm$ 0.2
SCRUB [28]		84.4 $\pm$ 0.0	66.8 $\pm$ 0.7	82.9 $\pm$ 0.1	0.2 $\pm$ 0.0	20.4 $\pm$ 0.2	67.6 $\pm$ 0.5	98.8 $\pm$ 0.6
MIU		83.8 $\pm$ 0.2	67.2 $\pm$ 5.6	82.4 $\pm$ 0.2	0.1 $\pm$ 0.1	20.9 $\pm$ 0.6	67.8 $\pm$ 4.3	98.8 $\pm$ 1.2

🔼 This table presents the results of a group-robust machine unlearning experiment on the CelebA dataset [31], where only 10% of the data points from a single group are removed from the model. The experiment compares the performance of MIU to several baseline methods (L1-sparse [23], SalUn [13], SCRUB [28]) in terms of retain accuracy (RA), unlearning accuracy (UA), test accuracy (TA), membership inference attack (MIA) efficacy, equalized odds (EO), dominant group accuracy (GA), and an average gap (Avg. Gap) calculated relative to a retrained model with reweighting. The Avg. Gap metric helps quantify the overall performance difference between each method and the ideal case, providing a comprehensive evaluation of group-robustness after unlearning.
read the caption
Table 5: Group-robust machine unlearning in CelebA [31] with 0.1 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.1. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	84.6 $\pm$ 0.1	66.6 $\pm$ 1.8	82.8 $\pm$ 0.0	0.1 $\pm$ 0.0	20.9 $\pm$ 0.7	67.3 $\pm$ 2.4	-
Retrain	$\times$	84.7 $\pm$ 0.4	54.6 $\pm$ 4.1	82.3 $\pm$ 0.1	0.1 $\pm$ 0.1	31.6 $\pm$ 0.9	56.9 $\pm$ 4.3	-
Retrain		84.5 $\pm$ 0.1	66.6 $\pm$ 2.4	82.4 $\pm$ 0.1	0.3 $\pm$ 0.1	20.8 $\pm$ 0.4	67.8 $\pm$ 1.6	-
L1-sparse [23]	$\times$	83.8 $\pm$ 0.2	44.3 $\pm$ 3.5	81.3 $\pm$ 0.2	0.2 $\pm$ 0.1	37.5 $\pm$ 0.2	47.5 $\pm$ 3.0	89.8 $\pm$ 0.4
SalUn [13]	$\times$	84.9 $\pm$ 0.2	47.6 $\pm$ 4.1	82.4 $\pm$ 0.2	0.1 $\pm$ 0.1	31.6 $\pm$ 0.7	48.8 $\pm$ 3.9	91.7 $\pm$ 0.8
SCRUB [28]	$\times$	82.1 $\pm$ 2.3	40.6 $\pm$ 4.8	79.8 $\pm$ 2.4	0.4 $\pm$ 0.1	40.1 $\pm$ 2.7	42.6 $\pm$ 5.4	87.4 $\pm$ 3.6
MIU	$\times$	84.8 $\pm$ 0.0	55.3 $\pm$ 0.8	82.6 $\pm$ 0.2	0.3 $\pm$ 0.1	27.4 $\pm$ 0.4	55.9 $\pm$ 1.0	94.9 $\pm$ 0.7
L1-sparse [23]		83.4 $\pm$ 0.1	60.6 $\pm$ 1.8	81.3 $\pm$ 0.2	0.2 $\pm$ 0.0	27.9 $\pm$ 0.4	62.0 $\pm$ 2.1	96.5 $\pm$ 1.2
SalUn [13]		84.4 $\pm$ 0.2	64.9 $\pm$ 2.4	82.5 $\pm$ 0.2	0.1 $\pm$ 0.0	21.4 $\pm$ 1.4	66.2 $\pm$ 2.2	98.5 $\pm$ 1.0
SCRUB [28]		84.4 $\pm$ 0.5	61.6 $\pm$ 2.3	82.6 $\pm$ 0.4	0.5 $\pm$ 0.1	24.0 $\pm$ 1.3	62.9 $\pm$ 1.0	97.7 $\pm$ 1.5
MIU		84.2 $\pm$ 0.1	68.8 $\pm$ 0.3	82.5 $\pm$ 0.1	0.1 $\pm$ 0.0	20.2 $\pm$ 0.6	69.0 $\pm$ 1.2	99.2 $\pm$ 0.6

🔼 Table 6 presents a comparison of different machine unlearning methods on the CelebA dataset. The experiment focuses on group robustness, specifically how well each method maintains accuracy for a specific group within the dataset when a portion of that group’s data is removed (unlearning ratio of 0.5). The forget set consists solely of data from a single group. The table compares the performance of MIU against three other methods: L1-SPARSE, SalUn, and SCRUB. The performance of each method is evaluated using multiple metrics. The ‘Avg. Gap’ metric indicates the average performance difference compared to an ideal scenario, which involves retraining the model after removing the forgotten data with a sample reweighting strategy.
read the caption
Table 6: Group-robust machine unlearning in CelebA [31] with 0.5 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	85.3 $\pm$ 0.1	67.6 $\pm$ 0.9	82.7 $\pm$ 0.1	0.1 $\pm$ 0.1	20.0 $\pm$ 0.8	68.0 $\pm$ 0.9	-
Retrain	$\times$	86.8 $\pm$ 0.4	27.8 $\pm$ 2.4	80.7 $\pm$ 0.2	1.3 $\pm$ 0.2	50.0 $\pm$ 1.0	29.9 $\pm$ 2.2	-
Retrain		84.2 $\pm$ 0.9	64.2 $\pm$ 4.5	81.8 $\pm$ 0.2	0.4 $\pm$ 0.4	22.9 $\pm$ 0.9	66.8 $\pm$ 4.8	-
L1-sparse [23]	$\times$	86.9 $\pm$ 0.1	15.6 $\pm$ 1.5	79.6 $\pm$ 0.2	10.0 $\pm$ 2.6	54.6 $\pm$ 0.8	17.0 $\pm$ 1.1	75.9 $\pm$ 1.9
SalUn [13]	$\times$	87.0 $\pm$ 0.3	17.3 $\pm$ 10.0	80.0 $\pm$ 1.0	8.2 $\pm$ 5.4	44.4 $\pm$ 0.8	18.3 $\pm$ 9.9	78.4 $\pm$ 2.9
SCRUB [28]	$\times$	71.5 $\pm$ 1.2	34.0 $\pm$ 1.6	67.0 $\pm$ 1.1	0.9 $\pm$ 0.9	36.6 $\pm$ 3.3	34.6 $\pm$ 2.1	82.6 $\pm$ 1.1
MIU	$\times$	87.3 $\pm$ 0.1	31.6 $\pm$ 2.0	81.3 $\pm$ 0.2	1.9 $\pm$ 0.4	38.3 $\pm$ 0.2	32.8 $\pm$ 1.8	85.5 $\pm$ 1.1
L1-sparse [23]		85.0 $\pm$ 0.4	56.0 $\pm$ 3.2	80.6 $\pm$ 0.2	5.1 $\pm$ 2.1	31.1 $\pm$ 0.4	58.5 $\pm$ 2.7	94.8 $\pm$ 1.5
SalUn [13]		85.1 $\pm$ 1.0	59.7 $\pm$ 11.6	81.4 $\pm$ 0.6	1.0 $\pm$ 0.8	23.5 $\pm$ 2.2	60.6 $\pm$ 10.8	94.3 $\pm$ 1.0
SCRUB [28]		64.2 $\pm$ 5.2	66.3 $\pm$ 6.2	62.8 $\pm$ 4.4	0.2 $\pm$ 0.1	16.0 $\pm$ 10.7	66.9 $\pm$ 7.4	87.8 $\pm$ 1.3
MIU		82.9 $\pm$ 3.1	64.1 $\pm$ 3.7	79.8 $\pm$ 2.1	1.5 $\pm$ 1.2	25.8 $\pm$ 3.1	66.4 $\pm$ 4.6	98.3 $\pm$ 0.6

🔼 Table 7 presents the results of a group-robust machine unlearning experiment on the CelebA dataset [31] using an unlearning ratio of 0.9. In this experiment, the forget set (the data the model is trained to forget) was created by sampling data points from a single group within the dataset. The table compares the performance of the proposed method (MIU) against three existing machine unlearning methods: L1-sparse [23], SalUn [13], and SCRUB [28]. The comparison focuses on several metrics including retain accuracy (RA), forget accuracy (UA), test accuracy (TA), membership inference attack efficacy (MIA), equalized odds (EO), dominant group accuracy (GA), and the average gap (Avg. Gap) which measures the overall performance difference compared to the ideal scenario of retraining the model with the retain set and re-weighting the samples, which is considered the gold standard. The Avg. Gap is calculated relative to the Retrain + Reweight scenario.
read the caption
Table 7: Group-robust machine unlearning in CelebA [31] with 0.9 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.9. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	99.0 $\pm$ 0.1	73.3 $\pm$ 18.9	88.0 $\pm$ 0.5	53.3 $\pm$ 9.4	25.9 $\pm$ 0.7	57.6 $\pm$ 2.0	-
Retrain	$\times$	99.0 $\pm$ 0.2	46.7 $\pm$ 24.9	87.1 $\pm$ 0.8	60.0 $\pm$ 28.3	27.4 $\pm$ 1.7	57.4 $\pm$ 2.9	-
Retrain		98.7 $\pm$ 0.3	46.7 $\pm$ 24.9	87.3 $\pm$ 0.7	66.7 $\pm$ 24.9	26.7 $\pm$ 1.3	57.6 $\pm$ 4.8	-
L1-sparse [23]	$\times$	99.0 $\pm$ 0.1	60.0 $\pm$ 16.3	85.3 $\pm$ 0.8	46.7 $\pm$ 24.9	29.2 $\pm$ 1.7	57.9 $\pm$ 2.6	92.3 $\pm$ 3.0
SalUn [13]	$\times$	100.0 $\pm$ 0.0	53.3 $\pm$ 9.4	78.3 $\pm$ 3.3	66.7 $\pm$ 9.4	42.7 $\pm$ 5.1	33.3 $\pm$ 7.3	81.6 $\pm$ 6.7
SCRUB [28]	$\times$	98.9 $\pm$ 0.1	53.3 $\pm$ 9.4	86.9 $\pm$ 0.4	53.3 $\pm$ 18.9	31.1 $\pm$ 1.1	44.3 $\pm$ 3.5	91.3 $\pm$ 2.0
MIU	$\times$	99.1 $\pm$ 0.0	60.0 $\pm$ 16.3	87.1 $\pm$ 0.2	33.3 $\pm$ 18.9	26.1 $\pm$ 2.1	59.2 $\pm$ 5.3	91.2 $\pm$ 9.1
L1-sparse [23]		99.0 $\pm$ 0.1	73.3 $\pm$ 18.9	85.3 $\pm$ 1.2	53.3 $\pm$ 24.9	29.1 $\pm$ 2.8	58.2 $\pm$ 3.8	92.2 $\pm$ 3.0
SalUn [13]		99.9 $\pm$ 0.1	53.3 $\pm$ 9.4	76.7 $\pm$ 2.9	86.7 $\pm$ 9.4	45.9 $\pm$ 4.7	29.6 $\pm$ 7.4	81.3 $\pm$ 2.9
SCRUB [28]		98.8 $\pm$ 0.2	53.3 $\pm$ 9.4	86.8 $\pm$ 0.7	60.0 $\pm$ 16.3	31.5 $\pm$ 0.8	42.8 $\pm$ 3.5	89.8 $\pm$ 0.4
MIU		100.0 $\pm$ 0.0	73.3 $\pm$ 18.9	87.3 $\pm$ 0.3	73.3 $\pm$ 24.9	26.2 $\pm$ 0.3	61.7 $\pm$ 0.8	93.3 $\pm$ 0.7

🔼 This table presents the results of a group-robust machine unlearning experiment on the Waterbirds dataset. The experiment focused on a scenario where the unlearning ratio is 0.1 (meaning only 10% of the data points from the dominant group in the forget set are unlearned). The table compares the performance of the proposed MIU method to three existing unlearning methods (L1-sparse, SalUn, and SCRUB). Performance is evaluated using various metrics, including retention accuracy (RA), unlearning accuracy (UA), test accuracy (TA), membership inference attack efficacy (MIA), equalized odds (EO), and the forget-set dominant group accuracy (GA). The Avg. Gap metric summarizes the overall discrepancy of each method with a retraining baseline that uses reweighting, indicating how well each method approximates this optimal result. The results are presented to illustrate MIU’s effectiveness compared to existing methods for robust unlearning in scenarios with non-uniformly distributed data.
read the caption
Table 8: Group-robust machine unlearning in Waterbirds [42] with 0.1 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.1. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	98.9 $\pm$ 0.3	84.5 $\pm$ 1.7	87.7 $\pm$ 0.5	33.3 $\pm$ 6.1	26.2 $\pm$ 1.9	56.6 $\pm$ 6.0	-
Retrain	$\times$	98.7 $\pm$ 0.3	52.4 $\pm$ 8.9	86.5 $\pm$ 0.2	54.8 $\pm$ 9.4	30.4 $\pm$ 0.5	49.4 $\pm$ 1.6	-
Retrain		99.0 $\pm$ 0.1	59.5 $\pm$ 11.8	87.2 $\pm$ 0.3	53.6 $\pm$ 8.7	28.3 $\pm$ 2.0	51.6 $\pm$ 6.0	-
L1-sparse [23]	$\times$	99.0 $\pm$ 0.1	59.5 $\pm$ 8.9	85.6 $\pm$ 0.4	44.0 $\pm$ 11.8	32.2 $\pm$ 1.8	48.8 $\pm$ 7.4	94.4 $\pm$ 0.3
SalUn [13]	$\times$	100.0 $\pm$ 0.0	50.0 $\pm$ 5.1	81.8 $\pm$ 0.4	90.5 $\pm$ 3.4	38.7 $\pm$ 1.2	39.3 $\pm$ 3.3	87.4 $\pm$ 3.5
SCRUB [28]	$\times$	98.8 $\pm$ 0.2	60.7 $\pm$ 7.7	86.9 $\pm$ 0.6	45.2 $\pm$ 8.9	31.9 $\pm$ 1.8	41.7 $\pm$ 1.7	93.9 $\pm$ 1.4
MIU	$\times$	100.0 $\pm$ 0.0	53.6 $\pm$ 7.7	86.1 $\pm$ 1.0	58.3 $\pm$ 8.9	28.3 $\pm$ 1.7	53.8 $\pm$ 2.6	95.3 $\pm$ 0.7
L1-sparse [23]		98.7 $\pm$ 0.1	64.3 $\pm$ 5.8	85.0 $\pm$ 1.2	46.4 $\pm$ 12.7	30.6 $\pm$ 1.4	53.7 $\pm$ 4.3	94.7 $\pm$ 1.1
SalUn [13]		100.0 $\pm$ 0.0	47.6 $\pm$ 4.5	81.1 $\pm$ 1.9	91.7 $\pm$ 7.3	39.0 $\pm$ 2.2	39.0 $\pm$ 1.6	85.8 $\pm$ 4.2
SCRUB [28]		98.9 $\pm$ 0.2	66.7 $\pm$ 1.7	87.0 $\pm$ 0.5	44.0 $\pm$ 8.9	30.9 $\pm$ 1.3	44.3 $\pm$ 3.1	94.5 $\pm$ 1.3
MIU		99.9 $\pm$ 0.1	54.8 $\pm$ 14.7	85.8 $\pm$ 0.7	59.5 $\pm$ 12.1	28.3 $\pm$ 2.9	53.7 $\pm$ 3.8	96.9 $\pm$ 1.6

🔼 This table presents the results of a group-robust machine unlearning experiment on the Waterbirds dataset. The experiment involves removing a specific portion (0.5 unlearning ratio) of data points from a single group within the dataset, simulating a non-uniformly distributed forget set. Multiple methods, including MIU, L1-sparse, SalUn, and SCRUB, are compared in terms of their ability to successfully unlearn the targeted data while maintaining overall model accuracy and group-level accuracy (especially for the affected group). The performance of each method is assessed using various metrics, including retain accuracy (RA), forget accuracy (UA), test accuracy (TA), membership inference attack efficacy (MIA), equalized odds (EO), and the accuracy of the dominant group in the forget set (GA). The average gap (Avg. Gap) metric represents the average difference between a method’s performance and the performance of a retrained model (using the retain set and a reweighting strategy), serving as a baseline for ideal unlearning. The Avg. Gap is calculated relative to the Retrain + Reweight method to ensure a fair comparison.
read the caption
Table 9: Group-robust machine unlearning in Waterbirds [42] with 0.5 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	98.6 $\pm$ 0.6	76.0 $\pm$ 9.1	86.5 $\pm$ 0.4	44.7 $\pm$ 4.1	28.3 $\pm$ 1.9	55.9 $\pm$ 5.2	-
Retrain	$\times$	98.9 $\pm$ 0.2	41.3 $\pm$ 5.7	84.3 $\pm$ 0.3	68.7 $\pm$ 6.8	36.4 $\pm$ 1.4	41.7 $\pm$ 3.9	-
Retrain		98.9 $\pm$ 0.1	41.3 $\pm$ 2.5	85.7 $\pm$ 0.2	62.7 $\pm$ 3.4	33.5 $\pm$ 1.3	43.0 $\pm$ 2.9	-
L1-sparse [23]	$\times$	98.9 $\pm$ 0.2	60.0 $\pm$ 3.3	82.9 $\pm$ 1.2	50.7 $\pm$ 5.0	35.3 $\pm$ 0.4	49.9 $\pm$ 4.6	92.9 $\pm$ 1.0
SalUn [13]	$\times$	100.0 $\pm$ 0.0	40.0 $\pm$ 4.3	81.3 $\pm$ 0.9	92.7 $\pm$ 3.4	41.4 $\pm$ 1.3	30.8 $\pm$ 2.2	89.6 $\pm$ 2.0
SCRUB [28]	$\times$	97.8 $\pm$ 0.1	30.7 $\pm$ 1.9	86.1 $\pm$ 0.5	52.7 $\pm$ 3.4	36.6 $\pm$ 1.0	25.1 $\pm$ 1.5	92.7 $\pm$ 0.7
MIU	$\times$	100.0 $\pm$ 0.0	66.7 $\pm$ 5.7	85.7 $\pm$ 0.7	58.7 $\pm$ 4.7	32.1 $\pm$ 1.5	49.8 $\pm$ 3.3	93.4 $\pm$ 1.3
L1-sparse [23]		99.0 $\pm$ 0.2	59.3 $\pm$ 10.6	84.6 $\pm$ 0.6	45.3 $\pm$ 6.8	31.5 $\pm$ 0.7	55.0 $\pm$ 4.0	91.5 $\pm$ 4.1
SalUn [13]		100.0 $\pm$ 0.0	45.3 $\pm$ 2.5	80.3 $\pm$ 0.7	87.3 $\pm$ 1.9	41.8 $\pm$ 0.5	31.9 $\pm$ 4.8	90.9 $\pm$ 1.0
SCRUB [28]		98.0 $\pm$ 0.1	33.3 $\pm$ 3.4	86.2 $\pm$ 0.7	54.7 $\pm$ 5.2	35.9 $\pm$ 1.5	28.0 $\pm$ 3.4	93.6 $\pm$ 1.2
MIU		98.9 $\pm$ 0.2	44.7 $\pm$ 3.4	83.1 $\pm$ 1.3	65.3 $\pm$ 3.4	35.7 $\pm$ 2.2	45.0 $\pm$ 1.7	97.2 $\pm$ 0.3

🔼 This table presents the results of a group-robust machine unlearning experiment on the Waterbirds dataset, where 90% of data points from a single group were removed. The experiment compares the performance of MIU against three other state-of-the-art machine unlearning techniques (L1-sparse, SalUn, and SCRUB). The performance metrics used include retain accuracy (RA), forget accuracy (UA), test accuracy (TA), membership inference attack (MIA), equalized odds (EO), dominant group accuracy in the forget set (GA), and the average gap (Avg. Gap). The Avg. Gap is calculated relative to the performance of a model retrained after reweighting the data samples.
read the caption
Table 10: Group-robust machine unlearning in Waterbirds [42] with 0.9 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.9. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	66.2 $\pm$ 0.7	79.2 $\pm$ 2.5	57.2 $\pm$ 0.1	0.7 $\pm$ 0.2	5.8 $\pm$ 0.2	71.6 $\pm$ 2.1	-
Retrain	$\times$	67.3 $\pm$ 0.1	71.7 $\pm$ 0.8	57.0 $\pm$ 0.4	1.0 $\pm$ 0.8	5.4 $\pm$ 1.1	69.0 $\pm$ 2.8	-
Retrain		66.8 $\pm$ 0.1	72.0 $\pm$ 1.7	56.8 $\pm$ 0.4	0.9 $\pm$ 0.5	4.3 $\pm$ 0.6	71.1 $\pm$ 0.6	-
L1-sparse [23]	$\times$	63.7 $\pm$ 0.3	78.9 $\pm$ 3.5	56.1 $\pm$ 0.8	0.0 $\pm$ 0.0	5.5 $\pm$ 2.6	69.7 $\pm$ 2.2	97.3 $\pm$ 0.7
SalUn [13]	$\times$	65.9 $\pm$ 0.8	73.9 $\pm$ 3.9	55.1 $\pm$ 1.1	0.5 $\pm$ 0.0	2.9 $\pm$ 1.1	69.8 $\pm$ 7.0	97.8 $\pm$ 0.9
SCRUB [28]	$\times$	68.4 $\pm$ 0.5	78.7 $\pm$ 0.8	57.5 $\pm$ 0.3	0.2 $\pm$ 0.2	5.7 $\pm$ 0.2	70.4 $\pm$ 1.5	97.9 $\pm$ 0.5
MIU	$\times$	66.9 $\pm$ 0.5	81.3 $\pm$ 0.2	57.3 $\pm$ 0.2	0.2 $\pm$ 0.2	5.3 $\pm$ 0.6	70.4 $\pm$ 0.5	97.8 $\pm$ 0.5
L1-sparse [23]		64.0 $\pm$ 0.3	72.7 $\pm$ 0.7	56.4 $\pm$ 0.6	0.2 $\pm$ 0.2	5.3 $\pm$ 1.2	69.1 $\pm$ 0.7	98.6 $\pm$ 0.3
SalUn [13]		66.2 $\pm$ 0.4	80.1 $\pm$ 1.5	55.3 $\pm$ 0.4	0.2 $\pm$ 0.2	4.7 $\pm$ 0.9	73.3 $\pm$ 4.6	97.1 $\pm$ 0.3
SCRUB [28]		68.4 $\pm$ 0.5	79.2 $\pm$ 1.0	57.5 $\pm$ 0.4	0.2 $\pm$ 0.2	5.6 $\pm$ 1.1	70.9 $\pm$ 1.7	97.8 $\pm$ 0.6
MIU		67.4 $\pm$ 0.5	82.3 $\pm$ 1.3	57.6 $\pm$ 0.3	0.0 $\pm$ 0.0	6.0 $\pm$ 0.7	71.2 $\pm$ 0.5	97.6 $\pm$ 0.8

🔼 This table presents the results of a group-robust machine unlearning experiment on the FairFace dataset [24], using an unlearning ratio of 0.1. The experiment focuses on scenarios where the data to be unlearned (forget set) is not uniformly distributed across all groups, but rather concentrated in a single dominant group. The table compares the performance of the proposed MIU algorithm with three existing machine unlearning methods: L1-sparse [23], SalUn [13], and SCRUB [28]. Performance is evaluated across several metrics, including retain accuracy (RA), forget accuracy (UA), test accuracy (TA), membership inference attack (MIA) effectiveness, equalized odds (EO), and accuracy of the dominant group in the forget set (GA). The Avg. Gap metric summarizes the overall performance difference compared to a baseline established by retraining the model with the reweighted retain set (Retrain + Reweight).
read the caption
Table 11: Group-robust machine unlearning in FairFace [24] with 0.1 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.1. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	65.6 $\pm$ 0.7	79.0 $\pm$ 1.2	57.2 $\pm$ 0.4	0.2 $\pm$ 0.1	5.4 $\pm$ 1.7	71.2 $\pm$ 2.4	-
Retrain	$\times$	66.8 $\pm$ 0.4	57.8 $\pm$ 3.3	56.5 $\pm$ 0.1	0.9 $\pm$ 0.2	9.2 $\pm$ 0.9	58.7 $\pm$ 3.0	-
Retrain		66.7 $\pm$ 0.2	69.3 $\pm$ 0.5	56.7 $\pm$ 0.2	0.7 $\pm$ 0.5	5.6 $\pm$ 1.5	69.6 $\pm$ 0.7	-
L1-sparse [23]	$\times$	64.0 $\pm$ 0.3	74.1 $\pm$ 1.2	56.9 $\pm$ 0.5	0.2 $\pm$ 0.1	6.1 $\pm$ 0.7	69.4 $\pm$ 0.7	98.3 $\pm$ 0.2
SalUn [13]	$\times$	66.3 $\pm$ 0.4	66.6 $\pm$ 3.4	55.9 $\pm$ 0.6	0.3 $\pm$ 0.1	9.0 $\pm$ 0.5	60.3 $\pm$ 2.4	97.1 $\pm$ 1.1
SCRUB [28]	$\times$	66.9 $\pm$ 0.1	65.4 $\pm$ 1.6	56.7 $\pm$ 0.7	1.0 $\pm$ 0.0	9.9 $\pm$ 1.3	61.3 $\pm$ 2.5	97.0 $\pm$ 0.5
MIU	$\times$	66.7 $\pm$ 0.2	74.7 $\pm$ 1.2	57.2 $\pm$ 0.7	0.3 $\pm$ 0.0	6.0 $\pm$ 2.0	66.1 $\pm$ 4.4	98.1 $\pm$ 0.4
L1-sparse [23]		64.4 $\pm$ 0.1	72.9 $\pm$ 2.1	56.0 $\pm$ 0.9	0.3 $\pm$ 0.1	6.1 $\pm$ 2.1	67.0 $\pm$ 6.8	97.3 $\pm$ 0.3
SalUn [13]		65.1 $\pm$ 0.4	69.8 $\pm$ 6.3	54.8 $\pm$ 0.6	0.3 $\pm$ 0.2	6.6 $\pm$ 2.1	63.7 $\pm$ 3.4	97.2 $\pm$ 0.4
SCRUB [28]		66.7 $\pm$ 0.1	73.4 $\pm$ 2.2	57.2 $\pm$ 0.5	0.7 $\pm$ 0.3	6.2 $\pm$ 1.1	70.2 $\pm$ 2.7	98.7 $\pm$ 0.7
MIU		64.7 $\pm$ 0.3	71.6 $\pm$ 2.8	57.1 $\pm$ 0.3	0.3 $\pm$ 0.2	5.8 $\pm$ 0.4	70.3 $\pm$ 1.6	98.7 $\pm$ 0.8

🔼 Table 12 presents the results of a group-robust machine unlearning experiment on the FairFace dataset. The experiment involved removing 50% of the data points from a single group within the training dataset (forget set). The table compares the performance of MIU (Mutual Information-Aware Machine Unlearning) against three other established machine unlearning methods: L1-sparse, SalUn, and SCRUB. Performance is evaluated across multiple metrics, including retain accuracy, forget accuracy, test accuracy, membership inference attack efficacy, equalized odds, dominant group accuracy within the forget set and an average gap calculated from the differences between the algorithms and a retraining baseline that uses a reweighting strategy. The table shows that MIU provides improvements in several metrics, particularly in maintaining the accuracy of the dominant group within the forget set.
read the caption
Table 12: Group-robust machine unlearning in FairFace [24] with 0.5 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	66.0 $\pm$ 0.0	77.5 $\pm$ 2.1	56.6 $\pm$ 0.4	0.2 $\pm$ 0.0	5.5 $\pm$ 1.1	69.0 $\pm$ 2.8	-
Retrain	$\times$	67.3 $\pm$ 0.6	38.5 $\pm$ 1.8	56.0 $\pm$ 0.4	2.7 $\pm$ 0.3	23.1 $\pm$ 1.5	37.1 $\pm$ 2.1	-
Retrain		67.1 $\pm$ 0.4	53.6 $\pm$ 1.2	56.6 $\pm$ 0.4	1.8 $\pm$ 0.1	11.9 $\pm$ 1.0	53.9 $\pm$ 0.7	-
L1-sparse [23]	$\times$	64.5 $\pm$ 0.2	57.1 $\pm$ 2.1	55.2 $\pm$ 0.6	0.4 $\pm$ 0.1	13.0 $\pm$ 0.7	51.0 $\pm$ 1.9	97.7 $\pm$ 0.3
SalUn [13]	$\times$	65.7 $\pm$ 0.5	46.5 $\pm$ 6.2	53.9 $\pm$ 0.1	0.5 $\pm$ 0.1	15.3 $\pm$ 1.0	42.8 $\pm$ 4.9	95.2 $\pm$ 1.8
SCRUB [28]	$\times$	60.2 $\pm$ 1.0	52.7 $\pm$ 4.4	53.3 $\pm$ 0.5	2.4 $\pm$ 0.7	15.6 $\pm$ 1.4	48.7 $\pm$ 4.9	95.7 $\pm$ 0.6
MIU	$\times$	68.2 $\pm$ 0.3	64.4 $\pm$ 2.5	56.5 $\pm$ 0.5	0.5 $\pm$ 0.1	10.4 $\pm$ 0.7	56.8 $\pm$ 2.6	97.0 $\pm$ 1.0
L1-sparse [23]		64.0 $\pm$ 0.4	74.5 $\pm$ 3.0	55.9 $\pm$ 0.4	0.3 $\pm$ 0.2	5.6 $\pm$ 0.6	69.8 $\pm$ 4.6	91.9 $\pm$ 1.5
SalUn [13]		65.5 $\pm$ 0.5	66.1 $\pm$ 3.7	55.3 $\pm$ 0.2	0.5 $\pm$ 0.3	7.2 $\pm$ 1.2	62.8 $\pm$ 4.9	94.9 $\pm$ 1.7
SCRUB [28]		61.2 $\pm$ 1.1	65.5 $\pm$ 3.5	54.5 $\pm$ 0.3	1.3 $\pm$ 0.1	9.5 $\pm$ 0.9	64.4 $\pm$ 2.7	94.5 $\pm$ 1.6
MIU		64.7 $\pm$ 0.2	67.1 $\pm$ 1.4	56.7 $\pm$ 0.2	0.5 $\pm$ 0.1	8.8 $\pm$ 0.2	63.5 $\pm$ 1.6	94.9 $\pm$ 0.6

🔼 This table presents the results of a group-robust machine unlearning experiment conducted on the FairFace dataset [24] using an unlearning ratio of 0.9. The experiment focused on removing the influence of a single, dominant group from the training data. The table compares the performance of MIU (Mutual Information-Aware Machine Unlearning) against three other baseline unlearning methods: L1-sparse [23], SalUn [13], and SCRUB [28]. Performance is measured across multiple metrics including retain accuracy (RA), unlearning accuracy (UA), test accuracy (TA), membership inference attack efficacy (MIA), equalized odds (EO), dominant group accuracy (GA), and the average gap (Avg. Gap) compared to a retrained model using a reweighting technique. The Avg. Gap provides a summary of the overall performance differences across all the metrics compared to the reweighted retraining baseline.
read the caption
Table 13: Group-robust machine unlearning in FairFace [24] with 0.9 unlearning ratio. We build the forget set by sampling data points from a single group. The unlearning ratio is set to 0.9. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	64.6 $\pm$ 0.5	81.5 $\pm$ 0.5	57.4 $\pm$ 0.3	0.4 $\pm$ 0.0	1.1 $\pm$ 0.1	72.5 $\pm$ 0.7	-
Retrain	$\times$	66.5 $\pm$ 0.5	60.1 $\pm$ 0.9	55.4 $\pm$ 0.4	1.1 $\pm$ 0.1	5.6 $\pm$ 0.4	60.4 $\pm$ 1.7	-
Retrain		64.4 $\pm$ 1.1	72.1 $\pm$ 2.2	56.3 $\pm$ 0.3	0.5 $\pm$ 0.0	2.0 $\pm$ 0.6	71.8 $\pm$ 2.7	-
L1-sparse [23]	$\times$	63.5 $\pm$ 0.6	69.9 $\pm$ 3.9	55.3 $\pm$ 0.1	0.5 $\pm$ 0.0	2.6 $\pm$ 1.0	64.4 $\pm$ 4.4	97.7 $\pm$ 1.1
SalUn [13]	$\times$	64.3 $\pm$ 0.5	64.6 $\pm$ 1.2	54.0 $\pm$ 0.1	0.2 $\pm$ 0.1	3.6 $\pm$ 0.3	59.7 $\pm$ 1.5	96.0 $\pm$ 0.6
SCRUB [28]	$\times$	67.2 $\pm$ 0.4	74.3 $\pm$ 0.7	56.9 $\pm$ 0.2	0.3 $\pm$ 0.1	1.8 $\pm$ 0.7	65.6 $\pm$ 0.6	97.7 $\pm$ 0.3
MIU	$\times$	66.3 $\pm$ 0.4	74.2 $\pm$ 0.4	56.8 $\pm$ 0.5	0.3 $\pm$ 0.1	1.7 $\pm$ 0.4	65.7 $\pm$ 0.6	97.9 $\pm$ 0.4
L1-sparse [23]		63.7 $\pm$ 0.1	75.2 $\pm$ 0.6	56.3 $\pm$ 0.1	0.2 $\pm$ 0.1	1.3 $\pm$ 0.3	69.6 $\pm$ 1.2	98.6 $\pm$ 0.2
SalUn [13]		63.7 $\pm$ 0.8	74.4 $\pm$ 1.5	55.5 $\pm$ 0.4	0.4 $\pm$ 0.0	2.3 $\pm$ 0.3	69.0 $\pm$ 1.6	98.4 $\pm$ 0.2
SCRUB [28]		66.7 $\pm$ 0.4	80.5 $\pm$ 0.1	57.4 $\pm$ 0.5	0.4 $\pm$ 0.1	1.7 $\pm$ 0.3	71.3 $\pm$ 0.4	97.4 $\pm$ 0.3
MIU		63.4 $\pm$ 0.4	73.2 $\pm$ 0.3	56.7 $\pm$ 0.3	0.4 $\pm$ 0.1	1.5 $\pm$ 0.5	70.3 $\pm$ 1.0	99.0 $\pm$ 0.1

🔼 This table presents the results of a group-robust machine unlearning experiment on the FairFace dataset. The experiment involved sampling data points for the forget set from 9 different groups, with an unlearning ratio of 0.5. The table compares the performance of the proposed MIU method with three existing approximate machine unlearning algorithms (L1-SPARSE, SalUn, and SCRUB). The performance is evaluated across several metrics, including group accuracy, overall accuracy, and the gap between the achieved performance and the performance of an ideal model (Retrain + Reweight). The Avg. Gap metric helps to quantify how well each algorithm achieves the goal of effective unlearning while preserving the model’s robustness.
read the caption
Table 14: Group-robust machine unlearning in FairFace [24] by sampling from 9 groups. We build the forget set by sampling data points from 9 groups. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	reweight	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
Pretrain	$\times$	65.1 $\pm$ 0.3	70.6 $\pm$ 0.5	56.9 $\pm$ 0.3	0.3 $\pm$ 0.1	1.6 $\pm$ 0.6	62.2 $\pm$ 1.4	-
Retrain	$\times$	66.5 $\pm$ 1.0	55.5 $\pm$ 2.1	54.8 $\pm$ 0.7	0.8 $\pm$ 0.1	1.9 $\pm$ 0.3	56.0 $\pm$ 2.1	-
Retrain		66.1 $\pm$ 0.6	60.5 $\pm$ 0.7	55.5 $\pm$ 0.4	0.7 $\pm$ 0.1	2.3 $\pm$ 0.3	60.5 $\pm$ 0.5	-
L1-sparse [23]	$\times$	65.5 $\pm$ 0.8	61.8 $\pm$ 0.3	55.5 $\pm$ 0.5	0.5 $\pm$ 0.1	1.6 $\pm$ 0.8	57.0 $\pm$ 0.9	98.7 $\pm$ 0.2
SalUn [13]	$\times$	64.5 $\pm$ 0.5	60.0 $\pm$ 3.8	55.2 $\pm$ 1.0	0.5 $\pm$ 0.1	1.1 $\pm$ 0.4	57.0 $\pm$ 3.8	98.0 $\pm$ 0.7
SCRUB [28]	$\times$	66.7 $\pm$ 0.6	62.4 $\pm$ 1.0	55.4 $\pm$ 0.3	0.3 $\pm$ 0.1	1.7 $\pm$ 0.9	55.2 $\pm$ 0.6	98.4 $\pm$ 0.2
MIU	$\times$	66.8 $\pm$ 0.2	64.8 $\pm$ 0.5	56.4 $\pm$ 0.5	0.3 $\pm$ 0.1	1.7 $\pm$ 0.7	57.5 $\pm$ 1.1	98.3 $\pm$ 0.2
L1-sparse [23]		64.3 $\pm$ 0.7	66.7 $\pm$ 0.4	55.8 $\pm$ 0.4	0.4 $\pm$ 0.1	2.1 $\pm$ 0.8	60.8 $\pm$ 1.0	98.3 $\pm$ 0.3
SalUn [13]		62.5 $\pm$ 1.6	64.6 $\pm$ 2.0	54.6 $\pm$ 0.8	0.3 $\pm$ 0.1	2.2 $\pm$ 0.4	60.1 $\pm$ 2.0	98.0 $\pm$ 0.3
SCRUB [28]		65.3 $\pm$ 0.3	71.4 $\pm$ 0.6	57.0 $\pm$ 0.3	0.2 $\pm$ 0.0	1.9 $\pm$ 0.9	63.7 $\pm$ 0.7	97.1 $\pm$ 0.1
MIU		66.2 $\pm$ 1.6	63.0 $\pm$ 1.9	54.3 $\pm$ 0.6	0.9 $\pm$ 0.3	3.5 $\pm$ 0.9	57.6 $\pm$ 3.3	98.3 $\pm$ 0.1

🔼 This table presents the results of a group-robust machine unlearning experiment on the FairFace dataset. The experiment involved removing data points from 25 different groups within the training data (the ‘forget set’) while attempting to preserve the model’s accuracy on the remaining data. The table compares the performance of MIU (Mutual Information-Aware Machine Unlearning) against three other machine unlearning methods: L1-sparse, SalUn, and SCRUB. Performance is evaluated across multiple metrics, including retain accuracy, forget accuracy, test accuracy, membership inference attack efficacy, equalized odds, dominant group accuracy of the forget set and an aggregate gap score comparing the methods to a retraining-based gold standard. The ‘Avg. Gap’ is a composite metric summarizing the differences between each method and a baseline model trained only on the retained data, while applying a reweighting strategy to improve group robustness.
read the caption
Table 15: Group-robust machine unlearning in FairFace [24] by sampling from 25 groups. We build the forget set by sampling data points from 25 groups. The unlearning ratio is set to 0.5. We compare AyMIU against L1-sparse [23], SalUn [13], and SCRUB [28]. The Avg. Gap is computed against Retrain + reweight.

Method	DP	EP	EO	WG
CelebA [31]
Pretrain	44.1	21.3	20.9	65.5
Retrain	52.3 (8.2)	33.9 (12.6)	31.5 (10.6)	55.6 (-9.9)
Retrain+rw	43.7 (-0.4)	21.0 (-0.3)	20.8 (-0.1)	66.7 (1.2)
MIU	43.4 (-0.7)	19.9 (-1.4)	20.2 (-0.7)	67.8 (2.3)
Waterbirds [42]
Pretrain	20.6	36.1	26.1	56.6
Retrain	23.2 (2.6)	43.4 (7.3)	30.4 (4.3)	49.4 (-7.2)
Retrain+rw	21.5 (0.9)	40.6 (4.5)	28.3 (2.2)	51.6 (-5.0)
MIU	22.9 (2.3)	38.1 (2.0)	28.3 (2.2)	53.7 (-2.9)
FairFace [24]
Pretrain	2.0	7.6	5.4	9.4
Retrain	5.3 (3.3)	17.9 (10.3)	9.2 (3.8)	8.3 (-1.1)
Retrain+rw	3.8 (1.8)	7.6 (0.0)	5.6 (0.2)	6.1 (-3.3)
MIU	1.1 (-0.9)	8.0 (0.4)	5.8 (0.4)	16.3 (6.9)

🔼 This table presents a detailed fairness analysis of different machine unlearning methods across three datasets (CelebA, Waterbirds, and FairFace). It compares the performance of the methods against a gold standard (RETRAIN + REWEIGHT) using four fairness metrics: Demographic Parity (DP), Equal Opportunity (EP), Equalized Odds (EO), and Worst Group Accuracy (WG). Lower DP and EP values indicate better fairness, while higher WG indicates better robustness. The table shows the differences in these metrics between the baseline methods (PRETRAIN, RETRAIN, L1-SPARSE, SALUN, and SCRUB) and the proposed method (MIU), both with and without a reweighting strategy. This allows for a comprehensive assessment of the fairness implications of different unlearning techniques.
read the caption
Table 16: Additional Fairness Metrics. Fairness metrics are computed on each of the three investigated datasets (using the same splits as Tabs. 1, 2 and 3). From left to right, we report the method, DP, EP, EO, and WG. AyMIU + reweight is highlighted.

Dataset	Eq. 5	Eq. 4	RW	RA	UA	TA	MIA	EO	GA	Avg. Gap $\uparrow$
CelebA [31]		$\times$	$\times$	85.1 $\pm$ 0.0	53.6 $\pm$ 0.7	82.6 $\pm$ 0.1	0.2 $\pm$ 0.0	28.6 $\pm$ 0.1	54.0 $\pm$ 1.0	94.1 $\pm$ 0.6
			$\times$	84.8 $\pm$ 0.0	55.3 $\pm$ 0.8	82.6 $\pm$ 0.2	0.3 $\pm$ 0.1	27.4 $\pm$ 0.4	55.9 $\pm$ 1.0	94.9 $\pm$ 0.7
	$\times$		$\times$	63.1 $\pm$ 7.1	41.7 $\pm$ 34.2	62.1 $\pm$ 7.0	0.0 $\pm$ 0.0	10.4 $\pm$ 5.4	42.3 $\pm$ 34.1	78.9 $\pm$ 8.8
				84.2 $\pm$ 0.1	68.8 $\pm$ 0.3	82.5 $\pm$ 0.1	0.1 $\pm$ 0.0	20.2 $\pm$ 0.6	69.0 $\pm$ 1.2	99.2 $\pm$ 0.6
Waterbirds [42]		$\times$	$\times$	100.0 $\pm$ 0.0	47.6 $\pm$ 7.3	85.0 $\pm$ 0.6	73.8 $\pm$ 3.4	32.3 $\pm$ 0.8	51.1 $\pm$ 0.6	92.5 $\pm$ 4.6
			$\times$	100.0 $\pm$ 0.0	53.6 $\pm$ 7.7	86.1 $\pm$ 1.0	58.3 $\pm$ 8.9	28.3 $\pm$ 1.7	53.8 $\pm$ 2.6	95.3 $\pm$ 0.7
	$\times$		$\times$	93.0 $\pm$ 3.3	16.7 $\pm$ 9.4	80.3 $\pm$ 2.9	64.3 $\pm$ 12.7	35.8 $\pm$ 7.8	16.8 $\pm$ 8.9	81.9 $\pm$ 5.5
				99.9 $\pm$ 0.1	54.8 $\pm$ 14.7	85.8 $\pm$ 0.7	59.5 $\pm$ 12.1	28.3 $\pm$ 2.9	53.7 $\pm$ 3.8	96.9 $\pm$ 1.6
FairFace [24]		$\times$	$\times$	65.2 $\pm$ 0.1	63.1 $\pm$ 1.6	56.9 $\pm$ 0.3	0.3 $\pm$ 0.0	10.3 $\pm$ 0.8	59.2 $\pm$ 1.9	96.1 $\pm$ 0.8
			$\times$	66.7 $\pm$ 0.2	74.7 $\pm$ 1.2	57.2 $\pm$ 0.7	0.3 $\pm$ 0.0	6.0 $\pm$ 2.0	66.1 $\pm$ 4.4	98.1 $\pm$ 0.4
	$\times$		$\times$	59.1 $\pm$ 3.1	87.1 $\pm$ 6.8	54.5 $\pm$ 1.8	0.0 $\pm$ 0.0	3.1 $\pm$ 0.5	81.1 $\pm$ 6.2	93.0 $\pm$ 3.1
				64.7 $\pm$ 0.3	71.6 $\pm$ 2.8	57.1 $\pm$ 0.3	0.3 $\pm$ 0.2	5.8 $\pm$ 0.4	70.3 $\pm$ 1.6	98.7 $\pm$ 0.8

🔼 This table presents an ablation study of the MIU model, showing the impact of each component on the overall performance. The experiments were conducted on three datasets: CelebA, Waterbirds, and FairFace. For each dataset, several model configurations were tested, systematically removing or adding components like the retaining term, unlearning term, calibration term, and sample reweighting. The results are evaluated using multiple metrics to provide a comprehensive assessment of the impact of each MIU component.
read the caption
Table 17: MIU ablations. We compute MIU ablations on each of the three investigated datasets. From left to right, we report the investigated dataset, the retaining term, the unlearning term, the calibration term, and reweight. We measure performance using all metrics. The configuration that corresponds to AyMIU + reweight is highlighted.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Group-robustness#

Unlearning MIU#

Reweighting Data#

Fairness Metrics#

Ablation Study#

More visual insights#

Full paper#