Granite Guardian

2412.07724

Inkit Padhi et el.

🤗 2024-12-11

↗ arXiv ↗ Hugging Face ↗ Papers with Code

TL;DR
#

Large language models (LLMs) are powerful but prone to misuse, generating unsafe or biased content. Existing safety mechanisms often fall short, lacking comprehensive coverage of various risks. This necessitates advanced detection models that ensure safe deployment.

This paper introduces Granite Guardian, a family of open-source risk detection models designed to address these challenges. These models provide a unified approach, covering both traditional safety concerns and novel risks specific to retrieval-augmented generation (RAG). Trained on a diverse dataset combining human-annotated and synthetic data, Granite Guardian achieves state-of-the-art performance on multiple benchmarks, demonstrating its effectiveness and generalizability. By offering these models to the community, the researchers aim to advance the field of responsible AI development.

Key Takeaways
#

Why does it matter?
#

This paper is crucial for researchers working on large language model (LLM) safety. It introduces Granite Guardian, a novel, open-source model family that outperforms existing models in risk detection, addressing limitations of traditional methods. This work opens avenues for creating safer LLMs and drives development of more robust risk mitigation strategies. The unified approach, and publicly available benchmarks, offers significant advantages to the community, accelerating progress in responsible AI.

Visual Insights
#

🔼 This figure details the guidelines used by annotators to label text samples as either ‘safe’ or ‘unsafe’. It outlines criteria for evaluating both prompts and responses based on categories such as bias, jailbreaking, violence, profanity, sexual content, unethical behavior, AI refusal, and other miscellaneous unsafe content. Specific examples are provided within each category to clarify the guidelines.
read the caption
Figure 1: Annotation guidelines

Risk	Prompt	Response
`harm`	✓	✓
`social-bias`	✓	✓
`profanity`	✓	✓
`sexual content`	✓	✓
`unethical behavior`	✓	✓
`violence`	✓	✓
`jailbreaking`	✓
`context relevance` (RAG)	✓
`groundedness` (RAG)		✓
`answer relevance` (RAG)		✓

🔼 This table lists the various risks that the Granite Guardian model is designed to detect. For each risk, it indicates whether the risk is typically found in the prompt (user input) or the response (model output), or both. This helps clarify where in the LLM interaction the model should focus its attention for risk detection.
read the caption
Table 1: Risks detected by Granite Guardian and the corresponding content sections (Prompt or Response) where they may occur.

In-depth insights
#

Unified Risk Model
#

A unified risk model for large language models (LLMs) is a crucial development for responsible AI. Such a model moves beyond addressing individual risks in isolation (e.g., toxicity, bias, jailbreaking) and instead holistically assesses multiple dimensions simultaneously. This holistic approach is essential because risks often intersect and exacerbate each other. For instance, a seemingly innocuous prompt can be weaponized through jailbreaking techniques to generate toxic or biased outputs. A unified model facilitates the development of more comprehensive and adaptable safety mechanisms. It enables a nuanced risk evaluation capable of detecting subtle and complex interactions between various risks, improving the accuracy and effectiveness of risk mitigation strategies. A key advantage lies in the potential for streamlining existing systems. Instead of deploying multiple independent risk detection modules, a unified model simplifies integration, reducing computational overhead and improving efficiency. The success of a unified model hinges on its ability to incorporate a broad spectrum of risks within a well-defined taxonomy. This taxonomy needs to be both comprehensive and granular, allowing for precise identification and categorization of potential harms. Thorough evaluation and testing against diverse datasets are vital to ensure the model’s generalizability and robustness across different application contexts.

Synthetic Data
#

The use of synthetic data in the research paper is a crucial aspect that warrants in-depth analysis. Synthetic data generation plays a vital role in addressing the limitations of real-world datasets, which may be insufficient, costly to obtain, or contain biases. The paper highlights how this approach enhances the model’s ability to detect risks that are typically overlooked in traditional models, such as jailbreaks and RAG-specific issues. The creation of synthetic data involves carefully crafted prompts and well-organized taxonomies to generate diverse samples at scale. This systematic approach ensures the model can handle both benign and harmful prompts effectively. The use of synthetic data also helps to address adversarial attacks, such as jailbreaks, by augmenting the training data with sophisticated prompts designed to bypass standard safeguards. The specific details on the methods and techniques utilized in generating synthetic data, coupled with the extensive evaluations performed to demonstrate the efficacy of the model, showcases the importance of synthetic data in advancing responsible AI development. The quality of the synthetic data and the careful design of the generation process are instrumental in ensuring the model’s robustness and generalizability.

Benchmarking
#

A robust benchmarking strategy is crucial for evaluating the effectiveness of large language models (LLMs). The choice of benchmarks should reflect the intended application of the model, encompassing diverse tasks and datasets to provide a holistic assessment. The selection of metrics is equally important. Beyond simple accuracy, metrics that capture various dimensions of performance (precision, recall, F1-score, AUC, AUPRC) are needed to expose strengths and weaknesses across different aspects of the LLM’s capabilities. Furthermore, a comparative analysis against established baselines provides essential context for interpreting the results and understanding the model’s relative performance within the current landscape of LLMs. Finally, transparency and reproducibility are paramount. Openly sharing datasets, model architectures, and evaluation protocols allows the research community to replicate experiments, enhancing trust and enabling further advancements in the field.

RAG Hallucination
#

Retrieval Augmented Generation (RAG) systems, while promising, are susceptible to “hallucinations.” These are instances where the model generates outputs that are factually incorrect or lack grounding in the retrieved context. Hallucinations undermine the reliability and trustworthiness of RAG, posing significant challenges for applications requiring accuracy. The problem stems from the limitations of LLMs in effectively synthesizing information from disparate sources and correctly judging the relevance and consistency of that information. Addressing RAG hallucinations requires a multifaceted approach, including improved retrieval methods to ensure relevant and reliable information is accessed, more sophisticated LLM architectures capable of better contextual understanding, and advanced techniques for detecting and mitigating hallucinatory outputs. Developing robust evaluation metrics specifically tailored to assess hallucination risks in RAG is crucial for advancing the field and fostering responsible development of these systems. Furthermore, investigating the underlying causes of hallucinations is necessary to inform the design of more effective mitigation strategies. This might include exploring better methods for handling conflicting or contradictory information, and exploring techniques that explicitly model the uncertainty associated with retrieved information. Ultimately, mitigating RAG hallucinations is paramount for establishing the trustworthiness and broad applicability of these powerful AI tools.

Deployment
#

Deployment of large language models (LLMs) is a critical phase demanding careful consideration of various factors. Robust risk mitigation strategies, such as those presented in the Granite Guardian research, are essential for safe and responsible operation. Transparency in the development and deployment processes is paramount to foster trust and accountability. Comprehensive testing across diverse use cases and scenarios is necessary to identify potential vulnerabilities. Scalability is a major concern, as LLMs are often deployed in high-volume environments. Integration with existing systems and workflows must be seamless and efficient. Furthermore, ongoing monitoring and maintenance are vital to ensure continued safety and effectiveness. Ethical considerations must be integrated into every stage of the deployment lifecycle, addressing issues of bias and potential misuse. Finally, a well-defined feedback loop enables continuous improvement and adaptation based on real-world usage data.

More visual insights
#

More on tables

Birth Year	Age	Gender	Education Level	Ethnicity	Region
-	-	Male	Bachelor	African American	Florida
1989	35	Male	Bachelor	White	Nevada
-	-	Female	Associate’s Degree	African American	Pennsylvania
1992	32	Male	Bachelor	African American	Florida
1978	46	Male	Bachelor	White	Colorado
1999	25	Male	High School Diploma	LATAM or Hispanic	Florida
-	-	Male	Bachelor	White	Texas
1988	36	Female	Bachelor	White	Florida
1985	39	Female	Bachelor	Native American	Colorado / Utah
-	-	Female	Bachelor	White	Arkansas
-	-	Female	Master of Science	White	Texas
2000	24	Female	Bachelor	White	Florida
1987	37	Male	Associate’s Degree	White	Florida
1995	29	Female	Master of Epidemiology	African American	Louisiana
1993	31	Female	Master of Public Health	LATAM or Hispanic	Texas
1969	55	Female	Bachelor	LATAM or Hispanic	Florida
1993	31	Female	Bachelor	White	Florida
1985	39	Female	Master of Music	White	California

🔼 This table presents the demographic information of the annotators who participated in the data labeling process for the Granite Guardian project. It shows the annotators’ birth year, age, gender, education level, ethnicity, and region. This data provides valuable context on the diversity of the individuals involved in creating the dataset and can help understand any potential biases in the dataset.
read the caption
Table 2: Annotator Demographics

Category	Prompt	Response
Bias	0.873	0.870
Jailbreaking	0.725	0.670
Violence	0.863	0.863
Profanity	0.817	0.842
Sexual Content	0.890	0.822
Unethical Behavior	0.894	0.883
AI Refusal	-	0.689
Other	0.892	0.811

🔼 This table displays the level of agreement between multiple human annotators who independently labeled prompts and responses for various risk categories. It quantifies the inter-annotator reliability for each risk category (Bias, Jailbreaking, Violence, Profanity, Sexual Content, Unethical Behavior, AI Refusal, and Other) by providing the inter-annotator agreement score (presumably Cohen’s Kappa or a similar metric) separately for prompts and responses. Higher scores indicate stronger agreement among annotators, suggesting higher quality and consistency in the labeling process.
read the caption
Table 3: Inter-annotator agreement for prompt/response labels

Dataset	[Ref.]	# sample	Benign	Harmful	type
AegisSafetyTest	Ghosh et al. (2024)	359	126	233	prompt
HarmBench Prompt	Mazeika et al. (2024)	239	✗	239	prompt
ToxicChat	Lin et al. (2023)	2,853	2,491	362	prompt
OpenAI Mod.	Markov et al. (2023)	1,680	1,158	522	prompt
SimpleSafetyTests	Vidgen et al. (2023)	100	✗	100	prompt
BeaverTails	Ji et al. (2023)	3,021*	1,288	1,733	response
SafeRLHF	Dai et al. (2024)	2,000*	1,000	1,000	response
XSTEST-RH	Han et al. (2024)	446	368	78	response
XSTEST-RR	Han et al. (2024)	449	178†	271‡	response
XSTEST-RR(h)	Han et al. (2024)	200	97†	103‡	response

🔼 This table presents the details of eight public benchmark datasets used to evaluate the performance of the Granite Guardian model in detecting harmful content in both prompts and responses. It lists each dataset’s name, the source reference, the total number of samples, the number of samples classified as benign, the number of samples classified as harmful, and the type of data (prompt or response). Note that some datasets have undergone sub-sampling (*), refusal responses flagged as benign (†), and compliance responses flagged as harmful(‡). This information is crucial in understanding the evaluation methodology and the nature of the data used to assess the model’s performance.
read the caption
Table 4: Details of the public benchmarks used for evaluation. ∗ indicates sub-sampling from the original set, †refers to refusal responses flagged as benign, and ‡refers to compliance responses flagged as harmful.

Dataset	[Ref.]	# sample	# Consistent	# Inconsistent	Task type
FRANK	Pagnoni et al. (2021)	671	223	448	Summarization
SummEval Prompt	Fabbri et al. (2021)	1,600	1,306	294	Summarization
MNBM	Maynez et al. (2020)	2,500	255	2,245	Summarization
QAGS-CNN/DM	Wang et al. (2020)	235	113	122	Summarization
QAGS-XSUM	Wang et al. (2020)	239	116	123	Summarization
BEGIN	Dziri et al. (2021)	836	282	554	Dialogue
Q²	Honovich et al. (2021)	1,088	623	460	Dialogue
DialFact	Gupta et al. (2021)	8,689	3,345	5,344	Dialogue
PAWS	Zhang et al. (2019)	8,000	3,536	4,464	Paraphrasing

🔼 This table lists the datasets used to evaluate the performance of RAG (Retrieval Augmented Generation) models on the task of groundedness. It details the name of each dataset, the reference where it is described, the total number of samples, and the breakdown of those samples into consistent and inconsistent examples. Each dataset focuses on different NLP tasks like summarization, dialogue, and paraphrasing, providing a comprehensive evaluation of RAG across various scenarios.
read the caption
Table 5: Details of the TRUE benchmarks used for RAG evaluation.

model	AUC	AUPRC	F1	Recall	Precision
Llama-Guard-7B	0.824	0.803	0.659	0.533	0.861
Llama-Guard-2-8B	0.841	0.822	0.723	0.627	0.852
Llama-Guard-3-1B	0.796	0.775	0.656	0.575	0.765
Llama-Guard-3-8B	0.826	0.819	0.710	0.607	0.857
ShieldGemma-2B	0.748	0.704	0.421	0.277	0.883
ShieldGemma-9B	0.753	0.707	0.404	0.262	0.886
ShieldGemma-27B	0.772	0.718	0.438	0.295	0.849
Granite-Guardian-3.0-2B	0.782	0.746	0.674	0.747	0.614
Granite-Guardian-3.0-8B	0.871	0.846	0.758	0.735	0.781

🔼 This table presents a comparison of the performance of Granite Guardian’s harm risk detection model against Llama Guard and ShieldGemma models. The comparison uses aggregated datasets for harmful content detection, focusing on the umbrella harm risk definition. The table shows several evaluation metrics (AUC, AUPRC, F1, Recall, and Precision). The baselines (Llama Guard and ShieldGemma) were adapted to ensure a fair comparison. The best and second-best performances for each metric are highlighted in bold and underlined, respectively. This allows readers to easily identify the relative strengths of each model in detecting harmful content.
read the caption
Table 6: Results on aggregated datasets for harmful content detection comparing Granite Guardian (using the umbrella harm risk definition) with Llama Guard and ShieldGemma model families. Baselines are suitably adapted for direct comparison (see section 6.1 for details). Numbers in bold represent the best performance within a column, while underlined numbers indicate the second-best.

model	AegisSafetyTest	ToxicChat	OpenAI Mod.	BeaverTails	SafeRLHF	XSTEST_RH	XSTEST_RR	XSTEST_RR(h)	F1/AUC
Llama-Guard-7B	0.743/0.852	0.596/0.955	0.755/0.917	0.663/0.787	0.607/0.716	0.803/0.925	0.358/0.589	0.704/0.816	0.659/0.824
Llama-Guard-2-8B	0.718/0.782	0.472/0.876	0.758/0.903	0.718/0.819	0.743/0.822	0.908/0.994	0.428/0.824	0.805/0.941	0.723/0.841
Llama-Guard-3-1B	0.681/0.780	0.453/0.810	0.686/0.858	0.632/0.820	0.662/0.790	0.846/0.976	0.420/0.866	0.802/0.959	0.656/0.796
Llama-Guard-3-8B	0.717/0.816	0.542/0.865	0.792/0.922	0.677/0.831	0.705/0.803	0.904/0.975	0.405/0.558	0.798/0.891	0.710/0.826
ShieldGemma-2B	0.471/0.803	0.181/0.811	0.245/0.709	0.484/0.747	0.348/0.657	0.792/0.867	0.371/0.570	0.708/0.735	0.421/0.748
ShieldGemma-9B	0.458/0.826	0.181/0.851	0.234/0.721	0.459/0.741	0.329/0.646	0.809/0.880	0.356/0.584	0.708/0.753	0.404/0.753
ShieldGemma-27B	0.437/0.860	0.177/0.880	0.227/0.724	0.513/0.757	0.386/0.649	0.792/0.893	0.395/0.546	0.744/0.748	0.438/0.772
Granite-Guardian-3.0-2B	0.842/0.844	0.368/0.865	0.603/0.836	0.757/0.873	0.771/0.834	0.817/0.974	0.382/0.832	0.744/0.903	0.674/0.782
Granite-Guardian-3.0-8B	0.874/0.924	0.649/0.940	0.745/0.918	0.776/0.895	0.780/0.846	0.849/0.979	0.401/0.786	0.781/0.919	0.758/0.871

🔼 This table presents the performance of various models, including Granite Guardian, in detecting harmful content in both prompts and responses. The models are evaluated across multiple datasets, categorized by whether the harm is present in the prompt or the response. The F1 score and Area Under the ROC Curve (AUC) are reported as key performance metrics. The baselines are modified to facilitate fair comparisons with Granite Guardian. Numbers in bold represent the best performance in each column, and underlined numbers indicate the second-best performance.
read the caption
Table 7: F1/AUC results across different datasets, categorised across prompt harmfulness and response harmfulness. Baselines are suitably adapted for direct comparison (see section 6.1 for details). Numbers in bold represent the best performance within a column, while underlined numbers indicate the second-best.

Model	MNBN	BEGIN	QX	QC	SumE	DialF	PAWS	Q2	Frank	AVG.
ANLI-T5-11B	0.779	0.826	0.838	0.821	0.805	0.777	0.864	0.727	0.894	0.815
WeCheck-0.4B	0.830	0.864	0.814	0.826	0.798	0.900	0.896	0.840	0.881	0.850
Llama-3.1-Bespoke-MiniCheck-7B	0.817	0.806	0.907	0.882	0.851	0.931	0.870	0.870	0.924	0.873
Granite-Guardian-3.0-2B	0.712	0.710	0.768	0.753	0.779	0.892	0.825	0.874	0.885	0.800
Granite-Guardian-3.0-8B	0.719	0.781	0.836	0.890	0.822	0.946	0.880	0.913	0.898	0.854

🔼 This table presents the Area Under the Curve (AUC) scores for different models on the TRUE benchmark dataset, specifically focusing on the groundedness aspect of RAG (Retrieval Augmented Generation). The TRUE dataset encompasses various natural language processing (NLP) tasks designed to assess the factual consistency of generated text. The models’ AUC scores are compared across multiple sub-datasets within TRUE, allowing for a comprehensive evaluation of their performance in accurately identifying grounded vs. hallucinated statements. The bold numbers in each column highlight the top-performing model for that specific task, while underlined numbers denote the second-best performance. This enables easy comparison of the models’ strengths and weaknesses across a diverse range of NLP tasks.
read the caption
Table 8: AUC results on the TRUE dataset for groundedness. Numbers in bold represent the best performance within a column, while underlined numbers indicate the second-best.

Risk name	Definition
`harm`	Content considered universally harmful. This is our general category, which should encompass a variety of risks including those not specifically addressed by the following categories.
`social-bias`	Systemic prejudice against groups based on shared identity or characteristics, often stemming from stereotypes or cultural influences. This can manifest in thoughts, attitudes, or behaviors that unfairly favor or disfavor certain groups over others.
`profanity`	Use of language considered offensive or socially unacceptable in a given context. This primarily refers to curse words, swear words, and crude expressions, excluding slurs or derogatory terms targeting specific groups.
`sexual content`	Material explicitly related to sexual activities, anatomy, or desires. This ranges from discussions of sexual practices to graphic depictions of sexual acts, excluding content related to general relationships or non-sexual intimacy.
`unethical behavior`	Actions that violate moral or professional standards, focusing on exploitation for personal gain or disregard for others’ well-being. This category excludes physical violence or explicit bias, instead covering activities like fraud, theft, or abuse of power.
`violence`	Promoting or describing physical harm to individuals or groups. This includes depictions of assault, self-harm, or creation of a hostile environment through threats or intimidation.
`jailbreaking`	Deliberate circumvention of AI systems’ built-in safeguards or ethical guidelines. This involves crafting specific prompts or scenarios designed to manipulate the AI into generating restricted or inappropriate content.
`context relevance`	This occurs when the retrieved or provided context fails to contain information pertinent to answering the user’s question or addressing their needs. Irrelevant context may be on a different topic, from an unrelated domain, or contain information that doesn’t help in formulating an appropriate response to the user.
`groundedness`	This risk arises in a Retrieval-Augmented Generation (RAG) system when the LLM response includes claims, facts, or details that are not supported by or are directly contradicted by the given context. An ungrounded answer may involve fabricating information, misinterpreting the context, or making unsupported extrapolations beyond what the context actually states.
`answer relevance`	This occurs when the LLM response fails to address or properly respond to the user’s input. This includes providing off-topic information, misinterpreting the query, or omitting crucial details requested by the User. An irrelevant answer may contain factually correct information but still fail to meet the User’s specific needs or answer their intended question.

🔼 This table provides a comprehensive list of risk categories and their detailed definitions as used in the Granite Guardian model. It covers a wide range of risks, categorized into general harm, social biases, profanity, sexual content, unethical behavior, violence, jailbreaking, and RAG (Retrieval-Augmented Generation)-specific risks such as context relevance, groundedness, and answer relevance. Each risk category includes a clear and concise definition to facilitate a thorough understanding of the model’s capabilities and limitations in risk detection.
read the caption
Table 9: Risk Definitions

Risk Type	Secondary	Primary
Harm++ (Prompt)	-	`user`
Harm++ (Response)	`user`	`assistant`
Jailbreak (Prompt)	-	`user`
RAG - Context Relevance	`user`	`context`
RAG - Groundedness	`context`	`assistant`
RAG - Answer Relevance	`user`	`assistant`

🔼 This table details how the safety instruction template is used for different risk categories. The ‘Risk Type’ column lists the various risk types, such as harmful content (Harm++), jailbreaking, and different aspects of Retrieval Augmented Generation (RAG) quality (context relevance, groundedness, and answer relevance). The ‘Secondary’ and ‘Primary’ columns indicate which parts of the prompt and response text (user or assistant) the safety agent focuses on when assessing the risk. This is crucial to understanding how the model processes the information for different kinds of risks and enables flexibility in adapting the model to various safety concerns.
read the caption
Table 10: Designated roles in the safety instruction template for different risk categories. Harm++ refers to all harmful content risks (Section 2.1.1). The “Primary” column indicates the tag that determines the safety agent’s focus, while the “Secondary” column, in conjunction with the “Primary” tag, specifies the content to be included in the safety instruction template, as detailed in Section 4.1.

model	AUC	TPr	AUC@0.1	TPr@0.1	AUC@0.01	TPr@0.01	AUC@0.001	TPr@0.001
`Llama-Guard-7B`	0.824	0.533	0.454	0.617	0.148	0.224	0.037	0.068
`Llama-Guard-2-8B`	0.841	0.627	0.506	0.660	0.137	0.239	0.014	0.032
`Llama-Guard-3-1B`	0.796	0.575	0.414	0.546	0.152	0.247	0.030	0.054
`Llama-Guard-3-8B`	0.826	0.607	0.521	0.648	0.174	0.320	0.016	0.033
`ShieldGemma-2B`	0.748	0.277	0.308	0.400	0.112	0.179	0.021	0.035
`ShieldGemma-9B`	0.753	0.262	0.307	0.403	0.129	0.193	0.020	0.052
`ShieldGemma-27B`	0.772	0.295	0.305	0.399	0.133	0.191	0.016	0.049
`Granite-Guardian-3.0-2B`	0.782	0.747	0.355	0.504	0.102	0.185	0.012	0.021
`Granite-Guardian-3.0-8B`	0.871	0.735	0.515	0.676	0.170	0.290	0.041	0.072

🔼 This table presents the Area Under the Curve (AUC) and True Positive Rate (TPR) at various False Positive Rate (FPR) thresholds (0.1, 0.01, and 0.001) for several models. The Area Under the Curve (AUC) provides an overall measure of the model’s ability to distinguish between classes. The True Positive Rate (TPR), at a fixed FPR, measures the model’s effectiveness at correctly identifying true positives, while controlling the rate of false positives. This is crucial for applications with stringent requirements for minimizing false positives. The models compared include Granite Guardian (two versions: 2B and 8B), Llama Guard models (multiple versions), and ShieldGemma models (multiple versions). The table highlights the best performing model for each metric at each threshold using bold font and the second-best performing model with underlined font.
read the caption
Table 11: AUC and TPr results on specific FPr thresholds (i.e., with FPr equal to 0.1, 0.01, 0.001). Numbers in bold represent the best performance within a column, while underlined numbers indicate the second-best.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Unified Risk Model#

Synthetic Data#

Benchmarking#

RAG Hallucination#

Deployment#

More visual insights#

Full paper#