Skip to main content

Safety

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
·2482 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hong Kong Polytechnic University
Aligned LLMs’ safety often anchors in the template region, creating vulnerabilities. Detaching safety mechanisms shows promise in mitigation.
o3-mini vs DeepSeek-R1: Which One is Safer?
·578 words·3 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University
ASTRAL, a novel automated safety testing tool, reveals DeepSeek-R1’s significantly higher unsafe response rate compared to OpenAI’s o3-mini, highlighting critical safety concerns in advanced LLMs.
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
·1678 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University
Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.