Skip to main content

Safety

LookAhead Tuning: Safer Language Models via Partial Answer Previews
·2175 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Zhejiang University
LookAhead Tuning: Safer LLMs via Partial Answer Previews by preserving initial token distributions.
Measuring AI Ability to Complete Long Tasks
·6252 words·30 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Model Evaluation & Threat Research (METR)
AI progress is tracked with a new metric, 50%-task-completion time horizon, showing exponential growth with a doubling time of ~7 months, hinting at significant automation potential in the near future…
Beyond Release: Access Considerations for Generative AI Systems
·1284 words·7 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hugging Face
AI system access is more than just release; it’s about how accessible system components are, impacting benefits, risks, and scalability.
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
·5119 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 KAIST
LLMs fail to act safely when considering user-specific safety standards, which were made to be solved via new benchmark.
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
·2482 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hong Kong Polytechnic University
Aligned LLMs’ safety often anchors in the template region, creating vulnerabilities. Detaching safety mechanisms shows promise in mitigation.
o3-mini vs DeepSeek-R1: Which One is Safer?
·578 words·3 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University
ASTRAL, a novel automated safety testing tool, reveals DeepSeek-R1’s significantly higher unsafe response rate compared to OpenAI’s o3-mini, highlighting critical safety concerns in advanced LLMs.
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
·1678 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University
Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.