Safety

Effectively Controlling Reasoning Models through Thinking Intervention

31 March 2025·3981 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Princeton University

Thinking Intervention offers a novel paradigm for controlling reasoning in LLMs, enabling fine-grained guidance and improvements in instruction-following and safety.

LookAhead Tuning: Safer Language Models via Partial Answer Previews

24 March 2025·2175 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Zhejiang University

LookAhead Tuning: Safer LLMs via Partial Answer Previews by preserving initial token distributions.

Measuring AI Ability to Complete Long Tasks

18 March 2025·6252 words·30 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Model Evaluation & Threat Research (METR)

AI progress is tracked with a new metric, 50%-task-completion time horizon, showing exponential growth with a doubling time of ~7 months, hinting at significant automation potential in the near future…

Beyond Release: Access Considerations for Generative AI Systems

23 February 2025·1284 words·7 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hugging Face

AI system access is more than just release; it’s about how accessible system components are, impacting benefits, risks, and scalability.

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

20 February 2025·5119 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 KAIST

LLMs fail to act safely when considering user-specific safety standards, which were made to be solved via new benchmark.

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

19 February 2025·2482 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hong Kong Polytechnic University

Aligned LLMs’ safety often anchors in the template region, creating vulnerabilities. Detaching safety mechanisms shows promise in mitigation.

o3-mini vs DeepSeek-R1: Which One is Safer?

30 January 2025·578 words·3 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University

ASTRAL, a novel automated safety testing tool, reveals DeepSeek-R1’s significantly higher unsafe response rate compared to OpenAI’s o3-mini, highlighting critical safety concerns in advanced LLMs.

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

29 January 2025·1678 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University

Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.