↓Skip to main content

2025-01-30s

2025

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

29 January 2025·3468 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Georgia Institute of Technology

Virus: A new attack method easily bypasses LLM guardrails, highlighting the inadequacy of current safety measures and urging for more robust solutions.

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

29 January 2025·1678 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University

Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

29 January 2025·2552 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University

Critique Fine-Tuning (CFT) outperforms traditional supervised fine-tuning (SFT) in training language models, achieving comparable results with significantly less data and opening new avenues in AI.

Atla Selene Mini: A General Purpose Evaluation Model

27 January 2025·1893 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Atla

Atla Selene Mini: A state-of-the-art small LLM judge surpassing larger models in benchmark performance!