Skip to main content

2025-01-30s

2025

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
·3468 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Georgia Institute of Technology
Virus: A new attack method easily bypasses LLM guardrails, highlighting the inadequacy of current safety measures and urging for more robust solutions.
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
·1678 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University
Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
·2552 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Carnegie Mellon University
Critique Fine-Tuning (CFT) outperforms traditional supervised fine-tuning (SFT) in training language models, achieving comparable results with significantly less data and opening new avenues in AI.
Atla Selene Mini: A General Purpose Evaluation Model
·1893 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Atla
Atla Selene Mini: A state-of-the-art small LLM judge surpassing larger models in benchmark performance!