2025-01-30s
2025
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
·3468 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Georgia Institute of Technology
Virus: A new attack method easily bypasses LLM guardrails, highlighting the inadequacy of current safety measures and urging for more robust solutions.
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
·1678 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Mondragon University
Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
·2552 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Carnegie Mellon University
Critique Fine-Tuning (CFT) outperforms traditional supervised fine-tuning (SFT) in training language models, achieving comparable results with significantly less data and opening new avenues in AI.
Atla Selene Mini: A General Purpose Evaluation Model
·1893 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Atla
Atla Selene Mini: A state-of-the-art small LLM judge surpassing larger models in benchmark performance!