AI Theory
Effectively Controlling Reasoning Models through Thinking Intervention
·3981 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Princeton University
Thinking Intervention offers a novel paradigm for controlling reasoning in LLMs, enabling fine-grained guidance and improvements in instruction-following and safety.
ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging
·1815 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Privacy
🏢 Zhejiang University
Model Merging: An unlearning system, which combines specialized models, achieves top results in SemEval-2025 Task 4 by selectively erasing sensitive knowledge.
LookAhead Tuning: Safer Language Models via Partial Answer Previews
·2175 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Zhejiang University
LookAhead Tuning: Safer LLMs via Partial Answer Previews by preserving initial token distributions.
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
·3290 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Interpretability
🏢 AIRI
LLMs’ reasoning is decoded via sparse autoencoders, revealing key features that, when steered, enhance performance. First mechanistic account of reasoning in LLMs!
Measuring AI Ability to Complete Long Tasks
·6252 words·30 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Model Evaluation & Threat Research (METR)
AI progress is tracked with a new metric, 50%-task-completion time horizon, showing exponential growth with a doubling time of ~7 months, hinting at significant automation potential in the near future…
Why Do Multi-Agent LLM Systems Fail?
·2168 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Robustness
🏢 UC Berkeley
Multi-Agent Systems (MAS) often underperform despite enthusiasm. This paper analyzes 5 popular frameworks across 150+ tasks, identifying 14 failure modes categorized into specification/design, inter-a…
Implicit Bias-Like Patterns in Reasoning Models
·3234 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Fairness
🏢 Washington University in St. Louis
AI reasoning models reveal bias-like patterns, processing association-incompatible info with more computational effort, mirroring human implicit biases.
Group-robust Machine Unlearning
·7203 words·34 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Robustness
🏢 University of Trento
Group-robust machine unlearning via MIU reduces perf. degradation in dominant groups after unlearning, preserving model robustness without compromising accuracy.
Mixture of Experts Made Intrinsically Interpretable
·3052 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Interpretability
🏢 University of Oxford
MoE-X: An intrinsically interpretable Mixture-of-Experts language model that uses sparse, wide networks to enhance transparency.
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
·2433 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Robustness
🏢 M-a-P
CodeCriticBench: A new benchmark for holistic code critique by Large Language Models.
Beyond Release: Access Considerations for Generative AI Systems
·1284 words·7 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Hugging Face
AI system access is more than just release; it’s about how accessible system components are, impacting benefits, risks, and scalability.
Forecasting Open-Weight AI Model Growth on Hugging Face
·2415 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Representation Learning
🏢 Rensselaer Polytechnic Institute
Predicting open-weight AI model growth on Hugging Face using a citation-style model, revealing adoption dynamics and influencing factors.
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
·1710 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Interpretability
🏢 AIRI
LLMs use punctuation in context memory, surprisingly boosting performance by using seemingly trivial tokens.
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
·5119 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 KAIST
LLMs fail to act safely when considering user-specific safety standards, which were made to be solved via new benchmark.
Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning
·6573 words·31 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Optimization
🏢 University of Texas at Austin
RL optimizes quantum error-correcting codes, slashing physical qubit overhead for fault-tolerant quantum computing.
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
·2482 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Hong Kong Polytechnic University
Aligned LLMs’ safety often anchors in the template region, creating vulnerabilities. Detaching safety mechanisms shows promise in mitigation.
The snake in the Brownian sphere
·1555 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Representation Learning
🏢 University of British Columbia, Department of Mathematics
Unveiling the Brownian snake within the Brownian sphere! This research constructs the inverse of the CVS bijection, mapping the sphere back to its underlying snake.
Presumed Cultural Identity: How Names Shape LLM Responses
·2724 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Fairness
🏢 University of Copenhagen
LLMs personalize based on user names, but this study reveals that cultural presumptions in LLM responses risk reinforcing stereotypes.
o3-mini vs DeepSeek-R1: Which One is Safer?
·578 words·3 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Mondragon University
ASTRAL, a novel automated safety testing tool, reveals DeepSeek-R1’s significantly higher unsafe response rate compared to OpenAI’s o3-mini, highlighting critical safety concerns in advanced LLMs.
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation
·1678 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Safety
🏢 Mondragon University
Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.