AI Theory

Effectively Controlling Reasoning Models through Thinking Intervention

31 March 2025·3981 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Princeton University

Thinking Intervention offers a novel paradigm for controlling reasoning in LLMs, enabling fine-grained guidance and improvements in instruction-following and safety.

ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging

27 March 2025·1815 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Privacy 🏢 Zhejiang University

Model Merging: An unlearning system, which combines specialized models, achieves top results in SemEval-2025 Task 4 by selectively erasing sensitive knowledge.

LookAhead Tuning: Safer Language Models via Partial Answer Previews

24 March 2025·2175 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Zhejiang University

LookAhead Tuning: Safer LLMs via Partial Answer Previews by preserving initial token distributions.

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

24 March 2025·3290 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Interpretability 🏢 AIRI

LLMs’ reasoning is decoded via sparse autoencoders, revealing key features that, when steered, enhance performance. First mechanistic account of reasoning in LLMs!

Measuring AI Ability to Complete Long Tasks

18 March 2025·6252 words·30 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Model Evaluation & Threat Research (METR)

AI progress is tracked with a new metric, 50%-task-completion time horizon, showing exponential growth with a doubling time of ~7 months, hinting at significant automation potential in the near future…

Why Do Multi-Agent LLM Systems Fail?

17 March 2025·2168 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Robustness 🏢 UC Berkeley

Multi-Agent Systems (MAS) often underperform despite enthusiasm. This paper analyzes 5 popular frameworks across 150+ tasks, identifying 14 failure modes categorized into specification/design, inter-a…

Implicit Bias-Like Patterns in Reasoning Models

14 March 2025·3234 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Fairness 🏢 Washington University in St. Louis

AI reasoning models reveal bias-like patterns, processing association-incompatible info with more computational effort, mirroring human implicit biases.

Group-robust Machine Unlearning

12 March 2025·7203 words·34 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Robustness 🏢 University of Trento

Group-robust machine unlearning via MIU reduces perf. degradation in dominant groups after unlearning, preserving model robustness without compromising accuracy.

Mixture of Experts Made Intrinsically Interpretable

5 March 2025·3052 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Interpretability 🏢 University of Oxford

MoE-X: An intrinsically interpretable Mixture-of-Experts language model that uses sparse, wide networks to enhance transparency.

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

23 February 2025·2433 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Robustness 🏢 M-a-P

CodeCriticBench: A new benchmark for holistic code critique by Large Language Models.

Beyond Release: Access Considerations for Generative AI Systems

23 February 2025·1284 words·7 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hugging Face

AI system access is more than just release; it’s about how accessible system components are, impacting benefits, risks, and scalability.

Forecasting Open-Weight AI Model Growth on Hugging Face

21 February 2025·2415 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Representation Learning 🏢 Rensselaer Polytechnic Institute

Predicting open-weight AI model growth on Hugging Face using a citation-style model, revealing adoption dynamics and influencing factors.

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

20 February 2025·1710 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Interpretability 🏢 AIRI

LLMs use punctuation in context memory, surprisingly boosting performance by using seemingly trivial tokens.

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

20 February 2025·5119 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 KAIST

LLMs fail to act safely when considering user-specific safety standards, which were made to be solved via new benchmark.

Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning

20 February 2025·6573 words·31 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Optimization 🏢 University of Texas at Austin

RL optimizes quantum error-correcting codes, slashing physical qubit overhead for fault-tolerant quantum computing.

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

19 February 2025·2482 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Hong Kong Polytechnic University

Aligned LLMs’ safety often anchors in the template region, creating vulnerabilities. Detaching safety mechanisms shows promise in mitigation.

The snake in the Brownian sphere

18 February 2025·1555 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Representation Learning 🏢 University of British Columbia, Department of Mathematics

Unveiling the Brownian snake within the Brownian sphere! This research constructs the inverse of the CVS bijection, mapping the sphere back to its underlying snake.

Presumed Cultural Identity: How Names Shape LLM Responses

17 February 2025·2724 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Fairness 🏢 University of Copenhagen

LLMs personalize based on user names, but this study reveals that cultural presumptions in LLM responses risk reinforcing stereotypes.

o3-mini vs DeepSeek-R1: Which One is Safer?

30 January 2025·578 words·3 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University

ASTRAL, a novel automated safety testing tool, reveals DeepSeek-R1’s significantly higher unsafe response rate compared to OpenAI’s o3-mini, highlighting critical safety concerns in advanced LLMs.

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

29 January 2025·1678 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Safety 🏢 Mondragon University

Researchers used ASTRAL to systematically test OpenAI’s 03-mini LLM’s safety, revealing key vulnerabilities and highlighting the need for continuous, robust safety mechanisms in large language models.