Safety
Unelicitable Backdoors via Cryptographic Transformer Circuits
·1600 words·8 mins·
loading
·
loading
AI Theory
Safety
🏢 Contramont Research
Researchers unveil unelicitable backdoors in language models, using cryptographic transformer circuits, defying conventional detection methods and raising crucial AI safety concerns.
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
·3401 words·16 mins·
loading
·
loading
AI Theory
Safety
🏢 Hong Kong University of Science and Technology
Current backdoor defenses, while effective at reducing attack success rates, are vulnerable to rapid re-learning. This work unveils this superficial safety, proposes a novel attack, and introduces a p…
Stepwise Alignment for Constrained Language Model Policy Optimization
·2517 words·12 mins·
loading
·
loading
AI Theory
Safety
🏢 University of Tsukuba
Stepwise Alignment for Constrained Policy Optimization (SACPO) efficiently aligns LLMs with human values, prioritizing both helpfulness and harmlessness via a novel stepwise approach.
SEEV: Synthesis with Efficient Exact Verification for ReLU Neural Barrier Functions
·1687 words·8 mins·
loading
·
loading
AI Theory
Safety
🏢 Washington University in St. Louis
SEEV framework efficiently verifies ReLU neural barrier functions by reducing activation regions and using tight over-approximations, significantly improving verification efficiency without sacrificin…
Secret Collusion among AI Agents: Multi-Agent Deception via Steganography
·5189 words·25 mins·
loading
·
loading
AI Generated
AI Theory
Safety
🏢 UC Berkeley
AI agents can secretly collude using steganography, hiding their interactions from oversight. This research formalizes this threat, analyzes LLMs’ capabilities, and proposes mitigation strategies.
Rule Based Rewards for Language Model Safety
·3342 words·16 mins·
loading
·
loading
AI Theory
Safety
🏢 OpenAI
Rule-Based Rewards (RBRs) enhance LLM safety by using AI feedback and a few-shot prompt-based approach, achieving higher safety-behavior accuracy with less human annotation than existing methods.
Refusal in Language Models Is Mediated by a Single Direction
·4093 words·20 mins·
loading
·
loading
AI Theory
Safety
🏢 Independent
LLM refusal is surprisingly mediated by a single, easily manipulated direction in the model’s activation space.
Provably Safe Neural Network Controllers via Differential Dynamic Logic
·2824 words·14 mins·
loading
·
loading
AI Theory
Safety
🏢 Karlsruhe Institute of Technology
Verifiably safe AI controllers are created via a novel framework, VerSAILLE, which uses differential dynamic logic and open-loop NN verification to prove safety for unbounded time horizons.
Neural Model Checking
·2039 words·10 mins·
loading
·
loading
AI Theory
Safety
🏢 University of Birmingham
Neural networks revolutionize hardware model checking by generating formal proof certificates, outperforming state-of-the-art techniques in speed and scalability.
Improving Alignment and Robustness with Circuit Breakers
·2515 words·12 mins·
loading
·
loading
AI Theory
Safety
🏢 Gray Swan AI
AI systems are made safer by ‘circuit breakers’ that directly control harmful internal representations, significantly improving alignment and robustness against adversarial attacks with minimal impact…
Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch
·527 words·3 mins·
loading
·
loading
AI Generated
AI Theory
Safety
🏢 Colorado State University
This paper introduces Expectation Alignment (EAL), a novel framework and interactive algorithm to address reward misspecification in AI, aligning AI behavior with user expectations.
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
·506 words·3 mins·
loading
·
loading
AI Theory
Safety
🏢 ETH Zurich
This paper introduces a novel quantitative definition of AI alignment for social decision-making, proposing probably approximately aligned policies and a method to safeguard any autonomous agent’s act…
Aligning Model Properties via Conformal Risk Control
·1981 words·10 mins·
loading
·
loading
AI Generated
AI Theory
Safety
🏢 Stanford University
Post-processing pre-trained models for alignment using conformal risk control and property testing guarantees better alignment, even when training data is biased.