Safety

Unelicitable Backdoors via Cryptographic Transformer Circuits

26 September 2024·1600 words·8 mins· loading · loading

AI Theory Safety 🏢 Contramont Research

Researchers unveil unelicitable backdoors in language models, using cryptographic transformer circuits, defying conventional detection methods and raising crucial AI safety concerns.

Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

26 September 2024·3401 words·16 mins· loading · loading

AI Theory Safety 🏢 Hong Kong University of Science and Technology

Current backdoor defenses, while effective at reducing attack success rates, are vulnerable to rapid re-learning. This work unveils this superficial safety, proposes a novel attack, and introduces a p…

Stepwise Alignment for Constrained Language Model Policy Optimization

26 September 2024·2517 words·12 mins· loading · loading

AI Theory Safety 🏢 University of Tsukuba

Stepwise Alignment for Constrained Policy Optimization (SACPO) efficiently aligns LLMs with human values, prioritizing both helpfulness and harmlessness via a novel stepwise approach.

SEEV: Synthesis with Efficient Exact Verification for ReLU Neural Barrier Functions

26 September 2024·1687 words·8 mins· loading · loading

AI Theory Safety 🏢 Washington University in St. Louis

SEEV framework efficiently verifies ReLU neural barrier functions by reducing activation regions and using tight over-approximations, significantly improving verification efficiency without sacrificin…

Secret Collusion among AI Agents: Multi-Agent Deception via Steganography

26 September 2024·5189 words·25 mins· loading · loading

AI Generated AI Theory Safety 🏢 UC Berkeley

AI agents can secretly collude using steganography, hiding their interactions from oversight. This research formalizes this threat, analyzes LLMs’ capabilities, and proposes mitigation strategies.

Rule Based Rewards for Language Model Safety

26 September 2024·3342 words·16 mins· loading · loading

AI Theory Safety 🏢 OpenAI

Rule-Based Rewards (RBRs) enhance LLM safety by using AI feedback and a few-shot prompt-based approach, achieving higher safety-behavior accuracy with less human annotation than existing methods.

Refusal in Language Models Is Mediated by a Single Direction

26 September 2024·4093 words·20 mins· loading · loading

AI Theory Safety 🏢 Independent

LLM refusal is surprisingly mediated by a single, easily manipulated direction in the model’s activation space.

Provably Safe Neural Network Controllers via Differential Dynamic Logic

26 September 2024·2824 words·14 mins· loading · loading

AI Theory Safety 🏢 Karlsruhe Institute of Technology

Verifiably safe AI controllers are created via a novel framework, VerSAILLE, which uses differential dynamic logic and open-loop NN verification to prove safety for unbounded time horizons.

Neural Model Checking

26 September 2024·2039 words·10 mins· loading · loading

AI Theory Safety 🏢 University of Birmingham

Neural networks revolutionize hardware model checking by generating formal proof certificates, outperforming state-of-the-art techniques in speed and scalability.

Improving Alignment and Robustness with Circuit Breakers

26 September 2024·2515 words·12 mins· loading · loading

AI Theory Safety 🏢 Gray Swan AI

AI systems are made safer by ‘circuit breakers’ that directly control harmful internal representations, significantly improving alignment and robustness against adversarial attacks with minimal impact…

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

26 September 2024·527 words·3 mins· loading · loading

AI Generated AI Theory Safety 🏢 Colorado State University

This paper introduces Expectation Alignment (EAL), a novel framework and interactive algorithm to address reward misspecification in AI, aligning AI behavior with user expectations.

Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

26 September 2024·506 words·3 mins· loading · loading

AI Theory Safety 🏢 ETH Zurich

This paper introduces a novel quantitative definition of AI alignment for social decision-making, proposing probably approximately aligned policies and a method to safeguard any autonomous agent’s act…

Aligning Model Properties via Conformal Risk Control

26 September 2024·1981 words·10 mins· loading · loading

AI Generated AI Theory Safety 🏢 Stanford University

Post-processing pre-trained models for alignment using conformal risk control and property testing guarantees better alignment, even when training data is biased.