Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition Paper β’ 2507.20526 β’ Published Jul 28 β’ 1
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Paper β’ 2504.07831 β’ Published Apr 10
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents Paper β’ 2410.09024 β’ Published Oct 11, 2024 β’ 1
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents Paper β’ 2410.10871 β’ Published Oct 8, 2024 β’ 1