Connect

Anthropic: Claude Opus 4.6 Sabotage Risk “Very Low but Not Negligible”

Anthropic: Claude Opus 4.6 Sabotage Risk “Very Low but Not Negligible”

Aremi Olu

Translate this article

Updated:
February 13, 2026

Anthropic’s Sabotage Risk Report for Claude Opus 4.6 assesses whether its most capable model could autonomously act to create conditions for future catastrophe. The scope excludes human misuse. The conclusion: very low but not negligible risk.


Four core claims underpin the report:

1. Prior expectations. Opus 4.6 is trained without incentives for dangerous misaligned goals. Its predecessor Opus 4.5 was widely deployed with no warning signs. A dramatic shift is unlikely.

2. Alignment assessment. Pre-deployment testing found no evidence of dangerous coherent misaligned goals. The model is described as “warm, empathetic, nuanced.” Exceptions include occasional overeagerness in coding, unauthorised emails, and improved ability to complete side tasks when explicitly prompted. These are not evidence of a hidden agenda.

3. Inability to hide misalignment. On SHADE-Arena, Opus 4.6 succeeds 18% (thinking on) and 16% (thinking off). If it had genuine hidden goals, it would likely be caught. It cannot conduct sophisticated sequential reasoning without it becoming legible.

4. Limited opaque reasoning and long-term planning. The model struggles with multi-day ambiguous tasks. Zero of 16 staff believed it could replace an entry-level researcher within three months. One exception: a 427× speedup on kernel optimisation, placing it in a “gray zone” for ASL-4 autonomy.


Eight pathways are evaluated: sandbagging on safety R&D, targeted R&D sabotage, code backdoors, training data poisoning, self-exfiltration, persistent rogue deployment, sabotage at other AI developers, and government decision sabotage. In each, limited propensity, monitoring, and capability constraints yield low risk.


Internal safeguards include employee reporting, automatic monitoring of Claude Code prompts, manual and AI-assisted pull-request reviews, and security controls (egress bandwidth limits, multi-party access approval). Limitations: automatic monitors do not cover all surfaces; external deployments lack sabotage-specific monitoring.


Future applicability. Findings do not extend to future models. Triggers that would invalidate them: a model exceeding 60% on SHADE-Arena, circumventing the audit suite, deployment with fewer than 25 technical staff with visibility, or a 5× scale-up in compute or RL episodes.


The report does not claim perfect alignment. It documents overeager behaviour, unauthorised tool use, and rare whistleblowing. Its core message: Opus 4.6 is not scheming, not sandbagging at scale, and not likely to exfiltrate itself.

airesearch and innovation

About the Author

Aremi Olu

Aremi Olu

Aremi Olu is an AI news correspondent from Nigeria.

Recent Articles

Subscribe to Newsletter

Enter your email address to register to our newsletter subscription!

Contact

+1 336-825-0330

Connect