Connect

Anthropic Shares Postmortem on Infrastructure Bugs That Impacted Claude

Anthropic Shares Postmortem on Infrastructure Bugs That Impacted Claude

Jack Carter

Translate this article

Updated:
September 19, 2025

Anthropic has published a detailed postmortem analyzing three infrastructure issues that recently degraded the quality of responses from its Claude models. The company emphasized that the incidents were not tied to demand, server load, or deliberate throttling of quality, but were instead the result of unforeseen technical bugs across its large-scale deployment systems.

What Happened

Between August and early September, users began reporting inconsistencies in Claude’s performance. Initially, the feedback appeared within normal variation, but as reports grew, Anthropic uncovered three overlapping infrastructure problems:

1. Context Window Routing Error: Introduced August 5, some Claude Sonnet 4 requests were incorrectly routed to servers configured for a 1M-token context window. This initially affected less than 1% of traffic but worsened after a load balancing change on August 29, peaking at 16% of Sonnet 4 requests on August 31. Because of “sticky” routing, affected users often continued to receive degraded responses.

2. Output Corruption on TPU Servers: On August 25, a misconfiguration on TPU servers led to occasional corruption in model outputs, such as producing Thai or Chinese characters mid-English responses or code syntax errors. This issue impacted Claude Opus 4.1, Opus 4, and Sonnet 4 until September 2, when the change was rolled back.

3. Approximate Top-k Compiler Bug: Also on August 25, a code change to improve token selection exposed a latent bug in the XLA:TPU compiler. This affected Claude Haiku 3.5 and possibly some Opus and Sonnet models. The approximation occasionally dropped the highest-probability token, producing inconsistent or degraded responses. Anthropic switched back to an exact top-k implementation and is collaborating with TPU compiler engineers to address the bug.

Why It Took Time to Fix

The overlapping nature of the issues made diagnosis difficult. Each bug produced different symptoms depending on platform and model configuration, creating contradictory reports. Anthropic also noted that its strict privacy controls limited engineers’ ability to directly inspect user interactions, which slowed reproduction and debugging.

Fixes and Changes Going Forward

All three bugs have now been resolved. Fixes were deployed gradually across Anthropic’s own API, Amazon Bedrock, and Google Cloud’s Vertex AI. To prevent similar incidents, Anthropic is:

Developing more sensitive evaluation methods to detect subtle degradation.

Running continuous quality checks on production systems rather than relying solely on canary deployments.

Building faster debugging tools to investigate community-reported issues while preserving user privacy.

Bigger Picture

Anthropic stressed that maintaining consistent model quality across hardware platforms—AWS Trainium, NVIDIA GPUs, and Google TPUs—remains a core priority. While infrastructure complexity enables global scale, it also introduces risks when optimizations interact unexpectedly.

The company closed its postmortem by acknowledging the role of user reports in uncovering these issues and encouraged continued feedback through its apps and developer tools.


ai

About the Author

Jack Carter

Jack Carter is an AI Correspondent from United States of America.

Subscribe to Newsletter

Enter your email address to register to our newsletter subscription!

Contact

+1 336-825-0330

Connect