A single point of failure triggered the Amazon outage affecting millions

bysai -October 24, 2025

0

A single point of failure triggered the Amazon outage affecting millions

**Summary: The 16-Hour Amazon Web Services Outage and Its Far-Reaching Impact**

In late June, Amazon Web Services (AWS) experienced a massive outage that disrupted thousands of businesses and millions of users worldwide. According to a detailed post-mortem shared by Amazon engineers, the entire debacle was sparked by a single software bug in one region of Amazon’s vast cloud infrastructure. The incident, which lasted over 15 hours, underscores the risks of concentrated dependencies and single points of failure in global cloud computing.

### The Outage: Scope and Immediate Impact

The outage persisted for 15 hours and 32 minutes, during which a cascade of failures spread from system to system within the AWS network. Network intelligence firm Ookla reported that its DownDetector service registered over 17 million reports of disrupted services from more than 3,500 organizations. The largest number of reports came from the United States, United Kingdom, and Germany. Some of the most affected services included Snapchat, Roblox, and AWS itself. Ookla described the incident as “among the largest internet outages on record for Downdetector,” highlighting its unprecedented scale and global reach.

### Root Cause: A Software Bug and the Domino Effect

Amazon traced the root cause to a software bug within the DynamoDB DNS management system in its US-East-1 region—AWS’s oldest and most heavily used hub. DynamoDB, a core component of AWS, is responsible for managing data and domain lookups in real time to ensure efficient load balancing and service reliability.

The specific failure was triggered by a “race condition”—a type of software error where the outcome depends on the unpredictable timing of events. In this case, the problem lay in how two key components of the DynamoDB DNS management system, called the DNS Enactor and the DNS Planner, interacted under unexpected delays.

The DNS Enactor is tasked with constantly updating domain lookup tables at individual AWS endpoints. During the incident, the Enactor encountered unusually high delays while retrying updates on several DNS endpoints. Meanwhile, the DNS Planner continued generating new configuration plans, and a second DNS Enactor began applying these fresh plans.

This overlap in processes created a problematic situation: The delayed first Enactor ended up overwriting newer, correct DNS plans with outdated information. At the same time, the second Enactor’s clean-up routine deleted the newly-applied plan and, crucially, all associated IP addresses for the regional endpoint. This left the system in an inconsistent state, unable to process further DNS updates and effectively severing connectivity.

Manual intervention was ultimately required to resolve the situation, as automated recovery mechanisms could not cope with the inconsistencies.

### Cascading Failures Across AWS Services

The initial disruption in DynamoDB’s DNS management quickly spread to other AWS services. Systems and customers relying on the US-East-1 region began experiencing errors and connectivity failures. The impact was not limited to customer-facing applications; Amazon’s own internal services were also affected.

As DynamoDB faltered,

A single point of failure triggered the Amazon outage affecting millions

Wikipedia says traffic is falling due to AI search summaries and social video | TechCrunch

نموذج الاتصال