AWS Resolves Major Outage After Nearly 24 Hours of Disruption

Amazon Web Services (AWS), the leading global cloud computing provider, has announced the resolution of a significant outage in its US-EAST-1 region. This disruption, lasting nearly 24 hours, affected millions of users worldwide and underscored the vulnerabilities inherent in the internet’s reliance on AWS infrastructure.

Incident Overview

The outage commenced late on October 19, 2025, and persisted until the early afternoon of October 20. By 3:01 PM PDT on October 20, AWS confirmed that all services had returned to normal operations. However, some data processing backlogs for services like AWS Config and Redshift were anticipated to clear within a few hours.

Root Cause Analysis

The disruption originated from DNS resolution issues affecting the DynamoDB API endpoint in the US-EAST-1 region, AWS’s busiest data center located in Northern Virginia. At 11:49 PM PDT on October 19, elevated error rates and latencies were detected across multiple services, initially traced to DynamoDB—a core database service integral to various applications.

By 12:26 AM PDT on October 20, AWS engineers identified the root cause: a faulty DNS update that prevented applications from resolving server IP addresses. This issue is comparable to a malfunctioning phonebook, where the inability to look up numbers leads to communication breakdowns.

Cascading Service Failures

The DNS failure triggered a series of service disruptions:

– EC2 Instances: Launches stalled due to dependencies on DynamoDB.
– Network Load Balancers: Health checks failed, leading to connectivity issues.
– Other Services: AWS Lambda, SQS, and CloudWatch experienced connectivity problems.

Widespread Impact

The outage’s reach was extensive, affecting over 100 AWS services and numerous consumer-facing platforms:

– Social Media and Entertainment: Applications like Snapchat, Fortnite, Roblox, and Coinbase went offline, preventing users from logging in or accessing features.
– Financial Services: Platforms such as Venmo and banking apps from Lloyds and Halifax in the UK faced login difficulties.
– Amazon Ecosystem: Prime Video experienced increased buffering, Ring doorbells lost remote access, and e-commerce checkouts encountered issues.
– Other Sectors: Government agencies, airlines like Delta, and media outlets including Disney+ and The New York Times reported interruptions.

This incident highlighted AWS’s significant role in global cloud infrastructure, with a market dominance of approximately 33%.

Criticism and Transparency Concerns

AWS faced criticism for a 75-minute delay in diagnosing the issue and for initial status page messages that suggested all systems were operational. These transparency concerns echoed previous critiques regarding AWS’s outage notifications. Importantly, the incident was attributed to an internal update error in a foundational service, with no evidence of a cyberattack.

AWS’s Response and Mitigation Efforts

AWS implemented several measures to address the outage:

– DNS Cache Flushing: To resolve the DNS issues.
– Throttling EC2 Launches: To stabilize subsystems.
– Scaling Up Polling Rates: For SQS queues associated with Lambda.

By 2:24 AM PDT, the core DNS issue with DynamoDB was addressed, leading to early signs of recovery. However, network issues persisted into the morning. Temporary throttles on operations like asynchronous Lambda invocations were applied to prioritize critical processes, with full restoration of EC2 launches achieved by 2:48 PM PDT.

Global features dependent on the US-EAST-1 region, such as IAM updates and DynamoDB Global Tables, also recovered, allowing support case creations to resume. AWS committed to providing a detailed post-incident analysis and emphasized ongoing processing of backlogs for analytics services like Connect and Redshift.

Expert Analysis

Experts from firms like ThousandEyes observed no external network anomalies, confirming that the issue was internal to AWS. As services returned to normal, affected users were advised to retry their operations and consult the AWS Health Dashboard for further updates.

Conclusion

This significant outage serves as a stark reminder of the internet’s dependency on major cloud service providers like AWS. Organizations are encouraged to develop robust contingency plans and diversify their cloud strategies to mitigate the impact of potential future disruptions.