AWS US-EAST-1 Region Faces Significant Service Disruptions Due to EC2 Deployment Delays
On October 28, 2025, Amazon Web Services (AWS) encountered substantial operational challenges within its US-EAST-1 region, leading to elevated latencies and failures in launching EC2 instances. This disruption had a cascading effect on various container orchestration services, underscoring the intricate interdependencies within AWS’s infrastructure.
Incident Overview
The issues originated around midday Pacific Daylight Time (PDT) in the use1-az2 Availability Zone. Customers reported prolonged delays and failures when attempting to launch virtual machines and tasks, highlighting the critical role of this region in global operations. AWS promptly notified affected users through the Personal Health Dashboard, acknowledging the problem’s scope.
Impact on Services
The initial EC2 deployment delays extended to the Elastic Container Service (ECS), resulting in elevated failure rates for task launches on both EC2-backed and Fargate serverless containers. Some customers experienced unexpected disconnections of container instances, leading to halted tasks and disrupted workflows.
Beyond core compute services, the outage affected analytics and data processing tools such as EMR Serverless, which relies on ECS warm pools for rapid job execution. Jobs in EMR faced execution delays or outright failures due to unhealthy clusters in impacted cells. Other affected services included Elastic Kubernetes Service (EKS) for Fargate pod launches, AWS Glue for ETL operations, and Managed Workflows for Apache Airflow (MWAA), where environments stalled in unhealthy states. App Runner, DataSync, CodeBuild, and AWS Batch also experienced increased error rates, though existing EC2 instances remained operational.
Root Cause Analysis
AWS identified the root issues in a small number of ECS cells but did not disclose specific details about the underlying cause. This incident is reminiscent of prior dependency failures in the same region, highlighting ongoing vulnerabilities in AWS’s densely interconnected infrastructure.
Recovery Timeline
To stabilize the system, AWS initiated throttles on mutating API calls in use1-az2, advising customers to retry operations that resulted in request limit exceeded errors. By 3:36 PM PDT, EC2 launches began to normalize; however, ECS recovery lagged, with no immediate customer-visible improvements.
Progress accelerated by 5:31 PM, as AWS refreshed EMR warm pools and observed reductions in AWS Glue error rates, estimating full resolution within 2-3 hours. At 6:50 PM, ECS task launches showed positive signs, prompting recommendations for customers to recreate impacted clusters with new identifiers or update MWAA environments without configuration changes.
Throttles continued in three ECS cells, but by 8:08 PM, EMR was fully refreshed, and ECS successes increased, with an estimated time of arrival (ETA) of 1 to 2 hours for full recovery. A significant recovery milestone was reached at 8:54 PM, and by 9:52 PM, two cells had fully recovered, leading to the lifting of their throttles, while the third lagged.
The issue was entirely resolved at 10:43 PM PDT, restoring normal operations across all services. AWS confirmed no lingering impacts, though some backlogs might cause minor delays.
Implications and Recommendations
This incident, following a major US-EAST-1 outage on October 20, exposes persistent fragility resulting from internal service interdependencies. While not as widespread as the earlier DynamoDB-triggered event, it disrupted workflows for developers and enterprises in AWS’s busiest region.
Experts note that such incidents, though contained, erode trust in multi-region strategies without robust failover mechanisms. AWS has urged customers to diversify cluster placements and engage in proactive monitoring to mitigate future risks.