Amazon Web Services has explained what went wrong to cause the major outage that crippled many businesses this week. In a post event summary, AWS outlined how an initial issue in its DynamoDB had a cascading impact, prolonging the outage.
Between 11:48 p.m. on Oct. 19 and 2:40 a.m. on October 20, Amazon DynamoDB experienced “increased API error rates” in its Virginia US-East-1 Region, the main region for deploying applications.
This led to various apps and services being rendered useless, including Snapchat, Fortnite, Ring, Roblox, Coinbase and messaging app Signal.
AWS describes how during this period, “customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service.”
It says the incident was triggered by “a latent defect” — in other words, a hidden fault — within the service’s automated DNS management system. This caused endpoint resolution failures for DynamoDB, AWS noted.
DNS — also known as the internet’s phone book — is the system that translates domain names such as Forbes.com to IP addresses so browsers can load internet resources.
Services such as DynamoDB maintain “hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region,” AWS said. “Automation is crucial to ensuring that these DNS records are updated frequently to add additional capacity as it becomes available, to correctly handle hardware failures, and to efficiently distribute traffic to optimize customers’ experience,” according to AWS.
But the “latent race condition” — which happens when multiple requests are sent concurrently to the same endpoint — in the DynamoDB DNS management system resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that automation failed to repair.
Issues In The Network Load Balancer
Then as systems started to recover, the network load balancer experienced increased connection errors for some in the same area between 5:30 a.m. and 2:09 p.m. on Oct. 20. “This was caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs,” AWS explained.
In tandem, between 2:25 a.m. and 10:36 a.m. on Oct. 20, new EC2 instance launches failed. While instance launches began to succeed from 10:37 a.m., some newly-launched instances experienced connectivity issues, which were resolved by 1:50 p.m., according to AWS.
“The delays in network state propagations for newly launched EC2 instances also caused impact to the network load balancer service and AWS services that use NLB,” AWS said.
Amazon Apologises For Outage, Explains Next Steps
AWS has now issued an apology for the incident. “We apologize for the impact this event caused our customers,” AWS wrote. “While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”
AWS said it is “making several changes as a result of this operational event.”
For example, it has already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide. “In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans.”
For NLB, AWS is adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover.
For EC2, AWS is building an additional test suite to augment its existing scale testing, which will exercise the DWFM recovery workflow to “identify any future regressions.”
The AWS outage had a huge impact, leaving some firms unable to operate for hours due to issues with the apps they depend on. AWS has delivered its post event analysis very quickly, which is to its credit. However, the damage has already been done to its reputation.
Read the full article here









