| Vlatko Kotevski
AWS Service Downtime - Reassessing Cloud Architecture
The importance of taking architecture decisions in consideration when assessing what happened with the S3 AWS service downtime
To begin with on 28th of February 2017 AWS was not down and Internet was not down. In our opinion, what really happened is that a single cloud service (Simple Storage Service also known as S3) in a single AWS region had a downtime and please note I use the word SINGLE (Click Here to read the official statement).
Also, it’s very important to emphasize that not all the sites were affected even though they were running in the same region. The reason is that AWS offers the opportunity, with a correctly architecture solution, to avoid even this kind of downtime, which happen once in a decade. But it comes at a cost.
Before getting deeper into the costs lets first explain two fundamental concepts in AWS. Availability Zones (AZs) and Regions. A Region in AWS can have one or more Availability Zones and for many of its services. With a simple setting, Amazon’s customers have a choice of running their services in more than one Availability Zones in the same region. In keeping with disaster recovery best practices, Availability Zones even if in the same region are not in the same location. Many of the AWS services have the easy setting to choose to work in more than one Availability Zone. This allows AWS to have a robust and failsafe architecture at just a small incremental cost.
In the case of S3 the only option, for a failsafe architecture, is to replicate the storage in more than one region - automatically doubles the cost of storage. So even though AWS has a well described and documented solution many customers have opted to not replicate the storage because of the reliability of S3 did not give them the justification for the increased cost. This is also why Amazon themselves have not had this as a recommendation in the Well-Architected principles. After this incident, I expect Amazon to revise and issue a more robust recommendation especially if the SLA remains at three 9s (99,9% availability).
So, for me the reasons some felt the downtime and some didn’t can only be because of either deciding not to go multiregion or not taking it in consideration. In either of the above it would be wrong to blame it on the infrastructure provider (in this case AWS but it would have been the same with Azure or IBM or Google for that matter). Even contractually they worked within their SLA perimeter and did keep constant information flow.
Alite-International also manages on our platform TyphoonX many different businesses in many different industries. TyphoonX is a Cloud Native platform that largely utilizes all the different advantages that Cloud Infrastructure providers offer. We follow the Well-Architected principles and also have done few of our own innovations in order to provide a stable and uninterrupted service for our customers. When making architectural decisions, we take in consideration the impact of content and storage for our customers and we try to give parameters for decisions based on well evaluated risk.
In our opinion cloud is here to stay and just as poor architecture was giving a lot of headache in private data centres, client server solutions, it also will continue to be an issue in Cloud Native solutions. Also we beleive this is a moment where Amazon will have to act fast and reconsider to put more effort not to allow for such an easy SPOF (single point of failure).
Read next: Beyond the buzz words
Read previous: Delivery relevancy