Dissecting the S3 SLA
A previous article discussed some practical advice when setting the SLA. In this article, let’s dissect an example SLA to see what it actually means. If you want to write one, WorkOS has a good blog post.
Disclaimer: I’m not a law expert. I’m just a SRE reading through what AWS committed publicly. For what it’s worth, I’m an AWS user and although this article takes a critical look at their SLA, I still believe this is a decent contract.
Why S3?
I’m going to pick S3 because:
- It is a relatively simple service (S3 literally stands for Simple Storage Service)
- It is from one of the world’s largest cloud providers
- It is an internal dependency for some of the other popular services at AWS like EC2, Lambda, etc. which may affect their SLA
- It has a relatively high SLA (99.9% availability) and a very good track record in the 16 years it’s been live
- It hasn’t changed much since Jeff Bezos wanted malloc (a key memory allocation function for C programs) for the Internet. (source)
- In its marketing copy, it brags about a whopping 11 9s!!!
Throughout this article we use this quotation style to copy/paste snippets from the SLA (except this paragraph).
Here’s the SLA if you want to have it open while ready. Let’s get to it!
Uptime
There are many performance metrics to even a simple service like S3 but Amazon has decided to only guarantee the uptime. It is admittedly the most important golden signal. Here’s the full list:
- Latency: the S3 SLA doesn’t commit to a specific SLA if a GET or PUT request to S3 takes too long leading to a timeout
- Traffic: the S3 SLA doesn’t commit to how much load can be put simultaneously on one S3 bucket or object in a specific region however if it leads to errors, it’s covered (see below)
- Errors: what percentage of good requests (eg. to a valid object) return an error. Note that instead of just relying on a synthetic ping, the S3 SLA commits to an SLA for your specific requests which is more realistic
- Saturation: as a managed service with no theoretical limit on storage, this is not critical for S3
The S3 SLA only commits to the “internal error” or “service unavailable”:
“Error Rate” means: (i) the total number of internal server errors returned by the Amazon S3 Service as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests for the applicable request type during that 5-minute interval.
Although S3 is primarily a storage service and theoretically can hold the data forever, the S3 SLA does not make any commitment to data integrity, validity or any other data metric. This means theoretically you may get a different data than what you put but that doesn’t breach the S3 SLA.
Only when you use the service
If you did not make any requests in a given 5-minute interval, that interval is assumed to have a 0% Error Rate.
In other words if you don’t make any request to S3 during the downtime (even if your ability to do so is limited because some higher level system is broken due to a problem with S3), the SLA is not breached.
No rolling error budget
For the billing cycle in which the Monthly Uptime Percentage fell within the ranges set forth in the table below.
Error budget is the reverse of SLA. For example a service that has 99.9% availability SLA can be down for a maximum of 0.1% without having to pay a penalty. 0.1% of a month is 43m 49s (here’s an online SLA calculator). The idea is that the team behind the service is allowed to break the system (usually as a result of making changes) as long as there is an error budget and stop doing so otherwise.
Let’s stick to the “month” as the evaluation period. There are two ways to calculate the error budget:
- Calendar month: if the service was down for 43 minutes at the end of December 31st, it’ll earn a new error budget on Jan 1 and can immediately be down for another 43 minutes bringing the total to 86 minutes. However, as long as AWS is concerned, both December and January are in the green.
- Rolling month: calculate the error budget for the last 30 days. This prevents the caveat above but doesn’t exactly map to how the billing periods are set up.
We’ve discussed the cons and pros along with some diagrams in this article.
The S3 SLA is based on the calendar month. Note that if your teams use rolling month on top of an infrastructure that uses calendar month, your theoretical maximum downtime is 86 minutes or an Availability SLA of 99.8%
10% for <99.9%
The SLA and credit is clearly mentioned in a table:
Putting that data in a chart we get this:
The longer the service is down, the more credit you get, exponentially. In fact if the service was down for more than 5% of the length of a month (1.5 days!) you will get all your money back (in form of credit). But if the service is down for less (for example an entire day!), you’ll get 25%.
It is important to understand this point because although S3 has one of industry’s highest SLA, it can still die on you for an extended period and if your customers cannot afford that level of downtime, you need to have a fallback mechanism.
You don’t get it automatically
To receive a Service Credit, you must submit a claim by opening a case in the AWS Support Center.
One would guess that AWS is on top of their metrics and pays you automatically when the SLA is breached. But given that they lose money, it’s in their interest to see if you notice and want to go through the trouble of creating a support ticket with evidence.
In fact, for the incident that broke the internet back in 2017, they didn’t even mention SLA or credit a single time.
You are responsible for the evidence
When contacting the support to get the credit, you need to have evidence supporting that the SLA was breached:
The dates and times of each incident of non-zero Error Rates that you are claiming; and […] your request logs that document claimed incident(s) when the Amazon S3 Service did not meet the Service Commitment
You may have to have extra tooling and/or improve your log messages just for gathering this evidence and of course you have to pay for that.
You have to apply within a deadline
To be eligible, the credit request must be received by us by the end of the second billing cycle after which the incident occurred
And it’s not particularly easy. you have to open a support case with the words “SLA Credit Request” in the subject line, as well as detailed evidence like your request logs.
Your failure to provide the request and other information as required above will disqualify you from receiving a Service Credit.
Needless to say, filing a request doesn’t automatically guarantee a service credit but rather needs to be confirmed by AWS (sound like manual work) as well as being above a certain threshold.
Credit is not refund
Instead of getting a refund, you get credit which basically acts as a discount coupon towards your next billing cycle. You do however pay actual money to AWS. By giving you credit instead of actual money, AWS ensures that you stay a customer even if the service level didn’t match the expectation.
your sole and exclusive remedy for any unavailability, non-performance, or other failure by us to provide the Amazon S3 Services is the receipt of a Service Credit (if eligible) in accordance with the terms of this SLA.
What’s more:
Service Credits may not be transferred or applied to any other account.
So you cannot simply abandon an AWS account after the hiccup. You gotta refactor whatever you got in order to use the credit or forget about it.
There’s a lower bound
Credit will be applicable and issued only if the credit amount for the applicable monthly billing cycle is greater than one dollar ($1 USD)
1$ may not be much but S3 is cheap! If you make 1 GET request every second towards a S3 standard storage of 10 GB with 100 GB for data transfer per month, you barely make it above that $1 threshold! (here’s the S3 price calculator)
In case of an outage, you will get a few dollars but you may have to pay thousands towards your own users depending on your SLA. Needless to say the S3 SLA is not an insurance policy covering your risks towards your customers.
Other services have less SLA (98%)
Not everything that goes under the S3 brand enjoys the same SLA:
For requests to S3 Intelligent-Tiering, S3 Standard-Infrequent Access, S3 One Zone-Infrequent Access, and S3 Glacier Instant Retrieval
The SLA is 99%. What’s the difference?
- 99.9% allows for 43m 49s downtime per month
- 99% allows for 7h 18m 17s downtime per month
No disaster recovery
They don’t cover any issues:
caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of the Amazon S3 Service
So if your business cannot afford to suffer from such incidents, you need to have a disaster recovery strategy in place.
Conclusion
AWS is one of the oldest and most popular cloud providers on the internet and S3 is one of their first services. This is a good example about what to look for when evaluating the SLA of a vendor. There are 3 main concerns:
- What they promise
- How do they measure
- When/how credits are paid
S3 is cheap but you may be building something expensive on top of it. In a future post we’ll dig deeper into the actual reliability engineering patterns and introduce some resources for building more reliable services on top of less reliable services. Yes, that’s possible!
Did you like what you read? Follow me here or on LinkedIn. I write about technical leadership and web architecture. If you want to translate or republish this article, here’s a quick guide.