Some practical advice when setting SLA

Alex Ewerlöf (moved to substack)
15 min readJan 24, 2022

If you have a product, you have a committment towards your end users. SLA (service level agreement) is a great tool to align product teams behind what your customers value most.

This article goes through some of the key practical points when setting an SLA: What to measure? How to measure? Where to measure it? And most importantly what target thresholds to set?

SLA is a core concept of SRE (Site Reliability Engineering). Although the “S” in SRE stands for “site” there’s nothing that limits SLA to websites or APIs because it’s all about connecting engineering efforts to impacting UX.

The “S” in SRE should stand for service

Let’s start simple. You have a service that exposes and endpoint through the internet to your end users.

User connectivity

There is a complex network of routers, internet service providers (ISP) and mediums (satellites, fibre optic cables, wireless, etc.) between your public endpoint and the end user and they fail all the time. This is important because if the average user has no connectivity to the internet for 30 minutes per month, even if your service had an impossible SLA of 100%, the end user would still experience it as 99.93%.

In certain conditions, there are ways around using the public internet. For example if your customer is a business application running on the same cloud provider, the traffic doesn’t have to go through the public internet.

If the customer can afford it, Azure ExpressRoute or AWS DirectConnet offer a physical dedicated connection directly to the customer that’s more stable than the public internet.

But for the end user accessing the service via a wifi or 5G connection, the electromagnetic waves are even less reliable than the wires connecting the backbone of the internet.

Legal and Finance aspect

The main difference between the SLO and SLA is that the SLA has a contractual dimention and is the legal commitment that your business makes towards the customer.

There is often financial pentalties in place. For example here’s how different Database products commit to “credit” the customer in case of breaching the availability SLA:

Disclaimer: this is marketing material for Azure SQL Database

There’s nothing wrong with asking the internal teams to aim for 99.999% availability but that’s a objective, not an agreement. The difference being:

  • SLO is an objective used internally between teams
  • SLA is a legal commitmet used externally between the business and the customers often entailing a penalty as a guarantee

When compensating for the SLA breach, companies usually commit to credit instead of cashback. That’s a way to keep the customer still attached.

Cost

Even the most stable services like S3 officially commit to a belanty if their SLA drops below 3 nines. The SLA of 100% is just a pipedream. A high SLA demand from the top management is usually a knee jerk reaction to having unhappy users due to a way lower current SLA for example 98% with a poor MTBF. I got good news and bad news. The good news is that the business is taking the user feedback seriously. The bad news is that improving reliability is a step by step process and takes time, energy and money. One doesn’t simply jump from 98% to 99% or even from 99% to 99.9% overnight.

As a rule of thumb for every 9 that is added to the SLA, the operational and organizational cost increases 10x

That’s because:

  • Infrastructure cost: Higher SLA generally demands more redundant or powerful infrastructure to achieve higher reliability.
  • Complexity: The architecture will be more complex which improves the maintanance cost.
  • Productivity: Change is one of the primary sources of issues. The developer teams will have to spend more time on quality assurance and ship relatively less often which ultimately means less features and bug fixes for the end users (arguably a larder team is not a shortcut).

For example a Highly Available system (HA) often has a sophisticated graph of redundant failovers and high performance fallbacks to eliminate single point of failure (SOP). HA implies redundancy and imposes more complexity which fight with other objectives like maintainability and feature release cadence. It is not free and not every business has the capacity and/or justification for HA.

Top it a notch and you’re in the realm of fault tolerant (FT) systems which can tolerate a variety of hardware and software failures and recover automatically. FT is typically even more expensive than HA and requires a software/hardware architecture that’s made with that requirement from scratch.

It may very well be the case that all things considered, you rather pay the occasional penalty than overspend or slow down.

There’s a law of diminishing return for SLA: too high SLA does’t necessarily guarantee a matching ROI.

When you think of Netflix, you probably don’t think of a service with poor uptime. Yet Dave Hahn, their former SRE Manager, says that they don’t aim to have “uptime at all costs”. If you got the time, watch the entire video (it’s like standup comedy for tech):

Offloading the penalty to 3rd parties

When committing to a SLA towards your customers, it is important to know the SLA and compensation model of the services that your business relies upon. For example if you rely on S3, AWS commits to 99.9% availability and will credit up to 100% of the cost in case the availability drops below 95%. This should compensate for part of the money you have to pay to your customers and reduce the business rik but not eliminate it.

If your service does not have a fallback strategy, any 3rd party dependency may affect your SLA and although they may compensate the costs, it may still be way below what you are comitting toward your users. Bear that in mind when setting a payback model for your SLA breach.

Business match

Since the SLA has a legal and financial penalty aspect, it’s good to get the legal and finance in the room. Their conservatism can help ground the overly ambitous managers on business facts.

If the product targets healthcare, airport control, banking or military the SLA of 5 nines or more may make sense. If it’s a multiplayer video game, social network or streaming platform, there’s no reason to burn the budget on unrealistic SLA that doesn’t match the business.

The two key question to ask is:

  1. What SLI is key to your customers experience? Availability is a common one but it can be any other golden signal or even something else.
  2. What is the least SLA commitment that you can get away with? Consider the cost, team capacity and your business model.

Another way to look at it is to figure out what’s an acceptable service outage without seriously affecting your business risk appetite?

For example, if you’re an online news site:

  • If the site can’t show the news or is too slow, the readers may get their news from another site. 30 minutes downtime per month may be acceptable but 5 hours downtime may turn into news itself!
  • If the ads don’t render, you may loose some ad revenue. 1 day may be acceptable but not an entire week. You may not have an obligation to pay the advertisers but you’re loosing ad revenue.
  • If the paywall functionality is broken, the users may be able to read some paywalled articles for free and it hurts conversion. A week may be acceptable but not an entire month.
  • If the login functionality is broken, the paid subscribers may not be able to read paywalled articles. In that case, you may have to credit them.

Know your risks. Separate business impacting penalties from user impacting ones. As a general rule of thumb, if it doesn’t make sense to compensate the customers, it is not a SLA, it is a SLO.

Pragmatism

A pragmatic way to set the SLA is to look at the historical operational data for a service and deduce the error budget from there. For example, most monthly subscription services have an evaluation window of one month for their SLA. The idea is: how much poor UX will the users toelrate before they decide to take their money to your rivals.

In other words:

What is the maximum number of minutes that the users will comfortably tolerate bad service?

If you’re below that level, they’ll scream. If you’re above that, there won’t be much feedback on that regard.

Let’s say we had the following service availability during last 30 days:

In this example, the endpoint was down for a total of 62 + 17 + 184 + 44 = 307 minutes. Given that a month has 43800 minutes, the availability SLA is 100 * (43800–307)/43800 = 99.29%.

Whether a SLA of 99.29% is reasonable depends on the type of the product, competition and what penalty strategy is in place.

For example if this is an online video streaming service, the users may not be happy to pay a full month subscription fee if they fail to play their favourite show for 5 hours per month (307m~5h). Not during the Corona lockdown at least!

Calendar month vs lookback window

The error budget for many SLAs are calculated for a calendar month. For example S3 SLA mentions “during any monthly billing cycle”. This sounds OK, until you realise that the actual user experience may be different.

For example, suppose that we have a service that has an Availability SLA of 99.9%. This allows for 43 minutes of downtime per month. But what if the downtime happens at the end of one month and continues to the next month?

In this scenario, the users experience 86 minutes of downtime but as far as the business is concerned, the SLA is not breached.

This means that the team has 3 minutes left in their error budget during March and can deploy new code which can potentially break the service even more from the user’s perspective.

Another way to set the evaluation window is to use a 30 day lookback window. So at any point in time, the error budget is calculated based on the incidents in the last 30 minutes and if within the budget, the team is allowed to do a code deploy.

MTBF

In reality the downtime may happen during the night when less users experience it. Or it may be scattered as shorter periods across the entire month like this:

Despite their impact on the end user experience, the SLA usually doesn’t consider those factors. Focusing on increasing Mean time between failures (MTBF) can alleviate this problem.

MTBF is an important metric for systems with long continuous sessions. For example, when watching a movie the “Loading…” pauses, can seriously hurts the user experience.

On the other hand, for a transactional service like a REST API, the MTBF might be less important because there might be other resilience patterns in play at the client side like retry. In this example, reducing the MTTR (mean time to resolve an issue) has higher priotiy for the user experience.

Runtime

There are other factors that affect the SLA. Let’s say the service has one single instance. In December 2021, AWS experienced several incidents with large blast radiuses. Even if there was no change to your service or its configuration, the endpoint could still be unavailable to the end user, affecting your SLA:

There are smaller failures that don’t make it to the news but still affect your SLA and are out of your control. Natural disasters are one that require a disaster recovery strategy. There are a few ways to build a more reliable system on top of a less reliable runtime:

  • Since the AWS regions are isolated from each other, usually the failure is isolated to to that scope. One way to improve the reliability is to have redundancy across multiple regions.
  • An evem more reliable (and costly) solution is to have a multi-provider strategy where the service is replicated across multiple cloud providers. The downside is to avoid any vendor specific feature in order to create a portable solution. This avoidance may limit the solution to “unmanaged” services which may lead to lower SLA.

Note that the 3rd party status pages may not tell the whole story. That’s why startups like Metrist.io exist! Having accurate data is the key to calculate an SLA you can commit to.

One may naïvely think that an on-prem runtime may guarantee a better SLA! AWS is the largest cloud provider and their whole business model depends on having a reliable service. They know their shit but still screw up. There’s no guarantee that an on-perm or bespoke solution can beat AWS unless there is a massive budget and army of experts.

Measurement tool

Another aspect is how the SLA is measured. Companies like Pingdom or DataDog define their entire business model on robust monitoring, yet they fail from time to time too. Some companies bootstrap their own monitoring tools in an effort to save costs. It can be just a premature cost-optimization for potential scale or completely valid. But more often than not, the tooling is subpar to what’s commercially available and it hurts the observability that is the very means to identify and troubleshoot reliability issues. Besides, when the SLA tool fails at the same time as the very system it is supposed to monitor, you are practically running blind:

In this scenario, the observed SLA is 99.48% instead of the actual 99.29%. The difference may not be much but it may be the deciding factor whether to deploy new changes or wait it up till there’s enough error budget.

Point of measurement

Another relevant aspect is where to measure the SLI. For example, the availability SLI can be measured at any of these locations:

  1. From inside the cluster: For example counting the number of successful Kubernetes health checks divided by the total health checks. This does not give a good picture of the uptime from the user’s perspective. Besides a failed health check is a signal to Kubernetes scheduler and there’s no guarantee that it affects the end user at all.
  2. Outside the cluster but in the same cloud provider: If the cloud provider encounters issues, both the service and the monitoring tool may be affected and we’ll be running blind.
  3. Using a 3rd party provider through the public internet. If the 3rd party is not running only on the same cloud provider, this can give a good signal about the user experience. On the other hand these 3rd parties usually have a few limited locations that they run their test from. Moreover, the link from these 3rd parties to the internet is usually more stable than say a wifi user.
  4. From the user client application: This gives the most accurate signal because it measures when system is down from the end user’s perspective. On the other hand, it may pollute your data based on whatever irrelevant connectivity issue the end user might be having that is out of your control.

Depending on the product and risk apptetite you may pick one or more of these measurement locations.

Status page

Sometimes, a decent error message can cover up for poor performance. It really does! If the customers’ systems break because of your SLA breach, the least you can do is to help them identify the cause and save them some head scratching.

Having a status page that actively monitors the endpoints can improve the customer experience. There’s no point in hiding it.

Build trust with every incident.

Here’s how AWS does it and here’s a list of 3rd parties that can create a status page for your endpoints.

Sampling

It may be expensive in terms of money to gather, storeand analyze the full data set to calculate the SLA. It may even hurt other critical metrics like performance or latency.

For example a high traffic real time IoT sensor monitoring system may receive millions of data points per second. Assuming that the Error Rate SLI is something worthy of a SLA, it might be tempting to monitor every single data point. In practice, that may slow down the system depending on the architecture. Also the sheer amount of data may not add to the required resolution for the SLA.

If the system is aiming for a SLA of 99% and you receive 1–5 million submissions per second, it is enough to sample just 0.1% of the data to compute the SLA with enough resolution. This still gives us 1–5K data points per second which allows measuring the SLI with a 0.1 resolution.

When sampling instead of analyzing the whole data set, it is extremely important to ensure that the sample represents the whole data statistically. ie. you are not receiving data only from one node that just happens to work fine while the rest are in trouble.

Synthetic vs actual flows

Many tools like UptimeRobot or Pingdom allow you to create a synthetic request toward your services to measure your SLA. Metrist.io is a more advanced version allowing to probe more advanced aspects like all possible actions again S3. Then there are tools like DataDog synthetics and NewRelic synthetic which allow you to script a synthetic user journey across multiple pages/endpoints.

This is all good, but at the end of the day you are measuring a synthetic journey. A more realistic data (and admittedly harder) comes from measuring the actual user experience.

Synthetic measurement is simpler, cheaper and more predictable but it has a few disadvantages:

  • It may not represent how the actual users experience your service in its entirety and gives a false sense of confidence
  • Even if it models the user experience with good accuracy when it was set up, it may drift from the actual flow as the product evolves
  • It creates fake load against the system and extra measures need to be taken place to filter out that data from the actual business metrics (eg. number of active users per day). It may even add to the costs as the synthetic load causes unnecessary scale out or preventing a service from retirement.

Not just availability

This article has been mainly focused on the Availability SLI (AKA uptime) because, typically, when people talk about SLA, that’s what they think about.

However, the service may be up and slow or erroneus which effectively hurts the customer experience.

The SLA may be set for any SLI that has a direct impact on the customer experience or business risk factors. Unlike SLO that is defined for every SLI, the SLA may be defined only for some SLIs. Remember there’s a legal and financial penalty involved.

As a rule of thumb the SLIs that directly bind to your business risks are good candidates for SLA.

Example:

  • Github: if pushing a commit takes too long to start, it may break CI/CD pipelines and ruin Github’s reputation to an extent that people may consider migrating to rivals like BitBucket or GitLab. Therefore they may define the SLA for commit push latency.
  • CNN: if too many ads fail to show up properly or be seen on the page, there might be some client-side issue that hurts the ad revenue. Therefore they may define a SLA for the error rate of ads.
  • 3 letter agency: if too many people switch to Linux because the data collection agent built into other operating systems consumes too much CPU or network bandwidth, it reduces surveilance data points and increases national security risk! Therefore they may define a SLA for CPU/Network saturation metrics.

Conclusion

The real world is messy and there are many variables that may affect system reliablity. A good SLA should consider the relevant metrics that align with the customer experience, business model and risks.

If you hear someone say “I want 5 nines” with resources for 2 nines, don’t laugh. Send them this article. But feel free to leave the room if they ask for 100%!

In a perfect world the SLA should be set together with educated marketing and legal teams who both understand the user tolerance and are willing to chip in when the SLA is breached.

--

--