Some practical advice when setting SLA

User connectivity

There is a complex network of routers, internet service providers (ISP) and mediums (satellites, fibre optic cables, wireless, etc.) between your public endpoint and the end user and they fail all the time. This is important because if the average user has no connectivity to the internet for 30 minutes per month, even if your service had an impossible SLA of 100%, the end user would still experience it as 99.93%.

Legal and Finance aspect

The main difference between the SLO and SLA is that the SLA has a contractual dimention and is the legal commitment that your business makes towards the customer.

Disclaimer: this is marketing material for Azure SQL Database
  • SLA is a legal commitmet used externally between the business and the customers often entailing a penalty as a guarantee

Cost

Even the most stable services like S3 officially commit to a belanty if their SLA drops below 3 nines. The SLA of 100% is just a pipedream. A high SLA demand from the top management is usually a knee jerk reaction to having unhappy users due to a way lower current SLA for example 98% with a poor MTBF. I got good news and bad news. The good news is that the business is taking the user feedback seriously. The bad news is that improving reliability is a step by step process and takes time, energy and money. One doesn’t simply jump from 98% to 99% or even from 99% to 99.9% overnight.

As a rule of thumb for every 9 that is added to the SLA, the operational and organizational cost increases 10x

That’s because:

  • Complexity: The architecture will be more complex which improves the maintanance cost.
  • Productivity: Change is one of the primary sources of issues. The developer teams will have to spend more time on quality assurance and ship relatively less often which ultimately means less features and bug fixes for the end users (arguably a larder team is not a shortcut).

There’s a law of diminishing return for SLA: too high SLA does’t necessarily guarantee a matching ROI.

When you think of Netflix, you probably don’t think of a service with poor uptime. Yet Dave Hahn, their former SRE Manager, says that they don’t aim to have “uptime at all costs”. If you got the time, watch the entire video (it’s like standup comedy for tech):

Offloading the penalty to 3rd parties

When committing to a SLA towards your customers, it is important to know the SLA and compensation model of the services that your business relies upon. For example if you rely on S3, AWS commits to 99.9% availability and will credit up to 100% of the cost in case the availability drops below 95%. This should compensate for part of the money you have to pay to your customers and reduce the business rik but not eliminate it.

Business match

Since the SLA has a legal and financial penalty aspect, it’s good to get the legal and finance in the room. Their conservatism can help ground the overly ambitous managers on business facts.

  1. What is the least SLA commitment that you can get away with? Consider the cost, team capacity and your business model.
  • If the ads don’t render, you may loose some ad revenue. 1 day may be acceptable but not an entire week. You may not have an obligation to pay the advertisers but you’re loosing ad revenue.
  • If the paywall functionality is broken, the users may be able to read some paywalled articles for free and it hurts conversion. A week may be acceptable but not an entire month.
  • If the login functionality is broken, the paid subscribers may not be able to read paywalled articles. In that case, you may have to credit them.

Pragmatism

A pragmatic way to set the SLA is to look at the historical operational data for a service and deduce the error budget from there. For example, most monthly subscription services have an evaluation window of one month for their SLA. The idea is: how much poor UX will the users toelrate before they decide to take their money to your rivals.

What is the maximum number of minutes that the users will comfortably tolerate bad service?

If you’re below that level, they’ll scream. If you’re above that, there won’t be much feedback on that regard.

Calendar month vs lookback window

The error budget for many SLAs are calculated for a calendar month. For example S3 SLA mentions “during any monthly billing cycle”. This sounds OK, until you realise that the actual user experience may be different.

MTBF

In reality the downtime may happen during the night when less users experience it. Or it may be scattered as shorter periods across the entire month like this:

Runtime

There are other factors that affect the SLA. Let’s say the service has one single instance. In December 2021, AWS experienced several incidents with large blast radiuses. Even if there was no change to your service or its configuration, the endpoint could still be unavailable to the end user, affecting your SLA:

  • An evem more reliable (and costly) solution is to have a multi-provider strategy where the service is replicated across multiple cloud providers. The downside is to avoid any vendor specific feature in order to create a portable solution. This avoidance may limit the solution to “unmanaged” services which may lead to lower SLA.

Measurement tool

Another aspect is how the SLA is measured. Companies like Pingdom or DataDog define their entire business model on robust monitoring, yet they fail from time to time too. Some companies bootstrap their own monitoring tools in an effort to save costs. It can be just a premature cost-optimization for potential scale or completely valid. But more often than not, the tooling is subpar to what’s commercially available and it hurts the observability that is the very means to identify and troubleshoot reliability issues. Besides, when the SLA tool fails at the same time as the very system it is supposed to monitor, you are practically running blind:

Point of measurement

Another relevant aspect is where to measure the SLI. For example, the availability SLI can be measured at any of these locations:

  1. Outside the cluster but in the same cloud provider: If the cloud provider encounters issues, both the service and the monitoring tool may be affected and we’ll be running blind.
  2. Using a 3rd party provider through the public internet. If the 3rd party is not running only on the same cloud provider, this can give a good signal about the user experience. On the other hand these 3rd parties usually have a few limited locations that they run their test from. Moreover, the link from these 3rd parties to the internet is usually more stable than say a wifi user.
  3. From the user client application: This gives the most accurate signal because it measures when system is down from the end user’s perspective. On the other hand, it may pollute your data based on whatever irrelevant connectivity issue the end user might be having that is out of your control.

Status page

Sometimes, a decent error message can cover up for poor performance. It really does! If the customers’ systems break because of your SLA breach, the least you can do is to help them identify the cause and save them some head scratching.

Build trust with every incident.

Here’s how AWS does it and here’s a list of 3rd parties that can create a status page for your endpoints.

Sampling

It may be expensive in terms of money to gather, storeand analyze the full data set to calculate the SLA. It may even hurt other critical metrics like performance or latency.

Synthetic vs actual flows

Many tools like UptimeRobot or Pingdom allow you to create a synthetic request toward your services to measure your SLA. Metrist.io is a more advanced version allowing to probe more advanced aspects like all possible actions again S3. Then there are tools like DataDog synthetics and NewRelic synthetic which allow you to script a synthetic user journey across multiple pages/endpoints.

  • Even if it models the user experience with good accuracy when it was set up, it may drift from the actual flow as the product evolves
  • It creates fake load against the system and extra measures need to be taken place to filter out that data from the actual business metrics (eg. number of active users per day). It may even add to the costs as the synthetic load causes unnecessary scale out or preventing a service from retirement.

Not just availability

This article has been mainly focused on the Availability SLI (AKA uptime) because, typically, when people talk about SLA, that’s what they think about.

  • CNN: if too many ads fail to show up properly or be seen on the page, there might be some client-side issue that hurts the ad revenue. Therefore they may define a SLA for the error rate of ads.
  • 3 letter agency: if too many people switch to Linux because the data collection agent built into other operating systems consumes too much CPU or network bandwidth, it reduces surveilance data points and increases national security risk! Therefore they may define a SLA for CPU/Network saturation metrics.

Conclusion

The real world is messy and there are many variables that may affect system reliablity. A good SLA should consider the relevant metrics that align with the customer experience, business model and risks.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store