Some practical advice when setting SLA

User connectivity

Legal and Finance aspect

Disclaimer: this is marketing material for Azure SQL Database
  • SLO is an objective used internally between teams
  • SLA is a legal commitmet used externally between the business and the customers often entailing a penalty as a guarantee

Cost

As a rule of thumb for every 9 that is added to the SLA, the operational and organizational cost increases 10x

  • Infrastructure cost: Higher SLA generally demands more redundant or powerful infrastructure to achieve higher reliability.
  • Complexity: The architecture will be more complex which improves the maintanance cost.
  • Productivity: Change is one of the primary sources of issues. The developer teams will have to spend more time on quality assurance and ship relatively less often which ultimately means less features and bug fixes for the end users (arguably a larder team is not a shortcut).

There’s a law of diminishing return for SLA: too high SLA does’t necessarily guarantee a matching ROI.

Offloading the penalty to 3rd parties

Business match

  1. What SLI is key to your customers experience? Availability is a common one but it can be any other golden signal or even something else.
  2. What is the least SLA commitment that you can get away with? Consider the cost, team capacity and your business model.
  • If the site can’t show the news or is too slow, the readers may get their news from another site. 30 minutes downtime per month may be acceptable but 5 hours downtime may turn into news itself!
  • If the ads don’t render, you may loose some ad revenue. 1 day may be acceptable but not an entire week. You may not have an obligation to pay the advertisers but you’re loosing ad revenue.
  • If the paywall functionality is broken, the users may be able to read some paywalled articles for free and it hurts conversion. A week may be acceptable but not an entire month.
  • If the login functionality is broken, the paid subscribers may not be able to read paywalled articles. In that case, you may have to credit them.

Pragmatism

What is the maximum number of minutes that the users will comfortably tolerate bad service?

Calendar month vs lookback window

MTBF

Runtime

  • Since the AWS regions are isolated from each other, usually the failure is isolated to to that scope. One way to improve the reliability is to have redundancy across multiple regions.
  • An evem more reliable (and costly) solution is to have a multi-provider strategy where the service is replicated across multiple cloud providers. The downside is to avoid any vendor specific feature in order to create a portable solution. This avoidance may limit the solution to “unmanaged” services which may lead to lower SLA.

Measurement tool

Point of measurement

  1. From inside the cluster: For example counting the number of successful Kubernetes health checks divided by the total health checks. This does not give a good picture of the uptime from the user’s perspective. Besides a failed health check is a signal to Kubernetes scheduler and there’s no guarantee that it affects the end user at all.
  2. Outside the cluster but in the same cloud provider: If the cloud provider encounters issues, both the service and the monitoring tool may be affected and we’ll be running blind.
  3. Using a 3rd party provider through the public internet. If the 3rd party is not running only on the same cloud provider, this can give a good signal about the user experience. On the other hand these 3rd parties usually have a few limited locations that they run their test from. Moreover, the link from these 3rd parties to the internet is usually more stable than say a wifi user.
  4. From the user client application: This gives the most accurate signal because it measures when system is down from the end user’s perspective. On the other hand, it may pollute your data based on whatever irrelevant connectivity issue the end user might be having that is out of your control.

Status page

Build trust with every incident.

Sampling

Synthetic vs actual flows

  • It may not represent how the actual users experience your service in its entirety and gives a false sense of confidence
  • Even if it models the user experience with good accuracy when it was set up, it may drift from the actual flow as the product evolves
  • It creates fake load against the system and extra measures need to be taken place to filter out that data from the actual business metrics (eg. number of active users per day). It may even add to the costs as the synthetic load causes unnecessary scale out or preventing a service from retirement.

Not just availability

  • Github: if pushing a commit takes too long to start, it may break CI/CD pipelines and ruin Github’s reputation to an extent that people may consider migrating to rivals like BitBucket or GitLab. Therefore they may define the SLA for commit push latency.
  • CNN: if too many ads fail to show up properly or be seen on the page, there might be some client-side issue that hurts the ad revenue. Therefore they may define a SLA for the error rate of ads.
  • 3 letter agency: if too many people switch to Linux because the data collection agent built into other operating systems consumes too much CPU or network bandwidth, it reduces surveilance data points and increases national security risk! Therefore they may define a SLA for CPU/Network saturation metrics.

Conclusion

--

--

--

Sr. Staff Engineer @volvocars, Knowledge Worker, MSc Systems Engineering, Tech Lead, Web Developer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Set up logical replication of postgresql in linux

How to Move Uno Platform Pages to a Multi-Targeting Library

Styling Web Components

What kind of people is KUB Name Service suitable for?

Train Python Code Embedding with FastText

Roblox Studio Apk v2.5 Download Updated 2022

Roblox Studio Apk v2.5 Download Updated 2022

3 Amazing Python Packages You Probably Don’t Know About

How I Made a Totally Practical, WiFi-Calibrated Mini Wall Clock For $20

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Ewerlöf

Alex Ewerlöf

Sr. Staff Engineer @volvocars, Knowledge Worker, MSc Systems Engineering, Tech Lead, Web Developer

More from Medium

Blitzscaling Engineering Teams

Retrospect on the AWS Outage and Resilient Cloud-Based Architecture

Do Google’s Engineering Practices Work for a Startup?

Rows of books arranged by color on a bookshelf.

The ownership trio