Calculating composite SLA

As site reliability engineers (SRE), we need to understand and reason about the reliability of complex systems behind a site or endpoint.

While the Service Level Agreement (SLA) is the commitment toward the end users, the Service Level Objective (SLO) is more of a guide for the team behind a service and serves as a guardrail for their deployment using Error Budgets. Explaining that basic jargon is outside the scope of this article. Google has a good blog post if you are interested.

Pretty much any complex system is composed of multiple subsystems. But how does one go about calculating the total SLA of the system based on the SLO of individual subsystems? Well, think of the system dependency graph as a series of parallel and serial dependencies. In this article we’ll break down the maths into simple calculations.

TLDR; for serial, multiply availability; For parallels, multiply unavailability.

Yeah, that about sums it up! 🥳 It’ll make more sense in the end.

For the sake of simplicity, we focus only on the availability Service Level Indicator (SLI) but the same calculations apply to any other SLI like Latency, Traffic, Error rate, etc.

Tip

We’re going to work with numbers in this article but it doesn’t have to be painful. Fortunately, Google has smart features in their search box. For example you can search “how many minutes in a month” and get an answer. You can also do math too:

To convert SLA to absolute time values, there are a bunch of tools like uptime.is or slatools.com:

A single service

Let’s start simple: assume there is only one single system behind the endpoint — a “monolith”, if you will:

The silliest and simplest endpoint Availability SLA is equal to the uptime of the system behind it:

Setting an SLA has many nuances that I’ve tried to cover in a separate article:

Serial dependency

Now let’s look at another example. Suppose our monolith has a hard dependency to another system (eg. a database). If either the System A or B fail, the endpoint fails:

This dependecy affects the SLA like so:

Let’s assume the following availability for the two systems:

  • 99.5% availability for System A
  • 99.6% availability for System B

The endpoint availability SLA is simply calculated by multiplying the availability of the two systems: 99.5% * 99.6% = 99.10%

The endpoint is less available than either of those systems. This is not surprising because due to the hard dependency between the two systems, the endpoint fails if either of them fail.

A bit more realistic

The single instance System A may fail for a variety of causes:

  • Application: unhandled exceptions, time sync issues, security issues, etc.
  • Runtime: cloud provider hiccups, lack of space/memory, etc.

When you think about it, the application cannot be more reliable than the runtime it is running on. There are workarounds to build a more reliable system on top of a less reliable system and we’ll touch upon that in the failover and fallback sections below. The application cannot surpass the SLA of its runtime (eg. EC2, Lambda, Google Compute Engine, etc). Therefore the runtime acts as a serial dependency to System A.

Let’s assume it is running on GCE. According to Google, it has an SLA of 99.5%:

Availability of System A = Application Availability * Runtime Availability = 99.5% * 99.5% = 99.0%

Oops! All of a sudden we went from 3.5h/month downtime to 7h/month!

Remember the network between our public endpoint to the end user’s machine? It is a sophisticated networks of routers, switches, fiber optics, sattelites, wifi, etc. That is not 100% reliable either. If we have the numbers, we can assume that the internet is a serial dependency and calculate the perceived SLA.

Just as a hypothetical example, if a user over 5G network has a connectivity 90% of the time, even if our endpoint has an Availability SLA of 99.0%, they still perceive it as 90% * 99.0% = 89.1% which amounts to close to 80 hours of downtime per month!!!

The internet connectivity varies drammatically between user to user. This is one reason that the SLA is usually calculated against our public endpoint and not from inside the user’s machine (a mobile app for example).

Multiple dependencies

It is more common for a system to depend on multiple other systems. For example Backend For Frontend (BFF) or GraphQL may talk to multiple backends to offer a unified interface to the frontend.

For example we may have a system that depends on an authentication service and a database. When a request comes, it first has to check whether the user is authenticated and then operate on the database.

Validating requests like this is a terrible idea. Apart from hurting the Availability SLA, it also increases Latency SLA. HTTP Basic or JWT can be validated locally but let’s say in this architecture every request should be validated against the authentication server and there’s no cache in place either.

This dependecy affects the SLA like so:

Let’s assume the following availability for the two systems:

  • 99.5% availability for System A
  • 99.6% availability for System B (auth)
  • 99.1% availability for System C (database)

Turns out the math is not any different than if these systems were sertially connected. ie. the endpoint Availability SLA is: 99.5% * 99.6% * 99.1% = 98.21%

As you might have guessed having two hard dependencies dramatically affects the SLA. 98.21% uptime means that the service will be down for 784 minutes or 13+ hours!

Parallel failover

To improve the endpoint reliability, let’s say we run two equal instances of our system behind a load balancer so that if one is not available, the other can handle the load (failover).

This dependecy affects the SLA like so:

There are different strategies for load balancing (round robin, weighted, least response time, etc.) but let’s not get into that. For now we assume that the load balancer is able to get a response to the end user’s request if any of the two replicas are available.

If we only have two replicas, it is best to decouple their availability from each other by running each in their own availability zone or even regions on different geographical locations. That way if the cloud provider encounters an issue in one region, the replica in the other region can continue to serve the users.

For the sake of simplicity, assume that the load balancer is fault tolerant and we are not concerned with its SLA at this time.

A fault tolerant system has no service interruption but a significantly higher cost, while a highly available environment has a minimal service interruption.

So we have two systems each with their own availability based on the availability of the underlying infrastructure where they run. For example:

  • 99.5% availability for replica 1
  • 99.6% availability for replica 2

To calculate the endpoint availability SLA, let’s first see what are the risk of each individual replica being unavailable:

  • 0.5% unavailability for replica 1
  • 0.4% unavailability for replica 2

The risk of both replicas failing at the same time is: 0.5% * 0.4% = 0.2%

Therefore the endpoint availability SLA is: 100%-0.2% = 99.8%

Not bad! The replicas improved the endpoint availability SLA but that was the point of it, wasn’t it? Note that to improve the reliability, we have to spend more money on infrastructure. Besides designing the system in a way that it is compatible with being run as replica may require some upfront refactoring (for example making it stateless or ensure idempotency). Therefore there is an upfront refactoring cost and potential maintanance complexity.

Fallback

Failover, fallback, … what’s going on? Both are patterns for improving reliability but let’s clarify the distinction:

  • Failover: Perform the activity against identical copies of the system (either wait for one to fail or just send the request to all and return the quickest response)
  • Fallback: Use a different mechanism to achieve the same result.

Let’s say our endpoint has only one job: to take data from the endpoint, validate it and write it to the database. But the database availability is not that high. One solution is to have a queue and throw the data in there when the database is not available. As soon as the database is available, we consume the queue and dump the data to the database:

This dependecy affects the SLA like so:

As you can see the availability of System A has a higher impact on the endpoint Availability.

For the sake of simplicity let’s assume that the data dumping mechanism always works. Assume the availability of the systems are as follows:

  • 99.5% availability for System A
  • 99.1% availability for System B (database)
  • 99.8% availability for System C (queue)

This is a bit more complex. System A depends on either System B or C being available. So the availability of System A’s dependency can be computed by the fact that System B and C are parallel:

  • 0.9% unavailability for System B (database)
  • 0.2% unavailability for System C (queue)

The risk of both the System B and its fallback, System C, failing at the same time is: 0.9% * 0.2% = 0.18%

Therefore the availability for System A’s dependency is: 100% -0.18% = 99.82%

But System A isn’t perfect either. Let’s see how it affects the endpoint dependency by picturing the dependencies (System B & C) as one system:

As we saw with the serial composition, the endpoint availability SLA is (availability of System A) * (availability of dependencies of System A): 99.5% * 99.82% = 99.32%

Not bad! If we did not have the fallback queue (System C), the total endpoint Availability SLA would be (Availability of System A) * (Availability of System B): 99.5% * 99.1% = 98.60%

That 0.72% might not sound like much but when put in context of 43800 minutes in a month, it means 315 minutes or 5+ hours less downtime during a month for the end users. That might be the difference of a relatively happy user and a user who takes their money to your competitor.

Obviously adding the queue and its dumping mechanism are not free. It increases system complexity and total infrastructure cost but if the product and market demand a stricter error budget, this is the price to pay.

Conclusion

We looked at how serial and parallel dependencies affect the SLA calculations. In practice there are many more factors that can affect the reliability as well as architectural concepts to mitigate them. Key takeaways:

  • Reliability is not free
  • SLA is tied to system architecture

If you liked this article, please share it and follow me for an upcoming article about architectural patterns to improve reliability.

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store