Calculating the SLA of a system behind a CDN

Alex Ewerlöf (moved to substack)
4 min readJan 30, 2022

--

This article builds on a previous and uses what we’ve learned to calculate the SLA of a system that is behind a CDN. If you want to read that one, here it is:

Cache

The content delivery network (CDN) is comprised of multiple caches spread across the globe in what’s called the “edge locations”. Let’s first examine one single cache.

One way to improve the endpoint reliability (and costs) is to put it behind a more reliable cache. How exactly does the cache improve the reliability?

To answer that question, let’s establish a few terms:

  • Cache hit: when the request can be responded directly from the cache memory instead of going to the origin (our service)
  • Cache miss: when there’s no cached response available and the request should go all the way to the origin (our service) and probably be cached
  • Hit ratio = (Cache hit) / (Cache hit + Cache miss) * 100

One way to see this system is to picture the internal cache memory as a parallel system to the origin. The chances of using the cache memory equals to the hit ratio. Let’s say our hit ration is 80%:

We have already seen how to calculate the availability SLA of such system:

System A dependency unavailability=(System B unavailability)*(System B unavailability)=20%*1.58%=0.316%

System A dependency availability=100%-0.316%=99.684%

Endpoint Availability SLA =(System A Availability)*(System A dependency availability)=99.99%*99.684%=99.67%

As you can see the endpoint availability is less than the actual cache but much better than the origin. The problem with these calculations is that the hit ratio greately depends on the hit ratio which is a variable factor.

Now let’s calculate the worst case scenario when the cache hit ratio is 0% and essentially the cache has to call the origin for every single request:

System B can be totally ignored. The origin acts as it has a hard serial dependency to the cache.

Endpoint Availability SLA=(System A Availabilit)*(System C Availability)=99.99%*98.42%=98.41%

Unsurprisingly, the endpoint availability is even worse than if we directly called the origin without having to go through a useless cache.

For the sake of argument, if the hit ratio was 100%, the endpoint availability would be equal to the availability of the cahce (99.99%).

Realistically speaking however, the endpoint availability is somewhere between 98.41% to 99.99% depending on the cache hit ratio. In other words the endpoint Availability SLA is between:

  • Worse case scenario: cache Availability * origin Availability
  • Best case scenario: cache Availability

When deciding the SLA and committing to a pentalty, the safe path is to commit to the worse case scenario.

CDN

Content delivery network acts as a load balancer in front of multiple caches spread across the world. However, if all these caches reuse the same origin, it can introduce a single point of failure (SOP):

In this scenario system B, C and D run in parallel and improve the availability but the resulting availability will be multiplied by the availability of System E which is a serial dependency.

In this example, the cache layer has an availability of 99.999999% but the endpoint availability is 99.999%*99.999999%*98.2%=98.199%. This is assuming that the origin (system E) will in fact have that availability while handling load from 3 different caches. Even if the origin maintains that availability under that traffic, it may affect other important metrics like latency.

A more reliable system has multiple origin replicas:

In this scenario each cache (system B, C or D) has a failover as as we’ve seen it drammatically improves the reliability.

We can think of this system has multiple layers:

The CDN DNS availability is 99.999%

We already know that the cache layer availability is 99.999999%

And the availability of the two origin replicas is 99.999676%

The endpoint Availability SLA is 99.999% * 99.999999% * 99.999676% = 99.9986750033%

This is an example where a less reliable origin (98.2%) can be behind an endpoint with a higher availability (~99.9986%).

--

--