The Stack
In my previous blog I established “The Three Service Truths” which are “reliably is the most important requirement of a service”, “users determine this reliability” and finally, “it’s okay to not to be perfect all the time”. Also brought a brief overview of The Reliability Stack, so we are on the same page. In this blog you will be hearing more about reliability stack (SLI, SLO and Error Budget).
Service Level Indicators(SLI)
SLIs are the most important and critical part of the reliability stack. You have it at the base of the stack (figure 1). The SLOs and Error Budget will totally rely upon it. Thinking about your services from your user’s perspective will be a watershed moment to your organisation. You may never get to the point of having reasonable SLO target or calculated Error Budget that you can use to trigger decision making without having a good SLI. An SLI is a measurement that is determined over a metrics or a piece of data representing property of a service.
An SLI most often useful if it can result in “good” or “bad”. Think about loading a website. Users like website that loads quickly. Quickly means not instantly; quick enough to satisfy users. For an example, Users would be happy as long as a website loads within 3 seconds. Meaning you can define your web page load time that is equal to or less than 3 seconds is a “good” value. Beyond 3 seconds is a “bad” value. Once you have these “good” and “bad” values, you can calculate the percentage. The “good” or “bad” events divided by total number of events. Let’s say you have 50000 visitors at a given period of time and you were able to measure 49950 page loads resulted within 3 second. Using the simple math you can calculate “good” events percentage, that is 99.9% (49950/50000 = 0.999 = 99.9%). This percentage represents that 99.9% users loading your webpage quickly enough.
SLI is a powerful indicator that you can aid to make decisions, alerting on things during incidents etc. Because it actually takes user’s perspective into account. An SLI can become quite complicated for complex and deeper systems. However, a proper SLI has to be understandable by a large audience.
Service Level Objective(SLO)
The next level of the stack is SLO, which are determined by SLIs. You have seen how to convert the number of good events among the total number of events to a percentage. Your SLO is a target for what the percentage should be. Continue with our example, you didn’t hear any complaints from your users when 99.9% of your webpage loads within 3 seconds and you can call them the “good” requests. This is the happy number of your webpage. So your target is to maintain webpage’s loading percentage, the SLO above or equal to 99.9%. Do not focused on achieving 100%. Things could fail, you can’t stop it. It’s not in your hand. For a moment let’s say you achieved 100% SLO on your service and there were another service in the critical path having a less SLO that your service highly depends on. When you are calculating the SLO of your service take the critical path service into account. Accordingly, your service SLO would be the critical path service SLO. The bottom line is focusing on reliability, not perfection.
It is important to reiterate your SLOs. You should feel free to change your SLO (target). For an example, You have a school management system which is not being used on weekends. You have a massive opportunity to focus on your service functionality such as releasing new features, having considerable down time or may be you can shut down your system during weekends and reduce infrastructure cost and etc. No one would care if your system was not available during weekends. but the situation would be different during weekdays. Users expect the system to work in weekdays. So focus on your user’s expectation and change your SLO (target) accordingly. You will understand evaluating SLO is important to meet users desires.
I will explore SLO further on how to pick up SLO targets and when you should change these targets.
Error Budgets
Error Budgets are in a way the most advance part in the reliability stack, not because it relies on both SLI and SLO, but they are the most difficult to implement in an effective manner. The Error Budget is incredibly useful when explaining the reliability status of your service to other people. There are two approaches that you can use to calculate Error Budget, namely “event based” and “time based” approaches. The approach right for you, hugely depends on your system.
Let’s focus on “event-based” approach. Think about “good” events and “bad” events . The aim is to figure out how many bad events you might be able to sustain during a defined error budget time window without your user base becoming dissatisfied.
The second approach based on “bad time intervals”, often refer to as “bad minutes”. This gives you another way of measuring current status of your service. Let’s say you have 30 days time window and your SLO says your target reliability is 99.9%. This mean you can have 0.1% failures or bad events over 30 days time period before you exceeding your error budget. However, you can also say this as “ we have 43 bad minutes every month”.
The Error Budgets are just representation of how much your service can fail over a period of time. It allows you to say either “We have 30 minutes Error Budget remaining this month” or “we can incur 5000 more errors every day before we run out of error budget”.
The Error Budget is much useful for communication and decision making. If you have excess error budget, then ship more features, do Chaos Engineering etc. If you are out of error budget, then you should focus on more reliability.