How to Think About Reliability
Overview
When you think about what to measure in terms of reliability, the users may expect certain things to be reliable in ways that you never imagined. It is not just about the service being “Up” or “Available”, it’s about delivering what users expect to be delivered. The SLO is just one way of collecting data. By using this data, teams can make better-informed decisions and work towards those goals etc. This page will present general information about how to think about the reliability aspects of a service and what you need to consider in order to achieve it.
Note : “Users” could refer to end-users, dependent services etc.
100% isn’t Necessary
Achieving 100% reliability is neither realistic nor expected in any service. Let’s consider telling the time for example. Watches are prone to falling behind a little in time, but more often than not the owner will either not notice or will adapt to the situation. Either way the user is most likely to still be happy with their watch since it is simply “good enough” as it provides the service desired
Users aren’t going to rush out and buy a new watch. However, if the watch starts falling behind by an hour every day, they may consider buying a new watch. The bottomline is, errors are acceptable within an acceptable range.
Reliability is Expensive
Reliability is expensive in many different ways. Financially, to build a system able to tolerate failure, you need many things. Firstly, the service needs to be distributed, so that a problem in one location doesn’t stop your entire service operating. It doesn’t matter if you run your service on your own hardware in your own datacenter, via Cloud Provider or both. Secondly, this service needs be highly available. You are going to incur more expenses in pursuit of the goal because you will need a presence in more than just one physical or logical location. Finally, you will need your system to have a thorough testing infrastructure. You are much more likely to encounter failures and unreliability if you do not properly examine your changes before they go out. This could mean everything from QA to staging environment and proper canarying techniques etc.
In addition to financial costs, there are human costs. Let’s take a common reliability target of 99.99% over a 30-day window. This target implies that you can only be unreliable for 4 minutes and 32 seconds during those 30 days. Furthermore, anyone on call for this service needs to respond immediately . You can’t have a service that is 99.99% reliable if it takes on-call engineers five minutes just get in front of the laptop to start diagnosing things. In order to achieve this type of reliability, you would almost certainly need multiple teams in different geographical locations.
When you define SLOs think carefully about the margins of error since one decimal number can greatly impact the cost. For example:
99.9% reliability implies an unreliability of 0.1% where 99.95% reliability implies an unreliability of 0.05%. Moving from 0.1% to 0.05% is a change factor of 2:
0.1 / 0.05 = 2
Meanwhile, 99.99% reliability implies an unreliability of 0.01%. This means you are moving from 0.05% to 0.01% if you are going from 99.95% to 99.99%. This is a change factor of 5:
0.05 / 0.01 = 5
Moving from 99.95% to 99.99% is thus 2.5 times the change factor of moving from 99.9% to 99.95%. Resources you will need to achieve this need to increase. It gets more and more expensive the more you try to get closer to a 100% reliable service. The users don’t need you to be 100% reliable, there is no point trying to get there.
How Reliable Should You Be?
“How should you think about reliability?” There is no a single answer here, it depends on a variety of factors. We could say something like “Be as reliable as users need it to be” but it’s incredibly difficult to know exactly what that level is, because these limits will rise and fall with time. SLOs are dynamic and can be adapted to the current reality. There will be a time when it is “true” that you should hold fast and enforce your Error Budget strongly. There will be other times where you experience unexpected events and have to temporarily ignore what the numbers tell you. SLO is just a way of collecting data in a new way and using this data to make good decisions. When you define reliability the best thing to do is take a step back and put yourself in the user’s shoes.
References
Books
- Site Reliability Engineering
- Site Reliability Engineering — WorkBook
- Building Secure & Reliable Systems
- Implementing Service Level Objectives
- SLO Adoption and Usage in Site Reliability Engineering (https://static.googleusercontent.com/media/sre.google/en//static/pdf/slo-adoption-and-usage-in-sre.pdf)
Blogs
- https://www.datadoghq.com/videos/solving-reliability-fears-with-service-level-objectives/#setting-user-focused-slos
- https://cloud.google.com/blog/products/gcp/sre-fundamentals-slis-slas-and-slos
Google Site Reliability Engineering official space (https://sre.google/)