Why Implement SLOs?
Understand how SLOs can help your organisation.
To understand the advantages of SLOs and how you can leverage them within your organization, it makes sense to first look at more traditional approaches to operations.
In many organizations today, failures are either detected when users report on an issue they are experiencing and start opening support tickets or, in case you have a monitoring system in place, when some operational threshold is breached. A major problem of these traditional approaches is the inability to understand the severity of an issue from a user’s perspective. A single support ticket doesn’t let you know how localized, or how broad an issue might be and traditional operational alerting correlates very poorly to actual user experience. This usually leads to wasted engineering resources, burned out on-call teams and fruitless operational work that didn’t actually solve a real problem for your customers.
SLOs argue that monitoring should be done from the point of view of how your users are experiencing your product. Executive and development teams come to an agreement on the desired performance level of an application based on how they want clients to experience their products. From this consensus, insightful SLOs can be created that accurately tie service performance to actual business goals. In this way SLOs, can help create a shared understanding between all product stakeholders about reliability and help turn long, drawn out conversations about roadmaps and prioritization into data-driven decisions.
Very importantly, SLOs also introduce the concept of error budgets, a quantitative measure of how unreliable your services are allowed to be before they start significantly impacting the customer's experience. Error budgets help teams strike a balance between reliability and development velocity and can drive short term operational response as well as long term prioritization. They also empower teams to take more ownership of the code they are shiping by providing a way for them to autonomously manage the risk of releasing new features.