Analyzing SLO performance
SLOs need to be reviewed frequently in order to ensure business and client expectations are being met and identify areas that require improvement. To successfully put this process into action, SLO performance must be clearly understood so that it can effectively drive development decisions and ensure that the ROI of your reliability efforts is maximized.
Through the SLO Workbench, you can access the SLO Insights panel by clicking on any SLO element in the list. This panel provides you with detailed information about your SLO's performance.
At the top of the SLO Insights page, you can find a summary overview of your SLO's configuration parameters. You can consult the following information:
- SLO name: the SLO name is displayed as the title of the page.
- Status: the current compliance status of your SLO based on it's level of remaining error budget.
- Target: the target that SLI is being measured against.
- Compliance Window: the size of the rolling window applied to the SLO. This determines how much data is being used to calculate the various SLO indicators such as SLI, Remaining Error Budget and Burn Rate.
- Data Source: which one of your available data sources is being used to collect the SLO's underlying metrics.
- SLI Category: the performance area the SLO's SLI is representing.
- SLI Evaluation Method: the evaluation method being applied to calculate the SLI. This can be directly mapped to the SLI's underlying formula: ratio formulas translate to request-based SLIs and threshold formulas translate to time-based SLIs.
SLO overview panel
The first chart in the SLO Insights page shows you how the monitoring metrics used for the SLI calculation have been evolving over time. This chart can help you identify error rate spikes or abnormal patterns with your monitoring metrics. This can be useful to bring context to post-mortem reviews, or to keep an eye on your service's operational status during incident response procedures. Moreover, you can leverage this information to better understand why an SLI might be behaving in a certain way and reason about fluctuations in performance.
SLI Raw Metrics chart
You can click the SLI Metrics toggle to see the full metrics being used for your SLI calculations.
The second chart in the SLO Insights page shows you how your Service Level Indicator has been performing over time in relation to the SLO's target. The chart's description lets you know about the last measured value for your SLI and intuitive color coding allows to quickly understand if this value is above, or bellow its intended target.
This chart also lets you get a quick understanding of how the SLI has been evolving in relation to some past moment in time. You can interact with the look back window selector at the end of the chart's description to see how much the SLI's value has varied in relation to the selected time. This is usefull for when you want to quickly understand how fast an SLI has recovered after a significant incident was resolved, for example.
One important thing to remember is that SLIs are calculated using rolling windows and that due to the mechanics of how these type of windows work at Rely.io, you can sometimes see your SLI spike at the beginning of each day. You can learn more about this and about how rolling windows work here.
Scrolling further down the page, you can find the Remaining Error Budget (REB) chart. This is one of the most useful charts in the SLO Insights page since it can help you quickly understand how much unreliability you can still tolerate inside your SLO's compliance window, before your target is breached.
Taking into consideration that all SLO time windows are rolling windows, you can expect this chart to adopt a saw like pattern, with error budget spikes at the end of each day. This is because, as the day finishes and a new one begins, the window rolls, dropping the data points corresponding to the earliest day of the compliance window. This means that any bad events, or downtime detected in the tail-end of the compliance window are no longer being considered for the REB calculation after the window rolls. This causes the REB value to spike up in proportion to the amount of unreliability data dropped by sliding the compliance window.
Remaining Error Budget chart
In the chart's description, you can find an intuitive interpretation of the results displayed in the chart. The first line lets you know about how much of your error budget has your SLO consumed over a configurable look back period. This information is displayed in requests, or minutes depending on if the SLO is request-based, or time-based, respectively.
The second line of the chart's description lets you know how much error budget you have left until the window rolls at the end of the current day. This information is given to you as a normalized percentage number and as the actual number of requests, or minutes your SLO has left, again depending on if the SLO is request-based, or time-based, respectively.
To accurately interpret this chart, it's good to remember that the displayed REB is being calculated using the SLO's rolling compliance window.
Since rolling windows at Rely.io slide at the end of each day, the level of Remaining Error Budget displayed in this chart always refers to the amount of error budget that can still be consumed until the end of the day.
The Burn Rate chart tells you about the rate of consumption of the SLO's error budget over time. Every time your SLI misbehaves, this means some level of unreliability was introduced in your application. This is always reflected in the Burn Rate chart. As a quick reminder from the Concepts and Definitions chapter, the burn rate value is meant to be interpreted as the speed at which the error budget is being consumed in relation to its expected depletion rate. The expected depletion rate, also interpreted as your SLO's allowed consumption rate, corresponds to a burn rate value of 1. A burn rate value of 2, for example, would indicate that, if this rate of consumption is maintained, the error budget will be depleted in half of the SLO's compliance window duration.
Burn Rate chart
The first line of this chart's description lets you know how many more errors your SLO threw in relation to its allowed consumption rate, within the configurable look back period. Every time this number is above 1, it means your SLO is burning error budget at a rate higher than what's permissible if it is to meet its expected target.
The second line in the description provides you with a forecast of how long the SLO's error budget is expected to last, if the current burn rate is maintained. This information is very usefull to understand how impactful an on going incident might be with regards to meeting the SLO's goal.
Since all charts in the SLO Insights are synchronized, you can visually match burn rate spikes to significant drops in the level of remaining error budget, drops in the SLI value, or to an increase in undesired events in the SLI Raw Metrics chart.
In the image bellow, you can see how a very high burn rate spike of 52x translated in an sudden 20% drop of the SLO's remaining error budget for the rest of the day. You can also see how before that, around the 8th of April, a longer, but lower burn rate spike of around 10x was translated into the steady depletion of the SLO's error budget indicating a more sustained problem.
Burn rate impact on remaining error budget levels
Quick burn rate spikes indicate short bursts of unreliability. These can be caused by, for example, poor capacity allocation of a service which is readily mitigated by automatically scaling infrastructural resources. Such situations should be considered when agreeing on the SLO's target which implicitly defines its error budget. Long, sustained burn rate spikes are usually much more indicative of a real problem that can be severely impacting your users.