A very common requirement of SRE is to have the ability to report on the performance of SLOs within specific time periods. Reliability reports allow you to evaluate your SLOs for compliance within specific time periods. These reports provide and effective way to communicate to both technical and less-technical product stakeholders how an application has been performing with the ultimate goal of understanding if it has been meeting client expectations. With these insights, organizations can make data-driven decisions on how to allocate reliability and development resources and plan different releases in order to mitigate the impact of unexpected bugs and issues on the end customer.
These reports are structured by grouping SLOs into customizable hierarchies and then analyzing the performance of each level of this hierarchy within the desired time-frame. These hierarchies are built by creating groups and sub-groups of SLOs. Groups are created mainly based on the dimensions assigned to SLOs, but users also have the ability to group their SLOs by owner, or category.
To create a group, users first select a dimension which will inform the Reliability Insights page of how to group your SLOs. For example, selecting the User Journey dimension as the main grouping attribute, would group the your SLOs into all the different user journey instances you have assigned to your SLOs. In case there are SLOs that do not contain the selected dimension, they are grouped together in a "No [dimension]" group placed at the bottom of the groups list.
In some situations, SLOs will have more than one instance of the same dimension. Take as an example an high-level SLO that is monitoring both an Authentication service and an Access Control service. This SLO would be associated with to two different instances, Access Control and Authentication, of the same dimension, Service. In these situations, the SLO would appear in both the Access Control group and the Authentication group.
Besides being able to build custom hierarchical views, the Reliability Insights page allows users to specify two different time-based parameters: the reporting period and the reporting frequency. The reporting period specifies the time-frame covered by the report and is defined by a start and end date. The reporting frequency informs the page about how frequently the SLOs in the report should be evaluated for compliance within the defined reporting period.
For example, you could specify the reporting period to be the month of January (Jan 1st - Jan 31st) with a reporting frequency of one week. This means that the report would evaluate each SLO for compliance for every week of the month of January.
In other words, the user would be able to see how many SLOs were in a Compliant, In Danger or Non-Compliant state plus the overall performance score for every week of the month of January. The reporting period can not be larger than a quarter and the reporting frequency can not be smaller than a one day.
Within Rely.io when SLOs are evaluated for compliance, they receive one of three possible scores: • Compliant: remaining error budget > 25% • In Danger: 0% < remaining error budget <= 25% • Non-Compliant: remaining error budget <= 0%
These scores are determined based on how an SLO performed within a certain evaluation period. Evaluation periods are determined by the reporting frequency. If you have selected a weekly reporting frequency, then each evaluation period will correspond to one week.
Within the context of the Reliability Insights page, SLO evaluation periods are always calendar-aligned compliance windows. Learn more about calendar-aligned windows here.
Each level of the reporting view hierarchy displays a performance breakdown indicator for every evaluation period. More specifically, this indicator lets you know how many of a given group’s SLOs were in Compliant, In-Danger and Non-Compliant states within a specific evaluation period.
The most granular level of the reporting view hierarchy is always the SLOs themselves. For this level, the report displays the compliance score of each individual SLO for every evaluation period. Moreover, the report also displays the actual amount of error budget that each SLO had left by the end of the evaluation period.
You might want to create a reliability report for a past, or current period of time. This is determined by whether the end date of your report has already past, or is somewhere in the future, respectively. In these situations it’s important to distinguish between past and current evaluation periods.
Past evaluation periods provide an accurate picture of how your SLO performed since they have access to a complete set of SLO data between the start and end dates of the evaluation period.
Current evaluation periods can be seen as forward-looking indicators that tell you how much error budget each SLO has left until the end of the evaluation period. The information displayed for current evaluation periods is constantly updating as more data is generated in real time. This type of information can be used to plan the best time for a new release, or to identify areas of your application that are more likely to impact your customers ahead of time. By the end of the evaluation period, a fixed compliance score is calculated for each SLO.