A glossary of popular terms and concepts related to SRE and to detech.ai platform.
Definitions for core SRE concepts according to detech.ai specifications.
Service Level Objectives (SLOs) set customer expectations and define specific operational goals IT and DevOps teams need to hit and measure themselves against. An SLO is usually derived from an a third-party agreement like and SLA and specifies a performance target for a specific metric, such as uptime or latency. SLOs can be seen as internal promises a company makes with itself in order to ensure their customer facing agreements are honored.
Service level indicators (SLIs) are a measurement of service performance. SLIs are represented by 0-100% scale and are good relfections of the level of performance a client is getting. SLIs are calculated from metrics collected by monitoring systems and measured for compliance against an SLO.
Service Level is the last, most recent value of an SLI.
Error budgets are derived from an SLO’s target and define the maximum amount of unreliability allowed for a specific technical system within a defined time period. They can be interpreted as a time buffer where failures are allowed that is meant to be allocated for product development.
The Burn Rate is a measurement of the rate of consumption of an SLO’s error budget. It indicates how fast a given SLO’s error budget is being depleted. The burn rate value is meant to be interpreted as the fraction of the SLO’s compliance window for which the error budget will be fully depleted (EB = 0). For example, a burn rate value of 2 would mean that, at the current error rate, the error budget would go from 100% to 0% within half (1/2) the duration of the SLO’s compliance window. A burn rate value of 5 would mean that error budget would be fully depleted in one fifth (1/5) the duration of the SLO’s compliance window.
Remaining Error Budget is the amount of error budget an SLO has left within its compliance window. For request-based SLOs, the remaning error budget is calculated using an estimated error budget value. This estimation is made based on historical data from past compliance periods. For time-based SLOs, the remaining error budget is calculated using the actual error budget derived from the SLO's target for a specific compliance window.
Compliance windows, also referred to as compliance periods, or evaluation periods are time windows for which a given SLI is evaluated against an SLO. These periods corresponde to the SLO's time window. SLO error budgets are allocated for the duration of the SLO's compliance window. Compliance windows can be either rolling, or calendar-aligned. All SLOs have a rolling window assigned to them when they are created that is used for calculating the SLI. Calendar-aligned windows can be applied to SLOs in the Reliability Insights page.
Rolling windows are time windows that slide as time moves forward, meaning that as new datapoints are generated, older ones are dropped. This ensures that the total number of datapoints inside a rolling window is always the same, and that the window always contains the most recent data. Within detech.ai, SLIs are calculated using rolling windows meaning that each SLI value is calculated taking into consideration all events that happened within the rolling window's time frame.Important! An important aspect of rolling windows at detech.ai, is that they only role at the end of each day (00:00h UTC). This might cause whatever is being measured by the rolling window to spike up at when a new day begins. This is because when the window rolls, the data pertaining to the oldest day within the window is dropped. These spikes can sometimes be seen in SLI, or Remaining Error Budget charts because when the oldest day of the window is dropped, so are any bad events, or bad minutes that happened that day, causing these indicators to rapidly increase in value.
Calendar-aligned windows are time windows that have a clear start and end date. These type of windows can be applied to SLOs in the Reliability Insights page. When using calendar-aligned windows, SLI values are calculated taking into consideration only the events that happened between the window's start date and end date. In case the end date is somewhere in the future, only events that occurred between the start date and the current time are considered for SLI and error budget calculations.
Dimensions are sets of key/value pairs useful for providing context to an SLO within the application’s ecosystem. Each dimension represents a different segment of an application. Grouping SLOs by dimensions allows for a quick breakdown of how each one of these segments is performing with regards to reliability, as measured by the underlying SLOs assigned to each specific dimension.
Definitions for the various SLI categories. These categories specify what measure of performance is an SLI representing.
The proportion of requests that returned a successful response.
The proportion of requests that were faster / returned a response faster than a given threshold.
The proportion of data being processed at, or above a given rate of production threshold.
The likelyhood of having a known-healthy copy of an application’s / an user’s data records.
The proportion of data returned in an undegraded state.
The proportion of up-to-date data records according to a given time threshold.
The proportion of processed data records that returned correct and consistent values.