How and why to create an SRE error budget
No IT service is completely immune to disruption. A realistic error budget is a powerful way to set up a service for success.
Organizations that deliver IT services face an interesting conundrum: They need to make their systems as technologically advanced as possible, while maintaining impeccable uptime. But not even the best service providers can achieve 100% availability.
That doesn't mean, however, that teams can stray far from that percentage without consequences. Any service provider is accountable to its users -- especially when those users are not internal. Top-notch reliability and functionality are expected.
Occasionally, users and service providers contractually determine a compromise between uptime and feature development. These terms help parties define what's known as an error budget. The site reliability engineer (SRE) plays a key role in this process.
Let's discuss what error budgets are, why they're beneficial, and some potential drawbacks.
To understand error budgets, you must first understand service level agreements (SLAs) and service level objectives (SLOs). When vendors provide a service, they do so with a baseline understanding of that service's performance under a variety of circumstances.
Accordingly, users expect performance to fall in line on their end. An SLA is a contract dictating that a service will perform on par with the users' estimates. This includes metrics like uptime, throughput and latency. Technical support and general communication practices are also critical components of an SLA. The SLA is binding -- failure to provide quality service results in penalties, which are often financial, for the service provider.
SLOs are more granular. These are assurances based around users' specific KPIs. The SLA is an all-encompassing contract, and the SLO forms a subset of agreements within that contract. SLOs enable DevOps, IT and SRE teams to set service delivery goals.
These performance goals must align with predetermined error budgets -- or the amount of acceptable downtime a service can endure, contractually. No service is perfect, but it's critical to keep errors below a certain threshold.
Error budgets are relevant across the entirety of the service ecosystem. Whether teams assess uptime, downtime, request errors or latency, it's possible to create KPI-based allowances.
Determine an error budget
There are many ways to measure service engagement, including real-time user traffic, consumed bandwidth and API requests. That last metric is perfect to determine an error budget -- a calculation which, at its core, is relatively simple.
API requests are HTTP-based and thus made via a network connection. As anyone who's visited a website or tried to use online services knows, attempts aren't always successful. Webpages hang, requests time out and access is sometimes denied due to authentication issues or bad gateways. Users receive feedback when this happens. SRE teams receive similar information.
Any API requests made within a monitored service are logged; this enables retrospective analysis after problems arise. Teams can see exactly how many errors occur during a period of time. They can then take this count, weigh it against successful requests and determine an error percentage.
Here's where an SLO comes in. According to Google's SRE appendix, an error budget is 1 minus the SLO of a given service. For example, let's say a service's API requests must succeed 99.8% of the time, per an SLO agreement. The math would be as follows:
1.0 (100%) baseline - 99.8% SLO = 0.2% error budget
Let's assume that a company makes 100,000 monthly API requests. If only 0.2% of those requests can result in errors, then the monthly, numerical error budget is a mere 200 requests.
Where does the SRE team come in?
SREs actively work to stay within predetermined error budgets. To accomplish this, the reliability engineer tackles a variety of DevOps and IT administration tasks. For example, an SRE must:
- maintain service availability;
- mitigate latency;
- boost service performance and efficiency;
- monitor service(s);
- manage changes;
- respond to outages and emergences; and
- perform resource and capacity planning.
SREs must ensure that compute, memory and networking resources can adequately support a scaling user base. Servers must be able to direct traffic and load balance effectively.
SREs must also know which metrics are most crucial to troubleshooting and error reduction. They can then use this intel to make recommendations. The SRE's goal is to remove the burden of service availability from other teams' shoulders. They achieve this through task automation and the creation of self-service tooling. A core step in this process is consultation with other teams and those integral to the services in question.
Through automated logging, remediation and metrics-gathering, SREs make it easier for their colleagues to improve existing services more quickly than before. This increases service quality -- thus reducing errors and ensuring teams don't exceed their error budgets.
Error budget benefits and drawbacks
Error budgets rally diverse technical teams around a central goal. The assignment of a hard number incentivizes high-quality software development. It encourages developers to take risks and experiment with new functionality -- within reason. Software feature changes, adjustments to adjacent services, and even hotfixes can introduce errors; an error budget reduces the blast radii of any issues.
Determining an error budget is crucial for strategic remediation. If services exceed the budget, DevOps and SRE professionals must determine which fixes will be the most effective. This is where the SRE's tooling and automation come in.
Unfortunately, there are two sides to that coin. A variety of conditions might determine whether teams reactively prioritize feature development or stability. Keeping track of these factors -- code bugs, procedural errors, outage origins and user scope -- can be tricky. The process might not be black and white.
SLO-based error budgets can be complicated and difficult to measure. SREs don't always receive clear guidelines pertaining to SLOs. This makes it difficult to determine clear error budgets and stick to them.
Furthermore, teams who set error budgets for the first time -- or set budgets that are exceedingly ambitious -- can overpromise and underdeliver. Pushing for extreme reliability is an admirable goal, but SREs must be realistic about service limits. Otherwise, a failure to deliver comes at a cost.