6 steps to reduce SRE toil

All organizations suffer from toil -- even if they don't call it that. And while SRE is not a single product, there are many ways to instill its concepts into IT environments.

Clive Longbottom

Published: 21 Jun 2021

The work around an IT platform can be separated into two types: work that adds value to the business and work that keeps the platform running. The aim of an IT operations team should be to maximize the amount of the former while minimizing the time and cost spent on the latter.

Work that keeps the IT platform running has many different names -- "keeping the lights on" is one -- but one term, toil, is growing in acceptance. The way to reduce toil is to adopt site reliability engineering (SRE).

Toil covers tasks such as patching, updates, firefighting issues and replacing broken parts. For most IT workers, toil is mind-numbing work with little technical attraction -- and is unlikely to ever result in plaudits from IT management, never mind business management.

Google -- where SRE originated -- advises that a specific SRE team should be set up to minimize toil. For many organizations, this might not be attractive or possible. Instead, they can embed the main concepts around toil minimization through implementation of SRE approaches into development and operational teams.

The aim should not be to completely eliminate toil -- there will always be tasks that cannot be magicked away. Certain work, such as better application of systems management tools or a more efficient platform monitoring approach, add no direct value to the business -- but are not toil. They lay the foundation for improved toil management and bring discernible value to the business over time as the platform becomes more reliable through improved engineering.

Frighteningly, and as recently as a decade ago, the costs of toil could be up to 80% of an IT operations group's budget. Increasing equipment reliability, improved systems management tools and the increased use of automation have reduced the costs -- but they remain a strong approach.

Even if an IT group has reached a 50/50 split between toil and value-added IT work, a move to 40/60 is a 20% increase to the IT budget allocated to value-add -- without any change in the overall IT funding. At the 80/20 level, a 10% shift means adding a half to actual value-add spend.

How to reduce toil

There are many ways SRE minimizes the costs of toil. The following six techniques will help your IT organization improve SRE management.

Standardize

A lack of standardization leads to a more complex IT platform, which then increases toil. Minimize the number of IT platforms in place -- for example, through different types of Unix, different versions of Windows Server and multiple separate hardware suppliers. Also, interrogate function repetition. Multiple applications that carry out the same functions -- for example, using overlapping customer relationship management and sales force automation applications -- increases the complexity, and therefore toil, of the environment. Standardization makes it easier to manage the platform as other steps are taken.

Reuse

Many toil tasks are repetitive. Therefore, once a fix is found for a task, engineers should apply it repeatedly to the same task, even on a different part of the platform. A library of callable scripts will help reduce toil. Increasingly, many tools used in SRE come with preexisting libraries that cover the most common areas.

Monitor

Triage, also called firefighting, is the worst thing that can happen to an IT platform. A problem that affects users harms the business and creates a negative perception from the business to IT while encouraging responders to cut corners. Operations teams must institute a solid procedure for monitoring the entire IT platform -- a system that can identify possible problems before they become issues and which can then initiate events to fix the problem.

Automate

Humans are, unfortunately, often the root of problems in the IT environment. Unchecked changes can domino into catastrophic issues across the platform. Therefore, look to systems that check any change before implementation, automate that change and roll back if any problems are identified post-deployment.

Improve

Poor code leads to more problems, which means more toil. Use an integrated DevOps approach with solid testing to improve initial code quality, with automated feedback loops between operations and development to raise any identified issues, along with indications of priority for fixing.

Embrace new technologies

But not too fast -- and don't assume they will remedy all problems. Machine learning, deep learning and AI will increasingly improve SRE capabilities but are still at an early stage of maturation in the market. However, waiting until they are 100% proven will cost your organization in toil levels. Introduce such technologies in small, defined areas and judge their effectiveness. Organizations can then begin to roll them out across the total platform as faith in their capabilities grows.

6 steps to reduce SRE toil

All organizations suffer from toil -- even if they don't call it that. And while SRE is not a single product, there are many ways to instill its concepts into IT environments.