marrakeshh - Fotolia

Error budget instills reckless IT project management mindset

Just how many errors is the IT operations team allowed to make? Trendy organizations are answering this question with a budget, but it possesses some admins to push the envelope.

Is someone who makes no mistakes guilty of excessive caution? We might ask whether an athlete ahead by two in a championship basketball game should risk taking a foul to block the opponent's three-point shot in the final seconds. Nearly everyone would say yes. In tech, let's ask the equivalent question: Are error budgets a way of pushing IT thinking to its limits? Or is the error budget trend a simplistic justification of adrenaline junkie thinking?

An error budget, popularized by site reliability engineering (SRE), is the idea that some amount of risk is unobtrusive and indeed necessary to move quickly in IT operations. If a team makes no errors, they likely do not innovate or attempt to improve operations. Likewise, a team that makes a great deal of errors has sacrificed reliability in the name of innovation and must scale back. An error budget is a way to measure the risks taken by individuals and organizations.

I'm all for aggressive thinking in tech, but I have a big problem with the concept of an error budget. It has the potential to turn technology planning into chest-beating, adrenaline junkie thinking, encouraging risk for risk's sake. It's the wrong way to approach a legitimate question, which is whether IT project approval processes achieve the best balance of risk and reward.

Improve IT risk assessment and management

Business hyperbole about an error budget represents a contamination of its original goal. The intent of an error budget was originally to represent the level of risk to business goals created by a given approach. That is, the project's error budget was an indication of just how far off predictions could be before the deviation terminally messed up the business case for the project. Failure is tolerable at some level, and the goal of an error budget in this original context is to measure how much failure risk the venture can absorb before it affects the project justification.

An error budget in SRE has become a border that some cutting-edge IT professionals feel duty-bound to push against, and perhaps a little beyond. This morphed meaning was, in some ways, inevitable: Any time you give somebody a new measuring stick, they start measuring things with it. Charlie blew his error budget twice, and Amy never did. So, maybe, Amy isn't pushing the envelope, or Charlie is pushing it too far. There's a competition of slogans: Nothing ventured, nothing gained versus better safe than sorry.

All too often, we pick objectives because we can measure them, not because they need to be measured.

What's missing in the popular error budget dialogue for IT change management processes is precision. Error budgeting should quantify a project's tradeoffs and risks. You must understand exactly what is generating the risks to service objectives and exactly what benefits you get from pushing the boundaries. Yet, IT professionals rarely have any solid numbers to back up the error budgeting process. In CIMI Corporation's 2016 survey of how enterprises make IT project decisions, respondents said that, in more than 90% of cases, they never truly quantified the risk side, only the benefit side.

The IT project management process is flawed. For example, most applications have an uptime goal: a percentage of time where the application is expected to be available. Only 2% of businesses in the survey had even asked what would happen if the uptime percentage were lower -- much less quantified the consequences of a change like that. Even fewer businesses estimated the benefit of better app availability. So, how did they first set the goal?

What we can't quantify, we talk about. Instead of error budgets being a way to predict the headroom your service objectives offer, they become something people talk about around the water cooler. Companies report that their team members make comments such as: "I really pushed the old error budget with that code!" Further, they also said that managers do sometimes expect their teams to operate close to their error budget line. Without any objective measurement of risk, there's no way to know how close to the line you should go.

There's a management failure hidden in this error budget trend. Management by objectives is a nice notion if you can define objectives as the things that you really need to care about. All too often, we pick objectives because we can measure them, not because they need to be measured.

Slash the budget

The term budget is also a problem. How seriously does government take budgets after all? At the IT project management level, we have a project budget, and almost every manager knows that the secret is to go just a little over to ensure that you get your share of the wealth. Come in under budget by some percentage, says classical business wisdom, and expect your next budget to be cut by that same percentage next project, before it even starts. Three or four years later, your budget -- and you -- vanish.

If budgets are boundaries, then pushing them is natural. Show a competitive IT professional a ribbon, and they want to break through it. Show them a starting gun, and they want to jump it. If we called error budgeting by a different name -- marginal risk/reward analysis -- would it generate the same weird turns of management behavior? It might well put management to sleep, in fact. Try it out, and you'll see that there's no sizzle.

Sizzle has a way of overselling. Jazzy terms that get people's attention can successfully publicize something, but what they sell is often not what was intended. Over time, correct usage could redeem the real notion of error budgeting. You don't foul to win, but you do play the odds to win.

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center