Maxim_Kazmin - Fotolia

Tip

Enterprise software developers: Forget minor software bugs

Many enterprises think software isn't perfect unless they eliminate minor defects. However, this approach won't prevent major disasters, and it hinders the practices that would.

George Lawton

By

George Lawton

Published: 12 Jun 2018

Counterproductive work happens when enterprises try to eradicate minor software bugs in complex systems. Enterprise software developers lose the ability to detect the hidden disasters that lurk underneath the surface of seemingly error-free systems.

Some of the worst safety disasters in modern industrial history occurred in organizations that previously had a perfect safety record, argued Sidney Dekker, professor at Griffith University, at DevOps Enterprise Summit 2017 in San Francisco.

When software bugs are a good thing

Defect-free software is a noble goal, but that ambition can discourage an organization from honest communication about bigger systematic defects. "Anything that puts downwards pressure in your organization on honesty, disclosure and openness is bad for your organization and is bad for your business," Dekker noted. Whereas, if teams openly discuss bugs that'd be inconvenient to fix, you have the potential to catch big defects and problems.

There is a sweet spot when it comes to rules and standardization -- both create conditions for better software. However, if you use the wrong metrics for what quality software looks like, you incentivize bad behavior across an enterprise.

"This fascination with counting and tabulating negative events -- as if they are predictive of a big event over the horizon -- is an illusion," Dekker said. "We should do something different if we want to understand how complex systems will collapse and fail."

Companies within physical verticals often tabulate the number of days without an accident. In enterprise software development, managers routinely count the number of days of error-free orders. Dekker believes both are bad ideas. "This is an invitation for a big blowup down the horizon," Dekker said.

Besides, a less-than-perfect safety record isn't necessarily what reveals a flawed system. "I cannot keep my house injury- and incident-free for a week," Dekker joked. Organizations should not overemphasize the importance of smaller negative incidents in comparison to flaws in the larger system itself.

Foster a culture of responsibility

In the medical industry, management looks at the underlying factors in reported incidents -- a strategy software teams also follow in post-mortems. Dekker worked with one hospital that had 7% of patients walk out the door sicker than when they arrived. The hospital focused all of its resources on what went wrong. This included communication failures, misconstrued guidelines, procedural violations and human error. Although, when they investigated the patients that fared well, they found the same four types of mistakes.

Enterprise software developers tend to make similar missteps. You should focus on which errors still occur -- or are masked -- when the results are positive. This requires having:

the ability for anyone to say "stop;"
a recognition that past success is not a guarantee of future success;
a culture that allows for a diversity of opinion and dissent; and
a culture that keeps discussion on risk alive.

Getting enterprise software developers -- not to mention everyone in the organization -- to share their insights of disasters in the making, rather than those that happened, is not easy. The reward for not speaking up tends to be more immediate and certain than any incentive to call out problems.

Dekker noted how individuals on a team are often in circumstances where they don't have to own a problem -- even if they see it coming or are a part of the issue. "If you don't have to own the problem, then the reward for speaking up is not there," Dekker explained.

Learn from what goes right

Reduce the likelihood of big disasters in any complex enterprise system with conditions that foster honest feedback across the organization. "If we want to understand in complex systems how things are going to go wrong badly, we should not try and glean predictive capacities from little bugs and error counts," Dekker explained. "We need to understand how success is created."

Much more goes right than wrong, suggested Erik Hollnagel, professor at University of Southern Denmark and a leading safety expert on resilience engineering. Organizations tend to do post-mortems most often when things go wrong.

Dekker agreed that teams should do post-mortems, but on one condition. "For us to know how things really go wrong, we need to understand how they go right," he said. In other words, post-mortems should also determine what was behind any successful aspects.

Abraham Wald, who many consider to be a founder of operations research during World War II, exemplified how to learn from successful elements in the midst of a predicament. Wald was tasked to figure out the best place to put armor on airplanes that were getting shot over Germany. Armor is dead weight on an airplane, and pilots wanted the bare minimum required to keep these complex systems flying.

After they measured and counted the holes on returning planes, Wald's colleagues suggested they put more armor where there were holes. Wald's colleagues were akin to today's enterprise software developers who focus their attention on where software bugs occur. Instead, Wald realized they should put armor where there were no holes. All the planes that made it back were the ones that didn't have holes in the sections where downed planes must have been hit.

"When I want to understand where the next fatality is coming from, you might think I look at the incident errors and bugs. I will look at the place where there are no bugs or holes," Dekker said. "I want to understand how we create success, because that is where the failures will hide."

Dig Deeper on Software development team structure and skills

Search Cloud Computing

Sneak Peek Q&A: Why AI governance breaks down in production -- and what comes next
Discover how industry thought leader Varun Raj helps businesses maintain robust AI governance frameworks across the complete ...
AWS launches FinOps agent, expands Bedrock cost tracking
At FinOps X 2026, AWS announced updates across FinOps tools, including an AI agent for cost analysis and new Bedrock attribution ...
A 4-step action plan to modernize legacy systems
By assessing legacy systems and prioritizing modernization, enterprises can transform old infrastructure into a modern digital ...

Search App Architecture

What repos are trending on GitHub?
GitHub Stars are a proxy for developer interest. Weekly GitHub star growth highlights fast-rising repos, giving early insight ...
Red Hat and IBM, Chainguard take on OSS security risk
Industry players are using a clearinghouse model to triage the AI-fueled surge in OSS vulnerabilities -- and, in some cases, act ...
GitHub Copilot desktop released amid reliability and pricing concerns
GitHub unveiled their "agent-native" tool one day after token-based billing went into effect. Devs are weighing the costs.

Search ITOperations

Secure IT infrastructure: A practical guide for IT leaders
This guide explains secure IT infrastructure, its core security pillars and how IT leaders can align investments with business ...
Top IT security challenges in modern infrastructures
Modern IT infrastructures face growing security challenges from AI-powered attacks, cloud misconfigurations, insider risks and ...
6 trends shaping IT automation in 2026 and beyond
Enterprises are expanding their use of automation in IT, where AI is changing the landscape with trends such as agentic workflows...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Enterprise AI

Bans on AI layoffs: Current laws and what might come next
An appellate court in China ruled that employers cannot cite AI as a reason for terminating employees. Is similar legislation ...
AI scaling is where most companies stall. Here's why
Expanding AI use requires new operating models, talent and mindsets -- not just more resources. Companies fail because they try ...
AI in law offices: How it's being used and the risks
Document discovery, court motions, patent protection, focus groups and client calls are tasks where AI saves law firms ...

Close