9 techniques for fixing bugs in production

Some companies defend against bugs with a strong offense of rapid iterations and feature flags. Others find the best defense is thorough test coverage. Here's what works and why.

Despite best practices and developers' diligence, bugs are an inevitable part of software development. Accordingly, so are defects within live software. A bug -- particularly a defect in production -- is a problem because it poses the risk of losing customers.

IT organizations have many ways of fixing bugs in production. The variety of techniques reflects a range of tolerance for risk and how urgently the team wants to push out new features. Bug fixes can vary depending on the type of product and its mission criticality. The magnitude of a bug can also determine whether it gets an immediate fix or not.

Software teams can follow these nine ways of fixing bugs in production:

  1. Establish a standardized process.
  2. Make plans to quickly fix defects.
  3. Practice time management.
  4. Implement benchmarks.
  5. Prioritize test code.
  6. Perform chaos engineering.
  7. Move fast and break things.
  8. Adopt a mission-critical mentality.
  9. Mature the product.

Establish a standardized process

It's important to establish a standardized process to address bugs.

"Escaped defects are an unfortunate reality of software engineering," said Casey Gordon, director, Agile engineering at Liberty Mutual Insurance. Despite best efforts to stop defects before they end up in customers' hands, a production problem inevitably arises. Gordon's staff relies on proactive monitoring, observability tools and customer feedback for alerts on such defects.

At Liberty Mutual, software engineers cannot directly modify code in production. Instead, these developers use pipelines and trunk-based development to quickly deploy code changes to production. This Release management approach can promote stability in the long run, as it enables developers to experiment with smaller releases.

"We've learned that being too cautious and applying legacy approaches to release management only yields bigger batch sizes, longer mean time to recovery and increased risk," Gordon said. Through DevOps practices, Liberty Mutual developers push smaller, more targeted releases into production. These engineers have also separated software components into loosely coupled microservices. Microservices enable them to fix issues with minor code changes and less process than it would take for a monolithic architecture.

Plan to quickly fix defects

Another approach to fixing bugs in production is to work backward from bug identification, to how the team could swiftly address such defects.

"While we think that we should do everything in our power to avoid bugs in production, you should still plan for when, not if, a bug surfaces [there]," said Yoseph Radding, a software development engineer at Amazon who also provides DevOps consulting through Developers can ensure faster resolution by proactively working on bottlenecks that could slow fixes.

First, focus on making the code and its dependencies easy to run locally using the 12-factor app methodology, Radding recommends. The 12 factors relate to the following:

  • code base
  • dependencies
  • configuration
  • backing services
  • build, release, run
  • processes
  • port binding
  • concurrency
  • disposability
  • parity
  • logs
  • admin processes

Then, make the deployment process automatic and easy, and keep deployments small.

Next, focus on monitoring and logging to help catch issues quickly. Technologies like feature flags, which are switches to turn parts of the code base on and off, can stop problem areas easily and quickly -- in as little as 200 milliseconds. "We actually keep some feature flags permanently around particularly nasty code and integrations, to give us a kill switch in case anything goes wrong at any time," Radding said.

He also recommends reverting changes rather than rolling them back. Radding uses the git revert command to undo specific commits and adds a new commit that removes the changes in the old commit. This technique ensures that other changes remain when a developer removes the problem code.

Practice time management

Bug fixes are faster and less disruptive to production when the team has a well-planned approach. There are a few paths to follow -- each with pros and cons -- said Phil Crippen, CEO of John Adams IT, an IT services consultancy.

Assign the team or an individual to fix bugs for a set time slot every day. This arrangement creates a buffer period that becomes a regular part of the daily routine. Teams are continuously aware of bugs and can address them daily. However, this buffer period can steal time from other commitments and not all bug fixes will fit in the allotted timeframe.

Create bug estimates from existing data. Often, software teams work on multiple programs, and they can gather information about past bugs and how long it took to fix them. Categorize these findings in terms of size and time to fix and use that data to generate an estimate of how long it should take to fix a future bug. This technique can speed up the process and show the team what types of bugs are likely to occur. However, you can't anticipate all cases and data collection takes time.

Implement benchmarks

Software teams should use benchmarks to estimate how many bugs the team can fix in a month. For example, in the U.S., an average programmer can fix between nine and 10 bugs in a month, Crippen said. An experienced programmer can fix up to 20 bugs in that time. With these averages in mind, IT leaders can estimate how many bugs the team can tackle.

But such estimates may not be accurate for all bugs, or broadly applicable to programmers in various countries. This approach is useful when paired with other techniques on fixing bugs in production.

Use placeholder times for every bug fix and then dedicate a portion of the workday to resolve them. A placeholder for a fix often offers enough time to complete the work, and it can be helpful when working in the Scrum framework for Agile. But the downside is that it can be more time-consuming than other scheduling strategies to address bugs.

Prioritize test code

A development team can prioritize the code it uses for testing at the same level as production code. The result is that fewer bugs will slip through to live environments.

"Development might slow down in the beginning as you start writing tests for existing functionality, but the quality will go way up," said Shayne Sherman, CEO of TechLoris, an IT consulting service.

Maintain test code as carefully as project code, and always write unit tests for any change developers make. It's impossible to have a less rigorous system and get better results, according to Loris. He argues the only alternative is to halt development to fix bugs -- sometimes called a fix-bugs-first approach -- which he's unfortunately seen happen.

Perform chaos engineering

Software testing verifies the code does what it's supposed to. However, such QA can miss bugs caused at the level of the systems where the code runs. Chaos engineering hits software -- usually live in production or in a realistic staging environment -- with unpredictable disruptions.

"Chaos engineering is a way of testing that the entire system is doing what you want it to, and code is just one part of the mix," said Manish Mistry, CTO at Infostretch, a digital engineering consultancy. To test effectively, the system needs to be running in production. After all, it's only in production that a team can work with factors like state, inputs and how external systems behave.

It's useful to budget for dark debt, Mistry said. Dark debt is the unforeseen anomalies that happen in complex systems of software and hardware. A development team can't predict every interaction in these systems. The term is a portmanteau of technical debt from IT terminology and dark matter from space.

One downside of chaos engineering, however, is how risky experiments are in a full production environment. A staging environment that is as close to production as possible is another option, Mistry said.

Move fast and break things

A company might take a relaxed attitude about releasing production code with bugs if business growth and popularity depend on pushing out new functionality quickly.

For example, startups get their products out the door to attract investors. Releasing a product with known imperfections may be the only way to keep the lights on and stay afloat long enough for their next release, said Dave Wade-Stein, senior instructor of software development at DevelopIntelligence, an enterprise tech learning program.

But companies should pair fast and seamless rollouts with fast and seamless rollbacks.

As it grew, Facebook embodied a "move fast and break things" mindset that helped the company dominate the social networking space. "Given that Facebook is offering a free product delivered via the web, a bug in the production code wasn't a calamity," Wade-Stein said.

However, this approach won't always work. Frequent bugs can irritate customers and push them elsewhere. Facebook dropped the motto in 2014.

Adopt a mission-critical mentality

Companies should avoid the former section's approach when building mission-critical software like avionics, autonomous cars or medical equipment. "You're going to have to adhere to extremely rigorous processes if peoples' lives or expensive equipment are at stake," Wade-Stein said. A mission-critical mindset helps developers build software that enhances the product's brand reputation. However, the extra precautions are expensive.

A mission-critical mentality can entail a shift to a consensus-driven process in which anyone can stop a new release. In a command-and-control management structure -- exemplified by Waterfall development -- a manager can decide to ship the product, even if the team doesn't agree. Mission-critical development should empower employees to voice concerns.

"When software development processes include more voices in the decision-making, the resulting products tend to be higher quality and more robust," Wade-Stein said.

Mature the product, then stabilize it

In early stages of the product lifecycle, teams can lose valuable time to market by focusing on perfect releases. The attitude toward fixing bugs in production can shift, depending on where you are in the product's lifecycle, said Bastin Gerald, founder and CEO of, which provides goal management software.

The more mature the product, the closer the development team wants to be to zero bugs in production, Gerald said. At that point, the benefit of a stable existing product outweighs the incremental payoff of new features that attract customers.

"After you reach the product market fit, you should focus on stabilizing your product," he said.

Next Steps

Fixing a critical bug in IT takes coordination and patience

Dig Deeper on Software development lifecycle

Cloud Computing
App Architecture