Broken or failing builds in a CI/CD pipeline can deteriorate a team's faith in its own processes. It can also hinder a team's ability to efficiently deliver high-quality software. That's why it's important to identify and fix broken builds in a CI/CD pipeline.
These types of CI/CD challenges aren't unique to any specific tool. Broken builds can be red flags for larger issues and also signify impediments to current -- and future -- workflows. Luckily, there are best practices to remediate broken builds and avoid failing ones.
Check the credentials
Let's set the stage.
You're working on a new application feature when you get a notification that your most recent commit failed the build. The last things you pushed up were some new tests that worked fine when you ran them locally. What gives?
When CI/CD pipeline builds fail, there are a few common reasons. Often, builds that were working for a while and then suddenly fail have credential problems. For example, the credentials may no longer be valid, or the permissions for those credentials may have changed. If your IT organization has a dedicated DevOps team, they may be best suited to resolve credential and permissions issues. DevOps teams typically have admin access to most services and may be most familiar with the proper permissions needed to perform certain actions.
Troubleshooting build credentials
I'm a DevOps engineer and was recently asked to help diagnose a build failure that stemmed from an authentication issue with the GitHub Packages registry.
This build worked successfully for many months but suddenly started failing 100% of the time. GitHub provides a personal access token through an environment variable. My first thought was to check and see if this value was still present. Then, I checked on the credential responsible for the authentication error. After I logged in to the respective GitHub account, I found the culprit: The personal access token expired that day.
This example perfectly illustrates the Occam's razor philosophy. It states that the simplest explanation for a problem or event is the most likely explanation. In this scenario, the simplest explanation was that the credential needed to authenticate wasn't present. While that assumption was false, the next simplest explanation -- an invalid credential -- proved to be true.
If you can't find the problem by checking out the credentials, try to locally reproduce the issue, and then make changes to fix it. Testing and debugging instead of rerunning the full build can save valuable time. Some tools like CircleCI offer command-line interfaces that use Docker to run CI/CD scripts locally.
Fix flaky or brittle tests
Automated tests can cause intermittent broken builds. To check for this, access reports that offer more information on why the test is failing. Once you identify a cause, try to reproduce the failure by rerunning the test or just manually reproducing it with the application under test.
QA professionals are the best equipped to help with failing builds due to tests. Flaky or brittle tests, in particular, can be a CI/CD pipeline challenge. These faulty tests can negatively affect the development team's confidence and potentially make the team expect, or even ignore, future failures. Legitimate failures can go undetected, and bugs may slip through more easily.
To mitigate such issues, involve developers in test design and development efforts. Developers can provide insight into the different test conditions that may ultimately cause tests to fail. In this way, developers can help the QA team build more comprehensive tests.
Alerts can reduce the number of build failures. Make sure your system automatically alerts the right people when builds fail to ensure minimal downtime.
Many chat tools -- like Slack or Microsoft Teams -- integrate with CI/CD systems to facilitate the creation of these alerts. Failed build alerts should be treated as all-hands-on-deck situations because a failing build severely limits the team's ability to deploy new software. Such circumstances not only delay new features, but they also block urgent bug fixes from being deployed to production.