Network change management is a process that is intended to reduce the risk of a failed change. This process entails several steps that ensure successful changes, but how does each step work?
Aircraft pilots use well-defined processes to ensure safe flying. Similarly, networking teams can use defined processes to reduce the risk of failed network changes that create unplanned outages. Still, organizations sometimes find that changes don't go as planned, resulting in a network outage. Some failures are due to a process failure, while others are due to nonobvious results of complex configurations.
The network change management process relies on the application of several basic operating principles, such as the following:
- scope determination and risk analysis
- peer review
- pre-deployment testing and validation
- implementation and testing
- documentation updates
Network teams perform the process of creating the change details -- new configurations, device connection information and documentation -- prior to the change management process. A valuable guide for network change management is Cisco's "Change Management: Best Practices" white paper.
1. Scope and risk analysis
The first step in the network change management process should be to evaluate the scope of a proposed change. Determine which services might be affected and who uses those services. The term blast radius is frequently used to describe the scope of effect a change can make, including the possible negative outcomes.
Teams will want to measure the scope in terms of the two following factors:
- the number of endpoints affected by a change; and
- the importance of the services a change might affect.
Once teams identify the scope, they should perform a risk assessment of the change. Is it something that's been done numerous times in the past and is well understood? Is it fully automated, or is there a chance that human error will alter the change in an unexpected way? Is the involved technology well understood, or is there a chance that something unexpected will happen?
The scope of a change figures into the risk. A change to infrastructure on which key business processes run has a greater level of risk to the business than a change to a small branch site.
Network teams can use a risk factor calculator that assigns values to key parameters. To create a risk calculator, average the values from the following example parameters, or search for a calculator on the web:
- Will the effect be visible to customers? (No = 1, Yes = 10)
- How many customers could be affected? (Range of 1 to 10)
- How important are the services within the scope? (Range of 1 to 10)
- Has this change been successfully implemented in the past? (Yes = 1, No = 10)
- Is the change automated? (Range of 1 to 10, depending on the extent of automation)
- Can the change be thoroughly tested prior to implementation? (Yes = 1, No = 10)
- Is the vendor documentation clear and unambiguous? (Range of 1 to 10)
- Is the peer review thorough, and did it surface any potential issues? (Range of 1 to 10)
The greater the risk, the more careful teams need to be during the remainder of the change management process.
2. Peer review
The next step is to conduct a peer review. While teams can perform this step before the risk analysis, it is better to use the level of risk to drive the thoroughness of a peer review. While all peer reviews should be comparably thorough, routine changes -- such as access control list changes or modifying virtual LANs -- will likely receive cursory reviews. Automated testing and deployment of routine changes can help mitigate the risk of cursory peer reviews.
Internal staff who are familiar with the network conduct most peer reviews. If a change is out of the ordinary, however, it makes sense to have an expert from the equipment vendor conduct the review. The reviews should feed back to the risk analysis phase, potentially updating the technical risk measurements, like indicating whether testing and documentation are sufficient.
3. Pre-deployment testing and validation
Ideally, all changes would go through a pre-deployment testing and validation phase. Automation of low-risk, repetitive tasks and changes can remove the temptation to skip testing for changes that teams perceive as low risk. Of course, the greater the scope and risk, the more important it is to properly test and validate the proposed change.
The prevalence of virtual router and switch OS instances is making it easier to automate the creation of test network topologies without expensive hardware investments. Teams need to build automation to create the virtual network topology and to tear it down when the tests have successfully completed.
Pre-deployment testing includes several steps teams should follow to evaluate a proposed change:
- Verify that the test network currently works as intended prior to the change.
- Implement the change in a test infrastructure to confirm that the change results in the desired final state. Teams should use automated processes to avoid human error and to reduce the time to validate the change. If the validation in the test environment fails, determine the reason. Did it fail because the change was incorrect, or was it because the test network doesn't accurately represent the real network?
- Test the backout change process so that it's easy to revert to the previous state if something goes wrong. The backout change should return the network to the starting state, which teams can validate by repeating Step 1.
4. Implementation and testing
The deployment and post-deployment testing and validation step should follow the same process as in Steps 1 and 2 of pre-deployment testing. If teams have done a good job of pre-deployment testing and validation, nothing unexpected should happen. Should the post-change testing detect an unexpected problem, teams should back out the change and verify the restoration of service.
Some network protocols require more time to converge after changes to large networks, requiring the post-change verification process to incorporate delays or convergence tests that pre-deployment testing in a small test environment doesn't need.
More advanced organizations are automating network configuration changes with the goal of migrating to a DevOps culture based on infrastructure as code. The objective is to adopt a continuous integration and continuous deployment testing and deployment process for low-risk changes.
5. Documentation and network management updates
Ideally, teams create and update documents during the change creation process, enabling them to review the documentation and network management changes along with the details of the change. Once teams have implemented and verified the change, they can incorporate the documentation changes into the network documentation system.
Don't forget to update the network management system as needed. Most network management systems have APIs that enable automated processes to make the changes.
If the change validation step is automated, it can be incorporated into periodic network validation checks. These periodic checks can detect failures in highly redundant and resilient networks. Over time, teams build a library of network validation checks that cover many parts of the network.
The principles of good network change management provide direction for reducing unplanned network outages due to failed changes. Teams should create a process that works for their organization and work toward making that process highly efficient.