Imagine a fictitious business-critical network is running smoothly. No critical trouble tickets are open, and all services are operational. A change control board has successfully vetted the day's changes in a meeting.
The network team implements a small routing change, and everything looks good. But, soon after, several high-priority trouble tickets are opened. Is this a coincidence or cause and effect?
The team reverts the change, which clears the problem, providing evidence that the routing change caused the outage. Further analysis shows that the routing change resulted in the accidental isolation of a critical part of the network from the internet.
Similar problems occur daily across networks of all sizes. Change control boards are supposed to detect and prevent incorrect changes, but issues still happen. How can network teams improve the quality of their network changes?
The case for automating pre-change and post-change checks
One option is to use pre-change and post-change network validation to assess whether the network is functioning as desired before and after the change.
The objective is for the network team to prevent an outage by performing a few simple pre-change routing checks. If the pre-change verification doesn't catch the problem, then the post-change checks could detect the incorrect routing state, pinpoint the reason immediately and revert to the prior configuration. This simple process to verify network state could shrink network outages or avoid them altogether.
While teams can use manual processes to perform pre-change and post-change checks, it makes more sense to automate them. Regardless of whether teams use a manual or automated process, they must identify the pre- and post-change network state. Engineers may note the post-change state frequently becomes the basis for the pre-change check in the next change cycle.
When teams automate the change process, it can proceed quickly. It also helps teams avoid human errors, like transposing digits or operating on the wrong interface, which can happen frequently when working within a change window deadline.
The pre-change process should verify that the desired interface is selected by checking its operational state and assigned address. If it's up and operational, is the right neighbor connected? These steps help teams avoid silly errors and the resulting outages.
Network teams can use pre-change checks as a validation step for the change control board function. They would present output of the pre-change validation as evidence documenting the desired starting state to the change control board. The change control board would also require that teams present the set of post-change checks that will be performed to verify that the network achieves the desired state after the change.
When a post-change check fails, the network is not in the intended state. Either the verification data is incorrect or the network is not in the desired state. Automation can save the collected data and quickly revert the change, restoring the network to its pre-change state. Teams can then analyze the collected data against the desired state, make any needed corrections and reapply the change.
As teams adopt this process, they'll likely find that many network operational state checks are useful to perform for any change, even if they think the checks don't apply. For example, is it necessary to check Network Time Protocol when making a routing change? Well, if device clocks aren't synchronized, the logging data will be harder to correlate between network devices. Automation makes it painless to perform multiple checks teams wouldn't do in a manual process.
Periodic state validation
The post-change state can be a useful tool to validate the network's operation on a periodic basis to ensure the network is performing as intended. Let's say a redundant interface fails and the network management system doesn't flag it. The periodic state validation would highlight it, enabling teams to take proactive action.
When to schedule a validation run
Knowing how often to schedule a validation run depends on the network and the business functions it supports. Teams should perform a check before the business day starts.
Checks should also be performed before any change windows, regardless of the planned changes. Network state validation is a read-only operation, so teams shouldn't hesitate to run it regularly.
Getting started with network validation
It's not significantly more work to store the current and desired operational state in a format that enables automation to perform the checks. The real work is in the automation platform's data collection and analysis. Fortunately, libraries like pyATS are available for DIY automation, and commercial products can help streamline an implementation. Consulting companies can help teams build systems should they not find a commercial product that meets their needs.
In summary, there's no good reason not to use automation for network state validation in daily operations, as well as in change control processes.