Olivier Le Moal - stock.adobe.co
Downtime, whether planned or unplanned, challenges both users and IT leaders. Preparation for these inevitable events to mitigate damage they may cause is a critical, yet often neglected, responsibility.
Unplanned downtime is more difficult to prepare for because it's impossible to anticipate every specific incident that might cause an unexpected outage. Organizations also must take into account the unique challenge a particular outage might present, said Frank Trovato, a research director at Info-Tech Research Group.
"That's why technology giants such as Microsoft, Google and Amazon still have outages, despite extensive resilience built into their environments," he said. It is difficult for even the largest tech companies to prepare for an unplanned outage because the severity and resolution path are not always immediately obvious.
"If you have a 'smoking-hole scenario,' then it's obvious you need to fail over to your disaster recovery site," Trovato said. "If you're dealing with a configuration or design issue, the resolution might be more nuanced and thus take longer to resolve."
While unplanned outages defy advance preparation, schedule planned downtime in advance to minimize user impact. Trovato stressed the importance of establishing regular maintenance windows during low-impact time periods -- the second weekend of every month, for example.
"As much as possible, wait for those maintenance windows to implement changes that require downtime," he said. "This enables the business to anticipate the outage and implement appropriate workarounds, if needed."
For systems that provide 24/7 services, Trovato recommends high-availability infrastructures that enable one instance to be down for maintenance while the other continues to operate.
It is nearly impossible to anticipate every possible downtime incident, which makes it difficult to mitigate individual risks. Instead, strive to make critical systems as resilient as possible overall.
"For example, when network connectivity to branch offices is absolutely critical, organizations will typically implement redundant network paths, and even redundant providers, to build-in resilience, regardless of what might cause one path to fail," Trovato said.
Watch out for common missteps
The biggest planned downtime error is failing to consult with affected business units in advance.
"Even for a Monday-to-Friday business, certain weekends might still be critical," Trovato said. Examples of this scenario include a batch processing job in progress or staff must work overtime to meet a deadline.
Frank TrovatoResearch director, Info-Tech Research Group
Organizations should not rely solely on a disaster recovery plan to restore service when it's apparent that the downtime incident was actually caused by poor planning, not a natural or human-created catastrophe.
"Even in obvious disaster scenarios, IT staff can trip over each other if they don't have a solid plan that ensures recovery steps are executed in the right order, taking all potential interdependencies into account," Trovato said.
Budgets aren't unlimited, so it's important to understand downtime's potential business impact to plan resilience investments.
"Similarly, not all systems are equally critical," Trovato said. "Group systems into high, medium, and low criticality based on business impact so you can prioritize the time and money you invest in resilience accordingly."