A resilient distributed network environment is essential for delivering standard and high availability along with business continuance and disaster recovery. Business continuation allows users to access storage, services and local or remote servers (physical or virtual) -- regardless of whether they are located at a central data center, co-location facility or with a managed service provider (MSP) or cloud service provider. Business continuation is especially important for remote offices and branch offices (ROBO) that need to access other sites, headquarters and public cloud services where information resources exist.
Hardware and software wide area networking (WAN) technology is the glue that ties ROBOs to headquarters and public cloud services. We tend to take WANs for granted when they work properly -- so much so that they may not get the care and investment needed to make them resilient. After all, if a WAN is working, why spend more or change anything. However, problems arise when things are not running normally or acceptably.
There are many ways to maintain normal ROBO operations (as well as other environments) with WAN resiliency high availability (HA), business continuance (BC) and disaster recovery (DR). These include technology tools, techniques and best practices, testing and training. In the first part of this series I will discuss what WAN and diagnostic tools are necessary for business continuance. In part 2, I will go over WAN resiliency best practices and training.
Troubleshooting and diagnostic tools for business continuation
Do you have diagnostic tools that provide insight into individual network components, as well as up and down the entire solution stack? What happens when the LAN appears slow because there are intermittent WAN issues slowing or preventing access to other locations including cloud services? Worse yet, what happens when the WAN connection fails due to an accident, faulty technology (hardware or software) or poor network configuration?
More on self-healing tools for business continuation
Understand how self-healing storage works.
Learn about self-healing data storage systems in this tutorial.
Could self-healing software cut support costs?
Look at Ipanema's Autonomic Networking System.
WANs are like any other technology in that it is not a question of whether they will fail, it is a question of when, why, where and how they will. The impact of a WAN link going down is one no organization wants to realize. But just like other technology, WANs and their components can be made more resilient or disaster tolerant by incorporating self-healing hardware, software, network services and configuration choices.
Your troubleshooting and diagnostic tools should go beyond checking to see whether an individual component or service is functioning; instead these tools need to monitor the whole stack. Part of using diagnostic tools also means having some baseline metrics, insight, reporting or information as to what is normal behavior. For example, a network manager needs to determine what is normal vs. abnormal for the network in terms or errors, incidents, retransmissions, lost packets, frames per second, response times, timeouts, latency, bandwidth capacity and I/O per second (IOPS). What this means is that -- in addition to quickly finding where the real problem or issue is -- you must also include how fast you can fail over or start to back up network bandwidth services when needed.
Business continuation technology tools for a resilient WAN
Don't forget about load balancers. Did you know that a summer 2012 AWS outages was the result of a load balancer issue? This means you need to make sure that your load balancers are not a single point of failure (SPOF) along with routers, name servers and other hardware and software components.
To ensure business continuation for your WAN services, you must determine your SPOF. Is it the network service (i.e., WAN circuit) provider or is it in your load balancer or routers? How about your network software configurations and name servers? If you have eliminated SPOF in servers with redundant network adapters or NICs, in alternate DNS settings in your operating system and hypervisor configurations, as well as in your LAN topology configuration -- do you have an active, passive standby or backup network service?
Speaking of networking, do you have primary active or standby bandwidth services either with the same or different carriers? If with different providers, does the network circuit actually traverse diverse routes, or do both use a common network carrier service? On the other hand, if you use a single provider, do they use separate diverse paths (theirs or others) for their redundant or HA services? Also, keep in mind the service-level agreements (SLAs) for those services along with remediation and renumeration (what will they give you for not meeting them).
To continue reading this tip series, read part 2: Network resiliency best practices for your WAN.