Colocation data center outage response and support
Navigating data center malfunctions when hardware is off premises can be tricky. Organizations must have strong SLAs with their colo provider to ensure appropriate response times.
Data center colocation providers offer space, power, cooling and physical security, but colocation also poses the potential drawback of slower response times during a data center outage. Organizations that use colocation must carefully plan where to store important data and pay attention to service-level agreements to minimize the effects of data center outages at colocation facilities.
Consider an on-premises data center. The company owns the facility and equipment, builds and maintains the infrastructure, employs and allocates the staff, implements policies and procedures, and sets the priorities needed to remediate any outages. When trouble strikes, business leaders know who to call, there is ample notification and the staff can focus on the interests of the business.
For data center colocation contracts, this direct control is ceded to the service provider, who is responsible for troubleshooting and maintaining contact with the organization. But colocation providers are independent businesses that operate in their own business interests -- a harsh reality that some colocation customers discover too late.
What causes colocation data center outages?
At its core, a colocation provider serves as a remote data center, and data center outages can typically be traced to many of the same problems that can affect locally owned enterprise data centers. There are four general categories of disruption: power, people, disaster and connectivity.
- Power. Colocation providers typically implement strong resilience features within their data centers -- for example, backup power systems. Such backup power includes uninterruptable power systems (UPSes) to power servers and rack-based systems, as well as industrial-grade backup generators that can power the entire facility if utility power is disrupted. However, UPS failures, inadequate generator startup or maintenance and other issues with backup power systems can disrupt colocation customers when utility power stops.
- People. Human error is a major cause of data center outages. Misconfigured routers, servers, authentication systems and other hardware and software infrastructure can result in inaccessible systems for customers. Internal and external attack or other malicious activity -- such as denial-of-service attacks -- can also interfere with the colocation and its customers' workloads.
- Disaster. Colocation facilities are usually engaged by organizations that need resilient and reliable alternatives to traditional local data centers, so colocation sites are typically selected to be free of natural and man-made disasters including fires, floods, earthquakes, crashes, civil unrest and acts of war. Although prudent colocation should reduce such risks, it's impossible to eliminate them entirely, and unforeseen disasters can disable or destroy a colocation facility.
- Connectivity Colocation is remote by its nature, and WAN or internet connectivity is critical for colocation providers. Most providers support numerous competing telecom providers and allow customers to engage the services of one or more available telco providers. Telecom infrastructure is also imperfect and not 100% reliable, possibly resulting in loss of connectivity for customers using certain telecom services. In such cases, it's the telecom provider -- not the colocation provider -- that must work to restore services, but the effect on those colocation customers can be just as severe as a fire or flood.
Troubleshooting both on and off premises
Troubleshooting a colocation problem can be particularly challenging because the process of fixing a problem first depends on identifying/recognizing a problem, and then determining whether the colocation provider -- or the business customer -- is responsible for the fault and corrective action.
For example, suppose that an enterprise workload is running in a colocation facility, and the colocation provider only supplies the building, power, cooling and other utilities. The client businesses would rely on timely communication from the provider if the facility is at fault, such as a power failure, and the colocation provider would be responsible for finding and correcting the utility problem under the terms of the prevailing service-level agreement (SLA). The process could take hours or even days depending on the extent of the problem.
However, the client businesses would still be responsible for all the servers, storage, networking and other business-gear deployed to the colocation provider. A failed server, storage subsystem, network switch or even an application fault -- software bug -- could potentially be responsible for the outage. The business would depend on the presence of systems management tools to monitor and report on the status of hardware and software, and the onus would then be on the client business to find and fix the problem, perhaps through a server reboot, server replacement or myriad other potential fixes.
If the client business is, indeed, responsible for the fix, they face the challenge of doing the work. Troubleshooting and repairing a failed enterprise application might require hands on the actual gear, which might require hours to deploy staff and perform the physical work involved in a repair. In some cases, the colocation provider might have staff available who can help, for an additional fee.
Hosted or managed colocation
In a hosted or managed colocation scenario, the provider supplies the facility as well as the servers, storage, networking and other infrastructure elements that the client business simply rents from the provider. However, the provider is completely responsible for the entire infrastructure -- client businesses never touch the provider's infrastructure. If a fault occurs in the colocation facility or computing resources, the provider must handle outage notification and then troubleshoot and remediate the fault within the terms delineated in an SLA. In such a scenario, the client business would often notify the provider of a fault -- application X isn't working -- through the established support channel such as an email, telephone call or web portal.
If the problem is actually with the client's application rather than the provider's infrastructure -- that is, the colocation is working perfectly, but the client's application suffers a crash or other exception -- the provider has no further obligation or even the means to determine if the application is working or not. The client business must have monitoring in place to track application health, recognize application performance or health problems and be able to identify the problem. When the application fails, the business's IT team might have the option to restart or reboot the application remotely or call upon the provider to assist with corrective action.
Types of colocation data center support
When trouble strikes, organizations must find fast and cost-effective ways to remediate problems, while maintaining data integrity and workload security required by prevailing industry or regulatory compliance standards. There are four general types of support that a client business can use:
- Physical staff. When a client business places its own gear into a colocation facility, the business might elect to use IT staff employed by the business -- not the colocation provider. This helps ensure that IT tasks are performed in the best interests of the business, but getting staff to a distant colocation provider can be time-consuming and costly.
- Remote hands. The colocation provider typically employs IT staff, and the provider's staff can be engaged to assist a client business with a wide range of IT tasks. Such tasks may include physical gear troubleshooting, replacement, configuration and power cycling. Remote hands are generally used on a per-incident or per-request basis, and the hourly cost is simply added to the client's monthly bill.
- Remote management. Modern systems management tools are adept at accessing an array of hardware devices across a network to perform common management tasks. Tools can usually reboot servers, restart applications, migrate virtual machines, and backup and restore data. Remote management is highly effective in managing routine tasks without the need for human IT staff on the colocation site.
- Colocation services. Managed colocation providers typically offer a range of services, such as managed email, that client companies can engage. Some services might be bundled into regular monthly managed colocation fees, and some services -- such as backups -- might carry additional monthly costs. However, the provider can usually be involved in adding new services, changing existing services, or reducing or cancelling unneeded services.
Mitigating uncertainty in a data center colocation setup
Data center colocation providers might introduce additional uncertainty and complexities for organizations. Facilities in remote areas might be subject to geopolitical uncertainty and greater security issues. The provider's desire to manage costs might trim the support staff, potentially lowering its response capabilities. Mergers and acquisitions can also disrupt the provider's management staff, possibly affecting the provider's day-to-day operations and responses to support requests.
Businesses can mitigate these colocation concerns by prudent contingency planning and copious monitoring. Common steps include:
- Workload suitability. Each enterprise application must be evaluated for its suitability in colocation. Not all applications are appropriate for colocation due to regulatory compliance, security, performance or other concerns. Some workloads should simply be kept in-house.
- Rollback or repatriation. Every workload migrated to colocation should have a rollback or repatriation process established to restore the application in the local data center if colocation should fail or prove unsatisfactory for that application.
- Backup and DR. Colocation isn't a guarantee of availability. Vital workloads might require additional colocation investment to establish a backup and disaster recovery framework to ensure application availability while running in colocation -- the colocation provider doesn't offer such services by default.
- Detailed monitoring. A colocation provider's SLA is only as good as the ability to validate the provider's claims. Employ monitoring tools, such as application performance monitoring, and tools for vital workloads to track the application's health and performance, as well as the availability of the colocation provider and its resources. Understand the provider's SLA and use monitoring results to validate the provider's adherence to the SLA.
- Get help. The colocation provider will offer a range of help desk or issue ticket resources to ask for support. Client businesses should have a clear picture of the help available, how to request help and how to escalate issues when necessary to drive timely corrective actions.
Ultimately, colocation providers are business partners -- not employees -- and the resources and services offered by a colocation provider can't be assumed or taken for granted. Client businesses have a responsibility to manage their own workloads running in colocation and need to be able to work collaboratively with the provider to maintain each workload's availability and performance.
Dig Deeper on Data center ops, monitoring and management
Related Q&A from Stephen J. Bigelow
What is data separation and why is it important in the cloud?
Some enterprises avoid the public cloud due to its multi-tenant nature and data security concerns. Learn what data separation is and how it can keep ... Continue Reading
NAS vs. object storage: What's best for unstructured data storage?
There are advantages and disadvantages to using NAS or object storage for unstructured data. Find out what to consider when it comes to scalability, ... Continue Reading
Do hypervisors limit vertical scalability?
Knowing hardware maximums and VM limits ensures you don't overload the system. Learn hypervisor scalability limits for Hyper-V, vSphere, ESXi and ... Continue Reading