cherezoff - Fotolia
ESXi hosts require vCenter for advanced functions and management, but you can do a lot of management tasks without it. However, when you have an incident or outage, vCenter is the most important tool in your virtual environment. You must protect vCenter and consider contingency plans for large-scale failures when you design your VMware environment.
Many admins virtualize vCenter in the same environment with which they manage vCenter. This means if you have a large issue -- such as a storage failure -- your primary tool for diagnosing issues could also go offline.
When you use redundant network and power connections, you might overlook full protection for vCenter. Luckily, you can link multiple vCenter servers together in the event one goes down. High Availability (HA) also provides vCenter protection by backing up vCenter servers and initiating failover, but this doesn't help if the whole virtual environment experiences a failure.
Plan for management clusters
Management clusters are hosts that exist outside of the main production infrastructure strictly dedicated to management tools and applications. A management cluster should contain vCenter, Active Directory controllers, a backup print server, and backup domain name system and Dynamic Host Configuration Protocol servers. The off-site management cluster is crucial to your data center.
To make a management cluster truly effective, you must connect it to your main production network, but keep it separate so network issues don't affect it. The same goes for storage, which should exist on its own frame, or you can use local shared storage such as vSAN to provide an alternative storage location.
A management cluster with critical tools and services can provide you with the basic functionality to get your other systems back online if you can't prevent a VMware environment failure before it occurs. This shouldn't replace or replicate your existing data center. It should, however, keep vCenter safe in the event of a massive outage.
Right-size to prevent VMware environment failure
The threat of failure affects how large you can make your hosts and how many VMs or containers can populate them. The bigger your hosts, the greater influence an outage has, depending on your workload distribution.
VM density also affects HA. The fewer hosts you have for the same number of VMs, the longer the restart takes, because you must restart more VMs at once.
Keeping workloads separate can increase the effect of host failures. If you mix production workloads with development or testing workloads, a failure will have a smaller effect. But you must manage more resource pools to ensure production VMs have resource authority. Mixing workloads also creates denser hosts, so you must decide which you value more: less work when your environment runs smoothly or less work in the event of a major failure.
Find compromise in design
A good design requires compromise. What makes sense to you might confuse another admin, and fixes might not always be perfect. You can prevent future VMware environment failures by ensuring other admins understand your thought process.
Document your design process and include not just the decisions you make, but why you made those decisions. This helps others understand the logic behind them. Such details can prevent others from making mistakes. You don't want new staff to upgrade or replace infrastructure and then face the same problems you did.