"Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep." - Scott Adams, The Dilbert...
When VMware administrators talk about mistakes they've made on the job, I always say, if you're not making mistakes, you're not learning.
Some mistakes are because of trial and error, while others result from a lack of knowledge. And some are just silly things we should've known not to do. But in the end, we become better VMware administrators because of the mistakes we've made.
Here are some of the more memorable VMware administrator mistakes that I've seen, heard about and experienced. But I couldn't possibly list them all, because they would cover an entire book.
VMware administrator mistake No. 1: Virtual machine renames
This mistake is a classic. In vCenter, it's very easy to rename virtual machines (VMs). Simply right-click on the guest object, select rename and type a new name.
But that process just renames the object pointer in the vCenter database. The directories and files associated with that guest are still under the old name. So it's easy for a VMware administrator, in the midst of quickly cleaning up the data stores, to delete a machine directory and its files in one click -- especially if he or she doesn't' match the guests to the directories. I've seen it happen, and the aftermath isn't pretty.
VMware administrator mistake No. 2: Cramming LUNs
At a conference many years back, I attended a session on a new feature in VMware ESX 3. The presenter created a 100 GB logical unit number (LUN) on a storage area network and presented it to a two-node cluster, which he used for the demo machines.
He had three servers on the LUN, each with 32 GB drives and a shared ISO data store of 2 GB. Now do the math: (32 GB x 3) + 2 GB = 98 GB. With a 100 GB LUN, he has more than enough room, right?
One by one, he fires up all three machines. And when the third one boots, all of them displayed the Purple Screen of Death. It seems he forgot about the swap files, which are created during the boot. Those files filled the LUN. It was even funnier because he had no idea why it happened, and he tried to start the machines, again.
And yes, he was a VMware engineer.
VMware administrator mistake No. 3: Network names
I once worked as a consultant on a Citrix Systems Inc. project for a small organization. One day I got a call from the organization's storage person, who was in charge of the new virtual environment. He was having problems with vMotion, and Distributed Resource Scheduler (DRS) was generating all kinds of errors. (Did I mention he was a storage guy?)
So I went into vCenter and found that all the ESX hosts weren't set up for the same networking. Each virtual switch had a different name on each host, which is a common mistake when the hosts are not set up at the same time, or when no naming standards are followed (or even exist). VMotion requires that the virtual switches be the same on all the hosts in a DRS cluster.
VMware administrator mistake No. 4: Honeymoons and roles
I know a VMware administrator who had to fix a virtualization issue on his honeymoon in Mexico. Before he left, he decided to lock down the infrastructure by removing people from roles in vCenter.
But he removed the roles from the permissions on the vCenter object -- not just the VMs or cluster. This action prevented access to anyone with permissions.
For the record, I heard this story from the new bride, who was not at all happy about the interruptions.
VMware administrator mistake No. 5: Network interface card wipeout
I could not wait for VMware's Host Profiles. I heard about it a year before it materialized, and I was chomping at the bit to quickly deploy standardized hosts in an infrastructure with more than 500 hosts. But when I finally used Host Profiles, it all went very wrong, very fast.
I generated a new host profile and tested it on a lab host. It went well, and the host didn't appear to have any issues after I tested a few VMs on it. So I decided to try it in a 16-host cluster in a production environment.
Soon after, vCenter displayed that everything went well. I was smiling for about five seconds, and then the alarms went off. All my guests and hosts were inaccessible through the network. One of the issues with Host Profiles in ESX is that no matter what the network interface card (NIC) speed settings are on the profiled host, all the hosts provisioned from that profile are set to auto-negotiate by default. (VMware calls it a feature, of course.)
This setting won't work on a network that has every switch port hardcoded to 1000/Full with no failback. (The lab network had auto on its ports, so it worked there). This setting, applied to all the hosts, brought down the whole cluster. I had to manually redo the 14 NICs on each host, which made for a very long day.
VMware makes mistakes too
Remember ESX 3.5 Update 2? Thousands of hosts all over the world came crashing down after that fiasco.
The user base discovered the bug that VMware didn't not readily admit existed. If you installed Update 2, as soon as your clock changed to 12:01 a.m. on August 12, 2008, you couldn't vMotion or power up any virtual machines.
VMware finally admitted the problem was caused by a piece of code that expired all the licenses, and the code somehow passed the beta testing and Quality Control. This "time-bomb" bug created huge problems, and the only workaround from was disabling Network Time Protocol on the servers and set the clock date back to August 10, 2008. VMware issued a patch on August 14, but it left many customers wary of the product and the testing done within VMware.
VMware CEO Paul Maritz sent out an email to customers apologizing for the bug, saying it would never happen again.