Resolving VMware ESX problems without pulling the plug
Instead of being driven to desperation and being forced to press Restart, you can learn resolve VMware ESX problems quickly.
Differences between virtual machines and physical servers highlight the unique challenges of resolving virtual machine issues. On a physical server you can always pull the power plug as a last resort before restarting a server. But this strategy may not work on virtual machines, which only have virtual power switches. There are, however, a few toolkits available that either help prevent problems, or make your troubleshooting process easier. I'll discuss several of these in this tip, and give you step-by-step instructions on how to fix various common problems.
VMware Tools
The first set of tools you want to familiarize yourself with is VMware Tools. VMware Tools is a set of enhanced drivers and applications that installs on your virtual machine's (VMs) operating system. As a best practice, you should make a habit of always installing VMware Tools to ensure the optimal performance and stability of your VM. Also, double check to make sure that you're running the latest version of VMware Tools after you install any upgrades to ESX (incidentally, some ESX patches will also require updates to VMware Tools). There is a column in the Virtual Machine view in the VMware Infrastructure Client (VI Client) that will show the VMware Tools status of every VM and whether it is OK, out of date or not installed.
Virtual machine file types
As part of the troubleshooting process, you'll need to understand all the various file types involved with fixing a possible problem. Let's review the files associated with a virtual machine:
- .nvram file – This file contains the CMOS/BIOS for the VM.
- .vmdk files – These are the disk files that are created for each virtual hard drive in your VM. There are three different types of files that use the vmdk extension, they are:
- *–flat.vmdk file - This is the actual raw disk file that is created for each virtual hard drive.
- *.vmdk file – This is the disk descriptor file which describes the size and geometry of the virtual disk file.
- *–delta.vmdk file - This is the differential file created when you take a snapshot of a VM (also known as REDO log)
- .vmx file – This file is the primary configuration file for a virtual machine. When you create a new virtual machine and configure the hardware settings for it that information is stored in this file.
- .vswp file – This is the VM swap file (earlier ESX versions had a per host swap file) and is created to allow for memory overcommitment on a ESX server.
- .vmss file – This file is created when a VM is put into Suspend (pause) mode and is used to save the suspend state.
- .log file – This is the file that keeps a log of the virtual machine activity and is useful in troubleshooting virtual machine problems.
- .vmxf file – This is a supplemental configuration file in text format for virtual machines that are in a team.
- .vmsd file – This file is used to store metadata and information about snapshots.
- .vmsn file - This is the snapshot state file, which stores the exact running state of a virtual machine at the time you take that snapshot.
Log files
Once you understand VM file types, you'll want to become very familiar with log files. Log files are the best method for troubleshooting problems with virtual machines. It's the first place you should check when problems occur.
The most important file is the Vmware.log file. This is the main log file for the VM on the ESX server, and is located in the working directory for the VM. Vmware.log is always the current working log for the VM and older log files are incremented numerically, i.e. vmware-1.log
You should also check /var/log/vmkernel and /var/log/vmware/hostd.log on the ESX host for any errors that may be related to the problem you are experiencing with your VM. Sometimes, restarting the hostd service (service mgmt-vmware restart) on the ESX host will resolve quirky problems with virtual machines. For more common problems, there are more specific techniques that will likely resolve your problem; I'll go over these next.
Problem: Can't shut down a virtual machine
Let's say you can not shutdown a VM using the VM power controls. You can try using command line methods to try and manually kill your stuck VM. There are several methods for doing this below. Employ these methods only as a last resort, short of restarting your ESX host.
- The first option you should always try is the command line equivalent to using the VI Client which is the vmware-cmd command.
- Login to the service console
- Type "vmware-cmd –l" to get a list of all VMs and their paths
- You can check the VM state by typing "vmware-cmd /
/ .vmx getstate" - To forcibly stop type vmware-cmd /
/ .vmx stop hard" - Check VM state again, it should now be off
- Type "vmware-cmd /
/ .vmx start" to power on VM
- The second option is to try and manually kill the VM's process by finding its process identifier (pid) and issuing the kill command to terminate it.
- Login to the service console
- Type "vmware-cmd –l" to get a list of all VM's and there paths
- You can check the VM state by typing "vmware-cmd /
/ .vmx getstate" - Type "ps -ef | grep
" - The second column is your pid of the vmkload_app of the virtual machine, you can also type "ps –eaf" to see all running processes
- Type "kill -9
" - Check VM state again, it should now be off
- Type "vmware-cmd /
/ .vmx start" to power on VM
- The last option is to use the vm-support to command to try and force the VM to shutdown.
- Login to the service console
- Get the vmid of the VM you want to kill by typing "vm-support –x" or "cat /proc/vmware/vm/*/names"
- Kill the VM and generate core dumps and logs by typing "vm-support –X
" - You will be prompted if you want to include a screenshot of the VM, send an NMI to the VM and send an abort command to the VM. You must answer yes to the abort question to kill the VM. The entire process will take about 5-10 minutes to run. It will create a tar archive in the directory.
Problem: Can't power on a virtual machine
Another common problem may be that you can not power on a VM. This can happen if the host server does not have enough resources for the VM to use. For example, if the VM has a memory reservation set and the ESX host does not have enough physical memory to meet the reservation, then it cannot power on the VM. If this happens you can either remove the memory reservation from the VM and migrate it to another host with more free physical memory, or you can free up physical memory on the existing host.
Also, when a VM is powered on it needs to create a vswp file in the working directory of the VM on the ESX host that is equal to the amount of RAM assigned to the VM (minus any memory reservations). If there is not sufficient disk space on your ESX host, then you will also not be able to power on the VM. A workaround it to set a memory reservation equal to the amount of RAM assigned to the VM so the vswp file will be 0 bytes in size. It's important, however, to always take care to leave additional disk space on your VMFS volumes for things like logs, swap files and snapshots.
Problem: Virtual machine encountering boot errors due to OS corruption
If a VM is having problems while booting due to operating system corruption or faulty configuration, a good way to deal with this is to add its virtual disk to another working VM so you can access the drive and make any needed repairs. To repair the VM, you should make sure the problem VM is powered off. Next add an additional drive to a working VM and browse to the problem VM's disk file. Boot the working VM; you can now access the drive of the problem VM to make any changes or corrections. When you are done remove the drive from the working VM, add it back to the problem VM and try booting it again.
Problem: General virtual machine OS issues
For troubleshooting problems with the VM's operating system, I create a toolkit of ISO files that contain helpful troubleshooting applications that I can quickly mount on a VM's CD-ROM and use (or boot from) to make repairs to a VM. A few of the ISO files I use include:
- Sysinternals utilities - Great utilities for troubleshooting Windows server problems.
- Gparted – A Linux-based disk partition editor.
- Knoppix - A Linux-based live CD with many tools and applications.
- Ultimate Boot CD - A live CD with many system repairs and testing tools.
- UBCD4Win - A Windows-based live CD with many system repairs and testing tools.
Conclusions
These are just a few of the problems and techniques that you will use when troubleshooting virtual machine problems. The information in this article should help you the next time you experience a problem with a troublesome VM.
ABOUT THE AUTHOR: Eric Siebert is a 25-year IT veteran with experience in programming, networking, telecom and systems administration. He is a guru-status moderator on the VMware community VMTN forums and maintains VMware-land.com, a VI3 information site.