What is a PSOD?
A purple screen of death (PSOD) is a diagnostic screen with white type on a purple background that's displayed when the VMkernel of a VMware ESXi host experiences a critical error, becomes inoperative and terminates any virtual machines (VMs) that are running.
Typically, a PSOD details the memory state at the time of the crash and includes other information, such as the ESX/ESXi version and build, exception type, register dump, what was running on each central processing unit (CPU) at the time of the crash, backtrace, server uptime, error messages and core dump information.
Why does a PSOD happen?
The purple diagnostic screen isn't as prevalent as the notorious blue screen of death -- the informal name for a Windows general protection fault error -- but it can be equally disruptive. Besides issues with the VMware hypervisors, outdated drivers, unstable graphics processing units (GPUs) and external hardware, other misconfigured settings on a device can also generate a PSOD.
The most common causes of a PSOD include the following:
- Critical kernel errors. If the kernel of a VMware ESXi host experiences a significant error, the PSOD will be displayed. Typically, the first line of the diagnostic message will show the ESXi version along with the build number.
- Hardware problems. Any type of internal or external hardware issue can trigger a PSOD. This can include out-of-band management warnings caused by RAM and CPU issues, nonmaskable interrupts or hardware failures, failed system boards, fried memory sticks, and damaged internal riser cards.
- Overheating or overclocking of the PC. A PC that overheats due to overclocking or the fan not working can generate a PSOD. It's advisable not to place the PC in a direction where the vents are closed, as it can lead to an unstable GPU.
- Software bugs. Misconfigured software settings or improper interactions between software components can also cause a PSOD. These can include race conditions and improper or unsupported configuration parameters.
- Outdated drivers. Outdated drivers, especially graphics drivers, can cause a PSOD to appear. Therefore, it's imperative to keep all drivers up to date.
- System upgrades. Sometimes, software upgrades can also cause a PSOD to appear.
What are the consequences of a PSOD?
A PSOD causes a kernel panic for VMs -- and once it initiates, the host crashes, and all services and VMs running on the host are terminated. The VMs don't get a chance to gracefully shut down, but are instead powered off abruptly. If, however, the host is part of a high availability cluster, the VMs will automatically failover to other redundant hosts in the cluster.
A PSOD not only causes an outage when VMs are unavailable, but some critical applications like database servers, backup jobs, message queues and additional services can also be affected by the abrupt shutdown. For example, if the host is part of a virtual storage area network cluster, a PSOD will affect the VSAN as well.
How to deal with a PSOD
The diagnostic message displayed by the purple screen of death provides intuitive clues into the problems the machine faces that can be very helpful during troubleshooting.
The following steps should be taken when trying to deal with a PSOD:
- Take a screenshot. The diagnostic message displayed inside a PSOD contains helpful information regarding the crash and can be used for troubleshooting. The ESXi servers are mostly accessed through remote tools -- such as Dell's Integrated Dell Remote Access Controller, Hewlett-Packard's Integrated Lights-Out and Cisco's Integrated Management Controller -- which make taking a screenshot easy, but if there's no remote access available, physically going to the machine and taking a picture is also an option.
- Restart the host. Sometimes, the easiest way to recover from a PSOD is to reboot the server. Performing this step might prevent complicated troubleshooting later, especially if the underlying issue is simple.
- Contact VMware support. To perform a root cause analysis and to expedite the troubleshooting process, contact VMware support, especially if the organization has a support contract.
- Collect the core dump. Once the server is rebooted, collect the core dump. The core dump, or vmkernel-zdump, is a zip file that contains logs and offers more detailed information seen on the PSOD to help with further troubleshooting. Even if the cause of the PSOD seems obvious, it's best to confirm by analyzing the core dump. The core dump is especially important for hosts that might be configured to automatically reset after a PSOD occurs, in which case no message is displayed.
- Decode the error message. The error message a PSOD produces provides insight into the actual problem. There's an infinite number of error messages that can be produced by a PSOD, such as "COS Error: Oops," "Lost Heartbeat," "Spin count exceeded (iplLock) - possible deadlock" or "Machine Check Exception: Unable to continue." The VMware website lists known VMkernel messages along with their descriptions.
- Check the logs. If the root cause of the PSOD isn't obvious after taking the aforementioned steps, then look for clues inside the host log files, especially for the time interval directly preceding the PSOD. The logs can also show errors related to add-in cards and other components, which, for example, can help with reseating a card inside a Peripheral Component Interconnect Express slot. For enterprise-based environments, specialized log management tools, including VMware vRealize Log Insight or SolarWinds Security Event Manager, can be used for observing the logs.
- Check overclock settings and clean the heat sink. Occasionally, a PSOD is caused by overclocking of a PC, which can change its hardware clock rate, voltage or multiplier, generating more heat and causing the CPU to become unstable. If a PSOD has occurred for this reason, then it's best to use a dedicated device or software to cool the PC. This can include using a cooling pad or specialized cooling software to disperse the heat faster. GPU malfunctions due to excessive heat can also cause a PSOD, so it's best to clean the device's heat sink regularly.
How to prevent a PSOD
At times, diagnosing the root cause of a PSOD can be challenging and frustrating. Therefore, the best defense against a PSOD is to prevent it from happening by taking a few precautionary measures.
The following items can help minimize or mitigate the occurrence of a PSOD:
- Patches. Typical PSOD issues can be resolved by performing regular patch management to ensure all software and apps are updated to their latest versions.
- Drivers and firmware. Faulty drivers can often be the culprit behind a PSOD. It's important to keep the drivers up to date by regularly checking the vendor's website for updated firmware. If the vendors have documented PSOD-causing drivers, then those should be upgraded as soon as possible.
- VMware's hardware compatibility list (HCL). Ensure that the VM server along with all the other devices and hardware are on the VMware HCL. This protects from unexpected hardware-related issues, and best of all, VMware support is available when a PSOD occurs due to hardware errors.
Choosing a virtualization option can be a daunting task. Explore this guide to discover the best approach to virtualization and the pros and cons of hosted vs. bare-metal virtualization.