Recovering RDM data with VMware ESX troubleshooting
Accidentally deleting important VMware ESX partitions cost our expert raw device mapping data. But a last attempt with an unlikely solution finally saved the day – and the data.
After deleting partitions from his ESX logical unit numbers (LUNs), our virtualization expert had to restore access to the Virtual Machine File System (VMFS) to access data. In part one of this article, he discussed his workaround to restore access to VMFS. But our expert soon discovered he couldn't restore virtual raw device mapping (RDM) in the same way, which also meant potentially losing a massive amount of valuable historical data. The following article features his workaround for that issue.
As I discussed in part one of this series on restoring access to the Virtual Machine File System, Vizioncore vRanger Pro helped me recover the partitions necessary for VMFS functionality. I also tried to restore my raw device mapping (RDM) this way, but the backup failed with the following message: "Cannot create file."
Next, I tried restoring the image manual to the ESX host using Vizioncore's vcbRestore file. This also failed because the server message block (SMB/CIFS) mount wasn't working correctly. My approach was to use the following command to restore all files.
/tmp/vcbrestore -D -I ./VM_4.tvzc -O /dev/stdout | tar -xvf -
Unfortunately, that did not work either. (Incidentally, at this point it's an option to move the vcbRestore file to a different logical unit number [LUN] and attempt to perform the restore that way, but I didn't have enough LUN space.)
Next, I tried to recreate the Linux logical volume manager (LVM) 2 file system from the ESX host's service console. But that attempt also failed once I rebooted the virtual machine.
# fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable.
The number of cylinders for this disk is set to 39162.
There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with two things: the software that runs at boot time (e.g., old versions of LILO) and booting and partitioning software from other OSes. (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-39162, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-39162, default 39162):
Using default value 39162
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (Unknown)
Expert command (m for help): w
The partition table has been altered!
The reboot put me in a debug mode to fix the problem, where I encountered this screen:
Using a blog post for help
In this case, restoring the partition was not enough to fix the RDM problem. I then logged in as root and proceeded to restore the LVM volumes using the steps found at Recovering an LVM physical volume. These steps, however, did not get me anywhere, because the instructions required me to remove the partition first, which is wrong.
Ultimately, I took steps similar to those suggested in the blog post, except I worked with /dev/sdb1 instead of /dev/sdb.
# mount -o remount,rw /
# cp /etc/lvm/backup/VolGroup03 /tmp
# pvcreate -u fvrw-GHKde-hgbf43-JKBdew-rvKLJc-cewbn /dev/sdb1
Physical volume "/dev/sdc" successfully created
# vgcreate -v VolGroup03 /dev/sdb1
Wiping cache of LVM-capable devices
Adding physical volume '/dev/sdb1' to volume group 'VolGrou03'
Archiving volume group "VolGroup03" metadata (seqno 0).
Creating volume group backup "/etc/lvm/backup/VolGroup03" (seqno 1).
Volume group "VolGroup03" successfully created
# cp /tmp/VolGroup03 /etc/lvm/backup
# vgcfgrestore VolGroup03
Restored volume group VolGroup03
# vgchange -ay
2 logical volume(s) in volume group "VolGrou03" now active
This attempt at recovering the LVM physical volume should have instructed the LVM2 volume to have a file system, but it failed. As we learned in part one of this adventure, if a systematic procedure returns abnormal results, something is wrong.
More manual restore work
My second attempt at manual restore work was done from the Microsoft Windows VMware Consolidated Backup (VCB) Proxy Server (it also acts as my Vizioncore vRanger Pro repository). For this restore, I tried using FileZipper, another tool that comes with vcbRestore. As previously, this restore failed.
I then tried installing bsdtar from the GNU Win32 repository. Also a failure. At this point, I began to think that the problem did not lie with VMware ESX or the vRanger Pro tools, but rather with the NT file system (NTFS). To determine whether the NTFS was indeed the problem, I found space on a physical Linux machine used as a desktop and mounted the backup location from the VCB Proxy to the Linux system using a CIFS mount and secure copy (scp) to copy the files.
# mkdir /mnt/backup
# mount -t cifs -o username=******* //vcbproxy/backup /mnt/backup
# scp /mnt/backup/VM_4.tvzc* /iet
Incidentally, I used scp rather than other tools because scp is the right choice when copying virtual machine disk files between ESX hosts. For handling sparse files, it works better than the traditional cp command.
Copying the files to the Linux host took the rest of the night, but once that process had been completed, I was ready to try again.
Restore day three: Eureka!
Since the files were now on the Linux host, I tried the vcbrestore command again. Success!
/tmp/vcbrestore -D -I ./VM_4.tvzc -O /dev/stdout | tar -xvf -
Since the last attempt was successful and I now had a safe copy of the -flat.vmdk and -rdm.vmdk files for the VM causing me trouble, I knew that the culprits of my previous failures had to do with the NTFS and the SMB/CIFS. However, since the virtual RDM restored as a VMDK file, some small changes to the -rdm.vmdk metadata file were required in order to use the VM properly. I had to change two lines in the VM_1.vmdk metadata file for the VM_1-rdm.vmdk.
createType="vmfsRawDeviceMap"
changes to
createType="vmfs"
and
RW 584888320 VMFSRDM "VM_1-rdm.vmdk"
changes to
RW 584888320 VMFS "VM_1-rdm.vmdk"
I could now run VM within either VMware Workstation v6.5 or VMware Server v2. I tested some of the restoration theories. The VM worked wonderfully in VMware Server v2. It would not have worked in ESX, because the VMFS could not handle files larger than 256 GB.
Why vRangerPro failed
At this stage of the process, I finally came to understand how Vizioncore vRangerPro backs up virtual RDMs. And thanks to a conversation with a Vizioncore senior engineer, I also started to understand its restoration methods. It turns out that the virtual RDM (vRDM) was larger than 256 GB, so I was not able to restore it as a VMDK to the VMFS. Why? Because when I set up the VMFS, I did not allow it to handle larger files. That's why Vizioncore vRangerPro failed to help me out of my dilemma. VRanger Pro restores all RDMs as a VMDK, which requires you to have enough space on a LUN and to have the LUN allow for files larger than 256 GB (at least in my case).
Restoring LVM and ext3 file systems
Attempting to restore the LVM and Linux ext3 file systems was frustrating. After additional Web research, I found a new tool that, simply put, couldn't do what it said it could.
I then started with a fresh -flat.vmdk file for the VM. I tried additional troubleshooting methods that were also to no avail. Nothing seemed to work. More research and still more trials led me back to the same set of Linux commands I referenced earlier.
At the same time, my search for a successful restore method had become increasingly important; I now needed to access some of my archived data for a customer. The clock was ticking,
Sleeping on it
Sometimes, a good night's sleep is all you need to tackle a problem with a fresh perspective. After sleeping on it, I woke up with the idea to use the memory image taken when the backup was created to gain access to the kernel partition and file system structures that were in memory and were working at the time of the backup.
Here's what happened:
- I was going to try to use the .vmss file to replace the .vmsn file with the contents of the .vmss file and reboot the VM in an attempt to revert the snapshot. (For some background, a .vmss file is part of the backup made when using Vizioncore vRangerPro as the .vmsn of a snapshot.) This attempt failed, and there were no reported errors.
- Next, I tried to use the .vmss file as a suspend file for the VM. This did not work. I kept getting a msg.checkpoint.resume.fail error that had no explanation. A quick Internet search said it may be a .vmsd file issue, so I removed the now-unused snapshot directory file. This didn't work either.
- I sent an email to the Vizioncore folks to see whether there was a way to use this image. Several hours after sending the email, however, I had not received a response (not that I expected an immediate reply).
- I also contacted my Forensics friends at AccessData, and they mentioned trying FTK-Imager. FTK Imager found the LVM data, but nothing within the volume. This is a great tool for Forensics but did not help with my recovery (it works with VMDKs where the partition is intact).
Nucleus Kernel Linux saves the day
During my Linux ext3 restore research, I rediscovered Nucleus Kernel Linux. Nucleus runs on Windows, so when I first came across it, I ignored it. But since I was now scraping the bottom of the barrel for solutions, I decided to try it. Nucleus didn't work directly with the virtual machine (VM) disk files, so I installed it on a Windows XP SP3 VM that I use as a helper VM for situations like this. I attached the virtual RDM to this VM prior to boot.
After a long few days of trial and error, I had finally solved my problem. Nucleus found the missing partition and then it found the files on the partition.
After making a purchase and re-running the Nucleus Kernel Linux tool, I was happily copying data off the virtual RDM to the Linux system. Once completed, I recreated the virtual RDM with the proper file system and restored the missing data. (Just in case problems cropped up, I kept a backup copy of the virtual RDM in VMDK form.)
Takeaways
In my struggle to restore access to VMFS and RDM, I have learned several lessons. First, never miss the first clue that something is awry -- if you suddenly find more partitions than you expect to see, stop and consider why this has happened instead of immediately deleting them. Remember that powered-on VMs maintain their partitions and you should immediately back them up, but virtual RDMs read the raw disk and not the kernel data structures, so back up vRDMs using some form of file copy.
If you want to restore a virtual RDM, you need a VMFS that can handle a file the size of the final VMDK, which I did not have on ESX. Luckily I had enough disk space on my Linux system. Finally, restoring a VMFS that has lost its partition table is trivial, but restoring a virtual or physical RDM that contains a Linux ext3 file system within a LVM2 partition is quite the task. Fortunately, Nucleus Kernel Linux can do this for you.
ABOUT THE AUTHOR: Edward L. Haletky is the author of VMware ESX Server in the Enterprise: Planning and Securing Virtualization Servers . He recently left Hewlett-Packard Co., where he worked on the virtualization, Linux and high-performance computing teams. Haletky owns AstroArch Consulting Inc. and is a champion and moderator for the VMware Communities Forums.