SUSE Linux Enterprise troubleshooting: Fixing boot problems by repairing a broken initrd

Having a problem booting SUSE Linux Enterprise on your server? It may be that you have a problem with your initial RAM disk. Expert Sander van Vugt troubleshoots the problem with this step-by-step guide to fixing a broken initrd.

Most of the time, your Linux server runs fine, maybe giving you some minor problems here and there. But one day you restart the system and find a critical error in the boot process. After installing an updated driver, it has happened that the initial RAM disk, or initrd, was broken and the server rebooted into a "kernel panic" message. Let's find out what we can do to resolve a problem like this.

Fixing a broken initrd

The initrd is a temporary file system created by system for the boot. So, if you ever want to get your system running again, you'll need to correct this error.

Step 1: Determining what is wrong

Troubleshooting is a near-science by itself on which I could spend many articles, but I'll try to keep it brief. During the system boot procedure, several phases occur, starting in GRUB, the Linux boot loader. Roughly, these are the following:

  1. GRUB loads the kernel
  2. GRUB loads the initrd
  3. The root file system is accessed by the kernel
  4. The /sbin/init process takes over.
  5. The initial boot stage happens
  6. The default runlevel is activated
  7. A login prompt occurs.

When a problem occurs, try to pin-point it to any of these seven phases. In some cases it is possible to tell exactly what happens, more often you will see that you can only give a rough indication of what is happening. In the case of a kernel panic, you can be sure about one thing: GRUB has loaded successfully and you are not yet at phase 4 of the boot procedure where the init process takes over. If a kernel panic occurs immediately after a driver installation, this is often caused by an error in the initrd.

How can we be sure? Sometimes it is quite obvious that the error is in initrd, as GRUB tells you that it failed to load the file /boot/initrd, in other cases some forensic work is needed as only a vague driver error message is generated. In the latter case, you have to check if the driver that fails is included in the initrd, as this helper file is used by the kernel to include drivers that are needed immediately. On SUSE Linux Enterprise, the file /etc/sysconfig/kernel contains a list of all drivers that should be included in the initrd. When you run the mkinitrd command, these drivers are written to your new initrd. When this happens automatically, something could go wrong.

Step 2: Fixing it

If an error occurs in the initrd, you will not be able to boot your server anymore. So, to fix it, you need the rescue system that is available from the installation dvd. This rescue system loads a complete Linux system off of the installation media. The next step is to mount all your Linux file systems off of that disk. Next, you need to run mkinitrd. You can only do this once the local file systems are all mounted, because the initrd has to be written to the local file systems. However, there is a caveat.

The problem with this approach is in the disk devices access in combination with the necessary use of a chroot environment. To start, you need to mount your server's file systems on a temporary mount point like /mnt. Let's say that you have the /boot directory on /dev/sda1 and your / directory on /dev/sda2. To mount them, you need the following two commands:

  1. mount /dev/sda2 /mnt
  2. mount /dev/sda1 /mnt/boot

Since the mkinitrd command wants to write the new initrd in /boot and the /boot on your hard drive is now in /mnt/boot, you need to change the root directory to be set to /mnt. You can use chroot to do that:
chroot /mnt

The contents of /mnt now becomes /, so all path references are OK. But we still have a problem. If you look in the /proc and /dev directory on your new root environment, you'll see that /proc is empty and /dev is as good as empty. Both are dynamically created file systems and they are created at the moment that your server boots. This means that they were created in / when the server booted from the rescue cd. Now, since the new root is in /mnt, you cannot access them anymore. We need to fix this.

  1. Type exit to exit from the chroot environment. You'll now get back to the original /mnt under which your servers local file systems where mounted.
  2. Use mount -t proc none /mnt/proc to make the proc file system available from the /mnt environment.
  3. Use mount -o bind /dev /mnt/dev which will make the original /dev which was filled by the udev process when booting available from /mnt/dev.

Now that you have the repair environment all in place, you need to check that the line in /etc/sysconfig/kernel that is used to generate a new initrd is as it should be. You are looking for the following line:

INITRD_MODULES="ata_piix processor thermal fan jbd ext3 dm_mod edd pciback"

This line will be different on every server, so check to make sure that all modules are included that are necessary to start your server (your server's documentation will help you with that.)

Now under /mnt you have the complete environment that is needed to repair your server, so take the following two steps to fix your server.

  1. Activate /mnt using cd /mnt and make it your new root environment using chroot .
  2. Issue the command mkinitrd to write the new initrd to /boot.

You have now fixed the initrd. Reboot your server and check that everything is working all right.

Now you know how to fix your server when the initrd fails. This information is most helpful for boot problems after major system modifications. Keep this information handy so you'll be able to apply a quick fix to your server if and when something goes wrong during an upgrade.

Troubleshooting a server can be cumbersome. Like battling a computer hydra, when you think you've solved one problem, a couple of new ones spring up. Good luck.

About the author:
Sander van Vugt is an author and independent technical trainer, specializing in Linux since 1994. Vugt is also a technical consultant for high-availability (HA) clustering and performance optimization, as well as an expert on SUSE Linux Enterprise Desktop 10 (SLED 10) administration.

Dig Deeper on Data center ops, monitoring and management

Cloud Computing
and ESG