That one odd sheep: A tale of a bad kernel update

one black sheep

Overview

When managing large numbers of servers, we rely on standardisation and orchestration tools to be able to reliably replicate new systems and manage all the others. This is the classic “cattle vs pets”, where cattle are the systems who simply perform their duty and if there are any issues you can destroy and re-create. Pets on the other hand are the servers you love and care for, with hand-crafted configurations and if you have to rebuild will cause you lots of tears.

Using SaltStack as our orchestration allows us to have known states for servers, as well as installation and build procedures where the same result is repeated over and over without any human intervention or human error of a mistyped command.

And this usually works, until you find one. odd. Sheep. 

Across thousands of systems setup in the exact same manner, every now and then there will be one which doesn’t quite look like the others and doesn’t behave like the others. In some instances, it’s easiest to destroy and start again but in others, these outliers represent an opportunity to work what went wrong and if improvements can be implemented to prevent other future issues.

A failed kernel update

This particular odd sheep was a CentOS 7 based Virtual Machine (VM), running the latest 7.7 release and didn’t have any abnormal changes compared to all other systems. The system build process was the same used as others, using SaltStack to build the system, deploy apps and manage ongoing changes to ensure consistency across the fleet.

After an update and reboot this particular VM didn’t return, instead dropping to a Dracut Emergency Shell. Odd, as this was the first time I’d even encountered the dracut emergency shell and hadn’t had to deal with a Linux VM failing to boot in many, many years. 

that one odd sheep: a tale of a bad kernel update

This wasn’t good. The initial warnings indicate that it couldn’t find the root nor swap partitions. As these were Logical Volume Management (LVM) based and quite standard, it was very odd that they simply couldn’t be found. Was it disk corruption? Were they marked inactive by mistake? 

The first step was to have a look at what LVM was reporting:

lvm lvdisplay

No results. That’s odd, it’s not seeing any logical volumes at all. The next step was to take a look at the physical disks then:

lvm pvdisplay

No results again. At this point it’s starting to dawn on me that this wasn’t going to be a quick fix. 

Ok, last resort let’s simply see all of the drives which have presented to the OS:

blkid

No results. blkid is very low level and simply lists low level block device metadata, so the fact that it couldn’t return any data was highly concerning. It meant that the VM itself wasn’t seeing any disks after the initial ramfs system had been loaded.

What is initramfs? 

While it’s greatly simplifying things, initramfs is the successor to initrd system of providing the minimum possible services to allow the root file system to be accessed and for the boot to continue. 

This is of course a very crude simplification, but the initramfs resolves the classic chicken and egg problem. To get access to the root filesystem, you need to mount the root filesystem in order to load the drivers located on there … required to mount the root filesystem. 

In order to get around this issue, there’s a tiny little preloaded set of drivers (initramfs) so that the system can at least mount the root filesystem to load more detailed drivers and continue the boot.

For this broken instance, clearly there was a failure in these drivers as it couldn’t load the filesystem!

As both grub and the initial boot both at least loaded, it was evident that the VM had a disk attached and this wasn’t a hypervisor fault at play here.

Next step – Rescue Mode

Out of the box CentOS keeps the last two kernels installed so that you have the ability to fall-back in case of error … which is exactly the situation we were in. However, in this instance the previous kernel exhibited exactly the same fault.

Thankfully, CentOS 7 has a basic rescue mode built in. This allowed us to boot the system up using a slightly older kernel version and it loaded without issue. At least this gave me some hope, especially since it validated that the filesystem itself was ok and that the system wasn’t a complete write-off. 

To create a bit of further confusion, the rescue mode also uses initramfs and had no issues. Why did it work yet not the others two kernels? It was becoming a case of finding more questions at each step rather than answers!

Rebuilding initramfs

If my assumptions were correct in that there were drivers missing from the initramfs, then the first step would be to therefore rebuild it. For nearly all modern Linux systems, this is handled via Dracut

Referencing the CentOS 7 Tips and Tricks Wiki, we can force a rebuild using a single line:

dracut -f /boot/initramfs-3.10.0-1062.4.1.el7.x86_64.img

This runs through and finds the required drivers to get root mounted (or at least it should!) and simply returns once complete. Hoping for an easy win, I rebooted the VM only to discover that the same fault existed. Back to the rescue mode again!

So that I could see more output, this time I used the -v flag to produce verbose output:

dracut -v -f /boot/initramfs-3.10.0-1062.4.1.el7.x86_64.img

The system showed each of the dracut modules determining what was or wasn’t required so that it knew if it was required to be included or not. Everything in the process was showing clean results, skipping where not required or including where things were required. Again, no easy wins here to find a simple show-stopper!

Knowing that dracut does a hardware detection in order to optimise what to load, my next step was to ensure that it was picking up the underlying hypervisor (KVM) correctly. A simple command (also used by dracut) to run is:

systemd-detect-virt

This detected kvm correctly, so it should have known the drivers to load. Just in case there was some other odd mismatch, I also tried the -N flag to disable the Host-Only mode which should have added all additional drivers.

No luck.

I case there was some rare package corruption or the kernel itself somehow had failed to install cleanly, I ensured that the kernel was reinstalled and flushed the yum cache before proceeding:

yum clean all 
yum reinstall kernel-dev
yum reinstall kernel

Still no luck, the rabbit hole was simply getting deeper and deeper without an end in sight.

Determining the drivers required

To verify what drivers have actually been included, I scanned the initramfs image using:

lsinitrd /boot/initramfs-3.10.0-1062.4.1.el7.x86_64.img

This gives a very verbose output of the files and modules contained within the image, allowing us to see exactly what was bundled in. For example, we can see the library files for LVM loaded: 

-r-xr-xr-x   1 root     root        11328 Nov  4 12:23 usr/lib64/device-mapper/libdevmapper-event-lvm2mirror.so 
-r-xr-xr-x   1 root     root        15664 Nov  4 12:23 usr/lib64/device-mapper/libdevmapper-event-lvm2thin.so 
-r-xr-xr-x   1 root     root        15640 Nov  4 12:23 usr/lib64/device-mapper/libdevmapper-event-lvm2vdo.so 

However, KVM uses the virtio drivers and they were nowhere to be seen. To confirm what was different, I then inspected the rescue initramfs image: 

lsinitrd /boot/initramfs-0-rescue-5e8eb9af2347493a99c3b0b496485b3d.img | grep virtio

The virtio drivers were present:

-rw-r--r--   1 root     root         7744 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/block/virtio_blk.ko.xz 
-rw-r--r--   1 root     root        12944 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/char/virtio_console.ko.xz 
drwxr-xr-x   2 root     root            0 Oct 21 15:53 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/gpu/drm/virtio 
-rw-r--r--   1 root     root        23260 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/gpu/drm/virtio/virtio-gpu.ko.xz 
-rw-r--r--   1 root     root        14296 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/net/virtio_net.ko.xz 
-rw-r--r--   1 root     root         8176 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/scsi/virtio_scsi.ko.xz 
drwxr-xr-x   2 root     root            0 Oct 21 15:53 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/virtio 
-rw-r--r--   1 root     root         4556 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/virtio/virtio.ko.xz 
-rw-r--r--   1 root     root         9664 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/virtio/virtio_pci.ko.xz 
-rw-r--r--   1 root     root         8280 Apr 21  2018 usr/lib/modules/3.10.0-862.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.xz 

My question immediately was “why were they missing? Why weren’t they included, even when disabling the host only mode for dracut?”

Firstly, let’s verify they exist for the new kernel version:

ls -lah 
/usr/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/block/virtio_blk.ko.xz 
-rw-r--r--. 1 root root 7.7K Oct 19 03:29 /usr/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/block/virtio_blk.ko.xz

Each file existed for the new kernel, so there was no reason why dracut shouldn’t be including them.

Further issues present themselves

While I was focused on the KVM drivers themselves, I completely omitted the fact that all kernel drivers were missing. This was a classic case of looking for a needle in a haystack while using tools to filter results and not seeing the rest of the haystack was missing.

Looking at it again, I ran a simple file count between the two images:

lsinitrd /boot/initramfs-3.10.0-1062.4.1.el7.x86_64.img | grep "usr/lib/modules/" | wc -l

Result: 12

lsinitrd /boot/initramfs-0-rescue-5e8eb9af2347493a99c3b0b496485b3d.img | grep "usr/lib/modules/" | wc -l

Result: 816

Wowsers. While the rescue kernel should contain more drivers as it’s non-host specific, it’s clear that 800+ files vs 12 shows there’s a definite problem.

To get a bit more info from the initial build, I copied the image to a temp directory, then extracted the files:

/usr/lib/dracut/skipcpio initramfs-3.10.0-1062.4.1.el7.x86_64.img  | zcat | cpio -ivd

This allows me to then see what kernel modules dracut thought it was loading by reading the build file:

less usr/lib/dracut/loaded-kernel-modules.txt

Of course, there were 53 modules in the list, including the expected virtio drivers:

ablk_helper
aesni_intel
ata_generic
ata_piix
bochs_drm
cdrom
crc32_pclmul
crc32c_intel
crc_t10dif
crct10dif_common
crct10dif_generic
crct10dif_pclmul
cryptd
dm_log
dm_mirror
dm_mod
dm_region_hash
drm
drm_kms_helper
e1000
fb_sys_fops
floppy
gf128mul
ghash_clmulni_intel
glue_helper
i2c_core
i2c_piix4
iosf_mbi
ip_tables
isofs
joydev
libata
libcrc32c
lrw
parport
parport_pc
pata_acpi
pcspkr
ppdev
sd_mod
serio_raw
sr_mod
syscopyarea
sysfillrect
sysimgblt
ttm
virtio
virtio_balloon
virtio_console
virtio_pci
virtio_ring
virtio_scsi
xfs

Trawling through the extracted files I found one other file which stood out being zero bytes:

ls -lah /tmp/extracted-initramfs/usr/lib/modules/3.10.0-1062.4.1.el7.x86_64/modules.dep 
-rw-------. 1 root root 0 Nov  4 15:54 usr/lib/modules/3.10.0-1062.4.1.el7.x86_64/modules.dep

The modules.dep file is a list of all the kernel module dependencies, generated so that modprobe commands and similar can determine what modules need to be loaded first. Looking at the source for the dracut kernel module, I can see that it references a locally generated modules.dep to determine file inclusions and obviously being blank it’s simply not including any files!

Finally, at least we’re starting to narrow in on the cause.

Finding the root cause

I first checked that the kernel itself had a clean module dependency file, which at 260k indicates it’s far from blank: 

ls -lah /usr/lib/modules/3.10.0-1062.4.1.el7.x86_64/modules.dep 
-rw-r--r--. 1 root root 266K Nov  1 12:13 /usr/lib/modules/3.10.0-1062.4.1.el7.x86_64/modules.dep

At least from a main kernel perspective, it was able to build and generate a list of dependencies.

Our next steps were to determine how dracut performs the module dependency check and why this was failing.

Again, to eliminate any weird install issues I reinstalled dracut:

yum reinstall dracut

The cycle of fun and repeating of the same error continued. As there’s a dozen or so other packages installed which could be at fault, I installed an additional plugin for yum to verify installed packages and their integrity. This was done via:

yum install yum-plugin-verify

To run, you can then simply call:

yum verify

Depending on the size of your installation and the speed of your server, this may take several minutes to complete and will provide a verbose output of any file which doesn’t match the installed package integrity. While this found a few faults, these faults weren’t to do with the kernel module simply the change of timestamps on a few Apache directories.

Out of absolute desperation and despite an hour of Googling previously, I ran one more search “initramfs has no modules” and found the result: https://stackoverflow.com/questions/53607020/initramfs-has-no-modules/53671006#53671006

This was for a completely different distro, but at this stage anything was worth a try. At this point I was happy to accept any miracle, no matter how vague.

Reinstallation of kmod (which handles the management of the Kernel modules) was as simple as:

yum reinstall kmod

Dracut was of course called to rebuild and all fingers, toes and any other objects I could find were all crossed. 

Success. 

I couldn’t believe it. We now had 819 kernel modules within the image, inline with exactly what was expected. Just so that I could confirm a standard upgrade was going to work, the simplest way was to reinstall the kernel again:

yum reinstall kernel

Rebooting the VM into the new kernel confirmed it was working exactly as expected, just like hundreds of others with the exact same build and environment. The sheep had returned to the herd.

Conclusion

For those who have made it to the end without nodding off, the obvious question here is why didn’t we just restore from a backup? 

The simple answer is because of the delay between the updates being applied (and the new kernel) and the actual reboot of the VM. Without knowing the root cause, the recovery point where it would work could be days or even weeks back and certainly create a significant delta when it comes to changed data on the VM.

The desire to find the fault and resolve of course meant that I simply couldn’t leave it alone anyway. Like many in the IT field, the desire to solve the puzzle sometimes overrides the more cost effective business logic of simply throwing it away and building it again. 

While I have a better understanding of the initramfs build process and system now and managed to fix the problem, the simple underlying cause of why or where the kmod package had failed still is and will remain a mystery.

Main Photo by Daan Stevens on Unsplash

Back to the Blog

avatar of tim butler

Tim Butler

With over 20 years experience in IT, I have worked with systems scaling to tens of thousands of simultaneous users. My current role involves providing highly available, high performance web and infrastructure solutions for small businesses through to government departments. NGINX Cookbook author.

  • Conetix
  • Conetix

Let's Get Started

  • This field is for validation purposes and should be left unchanged.