Software Troubleshooting - A Case Study

Many students often wonder what a system administrator's job is. The question has broader appeal as well, as many system administration forums seem to have a large share of questions like "I was just made system administrator - what do I do?"

The course you are currently taking covers many (but not all!) of the aspects of the administrator's job: installing, configuring and updating software, managing storage and users, day-to-day operations, and backup. But one very important responsibility of the system administrator is software troubleshooting, and this one is difficult to teach. So you can imagine my delight in having a relatively serious problem during a class session. It was a perfect teachable moment!

Here is a chronology of the experience. I hope that it will give you some idea of how to approach such problems. The only difference between this story and real life is that I did not have a boss breathing down my neck and asking me every 5 minutes when it was going to be fixed!

Note that this incident occurred while the class drive was sda.
  1. I had just led my class through the installation of the RWC Linux distro onto their hard drives. The final step was to reboot and make sure everything worked. They chose the new system from the Legacy Grub multi-boot menu, and were presented with the following error:
    Booting 'lthree'

    root (hd0,10)
    Filesystem type is ext2, partition type 0x83
    kernel --no-mem-option /boot/kernel-2.6.32.full root=/dev/sda11 noacpi nolapic

    Error 2: Bad file or directory type

    Press any key to continue...

  2. My first step was to write down the error and take it at its word. I rebooted from the DVD and checked the info page on grub. Under "Troubleshooting/Stage 2 errors", it explains error 2:
    This error is returned if a file requested is not a regular file,
         but something like a symbolic link, directory, or FIFO.
    
  3. I knew this was not true, but to make sure I mounted sda11, and used the stat command on the kernel file:
      File: `/boot/kernel-2.6.32.full'
      Size: 3984704         Blocks: 7792       IO Block: 4096   regular file
    Device: 80bh/2059d      Inode: 360345      Links: 1
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2011-01-13 20:34:49.000000000 -0500
    Modify: 2011-01-13 20:34:49.000000000 -0500
    Change: 2011-01-13 20:34:49.000000000 -0500
    
  4. So the file was indeed the correct type. My next step was to try to boot from a different system. I chose another, seemingly identical, Linux system and it booted fine. Moreover, when I mounted sda11 and compared the kernel image that wouldn't boot with the one that did (on sda5), they were identical.

    It was clear that I would get no further with the class that night, and the professor's prerogative allowed me to end class a little early so I could troubleshoot without interruption. I did not have a boss breathing down my neck, but a class waiting on you to fix something can provide a similar sense of pressure.

  5. When I got home, I Googled the search
    grub "Bad file or directory type"
    This brought 29,800 results, but the first was all I needed. In one of the Ubuntu forums, someone described the problem, and the fifth post referenced a useful link.

    It explained that legacy grub (version 0.97) assumes that ext2 file systems have 128-byte inodes, and file systems created with mke2fs versions greater than or equal to 1.40.5 create file systems with 256-byte inodes by default.

    Sometimes searching forums can lead you on wild goose chases. Look for posts that seem to speak with some authority, and ones that include links which claim to explain what causes the problem.
  6. So I did a man on mke2fs, and at the bottom of the man page it cited the version as 1.41.8. I new we were running legacy grub, so it seemed like I had the answer. But why did the other system boot flawlessly?
  7. So I e-mailed the person who set up our lab, to ask how the multi-boot systems were installed. It turned out that the new distro was installed on pre-existing file systems, created with an earlier version of mke2fs. This also explained why I had not had the problem at home where I developed the distro: I had built on file systems created using the mke2fs from the previous distro. And I had not had the problem on my office PC because I had installed grub version 2 on it.
  8. There were a number of possible fixes:

    That was the solution which was easiest to implement with the least collateral disturbance.
    I chose to change it in the file name for the kernel image rather than in the "root" directive, in order to emphasize how grub searches for files.
  9. So just before my noon class the next day, and just after another professor was finishing up a different class in the lab, I tried it out. And of course it worked perfectly.
This experience is a reasonably good illustration of how a system administrator fixes problems. In the vast majority of cases, someone else has beat you to the problem, and there is some sort of fix available. Very few administrators are ever so close to the bleeding edge that they make the initial discovery of a problem.

But it takes more than searching forums. You have to know your way around the documentation, you have to know all of the commands that can give you the information you need, and you need to have some detailed knowledge of how your system functions.

But now I have a great lesson to teach, and I can also include the various ways to fix (or work around) the problem:

And of course it points out just how handy it is to have a multi-boot system, with an older Linux system available to help in your testing and fixing.


©2011, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.

Please send comments or suggestions to the author.