Software Troubleshooting - A Case Study
Many students often wonder what a system administrator's job is. The question has broader appeal as well, as many system
administration forums seem to have a large share of questions like "I was just made system administrator - what do I do?"
The course you are currently taking covers many (but not all!) of the aspects of the administrator's
job: installing, configuring and updating software, managing storage and users, day-to-day operations, and backup. But one very
important responsibility of the system administrator is software troubleshooting, and this one is difficult to teach. So you can
imagine my delight in having a relatively serious problem during a class session. It was a perfect teachable moment!
Here is a chronology of the experience. I hope that it will give you some idea of how to approach such problems.
The only difference between this story and real life is that I did not have a boss breathing down my neck and asking me every 5 minutes when it
was going to be fixed!
Note that this incident occurred while the class drive was sda.
- I had just led my class through the installation of the RWC Linux distro onto their hard drives.
The final step was to reboot and make sure everything worked. They chose the new system from the Legacy Grub multi-boot menu, and were
presented with the following error:
Booting 'lthree'
root (hd0,10)
Filesystem type is ext2, partition type 0x83
kernel --no-mem-option /boot/kernel-2.6.32.full root=/dev/sda11 noacpi nolapic
Error 2: Bad file or directory type
Press any key to continue...
- My first step was to write down the error and take it at its word.
I rebooted from the DVD and checked the info page on grub. Under "Troubleshooting/Stage 2 errors", it explains error 2:
This error is returned if a file requested is not a regular file,
but something like a symbolic link, directory, or FIFO.
- I knew this was not true, but to make sure I mounted sda11, and used
the stat command on the kernel file:
File: `/boot/kernel-2.6.32.full'
Size: 3984704 Blocks: 7792 IO Block: 4096 regular file
Device: 80bh/2059d Inode: 360345 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2011-01-13 20:34:49.000000000 -0500
Modify: 2011-01-13 20:34:49.000000000 -0500
Change: 2011-01-13 20:34:49.000000000 -0500
- So the file was indeed the correct type. My next step was to try to boot from a different system. I chose another,
seemingly identical, Linux system and it booted fine. Moreover, when I mounted sda11 and compared the kernel image that wouldn't
boot with the one that did (on sda5), they were identical.
It was clear that I would get no further with the class that night,
and the professor's prerogative allowed me to end class a little early so I could troubleshoot without interruption.
I did not have a boss breathing down my neck, but a class waiting on you to fix something can provide a similar sense of pressure.
- When I got home, I Googled the search
grub "Bad file or directory type"
This brought 29,800 results, but the first was all I needed. In one of the Ubuntu forums, someone described the problem,
and the fifth post referenced a useful link.
It explained that legacy grub (version 0.97) assumes that ext2 file systems have 128-byte inodes, and file systems created
with mke2fs versions greater than or equal to 1.40.5 create file systems with 256-byte inodes by default.
Sometimes searching forums can lead you on wild goose chases. Look for posts that seem to speak with some authority, and ones
that include links which claim to explain what causes the problem.
- So I did a man on mke2fs, and at the bottom of the man page it cited the version as 1.41.8. I new we were running legacy grub, so
it seemed like I had the answer. But why did the other system boot flawlessly?
- So I e-mailed the person who set up our lab, to ask how the multi-boot systems were installed. It turned out that
the new distro was installed on pre-existing file systems, created with an earlier version of mke2fs. This also explained
why I had not had the problem at home where I developed the distro: I had built on file systems created using the mke2fs from the previous
distro. And I had not had the problem on my office PC because I had installed grub version 2 on it.
- There were a number of possible fixes:
- have the students do the exercise over again, and use the mke2fs option "-i 128"; but that takes too much class time;
- install grub version 2 and reconfigure; but that takes too much time outside of class, during which the lab is unavailable;
- change the grub menu.lst file to make grub use the copy of the kernel image residing on sda5; this entails changing the
"/boot/kernel-2.6.32.full" to "(hd0,4)/kernel-2.6.32.full".
That was the solution which was easiest to implement with the least collateral disturbance.
I chose to change it in the file name for the kernel image rather than in the "root" directive, in order to
emphasize how grub searches for files.
- So just before my noon class the next day, and just after another professor was finishing up a different class in the lab,
I tried it out. And of course it worked perfectly.
This experience is a reasonably good illustration of how a system administrator fixes problems. In the vast majority of cases,
someone else has beat you to the problem, and there is some sort of fix available. Very few administrators are ever so
close to the bleeding edge that they make the initial discovery of a problem.
But it takes more than searching forums. You have to know your way around the documentation, you have to know all of the
commands that can give you the information you need, and you need to have some detailed knowledge of how your system functions.
But now I have a great lesson to teach, and I can also include the various ways to fix (or work around) the problem:
- You can use the DVD to boot from the hard drive;
- you can edit the grub menu live and boot the edited version (which is how I tried the fix);
- or you can edit the menu permanently on the partition in question (which is what my students will do).
And of course it points out just how handy it is to have a multi-boot system, with an older Linux system available to
help in your testing and fixing.
©2011, Kenneth R. Koehler. All Rights Reserved. This document may be freely reproduced provided that this copyright notice is included.
Please send comments or suggestions to the author.