When I was working on my operating systems project in university, I stayed up all night for weeks feverishly rebooting my test machine, hoping that THIS time my interrupt handler changes worked. I survived on a diet of raisin bagels, baby carrots, and Dr Pepper, and only left the computer lab to shower and sleep for a few hours before stumbling back in and opening up EMACS again. I loved it.
Julia Evans' blog posts about writing an operating system in Rust at Hacker School are making me miss the days when I thought everything about operating systems was magical. Julia is (a) hilarious, (b) totally honest, (c) incredibly enthusiastic about learning systems programming. (See "What does the Linux kernel even do?", "After 5 days, my OS doesn't crash when I press a key", and "After 6 days, I have problems I don't understand at all.") I'm sure somewhere on Hacker News there is a thread getting upvoted about how Julia is (a) faking it, (b) a bad programmer, (c) really a man, but here in the real world she's making me and a lot of other folks nostalgic for our systems programming days.
Yesterday's post about something mysteriously zeroing out everything about 12K in her binary reminded me of one of my favorite OS debugging stories. Since I'm stuck at home recovering from surgery, I can't tell anyone it unless I write a blog post about it.
In 2001, I got a job maintaining the Linux kernel for the (now defunct) Gemini subarchitecture of the PowerPC. The Gemini was an "embedded" SMP board in a giant grey metal VME cage with a custom BIOS. Getting the board in and out of the chassis required brute strength, profanity, and a certain amount of blood loss. The thing was a beast - loud and power hungry, intended for military planes and tanks where no one noticed a few extra dozen decibels.
The Gemini subarchitecture had not had a maintainer or even been booted in about 6 months of kernel releases. This did not stop a particularly enthusiastic PowerPC developer from tinkering extensively with the Gemini-specific bootloader code, which was totally untestable without the Gemini hardware. With sinking heart, I compiled the latest kernel, tftp'd it to the VME board, and told the BIOS to boot it.
It booted! Wow! What are the chances? Flushed with success, I made some minor cosmetic change and rebooted with the new kernel. Nothing, no happy printk's scrolling down the serial console. Okay, somehow my trivial patch broke something. I booted the old binary. Still nothing. I thought for a while, made some random change, and booted again. It worked! Okay, this time I will reboot right away to make sure it is not a fluke. Reboot. Nothing. I guess it was a fluke. A few dozen reboots later, I went to lunch, came back, and tried again. Success! Reboot. Failure. Great, a non-deterministic bug - my favorite.
Eventually I noticed that the longer the machine had been powered down before I tried to boot, the more likely it was to boot correctly. (I turned the VME cage off whenever possible because of the noise from the fans and the hard disks, which were those old SCSI drives that made a high-pitched whining noise that bored straight through your brain.) I used the BIOS to dump the DRAM (memory) on the machine and noticed that each time I dumped the memory, more and more bits were zeroes instead of ones. Of course I knew intellectually that DRAM loses data when you turned the power off (duh) but I never followed it through to the realization that the memory would gradually turn to zeroes as the electrons trickled out of their tiny holding pens.
So I used the BIOS to zero out the section of memory where I loaded the kernel, and it booted - every time! After that, it didn't take long to figure out that the part of the bootloader code that was supposed to zero out the kernel's BSS section had been broken by our enthusiastic PowerPC developer. The BSS is the part of the binary that contains variables that are initialized to zero at the beginning of the program. To save space, the BSS is not usually stored as a string of zeroes in the binary image, but initialized to zero after the program is loaded but before it starts running. Obviously, it causes problems when variables that are supposed to be zero are something other than zero. I fixed the BSS zeroing code and went on to the next problem.
This bug is an example of what I love about operating systems work. There's no step-by-step algorithm to figure out what's wrong; you can't just turn on the debugger and step through till you see the bug. You have to understand the computer software and hardware from top to bottom to figure out what's going wrong and fix it (and sometimes you need to understand quite a bit of electrical engineering and mathematical logic, too).
Updated to add: Hacker News had a strangely on-topic discussion about this post with lots more great debugging stories. Check it out!