The case of the Borked HDD

0

It all started this morning when I went to start my regular Saturday yoga session.

I have a room at home that has all my yoga paraphernalia and a desktop computer I use to stream my meditative music. After years of service in that room, my computer displayed hard disk errors when I turned on the display. So I rebooted — rather tried to. This system just hung after UEFI BIOS with only a single dot at the upper left of the screen. I’d never seen a symptom like that before, but, based on the disk errors, I decided to start my problem determination right there.

I moved the system into my home lab and connected it to a KVM and tried to boot it again with the same result. BIOS showed that only 1 of the 2 80GB HDDs in the system was being detected. I determined which one that was and removed it for further testing.

I have an external docking station on my primary workstation for SATA storage drives and installed the defective drive in that. The dmesg command only showed a single line with none of the usual disk status information. It was truly borked.

At this point I turned back to the system and turned it on, not knowing what to expect. It actually started to boot. It got all the way to a point where it needed access to the /var filesystem — which was located on the defective drive along with /home — and dropped into recovery mode.

I removed the second 80GB HDD and installed a 160GB HDD, reinstalled Fedora, and all is now well.

I tested the working drive to ensure it was still OK, and tossed the defective drive in the bin I use to keep recyclable electronics until I can make the trip to the local collection point.

Thoughts

That was one weird symptom. I’ve never seen a problem with the non-root drive cause the system to fail immediately after UEFI BIOS. But the clues to the root cause were there on the system console and in the UEFI BIOS.

I always keep old parts that I scrounge from systems that have other problems and aren’t worth fixing. These are usually old computers that friends want me to look at in case they can be repaired. I salvage what I can verify still works — like memory and storage devices — and recycle the rest.

The other thought I had is that it pays to monitor systems when they are working so that you know what one is supposed to look like. That means regular use of things like dmesg, lsblk, top and all it’s spin-offs, journalctl, the system console, and all the other tools. It’s much easier to know failure when you see it if you know what proper functioning looks like.

Leave a Reply