Even seasoned Sysadmins can have epic fails
And this was mine. It was a bit frustrating – well, a lot frustrating. I managed to totally bork my primary workstation while trying to perform some hardware upgrades along with a restructuring of my storage configuration. The story is a bit long and consists of several intersecting events that took place over a period of weeks.
I have been working with computers for over 50 years and using Linux for almost 25. I should have known better.
Installing the first SSD
It started when I began migrating my primary workstation to SSDs. You can read the long story of that here, but this is the short version.
Having noticed that my System76 Oryx Pro laptop, with its SSDs, booted much faster than my primary workstation, I decided to convert at least one of my 4 internal hard drives to SSD.
I had previously purchased an Intel 512GB m.2 NVMe SSD for a customer project that was cancelled. I ran across that SSD while looking through my few remaining hard drives. Did I mention that my laptop boots really, really fast? And my primary workstation did not.
I have also wanted to do a complete Fedora reinstallation for a few months because I have been doing release upgrades since about Fedora 21. Sometimes doing a fresh install to get rid of some of the cruft is a good idea. All things considered, it seemed like a good idea to do the reinstall of Fedora on the SSD.
I installed the SSD in one of the two m.2 slots on my ASUS TUF X299 motherboard and installed Fedora on it, created vg01 to fill the entire device, and placed all of the operating system and application program filesystems on it, /boot, /boot/eufi, / (root), /var, /usr, and /tmp. I chose not to place the swap partition on the SSD because I have enough RAM that the swap partition is almost never used. Also, /home would remain on its own partition on an HDD.
The installation went very smoothly. After this I ran a Bash program I wrote to install and configure various tools and application software. That also went well – and fast – very fast.
And my workstation booted and ran much faster.
Then, a few weeks ago, my primary display, a Dell with 2560×1600 resolution failed. It had started blanking out – going totally dark – for a few seconds and progressing to longer and more frequent blackouts. Until it blacked out and never recovered.
I purchased a new LG 32″ display with a maximum 3840×2160 resolution. The high res failed with my 10 year old graphics adapter so I had to purchase a new Sapphire Radeon 11265-05-20G to drive it. Then I had to reconfigure my desktop and apps to deal with the HiDPI display so I could read everything.
This problem did not directly affect how or why I borked my workstation, but it was one of several things happening at that time.
The second SSD
A few weeks after I performed the initial migration I decided to install another M.2 SSD in the second slot on my motherboard. I wanted to do this to speed access to my /home directory which was still located on an HDD. Also, I could then move swap to the first SSD which still had lots of room and then remove the HDD which would be empty.
I have an APC UPS which tells me how many Watts of power are being consumed and I was surprised at how much difference it made to move from HDD to SDD devices. Although a bit fuzzy, I estimate that I save about 20 (continuous) watts per device, which works out to about 480Watt-hours per day per device.
I moved my home directory to the new SSD which was created as vg02, turned off swap, deleted the old swap volume, and created a new 10GB swap volume on the original SSD on vg01 because there was still plenty of space there.
I had to changed the entry in /etc/fstab to reflect the new locations for those two logical volumes.
/dev/mapper/vg02-home /tmp ext4 discard,defaults 1 2 /dev/mapper/vg01-swap none swap discard,defaults 0 0
I turned swap back on and all was good – until I rebooted. The startup sequence – when systemd takes over – locked up at about 2.6 seconds after starting. A bit of investigation showed that the /etc/defaults/grub local configuration file still contained a reference to the old swap location in the Linux kernel option line.
I changed that line to the following:
GRUB_CMDLINE_LINUX="resume=/dev/mapper/vg01-swap rd.lvm.lv=vg01/root rd.lvm.lv=vg01/swap rd.lvm.lv=vg01/usr"
I then ran the following command to recreate the grub2 configuration file.
# grub2-mkconfig > /boot/grub2/grub.cfg
I rebooted and all was well.
A bit of additional testing resulted in significantly improved times for applications to load data from my home directory which was the whole idea.
The reboot I did of my workstation after making the volume changes is always a part of my testing procedures. Any time I make a change that affects the runtime or startup configuration of the operating system I always perform a reboot to verify that none of my changes have caused problems with boot and startup. In this case it had and I was able to fix it immediately.
You do have a standard testing procedure that you use after making changes – right?
The third SSD
By this time I had one more volume located on a hard drisk that I wanted to move to an SSD to improve performance. I have over 20 virtual machines that I use for testing various Linux distributions and releases. They would still load and run fairly slowly because they were on the HDD. So I purchased a SATA, 2.5″ SDD because I was out of M.2 PCIe slots on my motherboard.
The installation was as easy as any SATA device and I created a logical volume on which I could store my virtual machines. After moving the VMs to the new volume, a little testing showed significantly improved speeds.
So after all of those changes I decided to move some other files around and restore some older ones from an old backup just so I could have them on-line again.
I needed to change the ownership of some of the restored files. I entered the command but mistyped something and I managed to run chown on most of the files in /usr, /var, /bin, and more.
A bit of fussing failed so I reinstalled but then my home directory would prevent me from logging in with a permissions error. Re-copying and changing permissions did not work. So I did another reinstall and intentionally wiped my /home volume. After running my post-install script and restoring from the most recent backup I was up and running again.
I got so caught up in making all these changes that I just neglected to verify the correctness of the command I typed. It happened to me and it can happen to you.
I learned from this, as I do from all of my mistakes. That is all we can do; fix the self-inflicted problem and learn from it so we don’t do it again. At least not any time soon. ;-)