Captain Fix-It
“One of my greatest sysadmin moments happened almost a decade ago. I was adding redundancy to our HP RX8600 server by adding two cells for RISC9000. Despite making only a small change, the next day our DBA reported ‘fuzzy checkpoints,’ and write operations were taking a minute, rather than a few seconds.
Our provider technicians suggested an OS patch, which as I suspected, resulted in a major failure and severe outage that left the business decommissioned for days. We eventually found a backup that could be read to successfully restore the system, but the slowness issue persisted.
I remembered that the rather large cell boards that had been added were slid in through the back of the chassis, dangerously close to the SAN fiber cables. After I fiddled with the cables on a hunch, our DBA burst into the data center exclaiming that the performance had bounced back. Now I knew it all boiled down to a problem with the fiber and replaced it immediately. After a close inspection, a barely visible crush spot could be seen on the cable, not severe enough to cause errors to be reported, but enough to cause packets to retransmit.
At the end of the day, I had successfully backed up a potentially disastrous payroll system crash, solved the database performance issue and everyone got paid on time!”