LUCHS.AT - System Administration - Ghosts Inside the Shell

The fortune cookie collections claims that hardware consists of the parts that can be kicked. This is especially true if something fails. Apart from the material there's often firmware involved which can also fail (algorithms are human, they have stress, too). We have two stories for you involving failed hardware.

Using redundant arrays of independent disks (RAIDs) sounds like a good idea. Have plenty of copies of your data and less worries. That's the idea on the surface. Below you'll find that mirroring data can also mirror deletion of data equally well. Then there are more complex RAID algorithms that use parity and checksums in order to deduce lost data from spare information. Complex is bad, and if only the firmware knows where your data is you probably won't in an emergency.
And then there is silent data corruption. A combination of faulty firmware and faulty hardware can destroy your file system(s) without warning. This happened to a logical volume spanning two RAID1 mirrors. There were no errors in the logs of either Linux kernel, RAID controller or server BIOS. Instead the Linux kernel got I/O errors when accessing the RAID1 containers, but no disk was marked as faulty and no RAID volume was marked degraded. Finally the JFS on the volume suffered a catastrophic failure and could not use its transaction log after a hard reboot of the locked server. A port mortem analysis of the file system and the hardware yielded no indication for the cause.

A different case was presented by a GNU/Linux router/firewall system. The hardware was an Mini-ITX board with three network interface cards, 1 GB RAM and crypto-acceleration in the CPU. The system worked flawless for over two years until the machine froze spontaneously during operation. The console stayed black, no input and no reset by keyboard was possible. The network interface cards were not reachable, too. Logs on the system showed no entries around the time of the freeze. Timestamps on the file system and files with 0 bytes indicated that the crypto-acceleration might have been in use at the time of the failure. After rebooting the firewall system selected Netfilter rules stopped working (about 3 out of 500+) including the NAT rules for SIP packets on port 5060/UDP. One NAT rule could be „repaired“ by switching the IP address of one server in the DMZ.
After switching the hardware and using the same set of rules on a different system all rules worked again, also in the form prior to changing the server address.

The hardware in question still needs to be examined in depth. Regardless of the results, you cannot trust any component of your infrastructure without regular maintenance.

Ghosts Inside the Shell - Hardware Failures