|
Ghosts Inside the Shell - Hardware FailuresThe fortune cookie collections claims that hardware consists of the parts that can be kicked. This is especially true if something fails. Apart from the material there's often firmware involved which can also fail (algorithms are human, they have stress, too). We have two stories for you involving failed hardware.
Using redundant arrays of independent disks (RAIDs) sounds like a good idea. Have plenty of copies of your data and less worries. That's the idea on the surface. Below you'll find that mirroring data can also mirror deletion of data equally well. Then there are more complex RAID algorithms that use parity and checksums in order to deduce lost data from spare information. Complex is bad, and if only the firmware knows where your data is you probably won't in an emergency.
A different case was presented by a GNU/Linux router/firewall system. The hardware was an Mini-ITX board with three network interface cards, 1 GB RAM and crypto-acceleration in the CPU. The system worked flawless for over two years until the machine froze spontaneously during operation. The console stayed black, no input and no reset by keyboard was possible. The network interface cards were not reachable, too. Logs on the system showed no entries around the time of the freeze. Timestamps on the file system and files with 0 bytes indicated that the crypto-acceleration might have been in use at the time of the failure. After rebooting the firewall system selected Netfilter rules stopped working (about 3 out of 500+) including the NAT rules for SIP packets on port 5060/UDP. One NAT rule could be „repaired“ by switching the IP address of one server in the DMZ. The hardware in question still needs to be examined in depth. Regardless of the results, you cannot trust any component of your infrastructure without regular maintenance.
|