We are using two daisy-chained Belkin OmniView PRO 16-Port KVMs. For the most part, they have worked great. It sometimes messes up reading the scroll lock key and you have to be very careful about not leaving a node with the scroll lock on. Also, the video signal seems significanly degraded when viewing nodes on the second KVM.
Most of all, the KVMs sometimes mess up the mouse signal when switching between nodes. According to Belkin this could be caused by our optical mice. This seems to only happen between certain nodes, and since we are only reading mouse input on the two masters, we are lucky enough to not have it affect anything, yet.
We are using SysKonnect SK98-21 Gigabit ethernet cards. The only problem we had with them was a warning reporting a temperature senser out of range. After much debugging and returning the cards, we discovered a bug in the driver code that reported the same message for temperature and voltage warnings. (This is why open source is so cool!) Redhat 7.1 seems to come with a different driver version that doesn't have the same problem. After the upgrade, only one computer still reported the error (about 1,000 times a day). Swapping out the power supply seemd to limit the error down to a few a day at most.
Our cluster is connected to three 2200VA Tripp-Lite SmartPro UPSs. Two UPSs was not quite enough and it seemed that each UPS required its own circuit.
Although the UPSs can handle heavy loads, they do not provide as clean of power as we might prefer. Whatever goes in to the UPSs comes out the back unless the UPS is running off the batery. Therefore minor fluxuations in the power go straight through. In addition, rather than simulating the AC wave with a stair-stepped wave when running off the batery, the UPS outputs a nearly square wave.
Our rack-mount cases include several case fans. There have been several cases where the fans break and the case starts beeping. Sometimes this can be solved by pushing the fan back into its axil or reseating the fan in its mount.
We have only had one hard drive fail on us so far.
We have one haunted node (slave #12). The syntom is always the same: the node will not boot. When the power switch is pressed, the power light flashes on and back off, and the CPU and power supply fans both turn an quarter turn; then nothing. To get this to happen again, you must wait a minute, otherwise there is no response from the computer. The problem occurs only when the computer has been shut down for awhile, and nearly every time at that.
This first started showing up a few weeks after building the nodes. We swapped the power supply, removed all drives and cards to no avail. We swapped the memory and CPU with another node, still not helping. The only thing not replaced was the case and motherboard -- sounded like a bad motherboard. Upon reassembly however, the node majically started running again (a screwdriver dropped on the motherboard at this point may have helped).
A month later, it happened again, but went away after messing around inside the case.
A few weeks later the same problem cropped up, so we replaced the motherboard, assuming that was the problem. The problem went away.
A month or two later, the same problem reoccured. After some fanagling, the computer started running again.
A month later, the problem came back and didn't go away as easily. We ended up taking the motherboard and putting it on a block of wood. We plugged in a spare power supply, spare memory, the original CPU, a new power cord plugged into a different circuit, and used a screw driver to short the pins instead of a power switch (the screw driver was not dropped). The problem persisted. We individually tested the memory, CPU, motherboard, powersupply, and spare power supply in a separate system. The problem followed the two power supplies, so we replaced them.
Two months later, the same thing happened again. We removed the cards and drives, and that didn't help. We used a spare power supply which did fix the problem. However, we unplugged the original power supply from the hard drive, floppy drive, and case and it suddenly started working. We put the computer back together just the way it was and it continued working.
There is not a single piece of hardware present in all of the situations. The best guess offered so far is that some piece of hardware was frying the power supplies. This is a case where it would be nice to have purchased the computers from a distributer to whom we could return the whole computer rather than custom building them ourselves.
We initially configured or cluster with Scyld Beowulf, based of RedHat 6.2. However, Scyld was very limited on the slave nodes; it made it difficult to play with routing issues and made LINDA nearly impossible to use. Therefore we upgraded to a pure RedHat 7.1 only to learn that LINDA did not work on RedHat 7.1. Lindaspaces has since released an update to their software.