Date:   2003-10-29 15:20:11
From:   anonymous2
Response to: an important piece is missing.....

Part of the advantage of clusters is that you don't NEED to keep every node running. If 1 of your 1100 nodes breaks, you have a 1099-node cluster. You simply take it out and either repair or replace it. There's no real 'system' to it's reliability, other than redundancy.
    2003-11-25 08:32:03  anonymous2

    From what I understand, it's actually important that each node remain stable. The primary ways of dividing processing time on a supercomputer are nodes and hours. If you grab 10 nodes and are going to be doing computations for the next 10 hours, you'll have a significant issue with one node going down.

    Especially if you have to then piece together a dataset with incomplete data.
    2003-10-29 20:29:10  anonymous2

    That's not what anonymous 1 meant ... statistically there is a possibility that there will be an error in RAM that flips a bit randomly (caused by a stray cosmic ray or whatever). ECC RAM has an extra chip on the memory module to compensate for that possibility. The longer the calculation and the more computers involved the more likely that a RAM error will occur. It would be interesting to know how they do resolve the problem.