Confessions of the World's Largest Switcher
Subject:   an important piece is missing.....
Date:   2003-10-29 15:20:11
From:   anonymous2
Response to: an important piece is missing.....

Part of the advantage of clusters is that you don't NEED to keep every node running. If 1 of your 1100 nodes breaks, you have a 1099-node cluster. You simply take it out and either repair or replace it. There's no real 'system' to it's reliability, other than redundancy.
Main Topics Oldest First

Showing messages 1 through 2 of 2.

  • an important piece is missing.....
    2003-11-25 08:32:03  anonymous2 [View]

    From what I understand, it's actually important that each node remain stable. The primary ways of dividing processing time on a supercomputer are nodes and hours. If you grab 10 nodes and are going to be doing computations for the next 10 hours, you'll have a significant issue with one node going down.

    Especially if you have to then piece together a dataset with incomplete data.
  • an important piece is missing.....
    2003-10-29 20:29:10  anonymous2 [View]

    That's not what anonymous 1 meant ... statistically there is a possibility that there will be an error in RAM that flips a bit randomly (caused by a stray cosmic ray or whatever). ECC RAM has an extra chip on the memory module to compensate for that possibility. The longer the calculation and the more computers involved the more likely that a RAM error will occur. It would be interesting to know how they do resolve the problem.