Confessions of the World's Largest Switcher
Subject:   an important piece is missing.....
Date:   2003-10-29 15:20:11
From:   anonymous2
Response to: an important piece is missing.....

Part of the advantage of clusters is that you don't NEED to keep every node running. If 1 of your 1100 nodes breaks, you have a 1099-node cluster. You simply take it out and either repair or replace it. There's no real 'system' to it's reliability, other than redundancy.
Full Threads Oldest First

Showing messages 1 through 11 of 11.

  • an important piece is missing.....
    2003-11-25 08:32:03  anonymous2 [View]

    From what I understand, it's actually important that each node remain stable. The primary ways of dividing processing time on a supercomputer are nodes and hours. If you grab 10 nodes and are going to be doing computations for the next 10 hours, you'll have a significant issue with one node going down.

    Especially if you have to then piece together a dataset with incomplete data.
  • an important piece is missing.....
    2003-10-29 20:29:10  anonymous2 [View]

    That's not what anonymous 1 meant ... statistically there is a possibility that there will be an error in RAM that flips a bit randomly (caused by a stray cosmic ray or whatever). ECC RAM has an extra chip on the memory module to compensate for that possibility. The longer the calculation and the more computers involved the more likely that a RAM error will occur. It would be interesting to know how they do resolve the problem.
    • an important piece is missing.....
      2003-10-29 22:30:39  anonymous2 [View]

      this was mentioned elsewhere - the uni has developed specific fault tolerant software.
      • an important piece is missing.....
        2003-11-05 02:16:55  anonymous2 [View]

        There is now way that a external software can always find a intermittent memory fault.
        You can do some things that is acceptable on a workstation that do graphics but if you do science or manufacturing work were every result is important you basically have to recalculate everything twice on different nodes, this will lower the peak performance with 50% but is the only way to know that the result is the right one.

        This is a fantastic system but for organization were the result has to be correct they better look at a system with ECC.
        • an important piece is missing.....
          2003-11-05 13:10:01  anonymous2 [View]

          Wow, anonymous, it's too bad all of those supercomputing guys didn't ask you! You could have set them straight before they wasted all that time and money. I guess you'd better let the folks at know, quick - they must have missed it before now.

          OR, just MAYBE, you have no idea what you're talking about. It's a relatively simple exercise in software development to identify spurious results via multipl iterations. ECC memory won't protect you from processor faults and other glitches anyway, so the software has to be robust enough to allow for bad results even with expensive memory.

          But thanks for playing.
          • an important piece is missing.....
            2003-11-06 00:03:09  anonymous2 [View]

            Thank you to
            You even prove me right what do this mean:
            It's a relatively simple exercise in software development to identify spurious results via multipl iterations.

            This mean that you have to do everything multile times to prove your result.

            But it look nice to have the peek performance to put you on the TOP500 list.
            • ++++++++++an important message++++++++++++
              2003-11-06 08:02:28  anonymous2 [View]

              both Apple and Dr. Varadarajan seem to be clear about what they are doing and it certainly seems like a success story for both.


              • ++++++++++an important message++++++++++++
                2003-11-07 06:50:17  anonymous2 [View]


                • ++++++++++an important message++++++++++++
                  2003-11-07 14:19:48  anonymous2 [View]

                  ....he likes peanuts.
                  • ++++++++++an important message++++++++++++
                    2003-11-18 17:28:40  anonymous2 [View]

                    mmm peanuts
                    • ++++++++++an important message++++++++++++
                      2003-12-14 04:40:11  anonymous2 [View]

                      Think of managing other $5.2M software engineering
                      projects ...

                      mmm peanuts
                      mmm single precision Altivec objects

                      What brand of peanuts could sell good?