Women in Technology

Hear us Roar



Article:
  Confessions of the World's Largest Switcher
Subject:   an important piece is missing.....
Date:   2003-10-29 12:38:31
From:   anonymous2
Why doesn't the article include the most interesting part of the story ? How can you keep a cluster of over 1000 non-failsafe computers running ? Varadarajan has devised a system to make the cluster reliable, even though it, for one, doesn't have ECC RAM.
Full Threads Oldest First

Showing messages 1 through 14 of 14.

  • an important piece is missing.....
    2003-10-30 14:22:31  anonymous2 [View]

    "Also, message caching and dynamic memory management were added for improved scientific application performance."

    They have software that performs dynamic memory management. It would be niceif Apple brings ECC to at least the XServes in time, but I'm sure the memory management through software (which was developed at VT) was a big part of the cost savings, and they were never looking at machines with ECC RAM because the software could handle the error corrections.
  • an important piece is missing.....
    2003-10-29 22:54:59  anonymous2 [View]


    You don't need ECC RAM if the software is designed with data redundancy in mind. If I recall correctly, HP showed a number of years back that in a demonstration of a machine that had some 50,000 known problems in hardware that software could be designed to compensate for problems and return accurate results.

    In the case of VT, I believe they built error correction and fault tolerance into the software, allowing them to forgo expensive ECC RAM-based hardware.
  • an important piece is missing.....
    2003-10-29 15:20:11  anonymous2 [View]

    Part of the advantage of clusters is that you don't NEED to keep every node running. If 1 of your 1100 nodes breaks, you have a 1099-node cluster. You simply take it out and either repair or replace it. There's no real 'system' to it's reliability, other than redundancy.
    • an important piece is missing.....
      2003-11-25 08:32:03  anonymous2 [View]

      From what I understand, it's actually important that each node remain stable. The primary ways of dividing processing time on a supercomputer are nodes and hours. If you grab 10 nodes and are going to be doing computations for the next 10 hours, you'll have a significant issue with one node going down.

      Especially if you have to then piece together a dataset with incomplete data.
    • an important piece is missing.....
      2003-10-29 20:29:10  anonymous2 [View]

      That's not what anonymous 1 meant ... statistically there is a possibility that there will be an error in RAM that flips a bit randomly (caused by a stray cosmic ray or whatever). ECC RAM has an extra chip on the memory module to compensate for that possibility. The longer the calculation and the more computers involved the more likely that a RAM error will occur. It would be interesting to know how they do resolve the problem.
      • an important piece is missing.....
        2003-10-29 22:30:39  anonymous2 [View]

        this was mentioned elsewhere - the uni has developed specific fault tolerant software.
        • an important piece is missing.....
          2003-11-05 02:16:55  anonymous2 [View]

          There is now way that a external software can always find a intermittent memory fault.
          You can do some things that is acceptable on a workstation that do graphics but if you do science or manufacturing work were every result is important you basically have to recalculate everything twice on different nodes, this will lower the peak performance with 50% but is the only way to know that the result is the right one.

          This is a fantastic system but for organization were the result has to be correct they better look at a system with ECC.
          • an important piece is missing.....
            2003-11-05 13:10:01  anonymous2 [View]

            Wow, anonymous, it's too bad all of those supercomputing guys didn't ask you! You could have set them straight before they wasted all that time and money. I guess you'd better let the folks at top500.com know, quick - they must have missed it before now.

            OR, just MAYBE, you have no idea what you're talking about. It's a relatively simple exercise in software development to identify spurious results via multipl iterations. ECC memory won't protect you from processor faults and other glitches anyway, so the software has to be robust enough to allow for bad results even with expensive memory.

            But thanks for playing.
            • an important piece is missing.....
              2003-11-06 00:03:09  anonymous2 [View]

              Thank you to
              You even prove me right what do this mean:
              It's a relatively simple exercise in software development to identify spurious results via multipl iterations.

              This mean that you have to do everything multile times to prove your result.

              But it look nice to have the peek performance to put you on the TOP500 list.
              • ++++++++++an important message++++++++++++
                2003-11-06 08:02:28  anonymous2 [View]

                both Apple and Dr. Varadarajan seem to be clear about what they are doing and it certainly seems like a success story for both.

                SO ALL YOU GUYS CAN SHUT THE HELL UP AND GET BACK TO WORK!!

                +++++++++++++++++++++++++++++++++
                • ++++++++++an important message++++++++++++
                  2003-11-07 06:50:17  anonymous2 [View]

                  If I KNEW WHAT YOU ALL ARE TALKING ABOUT,
                  I WOULD NEVER SAY ANYTHING

                  BRAINS
                  • ++++++++++an important message++++++++++++
                    2003-11-07 14:19:48  anonymous2 [View]

                    ....he likes peanuts.
                    • ++++++++++an important message++++++++++++
                      2003-11-18 17:28:40  anonymous2 [View]

                      mmm peanuts
                      • ++++++++++an important message++++++++++++
                        2003-12-14 04:40:11  anonymous2 [View]

                        Think of managing other $5.2M software engineering
                        projects ...

                        mmm peanuts
                        mmm single precision Altivec objects

                        What brand of peanuts could sell good?