Confessions of the World's Largest Switcher
Subject:   an important piece is missing.....
Date:   2003-10-29 12:38:31
From:   anonymous2
Why doesn't the article include the most interesting part of the story ? How can you keep a cluster of over 1000 non-failsafe computers running ? Varadarajan has devised a system to make the cluster reliable, even though it, for one, doesn't have ECC RAM.
Main Topics Oldest First

Showing messages 1 through 3 of 3.

  • an important piece is missing.....
    2003-10-30 14:22:31  anonymous2 [View]

    "Also, message caching and dynamic memory management were added for improved scientific application performance."

    They have software that performs dynamic memory management. It would be niceif Apple brings ECC to at least the XServes in time, but I'm sure the memory management through software (which was developed at VT) was a big part of the cost savings, and they were never looking at machines with ECC RAM because the software could handle the error corrections.
  • an important piece is missing.....
    2003-10-29 22:54:59  anonymous2 [View]

    You don't need ECC RAM if the software is designed with data redundancy in mind. If I recall correctly, HP showed a number of years back that in a demonstration of a machine that had some 50,000 known problems in hardware that software could be designed to compensate for problems and return accurate results.

    In the case of VT, I believe they built error correction and fault tolerance into the software, allowing them to forgo expensive ECC RAM-based hardware.
  • an important piece is missing.....
    2003-10-29 15:20:11  anonymous2 [View]

    Part of the advantage of clusters is that you don't NEED to keep every node running. If 1 of your 1100 nodes breaks, you have a 1099-node cluster. You simply take it out and either repair or replace it. There's no real 'system' to it's reliability, other than redundancy.