Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Linux Software Technology

North America's Fastest Linux Cluster Constructed 325

SeanAhern writes "LinuxWorld reports that 'A Linux cluster deployed at Lawrence Livermore National Laboratory and codenamed 'Thunder' yesterday delivered 19.94 teraflops of sustained performance, making it the most powerful computer in North America - and the second fastest on Earth.'" Thunder sports 4,096 Itanium 2 processors in 1,024 nodes, some big iron by any standard.
This discussion has been archived. No new comments can be posted.

North America's Fastest Linux Cluster Constructed

Comments Filter:
  • Google Cache (Score:2, Informative)

    by nadolph ( 661727 ) on Thursday May 13, 2004 @11:01PM (#9147364) Homepage
    http://www.google.ca/search?sourceid=navclient&ie= UTF-8&oe=UTF-8&q=cache:http%3A%2F%2Fwww%2Ellnl%2Eg ov%2Flinux%2Fthunder%2F
  • Re:Whoa. (Score:4, Informative)

    by TravisWatkins ( 746905 ) on Thursday May 13, 2004 @11:02PM (#9147373) Homepage
    That would be the Earth Simulator [jamstec.go.jp] in Japan.
  • Re:vs google (Score:5, Informative)

    by complete loony ( 663508 ) <Jeremy@Lakeman.gmail@com> on Thursday May 13, 2004 @11:20PM (#9147495)
    Google have lots of little (in comparison only) jobs that have to process heaps of data, googles cluster(s) wouldn't perform well in the top 500 list since they don't concentrate on link speed, which is the main factor in performace for supercomputers, but on raw data processing power.

    The GFS article that appeared a while back said they used standard 100MBit ethernet, this is not going to get you a good score in any supercomputer benchmark.

  • Wow (Score:2, Informative)

    by 0xC0FFEE ( 763100 ) on Thursday May 13, 2004 @11:23PM (#9147509)
    Here's a picture: http://doc.quadrics.com/quadrics/QuadricsHome.nsf/ DisplayPages/3A912204F260613680256DD9005122C7
  • by Anonymous Coward on Thursday May 13, 2004 @11:34PM (#9147589)
    Next time you cut and paste [google.com.au] from someone else's blog, perhaps take the time to restore the paragraph tags? Smells like plagurism to me...
  • by damiam ( 409504 ) on Thursday May 13, 2004 @11:43PM (#9147659)
    Sorry to burst your bubble, but Itanium isn't x86.
  • by sapbasisnerd ( 729448 ) on Thursday May 13, 2004 @11:49PM (#9147681)
    Had to decide to reply to this or mod it down, decided to reply.

    That's a wildly inaccurate summary of the landscape of RDBMS clustering technology.

    Problem is, that's not what we are talking about here.

    So the answer to your question at this end is almost certainly "none of the above" or probably more correctly "some bits of all of the above". Functionally most of the kind of stuff you do here doesn't need shared concurrent access to the same data files however for simplicity of implementation they probably nevertheless run GPFS so that all nodes can see all files.

  • by skdffff ( 140618 ) on Thursday May 13, 2004 @11:53PM (#9147710)
    There are basically two types of clusters - HA (High Availability) and HPC (High Performance Computing). They both called "clusters" (what confuses some people) but designed for completely different purposes. You're talking about variations of first type while cluster in the article is HPC cluster.
  • by tap ( 18562 ) on Thursday May 13, 2004 @11:57PM (#9147734) Homepage
    Do you have any kind of benchmark where the Itanium smokes the Opteron? The Itanium does have a greater memory bandwidth, but not by a lot. If you look at the spec benchmarks, it can be faster on some of them, but not by a lot. However, the Itamium is a lot more expensive!

    Compared to a Xeon or AthlonMP cluster, the Itanium faired poorly in price/performance. The only reason to use Itaniums was if you needed 64 bits for more than 4GB of memory, or needed high single CPU performance for a pooly parallized application. (Of course if your application parallizes poorly, a cluster is probably a bad choice to begin with). Then Opterion came out and changed all that. It's 64 bits, it's fast, and it's a fraction of the price of the Itanium2.

    I just purchased a new Beowulf cluster. The decision was between Xeons vs Opterons. The Opterons had better price/performance, but the Xeons would fit in better with our existing Pentium3 Beowulf, other ia32 servers, and existing software. In the end, we went with Opterons. Itanium2 was never even in contention. Just one look at the price and performce of a Itanium2 system was all it took to cross it of the list.
  • Re:"Most" powerful (Score:5, Informative)

    by tap ( 18562 ) on Friday May 14, 2004 @12:13AM (#9147824) Homepage
    I think you've got that backwards, Quadrics is the performance leading, not the price/performance leader. Myrinet, SCI, and Infiniband all beat it in price/performance. Quadrics is faster, and scales to more nodes than the others.

    According to Quadrics latest price list, the cards are $1200 each, $913 per port for a 64 node switch, and $185-$265 for a cable. That's $2300/node.

    Myrinet cards are $595, the switch is $400 per port for 64 nodes, and the cables are ~$50. That's $1050/node.

    Quadric's price for a 1024 node interconnect is $4,176,094. That's hardly chump change. The bandwith is about 10x higher than gigabit ethernet, and the latency about 100x lower.
  • by Roydd McWilson ( 730636 ) on Friday May 14, 2004 @12:16AM (#9147846) Journal
    GCC? On Itanium? Optimized quite well? Whatever. Check out Trimaran [trimaran.org] for the HP/Illinois/NYU compilers which basically inspired Itanium.
  • by Anonymous Coward on Friday May 14, 2004 @12:21AM (#9147875)
    Google's cluster isn't a computational cluster.

    You have several types of clusters, each are designed to do a specific task, although you can easily mix-n-match for different purposes.

    1. Server clusters. Bunches of machines running together, providing services that compliment each other.

    For example you have a file server that is mirrored to another that is hooked up to a different part of a Lan/Wan backbone in order to improve service. Lot's of databases are clusters like this.

    2. High avaiblity clusters.

    You have a machines that are backups of other machines. If one machine fails a backup is activated instantly and replaces the failed machine without ANY loss in services.

    Sort of like a RAID harddrive setup. Hotswappable computers, that sort of thing.

    Google is the first 2 types. It has several clusters with nodes. Each node is made up of a few computers, if a node fails then another backup can back it up instantly, giving the techs time to correctly fix the issue. The computers each take some of the burden, too, so that it seems that they would have to be running mega-machines to provide the performance when in reality they just run a bunch of PC-style computers.

    3. Computational clusters. Clusters that are designed to pool their resources to create a single big computer that is used to proccess large amounts of data and intense mathmatical functions.

    2 types of these are Beowolf clusters and OpenMosix clusters.

    OpenMosix cluster is easy to setup if your a little bit familar with linux and even have knoppix cluster cdroms you can build ones quickly and easily.

    Beowolf is used for big number crunching and programs that use it are generally written to run a specific cluster, although libraries and tools are portable.

    Used lots in astromony for example. 10-12 PCs in a college lab can make a nice number crunching machine.

    There are some clusters that do all 3, lots can do only 1 or 2 of the types easily. Different types can compliment each other.
  • by Yenya ( 12004 ) on Friday May 14, 2004 @12:56AM (#9148066) Homepage Journal
    The problems of Opteron against Itanium2 are:
    • You cannot order the bigger L2 cache (Itanium2 can have 6MB).
    • For "randomly branched" code you need as short pipeline as possible. This is the reason Athlon outperformed PentiumIV at the same clock speed. Now Itanium2 has 6-stage pipeline, while Opteron has 20-stage, IIRC.
    OTOH, for full performance you need _much_ finely-tuned compiler for VLIW CPUs such as Itanium2 than for a generic CISC or RISC CPU.
  • by identity0 ( 77976 ) on Friday May 14, 2004 @01:05AM (#9148112) Journal
    I am not an expert, but in general, Opteron seems to be targeted more for the workstation/server market than the supercomputer market. It's not like they really need x86 backwards-compatibility in the supercomputer field, so Opteron doesn't seem to be optimized for that market. I think Intel may have made IA-64 with supercomputers in mind than AMD did with x86-64.

    Some reps from SGI came to my LUG [golum.org] the other day, and talked about their clusters and supercomputers. The guy doing the Q&A said that he personally liked the Opterons and x86-64, and that the Opterons were fast, but for what SGI does they preferred Itanium. The Opterons have their memory controller embedded in the chip itself, which is great for 1 or 2 or even 8 processors. However when you go up to a 512 processor single-system image supercomputer like SGI's Altix, a lot of the memory controller stuff is done in the switches or otherwise off-chip. Itanium allowed for more flexibility in how they did memory controllers, because they don't have an on-chip one.

    There were some other reasons too, like having more registers, etc. that made SGI choose Itanium over Opteron. I don't know how applicable they are to this situation, as this doesn't seem to be a SSI supercomputer.
  • by fupeg ( 653970 ) on Friday May 14, 2004 @01:17AM (#9148167)
    Try any from SPEC, for example [spec.org]. Maybe you're thinking about x86 because otherwise, the Itanium2 is way out of the Opteron's league (as well as price range, but that is besides the point.)
  • by slamb ( 119285 ) * on Friday May 14, 2004 @01:23AM (#9148207) Homepage
    Some of the coolest features of the Itanium are also some of the reasons why a lot of people don't want to use it. The EPIC ISA, for example. It was designed ( along w/ the physical hardware ) to expose a lot of the internal workings of the processor to the user. But rather than recompile and re-optimize their code, people would rather bitch about migration. That's fine for workstations and servers, but in an HPC environment, you want the nifty features, you want to occasionally hand-tune code segments in assembler, etc.

    I just coded some IA-64 assembly and from what I've seen, this comment is dead-on. They've got a lot of interesting features:

    • Speculation. The idea is to do memory fetches far in advantage to avoid waiting for the (much slower) memory system. You can do a LD.S operation that tells the machine something like "I might want the value from this memory address in a few instructions." It fetches it from memory, if it's in a good mood. If the address is paged out, it doesn't get it. (Instead, it sets a NaT (not a thing) bit to tell you nothing useful is there.) Later, you do a CHK.S. If it turns out that the speculative load fails, it jumps to some "recovery" code which gets it for real.
    • Lots of registers. 128 general-purpose 64-bit registers. Floating point registers. Some specialized ones, I think.
    • EPIC. (Explicitly Parallel Instruction Computing.) It has different types of instructions, aimed at different execution units. In the current incarnation, there are two sets of these in each processor. You give it bundles of three instructions, more broadly divided into groups. Instructions in a group don't depend on any earlier results calculated by the group, so they can be executed in parallel.
    • Rotating registers. This lets you make different iterations of the same loop work with different registers, to take advantage of EPIC more fully.
    • Predicated instructions. There are a bunch (16? 64? don't remember) of predicate bits, set by the CMP instruction and the like. Every instruction has an associated predicate. (p0 is hardcoded to true, so you normally don't notice.) So you can do conditional execution without jumping. More efficient, especially if it's just a few instructions that differ.

    If you just have a simple sequence of operations, each dependant on the one before, you can't really take advantage of these capabilities. (My code was like this. Even though performance wasn't my reason for writing assembly, it was a little disappointing that I couldn't play with the new toys.) If you're expecting these features to make Word start faster, you'll probably be disappointed.

    But if you're doing intensive computations in a tight loop, you can do amazing things. If you can get all the execution units working simultaneously, it will fly. And the features like rotating registers are designed to make that possible. You need a very good compiler or a very smart person to hand-tune it. You may need to recompile to tune if your memory latency changes (affecting how many iterations to run at once) or they come out with a new chip with more sets of execution units. But in a situation like this, none of that is a problem. They'll have applications designed to run as fast as possible on this machine. They may never be run anywhere else.

  • Re:LLNL's usefulness (Score:3, Informative)

    by slamb ( 119285 ) * on Friday May 14, 2004 @02:01AM (#9148362) Homepage
    Why no hydrogen cars? Well, it could have something to do with hydrogen being a net-loss fuel; it takes more energy to make than it provides.

    That's thermodynamics. It's true for any fuel. It's even true for oil and nuclear energy - the difference being only that the energy wasn't put in during our lifetime. (And in the case of nuclear, that the pre-existing energy is all but inexhaustible.)

  • by tap ( 18562 ) on Friday May 14, 2004 @02:26AM (#9148455) Homepage
    Ok, checked them again. The best 1.5 GHz Itanium2 SPECfp2000 score is 2148 while the opteron 248 is 1691. That's 27% faster. I'd hardly call that smoked.

    The Opteron 248 is $670 on pricewatch, while the 1.5 GHz It2 is $5200! The motherboards are like $1400 vs $400.

    You have to keep in mind that this isn't a single machine, it's a cluster. You could take the money spent on an Itanium2 cluster, and buy an opteron cluster with five times as many processors. I am well aware that one does not get perfect scaling. But if you are running something on a cluster in the first place, I have a hard time imagining something that is faster with one fifth as many 27% faster processors. Yes, there are codes that would be faster on 1000 Itanium2 vs 5000 Opterons, but you would never runs these on cluster, because they would be faster still shared memory system.
  • by joib ( 70841 ) on Friday May 14, 2004 @04:23AM (#9148858)

    There is a limit to how much you can effectively parallelize many problems. If that limit is 1, then you need a Cray or something.


    Well, Crays are also parallel computers, so they won't help you much in this situation. Some Crays do have vector processors, but that is also a sort of parallelism. It's just that you use that parallelism through tuned BLAS libraries or with a vectorizing compiler (e.g. Fortran 95, HPF and such things), instead of doing it manually with MPI or threads or something like that. So if you're problem is totally serial, a vector processor won't help you either.


    (Or you can just take the google route and let it fail and replace the whole box. But that really requires your whole application to be written to accomodate it.)


    Not necessarily. Most supercomputers are not used to run a single job taking months, but rather they run lots of smaller and shorter jobs. On the p690 cluster where I do my stuff, I (and apparently most users) mostly run jobs using about 8-16 cpu:s , with a runtime of a few hours to a day. If one node would fail, the jobs that are executing on that node would also fail. It's no big deal, just resubmit the job to the queue when you get around to it.

    Of course, if you're programming one of the very few and far between applications that has a runtime of months, you certainly want to save intermediate results once in a while. Not only to guard against hardware failure, but also so that the user can check the intermediate result and see if the app is still on the right track. It would be quite a bummer to use months of cpu time only to realize the entire thing is wasted because you specified the initial values wrong.. :-)

Always draw your curves, then plot your reading.

Working...