Fermi's 2000 Node Beowulf Cluster 88
Our eagle eye
Chris DiBona is stuck
in Chicago this week at Comdex, but he did yell to us that
Fermi Labs has announced a 2000 Node Beowulf cluster.
If anyone can send me some concrete data, I'd appreciate it.
Hmmm (Score:2)
Mysterious Fermi 2000 node cluster..... (Score:1)
I would say it isn't even parallel processing.
Yes, they have 2000 nodes (Score:2)
The trouble is there's this bug, see, and the cluster thinks there's only 1900 nodes for some reason. Any ideas?
-----------------------------
Computers are useless. They can only give answers.
What's the network infrastructure? (Score:2)
What's the network infrastructure? (Score:1)
Nill benifit.. (Score:1)
Heck, you'd at most want them 2-3 hops away, and I can't see how this would be possible. That'd be like 20 100 port hubs for cryin out loud..
Nill benifit.. (Score:1)
Can PVM do this type of thing?
Look at floating point DSPs (Score:1)
That is exactly what they are all about.
Even an older model, say a TMS320C3x can do a 32 bit fp multiply _and_ fp add in a single cycle. That's basically 25-50Mflops or >1Mflop/$ on a rather old 50MHz chip.
SCSI does this today (Score:1)
10-20Gbs is easy. I once worked on a project where we connected side-by-side PCs together with SCSI. It worked like a charm.
>You could do hard drives and memory that are directly accessed through this same bus.
Bingo! It was as USB, firewire, etc. are supposed to be. That is, intelligent, fast peripherals and plenty of bandwidth.
(last time I poked around) Hmmm (Score:1)
I wonder.... (Score:1)
----------------
"Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein
time for some infrastructure integration (Score:2)
Why not locate the facilities in more extreme northern/southern/high altitude climes, and put all that heat to good use? The model waste treatment plant in Seattle uses its own methane to power the waste treatment systems, and even sells electricity back to the city. Apply the same idea to a computing facility -- If you know the average heat production of a system (over a large enough population the variances between systems wouldn't be that great even as you perpetually upgraded individual systems), it wouldn't be that hard to design it into a facility HVAC plan and make very efficient use it.
Quest for the Holy Grail. (Score:1)
Beowulf certainly does not. You want a myriad of nodes to act as one single REALLY FAST computer which, I dare say, is the Holy Grail of clustering. If you discover a good way of doing this, than the computing world will knock down your door.
SMP systems will be able to split your threads between processors, yielding a significant speedup, but SMP systems are shared memory devices. Clusters are not, so communication happens over the network.
To take advantage of a cluster, apps have to be custom written to coordinate over seperate machines, connected by a network. The kernel provides virtually no abstraction here...the application author has to fully specify the coordination between components of his task.
I believe CORBA is meant to help in cases such as this, by providing services with a layer of network transparency. I'm not too familiar with the beast, but it seems like one could write a CORBA app that would, say, perform a FFT calculation on a chunk of data and return the result. Then, you could have a cluster of identical machines networked together, all with this service. You would have a "master" node that you were doing your work on. Whenever you needed to perform a FFT on a vector of data, your application would break up the vector and send requests out to all available systems for a FFT "service", and collect the results when the nodes return their answers.
There may be modern approaches to distributed computing that are more elegant than the above...this is not an area in which I conduct research. In general, though, I think that the problem is far from solved, and still an area of active research.
sorry that I couldn't provide you with a more satisfying answer,
--Lenny
//"You can't prove anything about a program written in C or FORTRAN.
It's really just Peek and Poke with some syntactic sugar."
I don't think it's ever been a real consideration (Score:1)
My (albeit limited) experience with supercomputers is that you just have to deal with the heat and power issues. My high school, Thomas Jefferson [tjhsst.edu], has an old ETA10 (top of the line in 1988 or so, about as fast as a PII 450 now
They're building a replacement Beowulf cluster out of Celeron 450a's
x86 scales up poorly because of power requirements (Score:2)
PPC might be a good challenger to x86 for this market, ARM is an even better contender from a power consumption issue, but those new PlayStation 2's will supposedly support firewire and very fast single and double precision floating point. Sony has said that they will support a Linux software development kifor the PS2 (though probably a cross compiler on x86 with a cartridge2PC dongle)... but if Linux gets ported to the PS2 hardware directly (and Sony provides a memory upgrade kit), I could see these in scientific and image rendering machine rooms all over the world. If not the PS2, then something like the ARM based Netwinder which consumes an astonishing 14 Watts of power per unit... unfortunately the Netwinder has poor very floating performance.
What we need is a new machine which is small, consumes low power, and does floating point like mad. If the PS2 draws less than 20Watts per unit, it would really fit that bill!
TOP500 (Score:1)
--
Hmmm (Score:1)
Tera: 2^40 ~10^12
Peta: 2^50 ~10^15
Exa: 2^60 ~10^18
Where did you get your sig quote ??? (Score:1)
time for some infrastructure integration (Score:1)
The ACC at the UW was heated/cooled by the water cooling system for the IBM & Cyber mainframes (if anyone still remembers those machines there).
When they pulled those out, it was a big deal for a few months how to continue having people using that building without HVAC (but they finally did get a system in)... Seattle may be pretty temperate most of the year, but 90 degrees in Seattle is pretty harsh. Add to this a big thermal heat mass that a concrete building is, with 15 or so hours of sun exposure on it...
Too bad we can't get a politician heat regeneration system to capture all the hot air and butt gas that they emit...
x86 scales up poorly because of power requirements (Score:2)
Seems like a good time for this question. (Score:1)
For rough estimates, Watts = Volts X Amps. Then you pay by the Kilowatt hour...1000 watts for one hour. There are equations to change watts into cooling load as well. In others words, this 10KW load will need to have xx Tons of cooling capacity to keep it happy.
Hmmm (Score:1)
For something like this, you'd build it in some sort of tree configuration.
My guess is probably 100megabit, because a 2000 node myrinet would be super expensive....
Nill benifit.. (Score:1)
It's also possible to use master/slave in a configuration with multiple masters - have one "master" master, and several lower-level masters, maybe one for every 500 hosts, or whatever. Same theory as the proxy servers for RC5...
-Erik
Nill benifit.. (Score:1)
-Erik
Probably not Beowolf but 2000 node farm likely (Score:1)
It all depends on your application. Some of the codes they run on the machines on the top500 will scale well to this machine, others will not.
-Erik
Hmmm (summary of Fermi's computing architecture) (Score:2)
BTW, I saw another poster mention 1 Tb of data a second. Realize that there are 3 levels of "triggers" designed to isolate interesting events (1 event = 1 proton-antiproton collision). All 1 Tb doesn't reach the computing cluster (The 2000 nodes are probably the Level 3 trigger system and/or batch reconstruction). From my brief perusal of Fermilab's Farms page [fnal.gov] it seems that they will use several I/O PC's connected via fast or gigabit ethernet to a gigabit ethernet switch, which will be connected to the farm. The switch will also be connected to the cental mass storage system.
-- Bob
x86 scales up poorly because of power requirements (Score:1)
The Arm10, due in the middle of this year, has one designed to give 600 MFlOps so next year's Netwinders should be ideal for clustering.
See :
http://www.arm.com/Pro+Peripherals/Cores/ARM10/
Power considerations at Fermi? (Score:1)
Not Very Useful (Score:1)
Also, the machines ARE intel (as I posted earlier), PII's to be exact. Why? They're cheaper. Cost matters - even if you are the US government. The analysis programs are very small, so what's really needed in this situation is just a lot of processor time.
2000 seems a bit high (Score:2)
Still, it's a lot. They run a custom version of PVM, that processes part of the 1 terabyte of data that is collected every second.
Linux clusters in particle physics... (Score:2)
Fermi Lab. To find out a bit more on how clusters
are used in particle physics see http://hp-linux.cern.ch/.
Not Very Useful (Score:1)
I am assuming that the processors are Alphas, just because it is easier. You could seal all of the cases and run a non-conductive fluid through all of them. Also, it would be possible to simply connect the busses of all of these computers together with something that could handle upwards of 10-20 gigabits/s. You could do hard drives and memory that are directly accessed through this same bus. Unfortunatly, what I just described is a Cray, not a Beowulf cluster. ( something resembling a T3E, except the bus speed on those is 128 GB/s )
PC's are great machines, but they only scale to a certain point. Beowulf also has its own major limitations, such as a lack of distributed memory, and the fact that it is unable to run programs that weren't specifically compiled for it. The ultimate distributed solution would be able to distribute anything that had forks or threads ( although I don't want to imagine what kind of bandwidth would be required to handle a program forking up to 2048 instances )
just my $.02
you make some bad assumptions (Score:1)
Oh, and BTW, I have PVM up and running on my Linux boxes, which are setup as a mini-Beowulf cluster.
(last time I poked around) Hmmm (Score:1)
Of course it would be more complicated than that, you probably cannot get 2000 switched interfaces on one backplane, so you would need multiple switches, probably connected by Gigabit ethernet.
Boxes on the same switch could communicate much faster that 3Mbps, off the switch would be much slower. Does beowulf have some way to control machine 'affinity'?
-Steve
What's the network infrastructure? (Score:1)
I'm sure Fermi knows how to do this. They've been running several RS/6000 CPU farms for a few years. The smallest one I saw (about five years ago or so) was 128 CPUs. The cabling didn't look that bad. Although scaling it up by a factor of 20 might make things a tad hairier.
Probably not Beowolf but 2000 node farm likely (Score:3)
queuing system to send events to processors as they become available.
High energy theory calculations can use the fine grained parallism of Beowulf but I doubt they would try to build a cluster as big as 2000 nodes.
No supporting information on site (Score:1)
You can search for yourself at www.fnal.gov [fnal.gov].
Depends on the application, look at setiathome (Score:1)
They're doing something like the equivalent of 15 *years* of Pentium CPU time *each day* now, and growing..
TA
(done) time for some infrastructure integration (Score:1)
TA
Clustering World's smallest webservers (Score:1)
How do I do this? (Score:1)
You could replace the compile command in your makefile (via a variable replacement or rule) with rsh roundrobin gcc whatever, or maybe ssh but I think the keys would interfere unless the machines all shared the same host key.
roundrobin would be a DNS entry that automatically gave you a random server.
The machines would all have to share filesystem via NFS or something though, and you'd want a minimum of 100 megabit ethernet for that, cuz compiling is fairly disk intensive as well.
Warning: This is just off the top of my head, I haven't actually tested this method, there might be some complications, but they shouldn't be hard to work around. (e.g. if rsh roundrobin doesn't work, replace it with a script that passes the command off to another machine)
If you want help, email me at echo 'LeBleu@NOSPAM@prefer.net' | sed -e 's/@NOSPAM//'
I would of course be better able to test this method if you gave me a couple of PII machines to play with^H^H^H^H^H^H^H^H^Hexperiment on.
Oh, and if that's not a satisfactory enough solution, you could always hire me to modify GNU make to add a distributed option to it, hourly rate negotiable. ;)
Where did you get your sig quote ??? (Score:1)
Seems like a good time for this question. (Score:2)
Forget that nonsense! Woooooo Beowulf! (Score:1)
hummm, do you mean a 2000 node Beowulf cluster is not a Beowulf cluster?!!?!
Hmmm (Score:1)
What's the network infrastructure? (Score:1)
Fast ethernet to a non-blocking network switch (not router) can easily apporximate the advantages of a number of them with low cost and complexity. This has been used in some of the larger beowulf clusters to date, such as Avalon.
PSX2 (Score:1)
Hmmm (summary of Fermi's computing architecture) (Score:1)
Actually the data is crunched for hours typically.
The stream off level 3 is ~20MB a sec.
I work at Fermi (Score:2)
1) All of the work I have been seeing is going to be done on farms. As stated elsewhere on slashdot the problems being solved here don't really get much help from many CPUs working on the same data set concurrently. The problems are shipped out essentially an event at a time and the analysis (e.g. event reconstruction) is sent back an event at a time. These reconstructed data sets are then used for lots of analysis processes but the worst part of the workload is the reconstruction, which is where the farms are used.
2) There will be lots of machines in the farm, but I'm not sure it will be 2000 machines, that seems pretty high. In all likelyhood these boxes will be Linux based dual processor machines.
3) For cooperative computing there are lots of multi-processor SGI boxes (O2K's).
On the 'interesting bit of trivia' side, the data volumes we are talking about are on order of petabytes. Pretty interesting challenges! It's kind of fun when your thumbnail information approaches the limitations of most relational databases.
What's the network infrastructure? (Score:1)
Playstation Particle Physics? (Score:1)
Program Parallelism? PVM and MPI (Score:2)
Probably not Beowolf but 2000 node farm likely (Score:1)
Lattice QCD? (Score:1)
is refered to in the anouncement,
I do know that FermiLab is
planning to use a pc cluster
for *Theoretical* Lattice QCD
calculations (as opposed to the
direct analysis of the experimental
data from the particle collider,
which is the other possible need
for such a cluster).
They seem confident that they can
overcome the communications problems,
which aren't *huge* for LQCD - most,
but not all, of the time only nearest
neighbour communications are needed -
but also cannot be neglected.
To be competative today, they would
have to be able to generate of the
order of 500 GFLOPS (the current
cutting edge lattice QCD computer is
probably the QCDSP at Brookhaven
National Lab, which runs at 600 GFLOPS,
using a dedicated machine with
12,000 processor nodes - the individual
cpu's are very low powered though,
and don't run linux).
supporting information? (Score:1)
Forget that nonsense! Woooooo Beowulf! (Score:1)
Use MOSIX. http://www.mosix.cs.huji.ac.il/ (Score:1)
does this count as a "Beowulf" cluster: (Score:1)
Mysterious Fermi 2000 node cluster..... (Score:1)
LSF != beowulf (Score:1)
re: Quest for the Holy Grail. (Score:1)
I just finished a course in distributed systems and this looks like it will be a big thing. I guess a number of the telcos are already looking into this stuff.
Program Parallelism? (Score:1)
Seems like a good time for this question. (Score:1)
Program Parallelism? (Score:3)
I work with a research group at my school which develops an architecture and a programming language designed to explicitly tell the program what functions you want run in parallel and where. The language is, for lack of a better description, rather complicated relative to C, which it's based on. We have ports for SMP machines, Beowulf clusters, SP2's etc, but this is the only distributed multithreaded implementation I've been exposed to and since ours is not public, I was wondering what everybody else was using. Pthreads can't be used for a distributed system can they? Thanks for any info.
--
Yes, they have 2000 nodes (Score:1)
Hmmm (Score:1)
at CERN, so they're probably going for that.
But, as Im an ATM person, how about using ATM
switches here? I think that would be an exciting
project - your point about scalability hits
the nail right on the head.
40 Gbs capacity non-blocking ATM switches are here! (ps. that's backplane capacity, not port speed)
I also see exciting work in constructing specific
protocols over ATM to 'tie together' clusters,
rather than relying on simply chucking bandwidth
at it a la gig ethernet.
But the real way to connect together high performance computers is GSN,
which is being used in high energy physics labs,
and other big data centers.
http://www.gsn.org
800 MBYTES per second. Mmmm....
How do I do this? (Score:1)
poor mans SMP system.
When I use GNU Make, I can use the "make -j" option and it starts *MANY* sub processes all at once. I want the computer I'm on to automatically
send those 'jobs' over to some group of hosts on
the network auto-majically, in effect - automatically remote-compile.
Say I have 10 dual pentium machines running linux in a *LOCKED* closet. I want them to appear like one virtual CPU. I don't need shared memory across
the network.
I'm willing to run TWO different network cables, CABLE 1 - the normal public fully functional eithernet interface. CABLE 2 a "Server Only" backbone. The idea being, the server cable is "SECURE" there is minimal, if any safty checks
and minimal fluff in the protocol used to talk between the servers (ie: this keeps
the server-2-server connection *FAST*) The other cable is what the rest of the world sees. Servers
could use this for process control and file transfer. (Hell I'd even add a 2nd or 3rd eithernet card just for files)
How do I do this? Bewolf does not seem to be this.
I've heard that Sun has this... I'm looking for
a linux solution.
Mysterious Fermi 2000 node cluster..... (Score:3)
Don't get your panties in a bunch - there was obviously a typo in the persons email, Probably due to the '0' key sticking.
Yes, I gave a tour to several students from my alma mater in Stillwater, MN.
No, I didn't show them a 600+ node cluster of old 486's (I can't even think of one 486 on site - they are there, I just don't know about them). I did show them the 20 node Run II prototype farm, the 10 node SAM farm and the production 37 node farm. We don't do Beuwulf clusters. We do compute farms. Why? They are 2 totally different things. A beowulf cluster is specifically designed to analyze a little bit of data with a lot of message passing between processors and nodes. This is for tightly bound compute processes. A farm is a cluster specifically designed to crunch huge amounts of data with absolutely no message passing between processes, CPU's, or nodes.
Quick physics lesson on what Fermilab does:
We take protons and anti-protons and accelerate them to darn near the speed of light. Then we collide them in the Tevatron at particular points on the ring that have detectors (currently CDF and D0 (D-Zero)). Each detector has about a million individual detectors and the amount of data that is actually produced is about a million megabytes of data per second... no, we don't keep it all! The 1st level trigger throws most of the data away 'cause it just isn't interesting. The second level trigger throws some more away. The 3rd level trigger, which will be a 128 node, dual CPU Linux farm, will throw some more away, but it will pass something like 20-250 MB/sec on to the tape library. This number is dependent on the budget allocated for tapes. Roughly, we'll be saving 1.5PB/year (that's Peta Bytes, or 10^15) to tape. After a while, the physicists will want to take better look at all that data. How? Linux farms. You see, the data that is taken, say, at 2:30:43.8238 on April 23, 2001 will have nothing to do with the data that is taken at 2:30:43.8239 on April 23, 2001, so it's safe to treat that as one individual data set. By doing so, we can put it on a single processor and let that processor grind on it for a while before spitting out the answer. Well, so the question is, should we send out one single data set to a worker node in the farm, let it analyze it, then send it another, or should we dump ALOT of data off to the worker node and let it grind for a lonf time before writing the finished data back to tape. It appears that it's best just to send a big ole chunk (~15GB) of data off to a node, and let it grind for some large (~24 hours) amount of time.
The "plan" for this year is to purchase about 200 (that's hundred) dual P-II/III machines, depending on our budget. The year after that the "plan" is to purchase about 400 machines, and the year after that (when Run II starts in the Tevatron) the "plan" is to purchase another 400 machines. So, let's do some math - (200+400+400)*2 = 2000 _processors_, but only a thousand nodes.
Yes, the cabling is problem.
Hope that clears up some confusion.
Cheers,
Dan
How do I do this? (Score:1)
Echelon? (Score:1)
Is Echelon alive?
http://www.wired.com/news/news/politics/story/1