Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Linux Software

Fermi's 2000 Node Beowulf Cluster 88

Our eagle eye Chris DiBona is stuck in Chicago this week at Comdex, but he did yell to us that Fermi Labs has announced a 2000 Node Beowulf cluster. If anyone can send me some concrete data, I'd appreciate it.
This discussion has been archived. No new comments can be posted.

Fermi's 2000 Node Beowulf Cluster

Comments Filter:
  • Anyone know what they are using for the backplane fabric? I woundn't imagine switched 100Mb ethernet would work too well for that many nodes. I wonder if they are using Myrinet or gigabit ethernet.
  • Posted by Jim Fromm:

    I would say it isn't even parallel processing.
  • Posted by !ErrorBookmarkNotDefined:

    The trouble is there's this bug, see, and the cluster thinks there's only 1900 nodes for some reason. Any ideas?

    -----------------------------
    Computers are useless. They can only give answers.
  • I'd be most interested in knowing how they'll be hooking those 2000 nodes together. That could get really expensive. Sounds like a cabling nightmare! :-)
  • Err, you've never looked at a cluster before, have you.. A 9 node cluster usually uses 27 network cards, 3 for each machine.. 2 to other machines, one to a gigabit router.. It's not all that simple.. In order to utilize this systems, you need to be able to communicate with them in less then one or two hops..

  • I can't really see this working all that much better then 512 machines due to comminications bottlenecks. The machines need to be able to talk to the host as they finish things, and 2000 machines going into one 'master' computer would be much 2 much bandwidth for the master to handle I would think..

    Heck, you'd at most want them 2-3 hops away, and I can't see how this would be possible. That'd be like 20 100 port hubs for cryin out loud..
  • I've never seen any PVM code to do this sort of thing. I guess it COULD be done, but I'm not sure how real time you could look at the data. I guess if your recording it, and it doesn't all need to come back time critical, this could be done.. Granted, I haven't done much myself, so I could be wrong..

    Can PVM do this type of thing?
  • >What we need is a new machine which is small, consumes low power, and does floating point like mad.

    That is exactly what they are all about.

    Even an older model, say a TMS320C3x can do a 32 bit fp multiply _and_ fp add in a single cycle. That's basically 25-50Mflops or >1Mflop/$ on a rather old 50MHz chip.
  • >connect the busses of all of these computers together with something that could handle upwards of 10-20 gigabits

    10-20Gbs is easy. I once worked on a project where we connected side-by-side PCs together with SCSI. It worked like a charm.

    >You could do hard drives and memory that are directly accessed through this same bus.

    Bingo! It was as USB, firewire, etc. are supposed to be. That is, intelligent, fast peripherals and plenty of bandwidth.


  • Well, the Cisco Catalyst 6500 series can scale to up to 256Gbit on the backplane. The thing is that you wouldn't want all 2000 machines in a single broadcast domain anyhow. Which would mean that you'd make like a spoke and hub arrangement: Perhaps 100 nodes in a broadcast domain, with a gigabit uplink. High end switches can do almost 300 100M ports, so you'd put 2 to 3 VLANs on a single switch, with 2 Gig uplinks to a core switch. 10 such switches would do you well... The numbers are kinda stagering though. 200 ports on a cat5500 is like 250k, and you need 10, and you'd need a 6500 class switch to concentrate all those gig uplinks. That's another 300k , so were' up to 2.8Mil. And then about 2000 computers. At 1k per computer, that's another 2 million. 6 million, and you might need more on racks, and I dunno how much that scale of UPS or generator would cost... But then Extreme Linux is just 50 bucks.NT would be more like another million. And they said that Linux has a higher TCO? heheheheheheheheh
  • ...if I could get a job working nights at Fermi. You know, cleaning the floors, emptying wastebaskets, play a little Quake. Naahhh,probably not. Oh well.

    ----------------

    "Great spirits have always encountered violent opposition from mediocre minds." - Albert Einstein
  • With that volume of hardware & wattage, it probably makes sense to invest in some good heat exchange systems in the facility and toss the traditional HVAC.

    Why not locate the facilities in more extreme northern/southern/high altitude climes, and put all that heat to good use? The model waste treatment plant in Seattle uses its own methane to power the waste treatment systems, and even sells electricity back to the city. Apply the same idea to a computing facility -- If you know the average heat production of a system (over a large enough population the variances between systems wouldn't be that great even as you perpetually upgraded individual systems), it wouldn't be that hard to design it into a facility HVAC plan and make very efficient use it.
  • >How do I do this? Bewolf does not seem to be this.

    Beowulf certainly does not. You want a myriad of nodes to act as one single REALLY FAST computer which, I dare say, is the Holy Grail of clustering. If you discover a good way of doing this, than the computing world will knock down your door. :)

    SMP systems will be able to split your threads between processors, yielding a significant speedup, but SMP systems are shared memory devices. Clusters are not, so communication happens over the network.

    To take advantage of a cluster, apps have to be custom written to coordinate over seperate machines, connected by a network. The kernel provides virtually no abstraction here...the application author has to fully specify the coordination between components of his task.

    I believe CORBA is meant to help in cases such as this, by providing services with a layer of network transparency. I'm not too familiar with the beast, but it seems like one could write a CORBA app that would, say, perform a FFT calculation on a chunk of data and return the result. Then, you could have a cluster of identical machines networked together, all with this service. You would have a "master" node that you were doing your work on. Whenever you needed to perform a FFT on a vector of data, your application would break up the vector and send requests out to all available systems for a FFT "service", and collect the results when the nodes return their answers.

    There may be modern approaches to distributed computing that are more elegant than the above...this is not an area in which I conduct research. In general, though, I think that the problem is far from solved, and still an area of active research.

    sorry that I couldn't provide you with a more satisfying answer,

    --Lenny

    //"You can't prove anything about a program written in C or FORTRAN.
    It's really just Peek and Poke with some syntactic sugar."
  • Yes, that many Intel boxes draw quite a bit and warm things up pretty good, but compared to what? An ARM would be a bad choice -- poor FP performance.

    My (albeit limited) experience with supercomputers is that you just have to deal with the heat and power issues. My high school, Thomas Jefferson [tjhsst.edu], has an old ETA10 (top of the line in 1988 or so, about as fast as a PII 450 now :). It's got its own frikkin' air conditioning unit in a special room!

    They're building a replacement Beowulf cluster out of Celeron 450a's :)
  • Wow. That must consume something like 250KW of power for CPU alone, never mind air conditioning power requirements. We have a ~300 machine Linux/Solaris/Irix cluster in my department and I can tell you the AC and UPS support costs are pretty serious.

    PPC might be a good challenger to x86 for this market, ARM is an even better contender from a power consumption issue, but those new PlayStation 2's will supposedly support firewire and very fast single and double precision floating point. Sony has said that they will support a Linux software development kifor the PS2 (though probably a cross compiler on x86 with a cartridge2PC dongle)... but if Linux gets ported to the PS2 hardware directly (and Sony provides a memory upgrade kit), I could see these in scientific and image rendering machine rooms all over the world. If not the PS2, then something like the ARM based Netwinder which consumes an astonishing 14 Watts of power per unit... unfortunately the Netwinder has poor very floating performance.

    What we need is a new machine which is small, consumes low power, and does floating point like mad. If the PS2 draws less than 20Watts per unit, it would really fit that bill!
  • maybe we'll see it in the TOP500 [top500.org] list? next update is 10-12 June 1999, hummm, too short...
    --
  • That would be a thousand TB!

    Tera: 2^40 ~10^12
    Peta: 2^50 ~10^15
    Exa: 2^60 ~10^18
  • If it's legit (not bogus) and documented, then we have the ultimate argument against MS FUD. That might be worth looking at.
  • Umm... cogeneration, regeneration & collection and redistribution of otherwise lost energy are not new techniques.

    The ACC at the UW was heated/cooled by the water cooling system for the IBM & Cyber mainframes (if anyone still remembers those machines there).

    When they pulled those out, it was a big deal for a few months how to continue having people using that building without HVAC (but they finally did get a system in)... Seattle may be pretty temperate most of the year, but 90 degrees in Seattle is pretty harsh. Add to this a big thermal heat mass that a concrete building is, with 15 or so hours of sun exposure on it...

    Too bad we can't get a politician heat regeneration system to capture all the hot air and butt gas that they emit...

  • You would be right except for one thing. Next to a superconducting supercooled particle accelerator, a few KW doesn't seem like a very big deal anymore.

  • No. The PS rating is the maximum sustainable power that can be supplied by that unit. The only way to really know what you are using is to put a power meter on the line and see what current you are drawing. One of the advantages of higher end UPS units is the power meter. I've seen HP boxes that were rated as needing 60 amps (thats around 7KW) humming along with 10 amps. Startup power is of course greater.

    For rough estimates, Watts = Volts X Amps. Then you pay by the Kilowatt hour...1000 watts for one hour. There are equations to change watts into cooling load as well. In others words, this 10KW load will need to have xx Tons of cooling capacity to keep it happy.
  • by epaulson ( 7983 )
    If 100meg ethernet wouldn't scale, then neither would myrinet or gigabit.

    For something like this, you'd build it in some sort of tree configuration.

    My guess is probably 100megabit, because a 2000 node myrinet would be super expensive....
  • Master/Slave is not the only way to do parallel computation. Many problems lend themselves very easily to be parallelized over a grid (like ocean models, for example) - then they only need to communicate with their neighbors. Oftentimes these sorts of problems can very easily scale to work efficently on large clusters.

    It's also possible to use master/slave in a configuration with multiple masters - have one "master" master, and several lower-level masters, maybe one for every 500 hosts, or whatever. Same theory as the proxy servers for RC5...

    -Erik
  • Yes, PVM can do everything I described. It's not hard.

    -Erik
  • No, that's not true.

    It all depends on your application. Some of the codes they run on the machines on the top500 will scale well to this machine, others will not.

    -Erik
  • The use of these "farms" is a little unusual compared to other "supercomputers". The ratio of computation to communication is large. Basically each computer is given one proton-proton collision and all its associated data to play with. It crunches this for a few seconds, and sends the results back for storage on tape. Nodes do not communicate with each other, so a massive network structure is not needed. Still, I know it's not going to be straight ethernet. ;)

    BTW, I saw another poster mention 1 Tb of data a second. Realize that there are 3 levels of "triggers" designed to isolate interesting events (1 event = 1 proton-antiproton collision). All 1 Tb doesn't reach the computing cluster (The 2000 nodes are probably the Level 3 trigger system and/or batch reconstruction). From my brief perusal of Fermilab's Farms page [fnal.gov] it seems that they will use several I/O PC's connected via fast or gigabit ethernet to a gigabit ethernet switch, which will be connected to the farm. The switch will also be connected to the cental mass storage system.

    -- Bob

  • Current Netwinders are no good at fp because Strongarm has no fpu.
    The Arm10, due in the middle of this year, has one designed to give 600 MFlOps so next year's Netwinders should be ideal for clustering.
    See :
    http://www.arm.com/Pro+Peripherals/Cores/ARM10/
  • Well, maybe in a normal workplace, this would actually be considered. But being Fermi accelerates particles to near the speed of light, using ungodly amounts of power (something like $15 million worth a year), it isn't an issue.
  • This system works for Fermi because they have different needs. As I was told, each node is given 4-6 gigs of data, and then 8 hours later, it sends the results back. All the limitations of a typical beowulf cluster you mentioned are valid, but this isn't a typical beowulf cluster.

    Also, the machines ARE intel (as I posted earlier), PII's to be exact. Why? They're cheaper. Cost matters - even if you are the US government. The analysis programs are very small, so what's really needed in this situation is just a lot of processor time.

  • From my tour of Fermi on Friday, I recall seeing a 600+ cpu custom cluster, which is a low 486 class machine, and a ~37 node dual PII 450 cluster, as well as some SGI stuff. I think I was told they would be purchasing a few hundred dual PII's, meant for clustering, each year for the next few years. I don't remember the specific numbers, but I know for a fact they are intel machines - because they are the biggest bang for the buck.
    Still, it's a lot. They run a custom version of PVM, that processes part of the 1 terabyte of data that is collected every second.
  • We at CERN do basically the same as the guys in
    Fermi Lab. To find out a bit more on how clusters
    are used in particle physics see http://hp-linux.cern.ch/.
  • A cluster of this makeup/size isn't going to be all that great. Some people have suggested gigabit to connect the nodes together. This doesn't even come close to the speeds, unless you are putting ALOT of ports in each node. Someone already mentioned that you only want the computers a few hops away, so you are going to want alot of 20+ port hubs. This also won't do you any good, since hubs are still broadcast. You are going to need probably a large router to handle this. Your networking hw for this could easily make it over $1 million. After you do this, you will then have to find a new bus, since PCI simply isn't going to be fast enough to handle this. Your next problem is that the interprocessor bandwith just isn't very good on a x86 machine, and neither is memory access speed. To top it all off, you are going to have to find some way to cool everything.

    I am assuming that the processors are Alphas, just because it is easier. You could seal all of the cases and run a non-conductive fluid through all of them. Also, it would be possible to simply connect the busses of all of these computers together with something that could handle upwards of 10-20 gigabits/s. You could do hard drives and memory that are directly accessed through this same bus. Unfortunatly, what I just described is a Cray, not a Beowulf cluster. ( something resembling a T3E, except the bus speed on those is 128 GB/s )

    PC's are great machines, but they only scale to a certain point. Beowulf also has its own major limitations, such as a lack of distributed memory, and the fact that it is unable to run programs that weren't specifically compiled for it. The ultimate distributed solution would be able to distribute anything that had forks or threads ( although I don't want to imagine what kind of bandwidth would be required to handle a program forking up to 2048 instances )

    just my $.02
  • I never suggested anything master/slave, I was talking about techniques for making it posible for each node to reach another node in a reasonable time. A master/slave relationship would imply a tree sort of structure.
    Oh, and BTW, I have PVM up and running on my Linux boxes, which are setup as a mini-Beowulf cluster.
  • Last time I looked at switch specs the backplanes were running at over 6Gigbits/second, and that was maybe a year ago. 6Gbps/2k nodes = 3 Mbps/node. Can beowulf work with that?
    Of course it would be more complicated than that, you probably cannot get 2000 switched interfaces on one backplane, so you would need multiple switches, probably connected by Gigabit ethernet.
    Boxes on the same switch could communicate much faster that 3Mbps, off the switch would be much slower. Does beowulf have some way to control machine 'affinity'?

    -Steve
  • I'm sure Fermi knows how to do this. They've been running several RS/6000 CPU farms for a few years. The smallest one I saw (about five years ago or so) was 128 CPUs. The cabling didn't look that bad. Although scaling it up by a factor of 20 might make things a tad hairier.

  • by bquark ( 13175 ) on Monday April 19, 1999 @02:44PM (#1926606)
    Typical high energy physics experiments do not need the fine grained parallism supplied by systems like Beowolf. An interaction is recorded with 5Kbytes to 100Kbytes of information which is called an event. All of the data for one event is sent to one processor. The next event is taken and sent to the next processor. There is no need for direct communication between those processors, so the network topology is simple. The new experiments could use 2000 nodes profitably. The compute time for many types of jobs is large compared to the time to get data into the processors. It may take seconds to finish one event so bandwidth into one machine only needs to be 100's kbytes per second. What is needed is a
    queuing system to send events to processors as they become available.

    High energy theory calculations can use the fine grained parallism of Beowulf but I doubt they would try to build a cluster as big as 2000 nodes.
  • A reasonably through look at the site does not support this story, not to say it isn't true.

    You can search for yourself at www.fnal.gov [fnal.gov].

  • You can't just say that the network would be a bottleneck, that's totally dependent on the application. Look at setiathome, they will be processiong enormous amounts of data in the end but the CPU/data rate is still very big so the slow Internet is more than fast enough. 340 KB of data takes half an hour on a fast MIPS CPU, one hour twenty on a 400MHz Pentium-II.
    They're doing something like the equivalent of 15 *years* of Pentium CPU time *each day* now, and growing..
    TA
  • A satellite groundstation I know in nothern Sweden (I don't have a map here but let's say it's around 67-68 deg. northern latitude) heats the building with the heat that the computer room coolers generate. However, many years ago when they were installing the thing only one of the two huge coolers (each is as big as a ship's engine) was working.. it got so hot in the computer room we had to keep all the 3-meter tall windows wide open, with -35 C (-31 F I think) outside and the temperature brakers still tripped :-)
    TA
  • Has anyone thought about Clustering those DIMM-PC's that the world's smallest web server runs on (previous slashdot article)??? They're the size of like an SDRAM chip, and go into a DIMM socket, so if you made a sufficient backplane and slapped several hundred, or several thousand, they'd never have to deal with network cable, ethernet cards or anything, and they're like $400 a piece with like 16Mes of ram and a 16 meg flash rom, and a 486-sx 66. Granted they're not that powerful by themselves, but you'd have power in numbers, you could probably fit a hundred of those on a motherboard sized backplane. And 100 of those would run circles around a top of the line machine today, and it'd be in one box. I was looking at that world's smallest web server thing and it got me thinking. The advantage of those over regular PCs is size, 2000 desktop machines would take several rooms, whereas 2000 of those, would be 2000 cubic inches...not that much room at all, but it'd cost ~$800,000. Hmmm if they're using 2000 machines, that's got to cost some major flow, would it even be worth it?
  • You could replace the compile command in your makefile (via a variable replacement or rule) with rsh roundrobin gcc whatever, or maybe ssh but I think the keys would interfere unless the machines all shared the same host key.

    roundrobin would be a DNS entry that automatically gave you a random server.

    The machines would all have to share filesystem via NFS or something though, and you'd want a minimum of 100 megabit ethernet for that, cuz compiling is fairly disk intensive as well.

    Warning: This is just off the top of my head, I haven't actually tested this method, there might be some complications, but they shouldn't be hard to work around. (e.g. if rsh roundrobin doesn't work, replace it with a script that passes the command off to another machine)

    If you want help, email me at echo 'LeBleu@NOSPAM@prefer.net' | sed -e 's/@NOSPAM//'

    I would of course be better able to test this method if you gave me a couple of PII machines to play with^H^H^H^H^H^H^H^H^Hexperiment on.

    Oh, and if that's not a satisfactory enough solution, you could always hire me to modify GNU make to add a distributed option to it, hourly rate negotiable. ;)

  • I got it from my Computer Science book, "A Computer Science Tapestry" by Owen Astrachan. I sure hope Prof. Astrachan didn't make it up! It's in the context of Bill Gates describing going through the trash of the CS building at Harvard, reading the source to their operating systems (if Bill Gats rooting for OS source in a trash-pile is not a perfect image, I don't know what is).
  • This reminds me of a question that I've had but haven't gotten answered. If I have a 300W power supply, does that mean the coputer will be drawing 300W constantly or will it just draw what it needs? I ask because, when building a server or some other type of computer that would be up 24-7 but not necessarily saturated, it would be a pretty serious concern if you could maybe get by with 250W, saving about 50MW/yr (I've got to be doing that math wrong, but still a lot). So, who knows?
  • Why don't they just use a Beowulf cluster instead!?

    hummm, do you mean a 2000 node Beowulf cluster is not a Beowulf cluster?!!?!

  • by Gumber ( 17306 )
    It all depends on the application, doesn't it. If most intranode communication is local then switched 100MBit could work just fine. Global communication is another story.
  • You overgeneralize. There are all sorts of topologies for clusters.

    Fast ethernet to a non-blocking network switch (not router) can easily apporximate the advantages of a number of them with low cost and complexity. This has been used in some of the larger beowulf clusters to date, such as Avalon.
  • by Gumber ( 17306 )
    Whether the Specs are real or not. The graphical output of the things is truly jawdropping. The unbeleivable screen captures people have circulated actually understate the effect.
  • It crunches this for a few seconds, and sends the results back for storage on tape.

    Actually the data is crunched for hours typically.

    The stream off level 3 is ~20MB a sec.

  • I work on the data access systems at Fermi and here is some skinny.

    1) All of the work I have been seeing is going to be done on farms. As stated elsewhere on slashdot the problems being solved here don't really get much help from many CPUs working on the same data set concurrently. The problems are shipped out essentially an event at a time and the analysis (e.g. event reconstruction) is sent back an event at a time. These reconstructed data sets are then used for lots of analysis processes but the worst part of the workload is the reconstruction, which is where the farms are used.

    2) There will be lots of machines in the farm, but I'm not sure it will be 2000 machines, that seems pretty high. In all likelyhood these boxes will be Linux based dual processor machines.

    3) For cooperative computing there are lots of multi-processor SGI boxes (O2K's).

    On the 'interesting bit of trivia' side, the data volumes we are talking about are on order of petabytes. Pretty interesting challenges! It's kind of fun when your thumbnail information approaches the limitations of most relational databases.
  • Yes and no; I don't see how it would be any harder than wiring up 2,000 workstations in a large office building. There are engineering firms that do this sort of thing as a matter of routine. I'd hate to see the bill, though.
  • How would you like to be the Director who was expected to sign the purchase req. for 2000 Playstations? I'd expect he would recieve a quick visit from his local friendly GAO auditor.
  • Most programs written for Beowulf clusters use PVM or MPI. PVM is older, most new codes use MPI. At least, all the ones I've written use MPI.
  • So if I understand correctly, this machine will be very useful for FERMI, if it is made to run the software they use over at top500 [top500.org], it would give worse results say than a ~300 nodes beowulf cluster?
  • While I'm not sure which cluster
    is refered to in the anouncement,
    I do know that FermiLab is
    planning to use a pc cluster
    for *Theoretical* Lattice QCD
    calculations (as opposed to the
    direct analysis of the experimental
    data from the particle collider,
    which is the other possible need
    for such a cluster).

    They seem confident that they can
    overcome the communications problems,
    which aren't *huge* for LQCD - most,
    but not all, of the time only nearest
    neighbour communications are needed -
    but also cannot be neglected.


    To be competative today, they would
    have to be able to generate of the
    order of 500 GFLOPS (the current
    cutting edge lattice QCD computer is
    probably the QCDSP at Brookhaven
    National Lab, which runs at 600 GFLOPS,
    using a dedicated machine with
    12,000 processor nodes - the individual
    cpu's are very low powered though,
    and don't run linux).
  • I don't have more time to keep wading, but check out www-hppc.fnal.gov/farms/r2farms.html> and thereabouts
  • Hmmm...a Beowulf cluster is just a cluster running either PVM and MPI. The only new definition is that Beowulfs are made of commodity components. I think the term itself is used very loosely....if my solaris network run's PVM and MPI then why isn't it a beowulf?!
  • Last I read MOSIX wasn't open sourced. I wouldn't trust anything on linux which isn't open sourced, especially kernel modules. What is there to stop binaries from taking over my computer (and privacy) when I'm not watching?
  • I feel the term is used very loosely. From what I can tell all beowulf clusters tend to run PVM and MPI and some load balancing software. But the definition of Beowulf itself specifies that the cluster has got to be made up of commodity components. I just call my cluster a cluster. Words/names like "Beowulf" or "Supercomputer" confuses definitions. Many supercomputer purists do not rate Beowulfs in the same league mainly due to the architecture....the term "distributed computing" comes to mind.
  • I hate definitions but isn't this VERY coarse grain parallel processing???!!!
  • I stand corrected but there is no such sofware known as "Beowulf", it is just a specification. The packages which parallelize the cluster is PVM and MPI.
  • CORBA could certainly do this. CORBA is a middleware solution and acts much like a software bus overtop of the network. But like you said, the apps would have to be written.

    I just finished a course in distributed systems and this looks like it will be a big thing. I guess a number of the telcos are already looking into this stuff.
  • One good choice is Cilk [mit.edu]. It is freely available and easy to use: all you have to do is insert a few keywords into your C code, and function calls can be spawned onto another processor. The code itself is not dependent on the machine specifics (i.e. number of processors) because that is all handled by the Cilk runtime system. The main version is for SMPs and there is also a distributed version.
  • It only uses what power it needs, and a bit that is lost converting AC to DC power. Keep big power supplies so the voltage doesn't drop in your equipment.

  • by Kastagir ( 33415 ) on Monday April 19, 1999 @02:01PM (#1926635)
    I admit I'm not too familiar with everything, but I was wondering where the parallelism/load balancing in the programs they run on massive clusters like this comes from?

    I work with a research group at my school which develops an architecture and a programming language designed to explicitly tell the program what functions you want run in parallel and where. The language is, for lack of a better description, rather complicated relative to C, which it's based on. We have ports for SMP machines, Beowulf clusters, SP2's etc, but this is the only distributed multithreaded implementation I've been exposed to and since ours is not public, I was wondering what everybody else was using. Pthreads can't be used for a distributed system can they? Thanks for any info.

    --
  • Hmmm, not off the top of my head. They should hire a shitload of cobol programmers to bill them for 30,000 man hours, then upgrade the bioses. That should get them above the 2000 mark.
  • Well, I gather Gigabit Ethernet will be big
    at CERN, so they're probably going for that.

    But, as Im an ATM person, how about using ATM
    switches here? I think that would be an exciting
    project - your point about scalability hits
    the nail right on the head.
    40 Gbs capacity non-blocking ATM switches are here! (ps. that's backplane capacity, not port speed)
    I also see exciting work in constructing specific
    protocols over ATM to 'tie together' clusters,
    rather than relying on simply chucking bandwidth
    at it a la gig ethernet.


    But the real way to connect together high performance computers is GSN,
    which is being used in high energy physics labs,
    and other big data centers.
    http://www.gsn.org
    800 MBYTES per second. Mmmm....
  • I don't want to do massive number crunching, just normal software development enviorment. I want a
    poor mans SMP system.

    When I use GNU Make, I can use the "make -j" option and it starts *MANY* sub processes all at once. I want the computer I'm on to automatically
    send those 'jobs' over to some group of hosts on
    the network auto-majically, in effect - automatically remote-compile.

    Say I have 10 dual pentium machines running linux in a *LOCKED* closet. I want them to appear like one virtual CPU. I don't need shared memory across
    the network.

    I'm willing to run TWO different network cables, CABLE 1 - the normal public fully functional eithernet interface. CABLE 2 a "Server Only" backbone. The idea being, the server cable is "SECURE" there is minimal, if any safty checks
    and minimal fluff in the protocol used to talk between the servers (ie: this keeps
    the server-2-server connection *FAST*) The other cable is what the rest of the world sees. Servers
    could use this for process control and file transfer. (Hell I'd even add a 2nd or 3rd eithernet card just for files)

    How do I do this? Bewolf does not seem to be this.
    I've heard that Sun has this... I'm looking for
    a linux solution.




  • by Dan Yocum ( 37773 ) on Monday April 19, 1999 @08:44PM (#1926639) Journal
    Hey Everyone,


    Don't get your panties in a bunch - there was obviously a typo in the persons email, Probably due to the '0' key sticking.


    Yes, I gave a tour to several students from my alma mater in Stillwater, MN.


    No, I didn't show them a 600+ node cluster of old 486's (I can't even think of one 486 on site - they are there, I just don't know about them). I did show them the 20 node Run II prototype farm, the 10 node SAM farm and the production 37 node farm. We don't do Beuwulf clusters. We do compute farms. Why? They are 2 totally different things. A beowulf cluster is specifically designed to analyze a little bit of data with a lot of message passing between processors and nodes. This is for tightly bound compute processes. A farm is a cluster specifically designed to crunch huge amounts of data with absolutely no message passing between processes, CPU's, or nodes.


    Quick physics lesson on what Fermilab does:


    We take protons and anti-protons and accelerate them to darn near the speed of light. Then we collide them in the Tevatron at particular points on the ring that have detectors (currently CDF and D0 (D-Zero)). Each detector has about a million individual detectors and the amount of data that is actually produced is about a million megabytes of data per second... no, we don't keep it all! The 1st level trigger throws most of the data away 'cause it just isn't interesting. The second level trigger throws some more away. The 3rd level trigger, which will be a 128 node, dual CPU Linux farm, will throw some more away, but it will pass something like 20-250 MB/sec on to the tape library. This number is dependent on the budget allocated for tapes. Roughly, we'll be saving 1.5PB/year (that's Peta Bytes, or 10^15) to tape. After a while, the physicists will want to take better look at all that data. How? Linux farms. You see, the data that is taken, say, at 2:30:43.8238 on April 23, 2001 will have nothing to do with the data that is taken at 2:30:43.8239 on April 23, 2001, so it's safe to treat that as one individual data set. By doing so, we can put it on a single processor and let that processor grind on it for a while before spitting out the answer. Well, so the question is, should we send out one single data set to a worker node in the farm, let it analyze it, then send it another, or should we dump ALOT of data off to the worker node and let it grind for a lonf time before writing the finished data back to tape. It appears that it's best just to send a big ole chunk (~15GB) of data off to a node, and let it grind for some large (~24 hours) amount of time.


    The "plan" for this year is to purchase about 200 (that's hundred) dual P-II/III machines, depending on our budget. The year after that the "plan" is to purchase about 400 machines, and the year after that (when Run II starts in the Tevatron) the "plan" is to purchase another 400 machines. So, let's do some math - (200+400+400)*2 = 2000 _processors_, but only a thousand nodes.


    Yes, the cabling is problem. ;)


    Hope that clears up some confusion.


    Cheers,


    Dan

  • I saw a link at 32bitsonline.com if I recall correctly, that offered something named ppmake. It was designed to spawn different compilers on different machines; I think it worked in a beowulf environment (did I spell beowulf properly? hmm). As for running traffic between two servers, my guess is that you could just ifconfig a second network card on each machine, and route add the direct route from one machine to the other using the secondary network cards. Of course, I have never done this, so YMMV. :)
  • Petabytes being processed ...
    Is Echelon alive?
    http://www.wired.com/news/news/politics/story/15 864.html

One good reason why computers can do more work than people is that they never have to stop and answer the phone.

Working...