disCERNing Data Analysis 82
technodummy writes: "Wired is reporting how CERN is driving the Linux-based, EU funded, DataGRID project. And no, they say, it's nothing like Seti@Home. The description on the site of the project is: '
The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'" If you're interested in this, check out the Fermi Lab work with LinuxNetworkX data as well as the all-powerful Google search on the Fermi Collider Linux project. As jamie points out, "Colliders produce *amazing* amounts of data in *amazingly* short time periods... on the order of "here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives".
Re:distributed computing (Score:5, Informative)
Re:EU funding (Score:5, Informative)
Re:Storage to the rescue (Score:3, Informative)
Re:Storage to the rescue (Score:4, Informative)
The problem is that there's way too much data to write to any storage medium to analyze later. The bandwidth makes hard drives look like tiny, tiny straws. When they throw the switch and the protons or whatever start smacking into each other, they get many collisions in a row, several every millisecond, maybe dozens every millisecond (depending on collider circumference I imagine). The huge array of detectors around the collision point stream out big chunks of data for each collision. The first line of defense is a network of computers that get handed each collision, or parts of it broken down, in round-robin order or something. Their job is to sift through X megabytes very quickly to decide whether there's anything "interesting" in this collision that warrants being remembered. If no flags go up, the data just gets dropped on the floor.
The datagrid described in the article is, as far as I can tell, set up to process data after that "first line of defense" -- even after dropping the majority of the bits on the floor, there is still a prodigious amount that has to be sifted through, just to check that the Higgs didn't leave a track or something. That's a different sort of engineering project.
My point was just that, yes, the amount of data involved here really is amazingly large.
Re:EU funding (Score:5, Informative)
Government funded work, in the EU, US and internationally, actually drive changes in the IT industry a lot more than most people realise (or perhaps would care to admit).
For christssakes, the web itself came out of a CERN project! Also many other web standards originated in EU funded projects, for instance JPEG and MPEG. So, the most common formats on the web for text (HTML), images (JPEG), and video (MPEG), all owe something to funding from the EU.
And of course the Internet itself comes from US government funded projects. Even commonly used business process have resulted from government funded work (project management methodologies).
Both Americans and Europeans like to bitch about the inefficies of their governments, but the fact of the matter is that if you look at the history of IT, more fundamental innovations come from government funded work than from industry. Of course Bill Gates, Larry Ellison etc. don't want you to think that, but that's the way it is.
Re:Storage to the rescue (Score:4, Informative)
1 GB per 10 ms comes out to 100 GB per second. after 24 hours of experimentation, you find yourself with 8.6 million gigabytes. hard drives are cheap, but not THAT cheap. and even if you had LOTS of 100 GB hard drives, you still need to find a place to PUT 86 thousand of them.
every 24 hours.
after 1 week's worth of data collection, you have 600 thousand 100 GB hard drives of data.
this is why 'store now, analyze later' is not as good of an option for collision data. you have to take that 100 GB of data per second, and first filter and say, 'which of these collisions might be interesting to look at? which ones produced the particles we are trying to study?'
-sam
Re:shear quantity of data (Score:2, Informative)
Chunks of data are perhaps 0.5 MB
Re:shear quantity of data (Score:2, Informative)
[berkeley.edu]
http://setiathome.ssl.berkeley.edu/totals.html
Total
Users 3383619 1872
Results received 399604453
Total CPU time 799230.603 years
Floating Point
Operations 1.142642e+21 (29.64 TeraFLOPs/sec)
Average CPU time
per work unit 17 hr 31 min 13.7 sec
A few Corrections (Score:4, Informative)
However, that said, D0 is heavily involved with the GRID project and has what is arguably one of the first production GRID applications, called SAM. This system essentially manages all of our data files around the entire globe and allows any member to run an analysis job on a selected set of data files. SAM then handles the task of getting those files to the machine where the job is running using whatever means is required (rcp or fetching it from a tape store). SAM also allows remote institutes to add data to the store which is used primarily by large farms of remote Linux boxes which run event simulations. We are also currently working on integrating SAM into our desktop Linux cluster which will allow us to use the incredibly cheap disk and CPU which is available for Linux machines. For more details you can consult the followng web pages:
http://www-d0.fnal.gov/ - the D0 homepage
http://d0db.fnal.gov/sam - the SAM homepage
From the ATLAS TDR... (Score:3, Informative)
So I went and found the ATLAS Technical Design Report, which gives all the numbers:
The final data rate is expected to be about 1PB/year (1 PB = 10^15 B = 10^7 MB). The LHC collider will probably run for about 25 years, there will be at least two experiments (and maybe up to four) running for most of that time
Re:distributed computing (Score:3, Informative)
Well 100 GB per second is the raw data rate, as read out (heavily parallel) from the detector, i.e. the data rate the DAQ (Data AQuisition) system has to keep up with. That's pretty difficult really, but done completely in hardware: the readout chips have relatively large on-chip buffers for each read-out channel. NOST OF THIS DATA IS DISCARDED RIGHT AWAY from the so-called Level 1 Trigger, whose purpose is to throw away the most obviously uninteresting collisions.
Since the data rate after L1 is still WAY too large to be all stored, another trigger, unimaginatively called Level 2 Trigger, sorts out even more crap. Since the data rate is lower than for L1, L2 can use more sophisticated algorithms to figure out which event is crap and which is an ever-famous Higgs [web.cern.ch] decay
One more trigger, Level 3 (you guessed it), is used to even further reduce the amount of data, again with more sophisticated means.
Still, the required bandwidth is quite impressive. At CDF II [fnal.gov], the data rate after Level 3 will be about 75 events per second, at half a meg each, summing up to 30-40 MB per second (well enough to saturate Gbit ethernet), which are all reconstructed [uni-karlsruhe.de]right away.Note that for the LHC [web.cern.ch] experiments (CMS, ATLAS) the amount of data is more than an order of magnitude larger than for CDF and D0 (at Fermilab [fnal.gov]).
The LHC data will be spread all over the world, using a multi-tier architecture with CERN being Tier 0, and national computing centers as Tier 1 centers, universities being Tier 2, etc. No national computing center will be able to store ALL data, so the idea is that e.g. your Higgs search will be conducted on the U.S. Tier 1 center, B physics on the German Tier 1 center and so on. Obviously not only US scientists will search for the Higgs, so others will also submit analysis jobs on the US Tier 1 and vice versa. To get this working, the GRID [gridcomputing.org] is designed. A current implementation is GLOBUS [globus.org].
Having said this, it is important to note that right now, the GRID is nowhere near this goal. To submit jobs in this "fire and forget" way is not possible yet. There is a shitload of problems to yet solve, the most important ones: trust and horsepower.
Trust: you must allow complete strangers to utilize your multi-million dollar cluster, and they haven't even signed a term-of-use form.
Horsepower: everybody expects to get more CPU cycles out of the GRID than he/she contributes. Obviously, this will not work. (Albeit the load levveling might improve the overall performance.)
Re:And you forget... (Score:2, Informative)