disCERNing Data Analysis 82
technodummy writes: "Wired is reporting how CERN is driving the Linux-based, EU funded, DataGRID project. And no, they say, it's nothing like Seti@Home. The description on the site of the project is: '
The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'" If you're interested in this, check out the Fermi Lab work with LinuxNetworkX data as well as the all-powerful Google search on the Fermi Collider Linux project. As jamie points out, "Colliders produce *amazing* amounts of data in *amazingly* short time periods... on the order of "here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives".
Technology transfer (Score:2, Funny)
Re:Anyone know (Score:1)
EU funding (Score:2, Flamebait)
Has anyone actually seen an IT related EU project that achieved something? The company I work for has been involved in two EU project proposals so far, and nothing came of either of them -- though they both consumed a large ammount of resources from universities to get through the three failed applications each.
Re:EU funding (Score:5, Informative)
Re:EU funding (Score:5, Informative)
Government funded work, in the EU, US and internationally, actually drive changes in the IT industry a lot more than most people realise (or perhaps would care to admit).
For christssakes, the web itself came out of a CERN project! Also many other web standards originated in EU funded projects, for instance JPEG and MPEG. So, the most common formats on the web for text (HTML), images (JPEG), and video (MPEG), all owe something to funding from the EU.
And of course the Internet itself comes from US government funded projects. Even commonly used business process have resulted from government funded work (project management methodologies).
Both Americans and Europeans like to bitch about the inefficies of their governments, but the fact of the matter is that if you look at the history of IT, more fundamental innovations come from government funded work than from industry. Of course Bill Gates, Larry Ellison etc. don't want you to think that, but that's the way it is.
Re:EU funding (Score:1)
Government funded work isn't that same as government work.
Re:EU funding (Score:1)
Government funded work isn't that same as government work.
?????
government work is work done by employees of the government. government funded work is work done by people the government is giving money to. pretty close in ;y line of work. Also considering that in Europe, most reseachers are paid directly by the government, as employees, unlike in America
Re:And you forget... (Score:2, Informative)
Re:And you forget... (Score:1)
Re:EU funding (Score:3, Interesting)
Perhaps you are expecting the wrong results.
I have been involved in a couple of large EU funded projects, and have spoken to the project managers about the aims and motives of the projects.
One principal point is that just because a new successful product/standard/format whatever does not arise from a project, does not mean that it has been a failure.
The EU is made up of lots of different countries with lots of different types of people speaking different languages and with different working mentalities. This is a major competitive disadvantage for us compared to a country like the US. If a company in San Francisco wants to work with a company in New York, there aren't many barriers to them doing that. In the EU, there are lots of barriers. One of the main aims of EU funded projects (and the EU in general) is to break down these barriers by getting different companies and universities working together across the EU. If new technologies come our of these projects, so much the better, but that's not necessarily the principal aim.
shear quantity of data (Score:3, Interesting)
Does anyone have a idea on how much data Seti at home has processed? This would certainly be useful as a yard stick of sorts.
Re:shear quantity of data (Score:2, Informative)
Chunks of data are perhaps 0.5 MB
Re:shear quantity of data (Score:2, Informative)
[berkeley.edu]
http://setiathome.ssl.berkeley.edu/totals.html
Total
Users 3383619 1872
Results received 399604453
Total CPU time 799230.603 years
Floating Point
Operations 1.142642e+21 (29.64 TeraFLOPs/sec)
Average CPU time
per work unit 17 hr 31 min 13.7 sec
Re:shear quantity of data (Score:1)
Yow!
distributed computing (Score:5, Interesting)
let's see. 1 GB in 10 ms works out to 100 GB per second. how recently did GB ethernet come about? and what would the average bandwidth of users be? i would guess much less, but let us assume 100KB per second.
so you have 107374182400 bytes of data per second. your users can take 102400 bytes per second each. even if everyone was connected directly to your network (no delays or bottlenecks... ha!) you would still require 1048576 users (that is over 1 million).
and this is not taking into effect sending any data BACK to the source or actual computation time on the users.
-sam
Re:distributed computing (Score:5, Informative)
Re:distributed computing (Score:2)
Re:distributed computing (Score:3, Informative)
Well 100 GB per second is the raw data rate, as read out (heavily parallel) from the detector, i.e. the data rate the DAQ (Data AQuisition) system has to keep up with. That's pretty difficult really, but done completely in hardware: the readout chips have relatively large on-chip buffers for each read-out channel. NOST OF THIS DATA IS DISCARDED RIGHT AWAY from the so-called Level 1 Trigger, whose purpose is to throw away the most obviously uninteresting collisions.
Since the data rate after L1 is still WAY too large to be all stored, another trigger, unimaginatively called Level 2 Trigger, sorts out even more crap. Since the data rate is lower than for L1, L2 can use more sophisticated algorithms to figure out which event is crap and which is an ever-famous Higgs [web.cern.ch] decay
One more trigger, Level 3 (you guessed it), is used to even further reduce the amount of data, again with more sophisticated means.
Still, the required bandwidth is quite impressive. At CDF II [fnal.gov], the data rate after Level 3 will be about 75 events per second, at half a meg each, summing up to 30-40 MB per second (well enough to saturate Gbit ethernet), which are all reconstructed [uni-karlsruhe.de]right away.Note that for the LHC [web.cern.ch] experiments (CMS, ATLAS) the amount of data is more than an order of magnitude larger than for CDF and D0 (at Fermilab [fnal.gov]).
The LHC data will be spread all over the world, using a multi-tier architecture with CERN being Tier 0, and national computing centers as Tier 1 centers, universities being Tier 2, etc. No national computing center will be able to store ALL data, so the idea is that e.g. your Higgs search will be conducted on the U.S. Tier 1 center, B physics on the German Tier 1 center and so on. Obviously not only US scientists will search for the Higgs, so others will also submit analysis jobs on the US Tier 1 and vice versa. To get this working, the GRID [gridcomputing.org] is designed. A current implementation is GLOBUS [globus.org].
Having said this, it is important to note that right now, the GRID is nowhere near this goal. To submit jobs in this "fire and forget" way is not possible yet. There is a shitload of problems to yet solve, the most important ones: trust and horsepower.
Trust: you must allow complete strangers to utilize your multi-million dollar cluster, and they haven't even signed a term-of-use form.
Horsepower: everybody expects to get more CPU cycles out of the GRID than he/she contributes. Obviously, this will not work. (Albeit the load levveling might improve the overall performance.)
Here's how you do it... (Score:2, Funny)
Storage to the rescue (Score:2, Insightful)
...or just write it all as it comes in and analyze it later. That's how most other science takes place. Since when is scientific analysis "real-time?"
In general, the scientific process does not require conclusions during an experiment. I think CERN should cite a different reason for this project, there are many valid ones.
Re:Storage to the rescue (Score:3, Informative)
Re:Storage to the rescue (Score:4, Informative)
The problem is that there's way too much data to write to any storage medium to analyze later. The bandwidth makes hard drives look like tiny, tiny straws. When they throw the switch and the protons or whatever start smacking into each other, they get many collisions in a row, several every millisecond, maybe dozens every millisecond (depending on collider circumference I imagine). The huge array of detectors around the collision point stream out big chunks of data for each collision. The first line of defense is a network of computers that get handed each collision, or parts of it broken down, in round-robin order or something. Their job is to sift through X megabytes very quickly to decide whether there's anything "interesting" in this collision that warrants being remembered. If no flags go up, the data just gets dropped on the floor.
The datagrid described in the article is, as far as I can tell, set up to process data after that "first line of defense" -- even after dropping the majority of the bits on the floor, there is still a prodigious amount that has to be sifted through, just to check that the Higgs didn't leave a track or something. That's a different sort of engineering project.
My point was just that, yes, the amount of data involved here really is amazingly large.
I was there! (Score:3, Funny)
I can see it already.
*** jamie(~who@gives.a.fl.us [mailto]) joined #slashdot ... u know linux and all ... hold on a sec ... can't find it though ... CERN is driving the Linux-based, EU funded, DataGRID project. ... like you know here's a gig fill it and you've got a split second to pull the goods outtathere ... now i gotta work this
<CmdrTaco> lookin' for cyber msg me
<Hem0s> Hey jamie
* KatzAWAY is now away [logger:on]
<jamie> hey hemos
<Hem0s> whazzup?
<jamie> oh got this gr8 link here but got no access to the backend right now. can u help me out?
<Hem0s> sure thing.. what you got?
<CmdrTaco> jamie a/s/l?
<jamie> i found this link about this grid computing whatsimagigger and i just thought it's cool
<Hem0s> u uh
<CmdrTaco> jamie a/s/l?
<jamie> shut up taco
<Hem0s> so what's the link?
<timothy> boooooring
<jamie> i found it while zapping through wired somehow my browser crashed on me again can u go find it?
<Hem0s> sure
<CmdrTaco> timothy a/s/l?
*** CmdrTaco (rob@home [mailto]) Quit (Connection reset by peer)
<jamie> gotta tell you i LOVE that post you did on OpenGL a minute ago
<Hem0s> thx
<jamie> it's there somewhere
*** CmdrTaco (rob@home [mailto]) joined #slashdot
*** bill{Taco} sets mode: +b CmdrTaco
<jamie> ok lemme try again
<Hem0s> hurry jamie i already fired up mah mozilla dont know how long she stays put
<CmdrTaco> lookin for a good time? msg me
*** KatzAWAY left #slashdot
<jamie> here it is
<jamie> The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'
<Hem0s> great stuff... lemme copy'npaste here..
<jamie> somethin bout amazing amounts of stuff in short timed periods
<Hem0s> you don't mind if i edit this a bit don't you
<jamie>gotta go bye!
<Hem0s> you don't mind if i redo this a bit don't you?
*** jamie left #slashdot (gotta reboot bye)
<CmdrTaco> lookin for cyber. msg me
<Hem0s> great
*** michael sets mode: +ms
*** You were kicked by michael (spyin on us?)
Re:Storage to the rescue (Score:1)
When they throw the switch and the protons or whatever start smacking into each other, they get many collisions in a row, several every millisecond, maybe dozens every millisecond (depending on collider circumference I imagine).
Yup, 40,000 every millisecond in the case of the LHC! Actually the size of the collider doesn't really limit the collision rate, since there's no problem with having more than one "bunch" of protons (or whatever) going round the ring at once.
Re:Storage to the rescue (Score:1)
Consider it 10ms to distill the Gigabyte into the useful 100MB that (you hope) may mean something in a few years.
-Paul
Re:Storage to the rescue (Score:4, Informative)
1 GB per 10 ms comes out to 100 GB per second. after 24 hours of experimentation, you find yourself with 8.6 million gigabytes. hard drives are cheap, but not THAT cheap. and even if you had LOTS of 100 GB hard drives, you still need to find a place to PUT 86 thousand of them.
every 24 hours.
after 1 week's worth of data collection, you have 600 thousand 100 GB hard drives of data.
this is why 'store now, analyze later' is not as good of an option for collision data. you have to take that 100 GB of data per second, and first filter and say, 'which of these collisions might be interesting to look at? which ones produced the particles we are trying to study?'
-sam
Re:Storage to the rescue (Score:1)
For instance, one possible application of this technology would be the ability to modify the beam in some way (flux, pulse, polariztion etc.) in real time during the experiment. Say that a high number of certain desirable event is observed. It might be interesting to try to modify the beam quickly to see what effect that might cause.
Heck- you could even set up some sort of feedback algorithm to maximize the number of events in real time, and that would be incredibly useful for people struggling to dig a signal out of a high noise level.
In prinicple this sort of high density data acquisition and rapid analysis could have applications in a number of fields way beyond experimental particle physics.
Re:Storage to the rescue (Score:1)
The full rate (real time, unfiltered, non zero-suppressed) for one of these detectors is *much* higher (40 million collisions/second * 10**7 channels of readout electronics). That's the real-time problem, and that is nothing to do with the Grid.
Re:Storage to the rescue (Score:2, Insightful)
Re:Storage to the rescue (Score:1)
Thats an unbelievably inefficient solution you have proposed. The problem is that we are studying quantum mechanical effects so everything we do is extremely statistical. Most of the time when an event happens in our detector, it is some trivial well understood event. If we didn't somehow ignore some (not all!) of these events, we wouldn't have enough space left over for interesting events that involve rare processes.
But how is this Science? Because it is extremely reproducible. We take enough unbiased data to understand the effect of more biased triggers. Then the biased triggers give us lots of rare events to study.
Keep an mind that our analysis still happens offline, we just have to work very hard to extract a signal while we are taking data
This wont work very well (Score:1, Interesting)
I know broadband is getting more accepted, but I don't think real-time is going to work on this kind of scale. SETI is successful because anyone can run it (evenif it is slow) and there's competition to get the most work units done. Without something to keep people interested, no one is going to run anything from CERN. Without the ability for a broad range of people to run a client or something, there's not going to be enough people anyway.
Harddrive space is cheap (compared to a super-colider) why can't they store all these petabytes of data? When the project gets more successful, they'll be able to actually analyse all the extra data they've got. I mean if you're going to spend that much money on a colider, you might as well get as much info as you can from it.
good luck,
sopwath
Re:This wont work very well (Score:1)
Grid computing? (Score:3, Insightful)
Yes, if you can't invent an idea, rename it, and maybe you'll get some credit. What the hell, it's worked before [slashdot.org].
Oh well. More power to them. It looks like a great opportunity for the world to learn that Linux is a powerful tool [extremelinux.org].
Re:Grid computing? (Score:2)
-sam
Re:Grid computing? (Score:1)
Re:Grid computing? (Score:1)
The main difference from existing distributed computing projects is that data storage is distributed as well as data processing; hence the investment in super-fast networks that people talk about.
Mind you, personally I don't see why we don't just put all the computers in the same room, and save all that investment in fibre...
Re:Grid computing? (Score:2)
In that article it says:
"One example of how this is not done is SETI," said Ellis, referring to the popular screensaver program beloved by millions of home and work computer users. The program processes chunks of satellite data for the Search for Extra Terrestrial Intelligence project.
"It's not real-time, and it's not online," said Brian Coghlan of Trinity College Dublin, an Irish participant in DataGRID. "You go to SETI and laboriously download data and then laboriously send it back."
With DataGRID, they're talking about a network that can do real-time processing of petabytes of data -- a barely imaginable amount of information. One petabyte is a quadrillion bytes -- equal to all the information that could be held on 125,000 PC hard drives."
SETI data can be delayed. If you don't get online for awhile, your data is held back from the grid. Doesn't that make it different?
Solid State Niche (Score:2, Interesting)
Re:Solid State Niche (Score:1, Interesting)
It's a little-known fact... (Score:2, Interesting)
Bring on the pixie dust!
(source [losangelesalmanac.com])
You stole my partytrick! (redundant) (Score:1)
http://slashdot.org/comments.pl?sid=23464&cid=2
900,720 km^2
The United States of America is 9,372,143 km^2
Alaska is 1,518,800 km^2
Texas is 692,405 km^2
Arizona is 295,024 km^2
The Atlantic Ocean is 82,362,000 km^2
Europe is 10,360,000 km^2
Denmark (my home country) is a measly 43,069 km^2
Great Britain is 244,044 km^2
Germany is 356,733 km^2
France is 547,026 km^2
The Pacific Ocean is 181,300,000 km^2
Australia is 7,686,810 km^2
Greenland (the largest island in the world) is 2,175,600 km^2
Re:You stole my partytrick! (redundant) (Score:1)
Custumization of PCs (Score:2)
I'm sure that a custom system could be designed and built for the problem on the cheap side (using off the shelf products and parts) and the cost could be spread around the various coliders around the world.
Heck, it would make for a good DARPA grant- hint hint.
Also, thinking about the amount of data generated, I'm sure that the collectors have some sort of system to buffer all that data (ungodly amount of RAM anyone?) which is then sent down the wire to storage over multiple NICs.
I also don't think that coliders are run 24/7 as someone else suggested / wrote.
Henry
Re:Custumization of PCs (Score:2)
I also don't think that coliders are run 24/7 as someone else suggested / wrote.
Actually, they are :-) They are run 24/7 for months at a time, then taken down for a few days/weeks for minor repairs, swap outs, and minor upgrades, then they go back up. And they do this for a few years on end. Then they go down for major overhauls and upgrades, and hopefully a few more runs.
Re:Custumization of PCs (Score:1)
Tell that to the grad students who are there at 2AM running experiments. I had a summer job at TRIUMF in Vancouver, and I can assure you that they don't just unplug the cyclotron at 5PM and go home.
I had to run a few shifts on one experiment that was shooting muons into cryogenic solids. Typical sequence:
- Collect data for 30 minutes
- Adjust temperature setpoint on the controller
- Go down to the experimental floor (involving various safety interlocks to ensure that the beamline is shut off before you open the door).
- Turn a liquid-helium valve just a little bit (this was the coarse temperature adjustment; the electric heater was the fine adjustment).
- Go back up to the data-collection room and press "start" (assuming you turned the valve by the right amount; if not go down and try again)
- Repeat until the sun comes up and they let you go home.
Particle physicists tend to be very good at Solitaire and Minesweeper.
Similar work here in the US: HENP, NEES, etc. (Score:1)
There is a great deal of activity here in the US w.r.t. the transfer of large amounts of data via advanced networks. Internet2 [internet2.edu] is working with the International Physics community from the US side. The HENP Networking Working Group [bnl.gov] (High Energy and Nuclear Physics). Additionally, there is work with with the National Earthquake Engineering Simulation Grid [neesgrid.org]. NEES is going to be collecting similar amounts of information from earthquake simulation experiments.
Some of the most interesting work is being done by those involved with the End to End Performance Initiative [internet2.edu]. These folks are trying to figure out what it takes to support the data transfer rates that will soon be necessary.
It continues to amaze me that it is now possible to use a network to transfer data to a disk/array faster than the disk/array can process it. I believe that many have pointed out that hardware (in terms of Moore's law and data acquisition/processing) has is not keeping up with the rate of data creation. But that is prob a bit obvious to most of us.
A few Corrections (Score:4, Informative)
However, that said, D0 is heavily involved with the GRID project and has what is arguably one of the first production GRID applications, called SAM. This system essentially manages all of our data files around the entire globe and allows any member to run an analysis job on a selected set of data files. SAM then handles the task of getting those files to the machine where the job is running using whatever means is required (rcp or fetching it from a tape store). SAM also allows remote institutes to add data to the store which is used primarily by large farms of remote Linux boxes which run event simulations. We are also currently working on integrating SAM into our desktop Linux cluster which will allow us to use the incredibly cheap disk and CPU which is available for Linux machines. For more details you can consult the followng web pages:
http://www-d0.fnal.gov/ - the D0 homepage
http://d0db.fnal.gov/sam - the SAM homepage
Uhm, kinda funny (Score:1)
Tested under Netscape 6.2 only...
Virtual science (Score:3, Interesting)
This reminds me of an astronomy-related story I saw yesterday [yahoo.com]. Some projects are generating more data than the people doing the projects can handle.
Re:Virtual science (Score:1)
But I'm certain that not-necissarily modern pattern-recognition software can handle the bulk of the satellite data, and publically available speech recognition software, piped to publically available grammer analyzers can handle rudimentary analysis of radio and, to a lesser extent, telephone conversations.
That doesn't cover data conveyed with insinuation, or non-standard modem connection speeds, but maybe the government has already paid to have that done.
Has anyone considered looking at publicly available information (like the CIA's allotment of the US budget) and looking at how much R&D that could fund
From the ATLAS TDR... (Score:3, Informative)
So I went and found the ATLAS Technical Design Report, which gives all the numbers:
The final data rate is expected to be about 1PB/year (1 PB = 10^15 B = 10^7 MB). The LHC collider will probably run for about 25 years, there will be at least two experiments (and maybe up to four) running for most of that time
A similar project: GriPhyN (Score:1)
You can read about it at: www.griphyn.org [griphyn.org]
Buried on the web site is the original proposal [ufl.edu] they made, and it gives you some idea of the amount of data we're working with.
Some approximate statistics from the paper:
SDSS gets data at 8MB/s, 10TB/year.
LIGO will get data at 10MB/s, 250TB/year.
CMS will get data at 100MB/s, 5 Petabytes per year.
Work has already been done with simulated data for CMS, and a demo of virtual data (may be pre-calculated, or calculated on demand) for CMS was shown at the Supercomputing 2001 conference last week. They used Condor clusters from a few different sites. I'm not sure which sites made it into the final demo, but it may have included U. Florida, Argonne, and U. Wisconin.
Info... (Score:1)
Fun facts of the Fermilab PA:
700 scienteist and engineers work there.
1000 giant superconducting magnets.
$10 million in annual elictricity bills.
15 miles of pipes to carry the liquid helium to the magnets.
It's an execute-only world out there (Score:1)
Basically what it comes down to is most people (even GNU/Linux users) want to download and run the program, MAYBE poke at the code a little. But take over actual maintainership (even if it's next no no actual work), fugedabouit!
Storing the data isn't the only problem... (Score:1)
One of the things DataGrid is designed to do is to give researchers easy access to the data they need.
It's kind of like a distributed data store with a tree like structure. The collider feeds data to national centres, they feed data to regional centres, regional centres feed data to local research groups, the researchers analyse the data.
What's more interesting, is what happens when these researchers start to exchange their results... terabytes of data flying around in all directions, not just downstream.
As for Grid Computing, yes - most of the technology isn't new, but then again neither was the World Wide Web. The Web was successful because it took existing good ideas, added a killer application (Mosaic) and proved to be useful to other fields than the one it was developed for.
The problem is that "grid" computing is being used to describe a number of distinctly different things: distributed data stores, clustered supercomputers, run-anywhere computing resources, commodity computing...
See the GlobalGridForum pages at: http://www.gridforum.org [gridforum.org]
for more details about Grid research and projects across the world.