Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?

Slashdot videos: Now with more Slashdot!

  • View

  • Discuss

  • Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).


New Linux Petabyte-Scale Distributed File System 132

Posted by samzenpus
from the check-it-out dept.
An anonymous reader writes "A recent addition to Linux's impressive selection of file systems is Ceph, a distributed file system that incorporates replication and fault tolerance while maintaining POSIX compatibility. Explore the architecture of Ceph and learn how it provides fault tolerance and simplifies the management of massive amounts of data."
This discussion has been archived. No new comments can be posted.

New Linux Petabyte-Scale Distributed File System

Comments Filter:
  • but in soviet Russia file systems Distribute you
  • History (Score:4, Informative)

    by Alcoholic Dali (1024937) on Wednesday May 05, 2010 @08:22PM (#32106480)
    Ceph was designed by Sage Weil (of WebRing fame), who is also one of the founders of DreamHost. They will likely be using it internally soon, if they aren't already. http://en.wikipedia.org/wiki/DreamHost [wikipedia.org]
  • Look at Google and Facebook, arguably among the top users of massive databases. They have petabytes upon petabytes of data stored and are constantly growing. But what happens if they lose some data?

    Nothing. They can always go back and regenerate that data. It's just a matter of time.

    So at this large scale, it doesn't make any sense at all to focus on data integrity beyond making sure that fopen() and fread() don't return garbage. It's the smaller databases that contain critical information that need data in

    • by CoderJoe (97563) * on Wednesday May 05, 2010 @08:36PM (#32106618)

      Google's BigFile/BigTable architecture is a distributed filesystem. if a node goes down, the data that was on that node gets copied to other nodes to keep the replication count up.

      Facebook is using apache cassandra, which adopts similar designs.

      • Re: (Score:2, Interesting)

        by Anonymous Coward

        Yes, but Google's file system makes no attempt to implement either the POSIX standard or the Linux VFS. It's highly specialized to only deal with the types of loads that Google sees. As a general solution, it's worth is debatable.

        • by CoderJoe (97563) *

          Yes, but Google's file system makes no attempt to implement either the POSIX standard or the Linux VFS. It's highly specialized to only deal with the types of loads that Google sees. As a general solution, it's worth is debatable.

          But that is not what the original question was about. The original question was about sites like Google or Facebook using anything like a distributed file system to keep from losing data.

      • by Lennie (16154)

        Facebook uses MySQL/memcached, cassandra is only used for systems running the statistical analysis.

    • by CoderJoe (97563) *

      Oh, and I forgot about Amazon Dynamo.

      • by Per Wigren (5315)
        ..and the pretty amazing open source distributed multi-master no-single-point-of-failure database Riak [basho.com].
    • by jdhutchins (559010) on Wednesday May 05, 2010 @09:19PM (#32106888)

      While google may be able to go ahead and re-index websites if it loses that data, "regenerating" gmail and google docs stuff isn't quite so easy, and even small amounts of data loss would kill those applications (especially among paid users).

    • by morgan_greywolf (835522) on Wednesday May 05, 2010 @09:25PM (#32106916) Homepage Journal

      Nothing. They can always go back and regenerate that data. It's just a matter of time.

      You just contradicted yourself. You're right; it's just a matter of time. Only, thing is, this is the Internet. How long to recreate that data? Weeks? Months? Years? 6 months is an eternity on the Net.

      If all the accounts and stories were lost on Slashdot due to a massive database failure, how many people would come back, creating a new account and so forth? How many long would it take before there was enough content and accounts to make it interesting again? Now realize that Slashdot is a drop in the bucket compared to Google.

      • If that were to happen, I'd finally be able to get a low UID!
        • by tjones (1282)

          Why? Is there something special about those?

          • Re: (Score:3, Funny)

            by ae1294 (1547521)

            Why? Is there something special about those?

            You must be new here!

          • Nope (Score:3, Informative)

            by avm (660)

            Nothing special at all. It only means Taco used sequential instead of randomised integers for user ids, which in turn can be viewed as a very loose chronology of user registrations.

            In other words, no.

    • by ProfMobius (1313701) on Wednesday May 05, 2010 @09:44PM (#32107028)
      First, Facebook & Google data are not possible to regenerate, as they are personal things, like emails, messages, posts, etc.

      Second, you have other sectors producing large amount of data beside your favourite networking website. One example is the LHC. It is going to produce terabytes of data per DAY (15 petabytes per year). Another are space telescopes. Those data can't just be 'regenerated'. 1 day worth of data is incredibly expensive to produce.

      Distributed file systems are already there, and people use them. Maybe not on your level of computer usage.

      When you don't know what you are talking about, I think it is better to just keep quiet.

      • by tehcyder (746570)

        When you don't know what you are talking about, I think it is better to just keep quiet.

        That would reduce the number of posts on slashdot by about 99%.

    • by gilboad (986599)

      Why do you assume that:
      A: PB storage is very rare and only used by several large organizations.
      B: PB storage is used to house generated data the can easily be replaced.

      - Gilboa

    • by drsmithy (35869)

      Nothing. They can always go back and regenerate that data. It's just a matter of time.

      No, they can't. This is a really, really important distinction to make. They cannot "regenerate" the data. They *might* (perhaps even "probably") be able to "recopy" the data, *assuming the original source is still available*.

  • by Meshach (578918) on Wednesday May 05, 2010 @08:30PM (#32106554)
    The headline in the Ceph wiki [newdream.net]: Ceph is under heavy development, and is not yet suitable for any uses other than benchmarking and review.
    • Re: (Score:1, Redundant)

      by EdIII (1114411)

      Thanks. I was about to download it to service my rather large storage requirements for porn, but it seems too risky now.

    • Yep and they are using btrfs for the underlying filesystem which is also not at the production use stage.

      For me this is quite a co-incidence, I just spent all yesterday reading up on fault taulerant distributed file systems and ceph and seemed quite promising until I realised they are also waiting on kernel 2.6.34 as it has their patches merged.

      For anyone who knows more about this stuff, I was quite interested in xtreemfs as it seems to allow you to add nodes anywhere on the internet and it will deal
      • by atamido (1020905)

        Yep and they are using btrfs for the underlying filesystem which is also not at the production use stage.

        Would you clarify what the difference between Ceph and BTRFS is? From the description I thought that is what BTRFS and ZFS were supposed to be.

  • by AdmiralXyz (1378985) on Wednesday May 05, 2010 @08:31PM (#32106568)
    "It took a lot of work, but this latest Linux patch enables support for multi-petabyte file organization and storage!"
    "Do you have support for smooth, full-screen Flash video yet?"
    "No, but who uses that?"
    • "Do you have support for smooth, full-screen Flash video yet?"

      Frankly, that's Adobe's fault, not ours.

      • by Hurricane78 (562437) <deleted&slashdot,org> on Wednesday May 05, 2010 @10:18PM (#32107254)

        Yes it is ours. If “ours” means: Us idiots who made Flash dominant in the first place, by using it in any way.
        It always takes two. The ass doing it, and the idiot letting him do it. That guy with the narrow mustache from the 40s would agree to that: “What luck for rulers that men do not think.” ^^

      • Re: (Score:2, Insightful)

        Frankly, that's Adobe's fault, not ours.

        It could be our fault if you wanted it to be:
        http://www.gnu.org/software/gnash/ [gnu.org]
        http://swfdec.freedesktop.org/wiki/ [freedesktop.org]

      • by tehcyder (746570)
        Nice way to miss the point.
    • Re: (Score:2, Redundant)

      by FauxPasIII (75900)

      At least link the the comic you're totally not ripping from. ;)

      http://xkcd.com/619/ [xkcd.com]

      • Re: (Score:3, Interesting)

        by iknowcss (937215)
        Actually, I'm glad that he didn't link to it. I swear, every other story on Slashdot has some comment with a link to XKCD. Hey, we get the jokes. All of us read XKCD. You don't link to a video of Yakov Smirnoff every time you make a Soviet Russia joke, do you?
    • by glwtta (532858)
      This may come as a shock, but Linux has more useful applications than "dicking around on youtube".
      • by jedidiah (1196)

        Even so. The this whole argument is mindless nonsense. Adobe finally only offered partial acceleration support even for Windows just recently.

        The idea that any variant of Flash is any better than any other (or worse) is just Lemming nonsense.

      • Yes, but users of OSs that don't can't understand why anyone would use an OS that doesn't.

    • by evilviper (135110) on Thursday May 06, 2010 @01:26AM (#32108298) Journal

      "Do you have support for smooth, full-screen Flash video yet?"

      A) Yes, I do. MPlayer will play any Flash videos, with a bare minimum of resources, and fully supports multiple video output methods, like xv and gl.

      The PROBLEM is that Flash videos aren't directly available anywhere... You have to parse through a SWF video player object to even determine where to FIND the URL of the actual FLV or MP4 file. And add to that extremely aggressive plugin detection scripts on many sites, which will refuse to even embed the SWF if you happen to have an unknown VERSION of the flash player. Unfortunately, I've mentioned this before, and got several interested replies, but nobody has thus far written a browser plug-in that will masquerade as Flash 10, and understand just enough SWF to find the URLs, and either present them to the users, or automatically pass them to MPlayer. A sad, sad failing, to be sure, since

      B) I (and many, many others) care VASTLY more about Linux's support for massive storage arrays than we do for it's support of Flash, and other user-level fluff. My servers never need to visit YouTube... But booting from a hard drive more than 2 terabytes??? Don't expect Windows to let you do that, without very specialized hardware (EFI firmware). Linux, however, can do it out of the box with many common distros.

  • I'm not really sure how much a petabyte is. Could someone please translate to Natalie Portmans? or Station wagons full of congresses? or Rods to the Hogshead?

    • I think I'll stick with ZFS. It's a million times better, give or take.

    • by fatalwall (873645)

      dont quote me on it as im too tired to look it up but i believe a petabyte is 1000 terabytes... and last i checked thats like billions of rods of hogsheds worth of Natalie Portmans being used as station wagons full of congresses.

    • Re: (Score:3, Informative)

      by SlothDead (1251206)

      Tera -> Tetra -> 4 -> 1000^4
      Peta -> Penta (like Pentagram) -> 5 -> 1000^5
      Exa -> Hexa (like Hexagon) -> 6 -> 1000^6
      Zeta -> Setta (like 7 in many languages) -> 7 -> 1000^7
      Yotta -> Otta -> 8 -> 1000^8

      Or use 1024 if you don't like IEEE/IEC norms...

  • I see a lot too many layers over layers there. Which always smells like the inner-platform anti-pattern [wikipedia.org] that a “enterprise consultant” would to, to me.
    But maybe I’m just misunderstanding things and that amount of layers is needed for large installations. Anyone here, who actually administers such large storage systems and read the article? Would be interesting to hear from someone with daily experience in this.

    Also, I could not find any mentioning of any ZFS-like scrubbing going on. Which

    • Re: (Score:2, Informative)

      by Anonymous Coward

      Did I miss it, or did they really forget that crucial part?

      You missed it. There is a scrubbing mechanism in ceph.

      • by Lennie (16154)

        Also it uses BTRFS as the local filesystem, which does quiet a few checks as well.

  • Linux® (Score:3, Insightful)

    by The Yuckinator (898499) on Wednesday May 05, 2010 @10:41PM (#32107378)

    The first word in the article summary is "Linux®"

    Does that look weird to anyone else? I realize it's technically correct for the registered trademark symbol to be there, but somehow it just doesn't seem right.

  • I am not real familiar with ceph and after going through the pain to learn more about glusterfs (http://www.gluster.org/) only to learn that gluster was not quite ready for primetime (this was about 6 month ago - may have changed), I am a bit skeptical. Anyone know the main differences between ceph and glusterfs (besides that glusterfs can run in userspace)?
    • by perlchild (582235)

      Ceph reminds me more of Coda than glusterfs. Anyone remember coda?

      • by Troy Baer (1395)
        I remember that the guys who originally wrote Coda basically abandoned it and moved on to doing Lustre...

Put your best foot forward. Or just call in and say you're sick.