Distributed Storage Systems for Linux? 52
elambrecht asks: "We've got a _lot_ of data we'd like to archive and make sure it is accessible via the web 24/7. We've been using a NetApp for this, but that solution is just waaaay to expensive to scale. We want to move to using a cluster of Linux boxes that redundantly store and serve up the data. What are the best packages out there for this? GFS? MogileFS?"
Our crystal ball is fuzzy! (Score:5, Insightful)
If you can afford NetApp, why not keep with NetApp? A bunch of Linux boxes is not a storage solution. Indeed, what does Linux have to do with anything? We're talking storage here. What are you planning to do - put in 200 of them with internal SATA drives? Yeah, that'll be a lot cheaper to maintain...
I'm not shilling for NetApp, but if you really have "a lot" of data to put "on the web" "24/7" then you need some kind of real storage solution like a NetApp or one of their competitors.
Now go away and please take Cliff with you.
We use OpenAFS (Score:4, Insightful)
We've moved to using linux based OpenAFS servers. A high quality 3U box (qsol.com [qsol.com]) loaded with 16x 300GB ATA drives costs about $8.5K and provides us about 3.5TB (2 drives for parity, 2 drives for hot-swap). That works out to $2.5K/TB. If your risk tolerance is higher than mine, you can bring that up to $8K/5.5TB, for about $1.5K/TB). We really want 99.999% availability, so just to be safe, we keep a 100% redundent read-only copy on a second machine (AFS supports this beautifully, including automatic fail-over).
OpenAFS has a couple of features that make it better than NFS (client-side cache, for instance), but it also has a few drawbacks, like no files >2GB.
Re:Our crystal ball is fuzzy! (Score:2, Insightful)
Hey man, don't tell that to Google.
Re:Centera (Score:3, Insightful)
Ug. It's just not true. Most applications that are built to work with Centera include functionality to migrate in/out of the system just like most applications that are built to work with tape can both put data on and get it back. The difference is tape sucks, Centera doesn't.
It can't scale beyond a 42U rack enclosure.
Also not true. I have worked extensively with a 3 rack install with about 50tb of data on it. I believe all versions of Centera since the very first are capable of scaling to 4 racks and some are capable of going to 8 racks. Lots of customers have 2 rack installs. Raw storage on the currently shipping nodes is over 1 tb per node and you can put 32 nodes in a rack. Do the math, a 4 rack Centera is quite big even after taking mirroring or CPP into account.
It's a bunch of little servers striped together to form a big NAS with a metedata controller in the middle.
No. No No.
It IS a bunch of little servers but no they are not "striped together", and no they don't form a NAS. There is no "metadata controller" and there certainly isn't one in the middle. It is a storage cluster that has features specifically designed to store fixed content. Centera is not a simple Linux hack to make a bunch of boxes look like a storage cluster. It's a robust, flexible, well thought out piece of clustering software that is built on top of a Linux base.
Centera hardware is good stuff too. It has redundant externally facing servers (access nodes) so that if one fails, applications can keep working. Both back end switches are linked to every node so everything has redundant data paths. Data is stored in such a way that no data is unavailable if any single node fails or goes offline for any reason.
It's easy to dismiss Centera because it's so different from the standard storage systems who's basic interfaces really haven't changed in 3+ decades. It's not a block device. It's not a filesystem. It's not a mountable share. It's a storage cluster with functionality specifically designed to manage fixed content. It is accessed only through a client side API that talks to the cluster over IP. It isn't easy to wrap your head around.