Ask Slashdot: Distributed Filesystems for Linux? 151
"So far the two (three?) solutions that had the most promise are: AFS or Arla, and Coda.
The reasons against: AFS is commercial and I don't want to pay $15,000 in licenses just for a convenience to me. Arla still appears to be extremely alpha quality, even for a Linux hacker used to seeing major parts of his kernel labeled "alpha" or "beta". I had Coda up and running for a couple of days before I ran into a fairly severe flaw in the fundamental design that showed it to be inappropriate for what I want it to do. (But Coda is still the coolest thing since individually-wrapped cheese slices, and if you don't need to worry about that little problem, it's cooler than sex.)
I've found lots of references to the "GFS" project which is not at all what I want, and here and there mentions of other projects such as "DFS", "xFS" and a distributed filesystem for Beowulf clusters but no further details, URLs or most importantly - code - could I dig up.
I don't need RAID, redundancy, failover, or anything like that. I only need to take these extra machines on my home network and make all their extra disk space look like a single volume on the network. Support for Linux as a client is, obviously, essential, but I also have Windows, BeOS, *BSD and Solaris machines on my network, so clients for those would be appreciated but not necessary. Since this is just for me at home, (yes, I've got all that crap on my network at home - so I'm a little crazy) I'd rather stick with free software. Is there anything that can do this? "
If not, then it sounds like it would be an interesting project to work on. The ability to be able to harness the spare disk space across a private network can only be a good thing.
Arla is _good_ (Score:1)
What about an LVM (Score:1)
Re:NFS on Linux (Score:1)
The NFS implementation in linux 2.2 has been completely reworked since 2.0, and is now much improved.
Re:check the web censor articles.. (Score:1)
Intermezzo link (Score:1)
Re:xFS, Frangipani (Score:1)
Intermezzo (Score:2)
nfs btween solaris and linux (Score:2)
Re:FreeBSD NFS is pretty bad (Score:1)
Yeah, it was mentioned on the linux-kernel list (Score:2)
PVFS (Score:3)
Personally, I like the "adventure" of Coda, but haven't tried setting it up in a few months. Now that my roommates have agreed to be guinea pigs for the Windows client, I figure I'll set it up behind my NAT box and play with it again. It's overkill for everything but a big installation, but I still think it's kind of fun. The thought that terrifies me is working with a multi-GB datafile or such over Coda -- but since my roommates will probably be more interested in playing Dopewars [ox.ac.uk] and moving around small files on a FE network, I'm going ahead with the grand master plan anyways. Besides, I have a laser printer and a burning desire to experience the frustrations of Samba...
Re:Arla is _good_ (Score:1)
The Arla client is quite nice, though.
Re:AFS Baby! (Score:1)
Yes, there have been changes in the VFS between 2.2.3 and 2.2.12, but they don't really break binary compatibility.
Re:The Charon Filesystem (Score:1)
Also, how are you handling namespaces? Are you going the AFS/Echo route and putting together a global namespace, or going the NFS route and punting on the whole thing?
Re:AFS Baby! (Score:2)
Transarc's AFS 3.5 includes both a Linux server and client. I'm using it right now, in fact.
Before the 3.5 release, the AFS clients and servers for Linux were a third-party effort by AFS licensees at MIT and CMU. Now, the third-party client is still in use for Linux 2.0 machines, and 2.2 machines can use the official Transarc client.
Also, Transarc is "in control" of AFS, and is owned by IBM. And, yes, for a while Transarc/IBM had no interest in pushing AFS -- they wanted their AFS customers to move over to DFS. Unfortunately for them, very few people wanted to use DFS. The AFS support issue is getting better now, though. IBM seems to be realizing that it really should give its customers what they want.
Re:For your setup (Score:1)
The ugly side of distributed filesystems... (Score:2)
In particular, Coda is really cool, and the RVM facility is particularly neat stuff.
Unfortunately, there's a downside to making this use of all that "extra disk space," and that is that this may diminish the overall reliability of the whole system, and a not particularly unusual case would be for reliability to degrade to the "reliability of the weakest system on the net."
The "ideal" situation would be to be able to "somehow publish" that extra space, and allocate it perhaps as follows:
I'm not proposing that the parameters here are necessarily "religious doctrine;" the point is that it is important to distribute some backups (analagous to RAID) such that if a drive goes bad, the rest of the system doesn't have to suffer.
Re:xFS, Frangipani (Score:1)
Daniel
Styx boatman? (Score:1)
george
Re:Your requirement doesn't sound too useful (Score:2)
I'm not the original poster, but I've got a similar problem and would be interested in a solution along the lines he proposes.
For a start, you can't seriously be advocating that spare blocks from a variety of machines be used to provide unique bits and pieces of storage for virtual files distributed across those machines, I hope. This would make the availability and reliability of those files extremely low, ie. as low as the weakest link in the system.
He can seriously be suggesting that. I am. I have a small LAN in my home. I'm not running a bank. Furthermore, none of my local machines has ever gone down. Either I'm lucky or I don't run NT.
Secondly, what happens when one of the contributing filestores requires more space, but can't use it because it's been allocated to one of those distributed files? You could no longer just delete something from the machine concerned without going through that hypothetical distributed filestore manager, because it would be the only party that would know whether the item in question is part of a distributed file and hence whether it can be deleted. (This assumes that it creates real files in the local filespace for allocating to distributed files, which it would have to do otherwise the space it allocates would evaporate if the distributing daemon died.) In other words, *all* of your storage becomes dependent on this new manager, slows to a crawl, and probably loses a lot of the reliability of your native filesystem to boot. No, no, no
First of all, the amount of free space on most of my machines is fairly constant. My webserver, for example, doesn't suddenly up and decide to create a few dozen extra megabytes of files for no reason. What happens when a machine runs out of room? The same thing that happens when a machine NOT using this system runs out of room. A system not using a DFS, a system unconnected to a network at all, can run out of room. So your "problem" is one that's not unique (and in fact has nothing to do with) this question. As always, one must be aware of how much disk space is required for a system, and not provide less than that. For all but one of my systems, these numbers are fairly constant and very well determined. I'd be more than happy to just dedicate half a gig of diskspace my mailserver has never used and never will, plus nearly another half gig by webserver never touches, to such a scheme. This will introduce no problems that everyone hasn't faced before when setting up partitions. Just make sure the space you dedicate for local use is sufficient, same thing you do every time you partition a drive. There's no problem here that we didn't already face without this scheme being proposed.
If the new distributed filesystem manager actually *does* make space on one machine as requested, it would clearly have to push out the data onto some other machine to compensate. If you think about it, the policy issues in this area are "interesting". (Aka "horrid".)
The original poster quite clearly said he was doing this in his home. I've found policy issues on my own home LAN remarkably easily to resolve and completely uncontroversial.
Finally, since the first point (unavailability cased by one machine going down) makes the idea completely untenable in most cases, you'd have to be talking about a system in which blocks are allocated in multiple places for each virtual file block. That's great, but notice that such a scheme is *not* storage-efficient, yet your requirement is based on not wanting to waste storage space!!!
You've obviously completely missed the original poster's point. You'd be an idiot to suggest allocating blocks in multiple places. That's a completely inappropriate suggestion, considering the original goals. Having each block in only one place, far from being untenable, is in fact exactly what is called for under the circimstances and is every bit as reliable as is required, under the circumstances.
No, I don't think you've thought this requirement through.
No, it's more like you didn't read the original question very carefully. But that's par for the course. A couple of people have suggested using network block devices and 'md'. (The opinions are that this would be slow, but in my book a slow drive is better than no drive at all.) Nearly everyone else has gone on and on about NFS issues that are admittedly interesting but have almost nothing at all to do with what the original poster's problem. We're quite clearly NOT looking for something like Coda or Arla or anything like that. Right off the bat, if you think keep the same data at multiple locations is a good idea, you're profoundly confused about what the problem is. If you're worried about what happens if a machine goes down, or the new virtual drive fills up, or one of the local drives fills up, you're worrying about things completely unrelated to the problem. Perhaps you should understand the requirements before you decide to criticize them.
--
Re:nfs btween solaris and linux (Score:1)
They make a huge difference to the stability of linux's nfs interoperability. The standard kernel nfs is horribly broken. I've used linux as an nfs server to linux, solaris x86 and MACH clients, and it works very reliably once you use knfsd. (but i've only tested with a couple of clients, not 60).
see ftp://ftp.varesearch.com/pub/support/hjl/knfsd/
Re:MD and net block device (Score:1)
NBD limitations (was: Re:A bit of a Kludge....) (Score:1)
Currently, (x86) NBD is limited to devices of size (2^31)-1 bytes in length. I've been meaning to fix this, but (whine) haven't had the time.
Also, nbd uses both a client and a server process. If either one dies, you're left with a filesystem that fails on all i/o-ops, no way to umount it, and no way to "reconnect".
Other than that, you should be fine. I've built raid-5 filesystems (for fun) over nbd with fair performance.
My apologies (Score:2)
If you keep up correctly with every intesting project that goes on here, and all the personalities involved, I'm impressed.
For your setup (Score:5)
For home network purposes, where a few users are unlikely to overwhelm the server, use NFS. It's easy, it's well supported across OSs, its performance may not be incredible but nothing you're likely to do will strain it. Even if you're moving huge files around, you're not going to have 10 people moving huge files around simultaneously.
Actually, there's one more fun option to consider: Inter-Mezzo, a distributed fs written in PERL in a few weeks by the creator of CODA, Peter Braam. It's small, it's pretty quick (the speed-critical parts are in C
NFS (Score:1)
Free, fairly easy to implement.
You will have to plan a boot-up sequence for all your machines if you want to automount these file-systems.
A better plan is to run a script by hand after all the machines are started up. Timing is everything!
Why cant you user NFS? (Score:1)
CP
Re:The Charon Filesystem (Score:1)
--
Re:NFS on Linux (Score:2)
between two 2.0.36 boxen. Now the server is 2.3.15
and the client is 2.2.10 and I have no more
"magical dropouts". The performance is acceptable
too. The server is a secondary/experimental box that has a
chunk of spare disk space that can be used elsewhere).
Availability even worse than of weakest link (Score:1)
The probability of the file being accessible would be equal not to that of the weakest link, but to the product of the probabilities of all the links and machines possessing a part of the virtual file going down, each probability being less than = 1.0 of course. It's the classic MTBF calculation, and the result is uniformly bad.
Which is of course why systems that do this kind of thing typically feature lots of redundancy and caching, which takes us back to the last point about storage efficiency.
It won't work without multiple redundancy (Score:1)
Files in Medley would have attrocious availability if it weren't for the extensive caching and redundancy, ie. if it weren't for a "prime directive" to trade efficiency for availability.
The goal of using spare storage for something is a good one, but not if the party in question is seeking efficient use of storage! Good idea, wrong requirement.
Might suit some, but not generally (Score:1)
Your solution doesn't create a single filestore, but several separate ones, ie. the path determines where a particular file resides. Nor does it distribute virtual storage across multiple machines, so it would not work at all for storing a few large files like database tablespaces. And of course the availability of any given file depends on which machine it resides, which is in turn a factor of which machines are currently up, so in your system file availability is under coarse manual control.
That may be acceptable, who knows --- horses for courses. I'm not saying that your system is not good for you, it clearly is, but it doesn't seem to meet the stated requirements of the poster nor is it a general solution. Some of the other proposals made here come much closer to being generally useful. However, that is at the cost of not using storage efficiently because they need the heavy caching and redundant distribution to provide the availability gain without which distributed storage is an untenable nightmare.
NFS on free BSDs more compatible with Solaris? (Score:1)
One way forward that might help you and which might provide some useful feedback is to try the NFS implementation on one of the free BSDs instead of Linux's. It's supposed to be a faster implementation than that of Linux anyway, although I haven't tried it myself so treat that with a pinch of salt.
It might well be more compatible with Solaris because of Sun's origins in BSD territory.
Complexity and reliability in Charon (Score:1)
It's going to come down to control and isolation of complexity, which I presume the designers have been well aware of and focussed on. How does Charon tackle the issue, by which I mean, what's the fault-decoupling strategy?
Hmmm, maybe that's a subject for the kernel dev lists rather than Slashdot.
Re:The Charon Filesystem (Score:1)
Off to the kernel dev lists for information before criticising!
Righto, seems useful (Score:1)
I can see how that would be useful in certain cases, namely when files are not huge and when their availability is not too critical so that single-point storage is acceptable. In particular, your examples of MP3 files would seem to provide a perfect match to the properties of this OS/2 solution.
And it *does* match the original poster's requirement too!
Your requirement doesn't sound too useful (Score:3)
For a start, you can't seriously be advocating that spare blocks from a variety of machines be used to provide unique bits and pieces of storage for virtual files distributed across those machines, I hope. This would make the availability and reliability of those files extremely low, ie. as low as the weakest link in the system.
Secondly, what happens when one of the contributing filestores requires more space, but can't use it because it's been allocated to one of those distributed files? You could no longer just delete something from the machine concerned without going through that hypothetical distributed filestore manager, because it would be the only party that would know whether the item in question is part of a distributed file and hence whether it can be deleted. (This assumes that it creates real files in the local filespace for allocating to distributed files, which it would have to do otherwise the space it allocates would evaporate if the distributing daemon died.) In other words, *all* of your storage becomes dependent on this new manager, slows to a crawl, and probably loses a lot of the reliability of your native filesystem to boot. No, no, no
If the new distributed filesystem manager actually *does* make space on one machine as requested, it would clearly have to push out the data onto some other machine to compensate. If you think about it, the policy issues in this area are "interesting". (Aka "horrid".)
Finally, since the first point (unavailability cased by one machine going down) makes the idea completely untenable in most cases, you'd have to be talking about a system in which blocks are allocated in multiple places for each virtual file block. That's great, but notice that such a scheme is *not* storage-efficient, yet your requirement is based on not wanting to waste storage space!!!
No, I don't think you've thought this requirement through.
Re:AFS Baby! (Score:1)
---
"'Is not a quine' is not a quine" is a quine.
Re:AFS Baby! (Score:2)
Not only that, but every machine that boots NT boots Red Hat 6.0 as well
and each of those have AFS as well
its all part of UMBC's
Universal Computing Environment
So we are not at all stuck with microsoft, eventually, i believe umbc will migrate all the irix's (which have novell and OS-X and solaris in there) will migrate to linux, because SGI is abandoning Irix for linux...
My $0.02
Mike
(mshobe1@nospam.umbc.edu
AMD (Score:1)
umm...no (Score:1)
Re:NIS+automount+NFS (Score:1)
ahh, but if you are using any boxen w/Alphalinux on it, you may run into compilation problems (I have). It compiles *FINE* on i386, but not on Alpha... Just a note..
... (Score:2)
Maybe in a few months I'll know enough C to do it myself. ;^)
--
Re:Here's a plan (Score:1)
Re:xFS, Frangipani (Score:1)
xFS, Frangipani (Score:3)
A bit of a Kludge.... (Score:2)
This isn't terribly efficient or portable, but it might work.
Re:Yeah, it was mentioned on the linux-kernel list (Score:2)
Links to filesystems (Score:2)
Re:Arla is _good_ (Score:1)
Re:The Charon Filesystem (Score:1)
Does this mean you will be using an authentication scheme like Kerbos??
Re:The Charon Filesystem (Score:1)
Haakon
NFS on Linux (Score:1)
Not a flame, just repeating what I saw on usenet and linux-kernel.
Re:NFS on free BSDs more compatible with Solaris? (Score:2)
Also note that you can still run most of your Linux applications on a NetBSD or FreeBSD box; I run Linux Communicator and RealAudio/Video player, for example. So you might lose no functionality at all by moving to a BSD in this particular scenario.
cjs
Re:The Charon Filesystem (Score:1)
Re:The Charon Filesystem (Score:1)
Somewhat unrelated question about filesystems (Score:2)
--
grappler
even better solution... (Score:1)
Would this work? (Score:1)
Then... on a single linux box, use the loopback device to turn the exported files into loop devices. Then use raid-0 to 'stripe' across all the mounted, imported loopback devices. Create your filesystem, and voila!
Don't have the resources here to try it, but it sounds like it might work.
NBD + Raid? (Score:2)
Re:Your requirement doesn't sound too useful (Score:1)
See my post here [slashdot.org] for my admittedly iffy solution that does not have any of the problems you point out.
--sean
Re:My kludgy solution (Score:1)
--sean
Re:My kludgy solution (Score:2)
However, i've just grabbed myself a copy of intermezzo [inter-mezzo.org], and it looks like it might be able to do everything i wanted and more. I hope to somehow get my 240 disc cdrom changer [kubikjukebox.com] into the mix so it appears as a single drive instead of 2 drives and a serially controlable robot.
--sean
Re:Might suit some, but not generally (Score:2)
I'll give you that it does nothing for spanning large files, but it also does not incure the problems with spanning files. If a box goes down, you only loose access to the files on that box. If you say 'stuff this' and decide you don't need the virtual FS anymore, you just stop accessing stuff through z:. your files are still there, in the native OS's FS. Things that ended up located on an OS2 box are accessable on that OS2 box, the same goes for things on the linux boxes (or fbsd boxes, win32 boxes, be boxes.. whatever. anything that can export NFS can join the VFS)
As for file availability depending on the state of the machine it is on-- of course. If you want otherwise, you will have to sacrafice storage space, which is contrary to the point.
I don't propose that this is 'generally useful'. It has the negative of sending ALL files through the tvfs box before they make it to their destination. That's not very efficient. However, as far as i can see, it does provide a basic solution to the original posters problem, assuming he would not mind sticking an os2 box in the corner.
On an aside, there is another possible tool for this kind of problem: avfs (A Virtual File System) [inf.bme.hu].
Avfs was originally designed so you could access
Recently it's gained the ability to load user-written extensions, mainly for other archive formats. I imagine it could be hacked to do the job of TVFS in my setup.
--sean
My kludgy solution (Score:4)
Here's the details:
1) all the boxes export their spare space as nfs mounts.
2) a nifty IFS (installable file system) from IBM's EWS (employee-written software) program called Toronto Virtual File System is installed on one of the os2 boxes (we'll call this box os2tvfs)
3) os2tvfs mounts all the exported drives
4) with tvfs, all the mounted NFS drives are mounted into a tvfs drive (z: in my setup)
5) os2tvfs exports z:
6) any box that wants to access the big-virtual-volume mounts os2tvfs:/z:/
So how's it work? Lets go through an example:
box1 exports d:\, a 10 gig ide drive on an os2 system
d:\ contains a bunch of stuff, for this example we'll focus on "d:\mp3s\foo.mp3"
box2 exports
box2 has a file on it located at "/s1/mp3s/bar.mp3"
box1 then mounts os2tvfs:/z:/ as v:\
on box1, a directory listing of v:\mp3s\ contains both foo.mp3 and bar.mp3. if i copy baz.mp3 to v:\mp3s, it ends up as box2:/s1/mp3s/baz.mp3, as long as their is enough free space on box2:/bfi1/ for it, because i assigned a higher write priority to that volume when i mounted it with TVFS (it's a scsi drive- might as well use it up first). It shows up as os2tvfs:/z:/mp3s/baz.mp3.
Of course, this solution is kinda bad because it creates a ton of extra network traffic, but it was the only one i could find that did what i wanted.
--sean
LDAP/CODA rather then Re:NIS+automount+NFS (Score:3)
Linux's NFS still has problems. If you need NFS use BSD (BTW, before someone mods me down for that comment, I use linux. NFS is just not a good idea in general).
NIS is a nice idea, poorly implemented, with a lot of problems with security.
You said that there was a problem w/ CODA, and that would be what I normally use rather then NFS. There are a lot of good suggestions posted here.
For distributing information ala NIS, try taking a look at LDAP instead. I have been implementing it at a few client sites, and it works much better then NIS. (There are plugins that let GLIBC and PAM use LDAP transparently, and you can even emulate NIS).
I would definitly kill for something that could transparently create a single large namespace/disk space over a network, but with disk space so cheap, you are probibly better off going and buying a 16gb IDE drive. cheep cheep....
network block device (Score:1)
I haven't tried any of this, so it may very well be crazy.
Re:CODA? (Score:1)
Moderators sometimes suck.
Re:The Charon Filesystem (Score:1)
I hope we won't have to wait as long as with Transmeta before seing the first alpha release
You could also find a cool website like the Transmeta one (which is one of the few website I have viewed in totality).
Re:NFS on Linux (Score:1)
I do occasionally have problems getting the blasted thing to accept incoming requests, but by & large that's a simple config thing *somewhere*.
as f
A related question.. (Score:1)
Something I'm not sure about, though: I know that with NFS, everything stays on the server and is dished-up for all to see.
However, the impression I got with Coda was that you have a server into which you put clients, all of whom are saying 'I've got this share available'. So the file stays on the client, and other clients' requests go via the server to the client.
Is this a right understanding of the model?
Re:NFS on Linux (Score:1)
If you want speed, the best thing I've encountered is ncp (check freshmeat) - it's basically tar over the network, you just stick a server in the destination directory, and push stuff over - gets quite a high bandwidth utilisation.
Not one filesystem (Score:1)
I think he is looking for something that combines the disk space into one filesystem. NFS might bring it into one tree, but it would be a bunch of directories. YOu would have to watch how much you used in each directory. One directory will fill and another will have tons of space.
Re:can you share a single /usr/local ???? (Score:1)
I have worked at many sites where this is done ( although I am now at a site that considers
Re:The Charon Filesystem (Score:1)
If I recall correctly, there is no problem with having patented code GPLed - so long as the license to use the patent is free. To quote the GPL's Preamble (from Version 2):
Also, the if you GPL your code, it is still copyrighted - you wrote the code, no-one can take that credit away from you (legally). What "copyleft" actually means is a clever use of copyright laws to allow everyone to use the code for free, and keep it free, as opposed to the traditional restrictions usually applied.
I recommend reading the GPL, or at least the Preamble, anyway. You could also read the GNU Manifesto. They explain these issues much more clearly than I do.
Mango (was Re:Windows 2020) (Score:2)
It caches at the file block level instead of at the file level (like coda), so it's not (as) affected by things like tailing a large file. It has some other feature which could be good or bad (redundancy, file block migration, etc.) depending on your particular application.
The downside? It only runs on windows, and I suspect that they keep their cache coherence and access protocols under fairly heavy wrap. Now, if someone could come up with a good argument for how they could make money developing a similar product for Linux/*BSD... *drool*. Their basic technology is applicable most places, but their implementation right now is as a Windows disk driver device.
www.mango.com [mango.com]
Re:Windows 2020 (Score:1)
Re:Mango (was Re:Windows 2020) (Score:1)
Re:The Charon Filesystem (Score:1)
I don't usually respond to such petty comments, but I can't resist pointing out that I didn't mention what I'm working on now. I don't believe in peddling vapor.
Re:As with everything, "it depends" (Score:1)
Say no more, say no more.
Your answers to my earlier questions actually lead me to believe that yours is a relatively easy case. I can see three ways to go:
Re:The Charon Filesystem (Score:4)
By what measures and for what workloads? Such claims are meaningless without describing the environment, and are the realm of marketroids (particularly the MS kind) not scientists or engineers.
I find it most odd that you would tout the system's distributed nature and then compare it only against local FSes. How well does it perform in sharing situations, either locally or through slow WAN links? What level of coherency does it guarantee? How is failure recovery (a very tricky issue for a DFS) handled? How about disconnected operation?
To be perfectly blunt, the lack of even an attempt to address these sorts of crucial issues makes me wonder whether the part about Charon being distributed is "part of the plan" that hasn't actually been implemented (or even designed) yet. The DFS literature is littered with papers about systems that would supposedly blow everything else away, but that never actually got implemented. I've been there, I've done that, and the sad truth is that the realities of implementing a usable DFS - i.e. one that isn't pathologically ill-behaved in at least one of the areas alluded to above - generally shred naive ideals of superfast coolness.
>Only changed disk blocks and metadata are replicated, as opposed to entire files (and only on close)
If this is really what you meant to say, it's great performance but has dire implications for recoverability. This only strengthens my suspicion that you haven't really climbed into the mud pit in earnest yet.
As with everything, "it depends" (Score:4)
I'm also curious what you found lacking in GFS. I have my own different ideas about "how things should be done" but perhaps explaining why you consider it inappropriate will shed some light on your needs.
As far as practical advice goes, I think most of the relevant products and approaches have been mentioned; I don't promise to have secret knowledge of any "magic bullets". DFS technology is an area where I feel we're still looking for the right answers (sometimes even the right questions). That's why I enjoy working on DFSes, but it does mean that there's a large element of "choose your poison" in evaluating current offerings.
Re:My kludgy solution (Score:1)
Re:The Charon Filesystem (Score:1)
Here's a plan (Score:3)
This is something I've been thinking about for a while. I might give a go once my current project (the user-mode kernel port [mv.com]) settles down.
This my current thinking on cfs (cluster fs)
All members of the cluster share a filesystem, which potentially uses all the available storage on the cluster (although you might want to keep stuff like your home directory on a separate device that you don't share with the cluster).
Files are duplicated on multiple machines for speed and redundancy. Files will tend to be located on the machines that are accessing them, so most I/O is local.
cfs will just be the networking part. Local storage will be handled by a local fs (like ext2). cfs metadata will be stored in local files with funky names (which are made invisible by cfs anyway)
There are multiple levels of membership in a cluster. Primary members can read and write everything. Secondary members can only read. They can have read copies of files locally, but they can't hand those out to other machines. Machines wanting to read a file have to go to a primary member for a copy. This is for sysadmins who don't necessarily trust their users to prevent them from becoming root and modifying files (like /etc/passwd) behind the back of cfs and then handing the new /etc/passwd out to everybody else.
Machines can be members of multiple clusters. /etc might come from a cluster that everyone is a member of, /bin might come from a cluster of machines of the architecture, /projects might come from a third cluster, etc.
Files can be marked "local" which means that they permanently live on that machine, override whatever file comes from the cluster, and aren't shared with the cluster. This would be useful for config files which are only relevant to your machine, or your email directory.
A machine's /dev would mapped into the cluster filesystem as /dev/aa.bb.cc.dd/ rather than being marked local. This gives transparent access to every device in the cluster.
A machine which a writing a file is designated the file's owner. While writes are in progress, all reads have to go to that machine. Once the writes have stopped, the machine remains the owner, but it can start spreading the new data around the cluster. It can also designate secondary owners, who would come into play if the primary owner crashes. One of them would become the new owner. If it turns out that the old owner had changes which it didn't manage to propagate and the new owner made changes, then my current thinking is that this is brought to the attention of a human, who straightens things out. If this is not acceptable for a particular file for some reason, then that file can be marked in such a way that accesses to it hang or fail until the owner comes back.
hmm... (Score:1)
Re:The Charon Filesystem (Score:1)
Re:NFS on free BSDs more compatible with Solaris? (Score:1)
NIS+automount+NFS (Score:3)
*** Proven iconoclast, aspiring bohemian. ***
Re:The Charon Filesystem (Score:2)
> can claim a patent anyway (prior art). Also has
> linus accepted it ? Without that your are
> condemmed into staying in the domain of
> continous upgrade patches which means you either
> struggle to keep ahead of the kernel or you
> become obsolete. and what do you mean by
> copyrighted ? GPLed code can never be
> copyrighted only copylefted
First, you don't know that someone is stealling your prior art until after they already have a patent. At which point it is virtually too late, since getting patents over turned is extremely hard.
Second, they said they were planning this too be a cross platform file system, which means that whether or not Linus officially supports it isn't going to kill the project any more than the head of Sun officially not supporting it will. Second, Linus will probably accept it if it is good. From the sounds of the project, it is too early for him to either have accepted it or rejected it.
And finally, GPLed code can very much be copyrighted. Legally, copylefted is not recognized other than as type of license to copyrighted material. Without copyrights, there is no GPL, only public domain. In fact, it is copyrights that make GPL binding and effective.
Actually, by default any item you create is legally copyrighted. If I remember correctly, if you don't mark any copyright on it, then the best you can hope for in the case of theft is a cease and desist. If you mark it copyrighted, you can collect damages up to a certain limit (damages meaning that you can collect any earnings made directly from your work), and if you register that copyright, you can then collect punitive damages.
Re:The Charon Filesystem (Score:1)
We'll post full benchmark results when we're ready, along with the methodologies, hardware and source code used.
>This only strengthens my suspicion that you haven't really climbed into the mud pit in earnest yet
Patience. Just because you didn't succeed...
Re:For your setup (Score:2)
http://www.inter-mezzo.org [inter-mezzo.org]
A filesystem written in Perl, of all things...
Re:The Charon Filesystem (Score:2)
He's the guy who takes the dead across the river styx.
You *can* patent, copyright and GPL all at once, BTW. As an example, look at the SQUID license (for copyright+GPL):
I don't know of anything that's been both patented and GPLd, but CAST encryption is close. Its creators patented it, but made it available for all uses, anywhere, for free. That keeps someone else (not going to mention any specific companies or professions here) from attempting to patent it later, pre-empting any stupid legal battles. We're doing the same thing.
We've not approached Linus for his blessing. It's a little early for that. Don't look for it in the 2.4 kernel.
The Charon Filesystem (Score:5)
We're writing a new distributed filesystem called Charon. It will be patented ("patent pending"), copyrighted, and GPLed. It's a true 64-bit, journaled filesystem that supports exabyte-plus file and volume sizes, sophisticated access control lists, per-directory quotas, distributed zero-knowledge protocol authentication, encryption, replication, named streams and indices (see BeFS, ReiserFS -- although we don't use B-trees of any type). It's in alpha stage right now, and full of debug code, but is already faster than Ext2fs, and way ahead of XFS and NTFS. We will be porting it to Solaris and NT after development on Linux is complete.
Unlike Coda, AFS, DFS, etc. replication, every Charon server is a full read-write replicant. Only changed disk blocks and metadata are replicated, as opposed to entire files (and only on close) as in Coda. Charon clients are partial replicants -- they use the local file system as cache and rely on their home server(s) for token management and authentication. The system also supports heirachical failover and replication.
Because of the way it is designed, it also supports a very nice feature for GUIs and web servers -- a very fast built-in file types database that provides a single repository for mime type, friendly name, icon(s), description, extension, and other information. Sort of alike the Windows registry, but much less stupid and much higher-performance.
Stay tuned! This isn't vaporware.
Re:hmm... (Score:2)
Also, with NFS being built into the kernel, should be the fastest way going...
Re:hmm... (Score:2)
Distributed in this manner looks like it should make a large difference in speed for muliple clients.. if you have only a node or two that are clients though, it wouldn't help much and NFS would be simplier, but if you have 20+ clients this looks like it has the possibility of being really cool..
Thanks to whoever posted the link to xFS...
Mango's Medley could be ported (Score:3)
Problems:
(1) It has only been written for Windows. But not that hard to port.
(2) More serious: they initially did it on Windows because that's where they saw a larger potential customer base. But my friend, last we spoke, said that despite the practicality of the product (and winning best of show 2 or 3 years ago at Comdex) they still haven't had any substantial sales. So a port isn't likely to happen. The best would be if they opened up the source for Linux (they still have a patent on the Windows version, so it probably wouldn't be a problem), but I have no clue if they would ever consider that. Regardless, somebody needs to write such a system for Linux/BSD. Probably wouldn't even be that hard.
You want distributed over a number of platforms... (Score:2)
Re:A bit of a Kludge.... (Score:2)
While I don't want to use the "Gathered" disk space from all these machines for any sort of "mission critical" purpose, I would like it to at least be fairly stable under average loads.
-=-=-=-=-
Exactly!!! (Score:2)
Coda *is* a blast. My favorite part is the disconnected operation. My least favorite part is the local caching. If you want to copy a 2GB file to your local machine, that means you need 2GB of free space to hold the file *plus* 2GB of free space to hold the locally-cached copy.
I guess for an essentially academic project it's kinda cool, and for other situations -- like as the back-end for a cluster of web servers -- it would work really well, but for me, where I *am* reading/writing big files around a lot, that part of it really sucks.
-=-=-=-=-