Ext4 Advances As Interim Step To Btrfs 510
Heise.de's Kernel Log has a look at the ext4 filesystem as Linus Torvalds has integrated a large collection of patches for it into the kernel main branch. "This signals that with the next kernel version 2.6.28, the successor to ext3 will finally leave behind its 'hot' development phase." The article notes that ext4 developer Theodore Ts'o (tytso) is in favor of ultimately moving Linux to a modern, "next-generation" file system. His preferred choice is btrfs, and Heise notes an email Ts'o sent to the Linux Kernel Mailing List a week back positioning ext4 as a bridge to btrfs.
BTRFS? REALLY? (Score:5, Interesting)
Couldn't they come up with a better name than "BuTteR FaSe?" I know I can't be the only one who read it like that. Call it anything but that.
Re: (Score:2)
Re:BTRFS? REALLY? (Score:4, Funny)
Butter Fase probably intended as Butter Face.
Sounds like "But Her Face" as in: She has a great body, but her face...
Re:BTRFS? REALLY? (Score:5, Insightful)
Why not? It's a good analogy for FOSS after all. Great software, robust and all, but her face...
Re:BTRFS? REALLY? (Score:5, Funny)
Good, strong file-bearing hips!
Re:BTRFS? REALLY? (Score:5, Funny)
You're right. BTRFS is really silly. I recommend that the shortened form be ButtFS.
Re: (Score:2, Redundant)
I think it reads more like "Bit Rot" filesystem, perfect for 20 year old EPROM chips.
Re:BTRFS? REALLY? (Score:5, Insightful)
"Couldn't they come up with a better name than "BuTteR FaSe?" I know I can't be the only one who read it like that. Call it anything but that."
I read it as:
BeTteR FileSystem
I guess we'll have to part was :P
Re:Back when there was only fat16, ntfs, ext2 used (Score:5, Informative)
I hope you're joking.
ext2 is nice and simple, but it's neither fast not reliable. It uses a linear search to find directory entries, which means it's very slow on large directories, like Maildir mailboxes. It doesn't do tail packing which means it wastes space and is slower with small files. It's not reliable because without a journal it needs a fsck after a bad shutdown which takes ages on a modern disk, and recovers it worse than a journal would.
Just search for benchmarks, something like reiserfs beats ext2 by huge margins when it comes to important workloads such as a mail server.
There are very good reasons why distributions generally go with ext3, or one of the other filesystems. I haven't seen ext2 as the default option for the root FS in a very long time.
Re:Back when there was only fat16, ntfs, ext2 used (Score:4, Funny)
Just search for benchmarks, something like reiserfs beats ext2 by huge margins when it comes to important workloads such as a mail server.
Hell, it probably beats it to death.
Re:Back when there was only fat16, ntfs, ext2 used (Score:5, Insightful)
Re:Back when there was only fat16, ntfs, ext2 used (Score:5, Interesting)
> Just search for benchmarks, something like reiserfs beats ext2 by huge margins
You mean like these ones [netnation.com] where ext2 beats reiserfs in most cases and is at least as fast in the others?
> I hope you're joking. ext2 is nice and simple, but it's neither fast not reliable.
> It uses a linear search to find directory entries, which means it's very slow on
> large directories, like Maildir mailboxes.
Believe it or not, the world does not revolve around huge mail servers. Some of us actually run Linux on a desktop, and so don't really care about how well an fs handles a million maildir mailboxes. Latency is the most important criteria, and reiserfs is just too complicated to deliver it, as well as being a largely fringe fs. Especially now with Hans gone, it would become even more fringe.
> It doesn't do tail packing which means it wastes space and is slower with small files.
Yup, I'd like to have efficient small file handling. But really, it is better to avoid having many small files in the first place. Use compressed archives to store such things; it's quite a bit more efficient, and does not require exotic file systems which most normal people (i.e. your customers) will not use.
> It's not reliable because without a journal it needs a fsck after a bad shutdown
I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.
Re:Back when there was only fat16, ntfs, ext2 used (Score:5, Insightful)
so I think that journalling will become obsolete in some near future.
I bet in 1992 you were still thinking color TV's wouldn't last either . . .
Look, a UPS is a great thing. I run one myself. Heck with more and more people switching to laptops a lot of people are running a "UPS" without even realizing it. The simple fact though is that modern processors and disks are so fast that the minimal speed impact of journaling is barely noticeable. It's certainly not worth giving up over some marginal speed gains.
I mean we're talking about a world where people will give up tons of speed in their computer just to make the WINDOWS WOBBLE when you move them, or to make teddy bears wave at them from the system tray. Do you honestly believe that they're going to risk having their files corrupt on an unexpected power outage for a fraction of a percent increase in meaningful speed?
Re:Back when there was only fat16, ntfs, ext2 used (Score:5, Insightful)
Look at the bottom of the page. That's from 2003. Of kernel 2.6.0. A lot of code changed since then.
I'm not sure what exactly you mean by this. Latency is mostly influenced by the hard disk. And on a desktop the disk shouldn't be a bottleneck anyway.
Except there's lots and lots of those files in a modern Linux system. Config files, icon files, and small libraries for instance. Additionally many files are searched in different paths, making a fast directory search important.
Just as a RAID is not a backup, an UPS isn't a disk journal. One of those days you'll get a long outage, or the power cable will turn out to fit badly into the power supply, have a kernel panic, the UPS won't switch to battery fast enough, etc. And then after several minutes of fsck something important might end up broken.
If the journal causes you a noticeable slowdown you probably aren't a typical user. In typical usage the disk should be mostly idle after boot.
I don't see a point in going forward insanely fast without brakes. I'll take the safety. I have an UPS on every computer, and still have a journalled FS, because there were times when the UPS was of no help. Like yesterday, when I upgraded my laptop's RAM, booted it, and found that with more than 2GB RAM, the BIOS maps the video RAM above 4GB. The video card showed its displeasure with that state of affairs by corrupting the display and locking up. Had no choice but to powercycle the box.
Re:Back when there was only fat16, ntfs, ext2 used (Score:5, Insightful)
Yeah, because systems never kernel panic, or crash for any other reason than power outages... Wake me up after you've been waiting for fsck to finish on your 1TB drive and it's been running for the last 72 hours.
Whether or not you've had a system shutdown uncleanly in the past, you certainly will at some time in the future, so why not just use ext3 and save yourself the headache of a 3 day long fsck?
It's also painfully obvious that you've never worked as a sysadmin before. You try explaining to your manager that the reason why your company's server will take 3 days to come back online is that you wanted to save a few microseconds of latency when users were accessing files...
All hardware can fail, including UPSes. (Score:5, Insightful)
I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.
Our industrial UPS (which is orders of magnitude more reliable than any APC product ever made) recently exploded, burnt, and shorted out the entire building's power. It spiked thousands of volts through the protected equipment and destroyed a half-dozen servers. The fire was fierce enough to cause our fm200 system (halon equivalent) to dump, which put out the fire before the main battery bank was breached.
This was the first time I've ever seen an UPS bigger than a Chrysler fail, but I've seen dozens of failures from those crappy little APC units. At one time I had a stack of burnt-out ones in my basement (I used to salvage the batteries for cash).
If your disaster survivability plan depends on any single piece of hardware never failing, it's no good. Offsite backup is your friend.
Re: (Score:3, Insightful)
A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.
While a UPS is certainly a must, it does not protect you from hardware faults completely. Ever have a cap burn out on your motherboard, or lightning strike through your network?
Or the most irritating one of all, get a static shock through the keyboard that resets the system?
Re: (Score:3, Interesting)
``Believe it or not, the world does not revolve around huge mail servers. Some of us actually run Linux on a desktop, and so don't really care about how well an fs handles a million maildir mailboxes.''
What if I have large Maildir mailboxes on my desktop system? Or anything else that puts many files in a single directory? Just because _you_ don't need that case to be fast doesn't mean it isn't a good idea to have it be fast, anyway.
``Latency is the most important criteria, and reiserfs is just too complicat
Re:BTRFS? REALLY? (Score:5, Funny)
I read it as BeaterFS and wondered if it was too soon for ReiserFS jokes.
Re:BTRFS? REALLY? (Score:5, Funny)
buttfsck!! (Score:5, Funny)
ButterFS (Score:3, Funny)
I can't believe it's not better.
Re: (Score:3, Funny)
BTRFS? (Score:5, Funny)
So it incorporates compression by vowel ommission? Brllnt!
Re: (Score:2)
They also omitted the double T. So you should have said "brlnt!"
Why not ZFS? (Score:5, Interesting)
Unless ZFS has patent issues, why not just work on having ZFS as Linux's standard FS, after ext3?
ZFS offers a lot of capabilities, from no need to worry about a LVM layer, to snapshotting, to excellent error detection, even encryption and compression hooks.
Re:Why not ZFS? (Score:5, Informative)
Re:Why not ZFS? (Score:5, Insightful)
Also important, he might be more focused due to not being in prison for first degree murder
Re:Why not ZFS? (Score:5, Funny)
Yep, BeaTeR FS is a kinder, gentler alternative to Reiser FS.
Reiser has time and no need to work (Score:4, Funny)
They feed him. They put a roof over his head.
They even bathe him.
He might as well devote himself to filesystems.
Re:Why not ZFS? (Score:5, Interesting)
Huh. One of the interesting things things about Reiser4 from an end-user perspective was Hans Reisers plans for file metadata. From what I can find about btrfs, it currently doesn't even support normal extended attributes. There was also talk about making it easy for developers to extend the filesystem with plugins that could add e.g. compression schemes.
I can't really recognize anything from Hans Reiser's ramblings in the btrfs documentation that isn't standard file system improvements already seen in e.g. ZFS. does anyone have any specific examples of the ZFS-leapfrogging features referred to?
Re:Why not ZFS? (Score:5, Funny)
Huh. One of the interesting things things about Reiser4 from an end-user perspective was Hans Reisers plans for file metadata.
No, the most interesting feature of ReiserFS is this one [wikipedia.org] (look to the far right).
--
ReiserFS: It puts the "stab" in "/etc/fstab".
Re: (Score:3, Informative)
Which is why it got edited out. Note the "oldid" bit in the URL.
Re:Why not ZFS? (Score:5, Informative)
One of the differences I can find between btrfs and ZFS is that ZFS explicitely avoided [opensolaris.org] a fsck utility, and btrfs is explicitely designed with features designed to make fsck even more powerful than it's on usual filesystems like ext3. In btrfs, data structures have "back references", and the fsck can be used while the filesystem is mounted.
IMO, this is a a btrfs advantage. ZFS has checksums and will find errors, but only will be able to self-heal the errors in a redundant configuration. On a single disk, ZFS will find the error thanks to checksums but will not be able to recover your data. Since ZFS was mainly designed for systems that will use redundant configurations, it may have sense there, but desktops are not never going to do such things. IMO the ZFS people were a bit elitist here - "let's going to build a filesystem so good that we won't need a fsck". But in the real world you _are_ going to need a fsck util. Only in excepcional and very rare cases, but you're going to need it.
Of course that doesn't makes ZFS a bad filesystem, but it's an advantage for btrfs and linux.
Re: (Score:3, Interesting)
I find this checksumming and self-healing interesting, but the real question is what do you actually do to really solve it? With ZFS
Re:Why not ZFS? (Score:4, Insightful)
If a filesystem detects errors it is helping me (at least) there. No matter what creates them.
I do not think SSDs will solve storage problems: there will be flaky adapters and other IF chips/firmware, etc.
Re: (Score:3, Informative)
How does those "back references" recover your data in case of a corrupted sector? Honest question, I do not know brfs.
AFAIK ZFS has no fsck because there is no failure case where it would really help.
Back references could help you reconstruct the file system tree during fsck, but if random data is getting corrupted, you're not going to get it back without redundancy (or forward error correction, I suppose, but that amounts to the same thing).
I can't think of many scenarios where the only kind of data corruption I'm worried about is corruption to file system metadata (which is incidentally all journaling is supposed to protect you from), but who knows.
Re:Why not ZFS? (Score:5, Informative)
The ZFS developers specifically wanted the open sourced code to be under a GPL incompatible license, hence it has been released under CDDL (there was a interview with the Sun open source rep, can someone provide info/links about this). So ZFS cannot be part of the kernel, but there is a FUSE port of ZFS and according to http://en.wikipedia.org/wiki/ZFS#Linux Sun is investigating a Linux port, so there may be something good coming
Re:Why not ZFS? (Score:5, Informative)
Rather, GPL is incompatible with anything else that can't be re-licensed as GPL, and that includes GPL v2 and v3, which can't even be mixed among themselves. May first we clear that mess, right ?
ZFS is present in both Mac OSX and FreeBSD, thank you! They have no license issues whatsoever.
Re:Why not ZFS? (Score:5, Informative)
> Rather, GPL is incompatible with anything else that can't be re-licensed as GPL, and
> that includes GPL v2 and v3, which can't even be mixed among themselves.
Saying that GPLv2 and GPLv3 "can't even be mixed among themselves" is wrong and
misleading.
Section 14 of GPLv2 specifically deals with the problem of later versions of the
licence and sets out the options. A copyright holder can choose to allow work to be used
with later versions, such as GPLv3, or can choose not to. There are also more
complex options. The licence itself doesn't force the choice one way or the other.
Matt
Re:Why not ZFS? (Score:4, Insightful)
Re: (Score:3, Informative)
That piece of text isn't part of the license itself, it's part of a separate standard notice that states that the software is copyrighted and gives permission to redistribute or modify it under the terms of the GPL. It could just as easily have said "either version 2 of the License, or (at your option) any license you want in exchange for buying Linus a beer" and still be under version 2 of the G
Re:Why not ZFS? (Score:5, Informative)
Rather, GPL is incompatible with anything else that can't be re-licensed as GPL, and that includes GPL v2 and v3, which can't even be mixed among themselves. May first we clear that mess, right?
With a copyleft license, you intend to secure certain rights to the end user to the work as a whole. It is at the very essence of what the GPL tries to do compared to non-copyleft open source licenses or the LGPL that only covers the parts consisting of LGPL code, not any sort of "flaw" or "mess". Licenses work so that you must simultaniously fulfill all of them, so the GPL denies using GPL code with code that denies end users the four freedoms the FSF profess. That is the intention by design, but then there is some collateral damage as well-intended licenses are rendered GPL-incompatible due to details since the GPL (or any copyleft license) couldn't allow open-ended arbitrary restrictions without losing all meaning. The GPLv2 was particularly flawed in this area since it was made fairly long ago with this not much in mind, and in the GPLv3 they did a lot of work to improve compatibility leading to section 7 that among other things say:
Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or authors of the material; or
e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors.
That vastly improves compatibility with the licenses the GPL wants to be compatible with so collateral damage is reduced to a minimum. It's still very easy to write a license, even a free software license, that isn't GPL compatible though. If you look at the reason the CDDL and GPL are incompatible it's that the CDDLs copyleft conditions and the GPLs copyleft conditions clash because they both try to do the same thing. It's almost impossible to write two copyleft licenses where one (or both) doesn't see the other as adding "additional restrictions" on the end user. Even the GPL can't escape that as it tries to improve the GPL unless you have the "and later" clause. Then again, there's no reason such a license should have to be revised often - it took 16 years before releasing version three and it'll probably be longer until next time it's needed.
Re: (Score:3, Insightful)
A FUSE ZFS guarantees it will never be the "default" filesystem anyway. BTRFS has a good shot at being your / in a couple of years.
Re:Why not ZFS? (Score:4, Interesting)
I'm not sure that I see why, unless you're simply bored with the older filesystems. Something as critical as this should not be driven by what is trendy at any given moment. If one has no need for particular advanced bells or whistles, there is no need to use them.
For instance, since for historical and security reasons I keep
I still use ReiserFS3 on most of my other partitions, since I don't have any intention of changing the filesystem until I change the drives. ReiserFS is still a good choice for my purposes anyway.
Re: (Score:3, Insightful)
The GPL is restrictive as it is /because/ it ensures freedom for users. It is the /developers/ that the GPL bugs.
Bring on the GPL, I say! Boo to Sun for being anti-users.
/Mike
Re: (Score:3, Informative)
I don't know about that either. There are consistent reports that ZFS is slower than Ext3 on many common workloads. Also reports of instability.
While I do respect some of the engineering achievements in ZFS, I do not consider it to be the last word in filesystem design, or even the best filesystem for many applications. I also have doubts about the wisdom of some of the design decisions, such as inhaling the LVM into the filesystem, using 128 byte block pointers, and making a distinction between filesyst
Re: (Score:2)
Re:Why not ZFS? (Score:5, Insightful)
...and that's it's biggest problem. ZFS duplicates a lot of functionality that belongs outside of a filesystem. All of the above can already be done using any Linux filesystem, so why keep around a second copy of all that code that implements those features for just a single filesystem?
ReiserFS was (is) in a similar situation, where it also duplicates a lot of functionality that doesn't belong in the filesystem. Not only does this make it harder to maintain, but it makes a lot of features filesystem specific that shouldn't be.
Re:Why not ZFS? (Score:5, Informative)
It wouldn't be possible to duplicate RAID-Z with LVM. Other features of ZFS are very handy, but RAID-Z is by far my favorite. Same storage density as RAID 5 but without the horrible write performance. RAID-Z uses copy-on-write to avoid RAID 5's required read for every non-cached write.
Being able to create filesystems just as easily as creating directories is quite handy as well, though. IIRC, the filesystem sizes in ZFS are controlled by a quota style system. That is much simpler than shrinking an LV (if your filesystem supports shrinking), then adding a new LV, and then creating a filesystem. I don't know about you, but I am always a bit nervous when I have to resize an LV.
You're both right. (Score:5, Interesting)
ZFS duplicates a lot of functionality that belongs outside of a filesystem.
Very true.
It wouldn't be possible to duplicate RAID-Z with LVM.
Also true.
And the features which could be duplicated, couldn't be done nearly as well without a little more knowledge of the filesystem.
The real problem here is that we're finding out that generic block devices aren't enough to do everything we want to do outside the filesystem itself. Or, if they are, it's incredibly clumsy. Trivial example: If I want a copy-on-write snapshot, I have to set aside (ahead of time) some fixed amount of space that it can expand into. If I guess high, I waste space. If I guess low, I have to either expand it (somehow, if that's even possible) or lose my snapshot.
A filesystem which natively implemented COW could also trivially implement snapshots which take up exactly as much space as there are differences between the increments. But because of the way the Linux VFS is structured, this kind of functionality would have to be in a single filesystem, and would be duplicated across all filesystems. Best case, it'd be like ext3's JBD, as a kind of shared library.
A humble proposal: We need another layer, between the block layer and the filesystem layer -- call it an extent layer -- which is simply concerned with allocating some amount of space, and (perhaps) assigning it a unique ID. Filesystems could sit above this layer and implement whatever crazy optimizations or semantics they want -- linear vs btree vs whatever for directories, POSIX vs SQL, whatever.
The extent layer itself would only be concerned with allocating extents of some requested size, and actually storing the data. But this would be enough information to effectively handle mirroring, striping, snapshotting, copy-on-write, etc.
It wouldn't be universal -- I've said nothing about the on-disk format, and, indeed, some filesystems exist on Linux solely for that purpose -- vfat, ntfs, udf, etc. Those filesystems could be done pretty much exactly the way they're done now. After all, the existence of a block layer in no way implies that every filesystem must be tied to a block device (see proc, sys, fuse, etc.)
But I think it would work very well for filesystems which did choose to implement it. I think it would provide the best of ZFS and LVM.
I haven't actually been seriously following filesystem development for years, so maybe this is already done. Or maybe it's a bad idea. If not, hopefully some kernel developers are reading this.
Re: (Score:3, Interesting)
That still only covers one deficiency of LVM snapshots. LVM snapshots are read only and intended to be temporary. I'm also pretty sure you can't snapshot a snapshot, which wouldn't be at all helpful with a read only snapshot anyway.
Re: (Score:3, Insightful)
RAID-Z uses copy-on-write to avoid RAID 5's required read for every non-cached write.
Of course, the very same copy-on-write will also result in massive file fragmentation, carefully smearing your dbf files over the entire platters, making your SAN caches useless. Over time resulting in horrible read performance.
ZFS is certainly a huge improvement for anyone used to ufs and disksuite, but I have to say that using it in the real world it's not all it's cracked up to be. A more layered approach would have made
Re:Why not ZFS? (Score:4, Interesting)
If you want good database performance you probably want as little file system overhead as possible between your database and the disk. I wouldn't have expected ZFS to be the most efficient place to store a database.
I would have to imagine your SAN is just doing uninformed readaheads. That would be a very good way to fill up a cache with useless data if you are reading from a fragmented file system. :)
This issue with copy-on-write will be entirely sidestepped in a few years by flash storage's lightning fast seek times and smarter caching. IIRC, isn't the reason that zfs-fuse uses so damn much ram because ZFS has its own caching logic built in? If the file system knows where all the blocks in a file are it can do readaheads on its own.
I don't have enough of my own real world experience with ZFS to argue with your experience. In fact, what I know of how ZFS works makes me believe that it can cause exactly the problems of which you speak.
However, I don't think that means that there aren't a ton of workloads that wouldn't be impacted by these problems. I also believe that a large percentage of those workloads could benefit greatly from some of the features ZFS brings to the table.
RAID-Z is nice when you need write performance but can't afford the drives for RAID 10. I can think of plenty of times when it would have been nice to have a writable snapshot to chroot into.
Hell, I would even love to have ZFS on my laptop for snapshotting and cloning. It also seems like ZFS send/recv would make for much more efficient backups of my laptop than rsync buys me.
I think we are getting some pretty neat new features out of our file systems by blurring the lines between the layers. I wouldn't be surprised if we stumble upon a few more neat ideas before we're through.
There is still quite a bit of improvement to make even before we have to make the file system aware of what is inside our files. :)
Re:Why not ZFS? (Score:4, Interesting)
The closest thing to RAID-Z in the Linux kernel is the RAID 5 DM. If you want to write a 4k block to some random location that isn't currently fully cached the DM has to read 1 stripe from each disk in the array, make the 4k change, recompute the checksums, and then flush that stripe back to each disk. The default stripe size is 64k. That means if you have 4 drives you would be performing a 256k read and a 256k write just to change a single 4k block. Of course, that is worst case. Best case is you have to overwrite the entire stripe with a fresh 256k block of data.
ZFS and RAID-Z get around that problem by just writing the changed blocks to an unused part of the disk. Once the write is complete it just moves the pointer to the new block location. This is copy-on-write, and this is where the performance boost comes in over RAID 5. With RAID-Z you should never be required to read the whole stripe to do a write.
RAID-Z also allows for dynamic stripe sizing. That helps get more optimal efficiency on small files and large files.
The dynamic stripes aren't terribly important, but if you could figure out a way to do the copy-on-write without the filesystem have very fine grained control and knowledge of the underlying array we would all love to hear about it :).
Re:Why not ZFS? (Score:5, Insightful)
The things you think belong outside of a filesystem only 'nelong' there because that's what years of narrowminded developing have tought you. Look at it this way: /everything/ related to filestorage is managed by ZFS. What could be more convenient than that? Because of this, ZFS can do things much faster and much more reliable than any combo of LVM with a filesystem. Why chain together tools yourself, and manually think about things you really shouldn't be thinking about, when you can have a good filesystem take care of it for you.
ZFS is easier to maintain, from a users perpective (and that's the job of development, to make usage easier, not ever the other way round).
Re:Why not ZFS? (Score:4, Insightful)
Why chain together tools yourself, and manually think about things you really shouldn't be thinking about, when you can have a good filesystem take care of it for you.
Because that's the Unix way - build small components (applications) and chain them together to create something out of the parts. I mean, why have ls and grep when you can have lsgrepsortfind? Really, the point is to have small, easily maintained apps that do 1 thing well than 1 app that does everything possibly well, but more usually poorly as its difficult to maintain and ensure it works properly. Not to mention the bloat when it replicates functionality already provided.
This may not be the best model for a critical component like a filesystem, but on the other hand, reliability of a filesystem is paramount, so keeping it as small as possible is probably a good idea.
Re: (Score:3, Informative)
I've been able to do this forever with LVM snapshots under Linux.
Re:Why not ZFS? (Score:5, Informative)
I don't know about the patents, but the current major obstacle is the license. ZFS, as part of the OpenSolaris kernel, is available under the CDDL. The CDDL is incompatible with the GPL, ruling out ZFS inclusion directly in the Linux kernel. Sun has hinted that they could dual license the Solaris kernel under CDDL and GPL, but that hasn't happened yet. Small parts of the ZFS filesystem code have been GPLed so they could be added to grub to support booting ZFS root filesystems.
There is a userspace port of the ZFS code and utilities which avoids the license problem by using FUSE to separate the filesystem code into a separate process: ZFS-FUSE [blogspot.com].
If Sun were to ever dual-license ZFS, the ZFS-FUSE codebase would be a good place to start for porting the code to direct kernel inclusion. (Note: Sun, via their subsidiary, Cluster File Systems, now employes the author of ZFS-FUSE to use his port as an optional backend for the Lustre file system.)
Re:Why not ZFS? (Score:4, Informative)
Sun has some patents on ZFS; the CDDL grants a license to these patents if you're deriving from the original ZFS source, but then you can't link it to linux.
FWIW, I doubt ZFS-FUSE would be a good place to start - FUSE is totally different from Linux's actual vfs layer, after all.
Re:Why not ZFS? (Score:4, Interesting)
Come back when ZFS has decent filesystem maintenance tools.
And don't give me that 'ZFS doesn't need a fsck' crap. SGI tried to pull that with XFS, and it didn't work. Filesystem (at least metadata) corruption will happen, and once it does, ZFS doesn't have the tools to fix it.
Mart
Re:Why not ZFS? (Score:4, Informative)
Re: (Score:3, Informative)
I'm confused: if we ask people why not run ZFS using FUSE, they reply because it's slow (I'm assuming it's possible to load ZFS at boot time using an initrd). And if we ask people which is better monolithic or microkernel, they reply microkernel. But ZFS using FUSE would be like a microkernel driver, so which is it?
Re: (Score:3, Informative)
No, it wouldn't. A microkernel loads modules into the kernel space. You're talking about running in user space. So when an application makes a system call, the kernel has to translate it to the FUSE layer into user space. So there's an extra layer consuming time. On top of that, kernel space isn't generally swapped out, but user space can be. Obviously it should never happen, but wouldn't it suck if your disk driver was swapped out?
See the diagram at the bottom of this page: http://fuse.sourceforge.net/ [sourceforge.net]
Also
Re:Why not ZFS? (Score:5, Interesting)
A microkernel loads modules into the kernel space.
No, that's the opposite of a microkernel. A microkernel loads its modules (then often called "servers") into user space. If the kernel and its drivers etc. run in the same address space (as is the case with, e.g., Linux), then we're talking about a monolithic kernel, even if it can dynamically load modules.
Re: (Score:3, Interesting)
In these systems, ALL the programs are run in one address space. Does it make the whole OS (not just the kernel) monolithic or what?...
Re: (Score:3, Interesting)
yes, IIRC Windows NT uses rings 0 and 4. However, the problem would not be made better by having more rings, the performance cost is the transition between rings, nothing special about the rings themselves. eg progressing from ring 10 to ring 9 is as expensive as going from ring 0 to 1, or from ring 0 to ring 100.
Re:Ring 1 and 2? (Score:4, Interesting)
Not exactly. To effectively change the actual permissions that the permissions rings allow, stacks, segment registers, i/o permission bitmaps, and page tables (among other things) have to be changed. Generally this means reading values from memory into caches, which is slow. Probably the slowest of them all is the page cache. Invalidating the entire page cache is godawful slow, and is necessary if each separate user-space has a truly private address space and not simply a chunk out of the entire virtual address space. Even for operating systems that partition the virtual address space into regions for each user process, the local descriptor (or equivalent) table for segment access needs to be reloaded. This has to happen for every cross-privilege-level call. It is *much* faster to simply call another kernel mode function (push some stuff on the stack, change the instruction register, and you're done) without messing with caches.
In fact, it would be even faster to not separate the kernel and user space processes at all, and instead use formal verification or a virtual machine (which really just means a smaller instruction set that's easier to verify) to prove that no user process could ever mess with the kernel or other processes. Virtual machines for languages are essentially at this stage today; they implement what would constitute a kernel as the run-time level portions of the virtual machine, running the virtualized software in the same address space. There have been some attacks based on virtual machine weaknesses or memory corruption that break the protection model by changing data structures so that they violate the security model. This can happen in OS's that use hardware protection as well, there are just fewer places in memory that random changes can cause problems (just the page tables and other security paraphernalia), making it less likely.
Re:Why not ZFS? (Score:4, Informative)
I'm definitely in the layered-design-is-good, ZFS-is-an-abomination camp. But I do have to point out that mlockall would keep a userspace filesystem server from being swapped out, and with realtime priority, the process could even have some guaranteed CPU time. Userspace isn't that bad.
What I'd like (Score:5, Interesting)
I would like transparent, administrator controlled, versioning. Modified a word document and saved it in place? root can go back and get the old version ( and, alternatively, the user can. root could disable this functionality ).
The pieces are in place, it's doable, just someone needs to program it.
Re:What I'd like (Score:5, Interesting)
So, you want a Versioning file system [wikipedia.org]? Just make sure you never let that run on /var.
OSS is like capitalism: If you see a need, then make it and distribute it.
Re:What I'd like (Score:5, Interesting)
That leads to space-bloat.
What I'd like are files with expiration dates. When I make up some twiddly chart or download some funny video, I keep it because I'll probably want it tomorrow or next week, but then I tend to forget to delete it later. It would be really cool if creating a user data file prompted you with a simple dialog specifying how long you want it. Common options like 1 Week, 1 Month, 6 Months, 2 Years, Forever would do most of the time, and an option to choose a custom date would cover the rest. When a file expired, it would be placed in some kind of psudo-Trash Bin that could be reviewed and emptied when you want more space.
I'd also love something tag-based instead of hierarchy-based. For example, I store photos by Year > Month > Event, but sometimes I want to make another category for photos of a specific person. This means I either make duplicates or have to dig around to find things. If I could tag them with dates (that should actually be auto-generated from the EXIF), event, place, and people I could then just browse for files with a particular tag.
Come to think of it, these ideas are both somewhat akin to how a human brain stores stuff.
Re:What I'd like (Score:4, Insightful)
Re: (Score:2, Informative)
wayback [sourceforge.net], copyfs [n0x.org], and ext3cow [ext3cow.com] are all fairly stable versioning filesystems for linux. I'm not sure if they let you stop non-root users from getting old versions, but I don't see why you'd want people to have to ask an admin to get old versions of their files?
what's a "next generation" file system? (Score:3, Interesting)
Something like ZFS immediately comes to mind... but is there some generally accepted definition of what makes a file system "next generation"? TFA doesn't say, and I hate to diminish anyone's efforts here, but the new features in ext4 (according to wikipedia) aren't much to write home about: higher precision time stamps, larger volumes, larger directories, faster fscking. These may be worthy accomplishments but they are incremental improvements, not anything new. Or did I miss something?
Re: (Score:2)
So, something like HAMMER [kerneltrap.org], then?
A HAMMER filesystem can be mounted with an as-of date to access a snapshot of the system. Snapshots do not have to be explicitly taken but are instead based on the retention policy you specify for any given HAMMER filesystem. It is also possible to access individual files or directories (and their contents) using an as-of extension on the file name.
Released and stable in DragonFlyBSD 2.0, and obviously BSD licensed.
Re: (Score:2)
Re:What I'd like (Score:5, Funny)
Re: (Score:2)
Did you just undo your own modding?
I can't believe... (Score:5, Funny)
Butter FS? Are you kidding me?
Here is your first official list of jokes. Please contribute.
1. You're still running ext4? I can't believe it's not ButterFS!
2. But will it run on toast?
3. Will fsck be renamed to butterknife?
4. If your system overheats will your filesystem melt?
5. If you use ButterFS too much, will it turn into FAT?
6. If you leave ButterFS on your volume too long, will your hard drive start to reek?
7. Will the next version of ButterFS be called GoatButterFS, just like the next version of Leopard is Snow Leopard?
8. "Tough" notebooks will never have their hard drives formatted with ButterFS, because if you dropped them, they would always land hard drive down.
9. When you submit your dead ButterFS hard drive to a data recovery centre, will they have an intern lick it to get the data off instead of putting it under a read head?
These are getting kind of desperate -- your turn now.
Honestly, what is it with FOSS and crappy names? (looking at you, gimp)
Re: (Score:2, Funny)
Honestly, what is it with FOSS and crappy names? (looking at you, gimp)
All the good ones are trademarked. And it's The Gimp, to you, mister!
Re: (Score:3, Funny)
When your hard drive fails and you hear those awful noises, you can say it's churning butter.
Re:I can't believe... (Score:5, Funny)
These are getting kind of desperate -- your turn now.
Yeah, you're spreading yourself a bit thin.
Re: (Score:3, Funny)
you said for yourself that this was getting desperate
Butters' FS! (Score:3, Funny)
Whoa! (Score:5, Funny)
Re:Whoa! (Score:4, Funny)
B-tree based Filesystem (Score:2)
I saw that and couldn't help but think, are they trying to make a filesystem based on the B-tree concept?
Re:B-tree based Filesystem (Score:4, Funny)
Re: (Score:3, Interesting)
Not to be confused with binary tree [wikipedia.org].
-metric
Btrfs = Bit torrent file system? (Score:2)
If you want a blazingly fast file system.... (Score:3, Informative)
Then look no farther then NSS [wikipedia.org] ( Novell Storage Services ).
It is Open Source, you get the full source if you download SLES.
It has more of the desired features [wikipedia.org] then anything else on the block right now.
This should be the default file system for Linux. It has years of very heavy duty R&D behind it, it is pretty much completely de-bugged and ready to rock.
Re:If you want a blazingly fast file system.... (Score:5, Interesting)
Max Volume Size: 8 TiB.
That's not enough. Given that 1TB storage devices are on the market now, that could become outdated quite quickly. You'd be foolish to adopt that sort of filesystem, unless you were absolutely positive that you'd never upgrade (unlikely).
Honestly, ZFS seems like it's the holy grail of filesystems. There are a few small issues that might need to be worked out, though it seems as close to "ideal" as you'd ever be able to get.
Re:If you want a blazingly fast file system.... (Score:4, Interesting)
Well, it looks interesting feature-wise but they seem to be explicitly targeting SuSE - which is a no-go for most people.
From a glance at the docs (hey, at least they have docs, that's a plus) it also seems like it's tied to specific versions of EVMS and other parts of the kernel, thus if you don't run a "blessed, certified" SuSE kernel with all the nasty patches then you're on your own.
Just google for "debian|gentoo|redhat|... novell nss filesystem". Apparently nobody even tried to run NSS on another distro, or at least didn't write about it.
I, for one, would only touch this on a blackbox, vendor-supported appliance but never consider it for a production server of my own (none of which run SuSE).
If they worked towards integrating it into the mainline kernel, now that would be nice.
when ext4 is feature complete it will be the #3 fs (Score:4, Interesting)
I'd like to know why Ted Tso and others are working on ext4? Even when ext4 is feature complete it will be the #3 filesystem in linux in terms of features and scalability behind xfs and jfs. I'd like to know what Ted Tso and others grudge against xfs and jfs is because they basically wont even acknowledge those filesystems.
btrfs does have some nice looking features, its basically a gpl rewrite of zfs.
The weakness with linux is in the LVM or EVMS layer. They both suck in that they are not enterprise ready (ie multi TB filesystems, 100+ MB/s sustained read/write) in that they cause unexplained IO hicups, lockups and kernel panics. LVM/EVMS certainly work fine for Joe Blow's HTPC, or a paltry 100GB database but they fall down when under serious load.
This is the problem with open source. Certain areas, like filesystem development attract all the developers, and other areas like LVM/EVMS are seen as busting rocks and nobody wants to work on them. The results is we get a plethora of second rate filesystems (ie ext4) and a buggy LVM/EVMS layer that nobody wants to work on.
Re:when ext4 is feature complete it will be the #3 (Score:5, Interesting)
The weakness with linux is in the LVM or EVMS layer. They both suck in that they are not enterprise ready (ie multi TB filesystems, 100+ MB/s sustained read/write) in that they cause unexplained IO hicups, lockups and kernel panics. LVM/EVMS certainly work fine for Joe Blow's HTPC, or a paltry 100GB database but they fall down when under serious load.
LVM has been rock-solid for me with a ~7TB and 2 2TB ext3 filesystems (24 500GB disks) over the course of a year and a half. No problems migrating extents all over the place when I needed to swap disks in and out. Almost identical to HPUX in functionality, but without the sizing constraints.
But, when I tried xfs for kicks I found out that a 7TB filesystem means you need 7GB of RAM to fsck it - impossible on a 32-bit system, I also had a week where I it all went in the shitter because I ran free-space to zero and started getting OS panics and data corruption.
I'm definitely considering jfs for the next generation, my main complaint with ext3 has been ridiculously slow deletes and fsck's. Problems I have read don't exist with jfs.
What about Tux3 (Score:3, Interesting)
Re: (Score:2)
No, that's one of the listed features, actually!
* Extent based file storage (2^64 max file size)
* Space efficient packing of small files
* Space efficient indexed directories
* Dynamic inode allocation
* Won't kill your family
* Writable snapshots
* Subvolumes (separate internal filesystem roots)
No (Score:2)
Re: (Score:3, Funny)
So you're saying someone should run a defrag on these filesystem projects?