Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Software Linux

Ext4 Advances As Interim Step To Btrfs 510

Heise.de's Kernel Log has a look at the ext4 filesystem as Linus Torvalds has integrated a large collection of patches for it into the kernel main branch. "This signals that with the next kernel version 2.6.28, the successor to ext3 will finally leave behind its 'hot' development phase." The article notes that ext4 developer Theodore Ts'o (tytso) is in favor of ultimately moving Linux to a modern, "next-generation" file system. His preferred choice is btrfs, and Heise notes an email Ts'o sent to the Linux Kernel Mailing List a week back positioning ext4 as a bridge to btrfs.
This discussion has been archived. No new comments can be posted.

Ext4 Advances As Interim Step To Btrfs

Comments Filter:
  • BTRFS? REALLY? (Score:5, Interesting)

    by erroneus ( 253617 ) on Monday October 20, 2008 @12:00AM (#25437223) Homepage

    Couldn't they come up with a better name than "BuTteR FaSe?" I know I can't be the only one who read it like that. Call it anything but that.

  • Why not ZFS? (Score:5, Interesting)

    by mlts ( 1038732 ) * on Monday October 20, 2008 @12:06AM (#25437263)

    Unless ZFS has patent issues, why not just work on having ZFS as Linux's standard FS, after ext3?

    ZFS offers a lot of capabilities, from no need to worry about a LVM layer, to snapshotting, to excellent error detection, even encryption and compression hooks.

  • What I'd like (Score:5, Interesting)

    by grasshoppa ( 657393 ) on Monday October 20, 2008 @12:09AM (#25437283) Homepage

    I would like transparent, administrator controlled, versioning. Modified a word document and saved it in place? root can go back and get the old version ( and, alternatively, the user can. root could disable this functionality ).

    The pieces are in place, it's doable, just someone needs to program it.

  • Re:What I'd like (Score:5, Interesting)

    by corsec67 ( 627446 ) on Monday October 20, 2008 @12:20AM (#25437355) Homepage Journal

    So, you want a Versioning file system [wikipedia.org]? Just make sure you never let that run on /var.

    OSS is like capitalism: If you see a need, then make it and distribute it.

  • Re:What I'd like (Score:5, Interesting)

    by bendodge ( 998616 ) <bendodge AT bsgprogrammers DOT com> on Monday October 20, 2008 @12:33AM (#25437441) Homepage Journal

    That leads to space-bloat.

    What I'd like are files with expiration dates. When I make up some twiddly chart or download some funny video, I keep it because I'll probably want it tomorrow or next week, but then I tend to forget to delete it later. It would be really cool if creating a user data file prompted you with a simple dialog specifying how long you want it. Common options like 1 Week, 1 Month, 6 Months, 2 Years, Forever would do most of the time, and an option to choose a custom date would cover the rest. When a file expired, it would be placed in some kind of psudo-Trash Bin that could be reviewed and emptied when you want more space.

    I'd also love something tag-based instead of hierarchy-based. For example, I store photos by Year > Month > Event, but sometimes I want to make another category for photos of a specific person. This means I either make duplicates or have to dig around to find things. If I could tag them with dates (that should actually be auto-generated from the EXIF), event, place, and people I could then just browse for files with a particular tag.

    Come to think of it, these ideas are both somewhat akin to how a human brain stores stuff.

  • by seanadams.com ( 463190 ) * on Monday October 20, 2008 @12:46AM (#25437507) Homepage

    Something like ZFS immediately comes to mind... but is there some generally accepted definition of what makes a file system "next generation"? TFA doesn't say, and I hate to diminish anyone's efforts here, but the new features in ext4 (according to wikipedia) aren't much to write home about: higher precision time stamps, larger volumes, larger directories, faster fscking. These may be worthy accomplishments but they are incremental improvements, not anything new. Or did I miss something?

  • You're both right. (Score:5, Interesting)

    by SanityInAnarchy ( 655584 ) <ninja@slaphack.com> on Monday October 20, 2008 @01:30AM (#25437741) Journal

    ZFS duplicates a lot of functionality that belongs outside of a filesystem.

    Very true.

    It wouldn't be possible to duplicate RAID-Z with LVM.

    Also true.

    And the features which could be duplicated, couldn't be done nearly as well without a little more knowledge of the filesystem.

    The real problem here is that we're finding out that generic block devices aren't enough to do everything we want to do outside the filesystem itself. Or, if they are, it's incredibly clumsy. Trivial example: If I want a copy-on-write snapshot, I have to set aside (ahead of time) some fixed amount of space that it can expand into. If I guess high, I waste space. If I guess low, I have to either expand it (somehow, if that's even possible) or lose my snapshot.

    A filesystem which natively implemented COW could also trivially implement snapshots which take up exactly as much space as there are differences between the increments. But because of the way the Linux VFS is structured, this kind of functionality would have to be in a single filesystem, and would be duplicated across all filesystems. Best case, it'd be like ext3's JBD, as a kind of shared library.

    A humble proposal: We need another layer, between the block layer and the filesystem layer -- call it an extent layer -- which is simply concerned with allocating some amount of space, and (perhaps) assigning it a unique ID. Filesystems could sit above this layer and implement whatever crazy optimizations or semantics they want -- linear vs btree vs whatever for directories, POSIX vs SQL, whatever.

    The extent layer itself would only be concerned with allocating extents of some requested size, and actually storing the data. But this would be enough information to effectively handle mirroring, striping, snapshotting, copy-on-write, etc.

    It wouldn't be universal -- I've said nothing about the on-disk format, and, indeed, some filesystems exist on Linux solely for that purpose -- vfat, ntfs, udf, etc. Those filesystems could be done pretty much exactly the way they're done now. After all, the existence of a block layer in no way implies that every filesystem must be tied to a block device (see proc, sys, fuse, etc.)

    But I think it would work very well for filesystems which did choose to implement it. I think it would provide the best of ZFS and LVM.

    I haven't actually been seriously following filesystem development for years, so maybe this is already done. Or maybe it's a bad idea. If not, hopefully some kernel developers are reading this.

  • Re:Why not ZFS? (Score:5, Interesting)

    by GrievousMistake ( 880829 ) on Monday October 20, 2008 @01:49AM (#25437829)

    Huh. One of the interesting things things about Reiser4 from an end-user perspective was Hans Reisers plans for file metadata. From what I can find about btrfs, it currently doesn't even support normal extended attributes. There was also talk about making it easy for developers to extend the filesystem with plugins that could add e.g. compression schemes.
    I can't really recognize anything from Hans Reiser's ramblings in the btrfs documentation that isn't standard file system improvements already seen in e.g. ZFS. does anyone have any specific examples of the ZFS-leapfrogging features referred to?

  • Re:What I'd like (Score:2, Interesting)

    by EvanED ( 569694 ) <{evaned} {at} {gmail.com}> on Monday October 20, 2008 @02:26AM (#25437989)

    How does the filesystem know when to create a new version? Should every byte ever written to the file be construed as a new version? If so, how does the admin figure out which precise version, out of the literally billions that would be created, is the right one?

    True, you may not be able to get it perfect, but you can get way more useful than nothing.

    For instance, many programs that work on small files (the kind you'd most want to version) don't keep the file open, and instead open the file, write to it, and close it each time you save. (Some will move the file to a backup name (eg file~), create a new file, and write to that. This is in part so you at least have the previous version in the ~ file, and in part to compensate for non-ACID file systems because there will always be at least one copy of either the old or new data at any given time.) So creating a version when a process calls fclose() is a reasonable thing to do.

    Sure, it won't work for programs that keep the file open and update it by seeking around and writing, but it will work for the vast, vast majority of the cases that at least I personally would want.

    And how do you reasonably prune that wasted space?

    What you see as wasted space I see as space going to a pretty darn good use.

    As for pruning, you'd have to be fairly clever. But you could create policies that specify how long to keep old versions, how many versions to keep in a certain time period, etc. You could also pay attention to how often a file is opened, how often old versions of files are opened, etc. There's a paper on a file system called Elephant written for FreeBSD where they discuss some ideas on how to do this.

    There's also a hypothesis that at least I would agree with that things recently saved are much more likely to be useful. If you remember the "last lecture" guy Randy Pausch, he did another talk about time management in which I think he told a story about an experiment they did where the goal was to clean up the lab. People were too hesitant to throw things out because "I might need it later," so they set up a rotation of the trash bins. Things you throw out would stick around for a week, which meant that you could still safely retrieve it. But if you didn't need it within a week, you almost certainly wouldn't need it, so it was still basically safe to do. It it helped a lot with cleanliness since people actually threw things away. (He said the biggest problems were when the janitors emptied trash bins at the wrong time.)

    Finally, you could restrict the versioning by the file size, so for instance it would only store past versions for files under a certain size, etc. If you set it to 200K or something that would cover almost all of the files that I would really like versioning on, and yet keep the extra space relatively low.

    No, what you really want is version control software.

    That may be what you want, but it's not what I want.

    At the very least, it ensures that each commit was deliberate, and represents a valid state.

    This is also a downside: it means you can't see anything but valid states.

    Personally, I would like it if things like text editors and word processors saved the entire edit history of documents, persistently. You could use a scroll bar to go through the history, saves would be marked with small tick marks, and deliberate commits would be marked with larger tick marks.

  • by ZeekWatson ( 188017 ) on Monday October 20, 2008 @02:29AM (#25437999)

    I'd like to know why Ted Tso and others are working on ext4? Even when ext4 is feature complete it will be the #3 filesystem in linux in terms of features and scalability behind xfs and jfs. I'd like to know what Ted Tso and others grudge against xfs and jfs is because they basically wont even acknowledge those filesystems.

    btrfs does have some nice looking features, its basically a gpl rewrite of zfs.

    The weakness with linux is in the LVM or EVMS layer. They both suck in that they are not enterprise ready (ie multi TB filesystems, 100+ MB/s sustained read/write) in that they cause unexplained IO hicups, lockups and kernel panics. LVM/EVMS certainly work fine for Joe Blow's HTPC, or a paltry 100GB database but they fall down when under serious load.

    This is the problem with open source. Certain areas, like filesystem development attract all the developers, and other areas like LVM/EVMS are seen as busting rocks and nobody wants to work on them. The results is we get a plethora of second rate filesystems (ie ext4) and a buggy LVM/EVMS layer that nobody wants to work on.

  • Re:Why not ZFS? (Score:4, Interesting)

    by mvdwege ( 243851 ) <mvdwege@mail.com> on Monday October 20, 2008 @02:52AM (#25438091) Homepage Journal

    Come back when ZFS has decent filesystem maintenance tools.

    And don't give me that 'ZFS doesn't need a fsck' crap. SGI tried to pull that with XFS, and it didn't work. Filesystem (at least metadata) corruption will happen, and once it does, ZFS doesn't have the tools to fix it.

    Mart

  • by moosesocks ( 264553 ) on Monday October 20, 2008 @02:52AM (#25438097) Homepage

    Max Volume Size: 8 TiB.

    That's not enough. Given that 1TB storage devices are on the market now, that could become outdated quite quickly. You'd be foolish to adopt that sort of filesystem, unless you were absolutely positive that you'd never upgrade (unlikely).

    Honestly, ZFS seems like it's the holy grail of filesystems. There are a few small issues that might need to be worked out, though it seems as close to "ideal" as you'd ever be able to get.

  • Re:Why not ZFS? (Score:5, Interesting)

    by adrianwn ( 1262452 ) on Monday October 20, 2008 @02:53AM (#25438103)

    A microkernel loads modules into the kernel space.

    No, that's the opposite of a microkernel. A microkernel loads its modules (then often called "servers") into user space. If the kernel and its drivers etc. run in the same address space (as is the case with, e.g., Linux), then we're talking about a monolithic kernel, even if it can dynamically load modules.

  • Re:Why not ZFS? (Score:4, Interesting)

    by BrokenHalo ( 565198 ) on Monday October 20, 2008 @03:38AM (#25438253)
    not to belittle ext3 and ext2 for that matter, but their time is beginning to pass, and something new needs to replace it.

    I'm not sure that I see why, unless you're simply bored with the older filesystems. Something as critical as this should not be driven by what is trendy at any given moment. If one has no need for particular advanced bells or whistles, there is no need to use them.

    For instance, since for historical and security reasons I keep /boot on its own separate partition which is mounted readonly, it makes sense here to not have anything trying to write to a journal, so ext2 is still a very good choice here. As the partition is tiny (only 20MB) it takes a fraction of a second to run e2fsck over it when or as required, so there is nothing to be gained by journalling it anyway.

    I still use ReiserFS3 on most of my other partitions, since I don't have any intention of changing the filesystem until I change the drives. ReiserFS is still a good choice for my purposes anyway.
  • by hitchhacker ( 122525 ) on Monday October 20, 2008 @03:57AM (#25438321) Homepage
    B-Tree [wikipedia.org]:

    Not to be confused with binary tree [wikipedia.org].

    -metric

  • by Kent Recal ( 714863 ) on Monday October 20, 2008 @04:14AM (#25438403)

    Well, it looks interesting feature-wise but they seem to be explicitly targeting SuSE - which is a no-go for most people.
    From a glance at the docs (hey, at least they have docs, that's a plus) it also seems like it's tied to specific versions of EVMS and other parts of the kernel, thus if you don't run a "blessed, certified" SuSE kernel with all the nasty patches then you're on your own.

    Just google for "debian|gentoo|redhat|... novell nss filesystem". Apparently nobody even tried to run NSS on another distro, or at least didn't write about it.

    I, for one, would only touch this on a blackbox, vendor-supported appliance but never consider it for a production server of my own (none of which run SuSE).
    If they worked towards integrating it into the mainline kernel, now that would be nice.

  • Re:Ring 1 and 2? (Score:3, Interesting)

    by Anonymous Coward on Monday October 20, 2008 @04:29AM (#25438443)

    yes, IIRC Windows NT uses rings 0 and 4. However, the problem would not be made better by having more rings, the performance cost is the transition between rings, nothing special about the rings themselves. eg progressing from ring 10 to ring 9 is as expensive as going from ring 0 to 1, or from ring 0 to ring 100.

  • by Anonymous Coward on Monday October 20, 2008 @05:09AM (#25438597)

    The catch is that coming up with such a layer is kind of tricky. That's probably why Sun didn't bother - they intended to only ever implement one filesystem using it, and had an interest in not making those extra features available in the other supported filesystems.

    Another intermediate layer that sits between the block device and the filesystem, along with some kind of support from the VFS for the new features, would probably be enough to allow implementing a ZFS-like filesystem in Linux. The big ones are that it has to be able to interact better with LVM, and has to be able to handle COW semantics (that'd be a fantastic feature to expose to userspace). With that as a base, most of the features are either already there (in LVM, or Linux software RAID) but need a bit of improvement (like the page cache), or can be done in the filesystem itself (checksumming).

  • Re:Why not ZFS? (Score:4, Interesting)

    by Wonko ( 15033 ) <thehead@patshead.com> on Monday October 20, 2008 @05:38AM (#25438699) Homepage Journal

    I often hear that claim but never see any support of that claim.

    The closest thing to RAID-Z in the Linux kernel is the RAID 5 DM. If you want to write a 4k block to some random location that isn't currently fully cached the DM has to read 1 stripe from each disk in the array, make the 4k change, recompute the checksums, and then flush that stripe back to each disk. The default stripe size is 64k. That means if you have 4 drives you would be performing a 256k read and a 256k write just to change a single 4k block. Of course, that is worst case. Best case is you have to overwrite the entire stripe with a fresh 256k block of data.

    ZFS and RAID-Z get around that problem by just writing the changed blocks to an unused part of the disk. Once the write is complete it just moves the pointer to the new block location. This is copy-on-write, and this is where the performance boost comes in over RAID 5. With RAID-Z you should never be required to read the whole stripe to do a write.

    RAID-Z also allows for dynamic stripe sizing. That helps get more optimal efficiency on small files and large files.

    The dynamic stripes aren't terribly important, but if you could figure out a way to do the copy-on-write without the filesystem have very fine grained control and knowledge of the underlying array we would all love to hear about it :).

  • What about Tux3 (Score:3, Interesting)

    by obi ( 118631 ) on Monday October 20, 2008 @05:41AM (#25438713)
    While btrfs looks quite cool, I'm even more interested to see whether http://tux3.org/ [tux3.org] will go anywhere. Let's hope both will materialise and mature soon.
  • by Wonko ( 15033 ) <thehead@patshead.com> on Monday October 20, 2008 @06:01AM (#25438785) Homepage Journal

    Trivial example: If I want a copy-on-write snapshot, I have to set aside (ahead of time) some fixed amount of space that it can expand into. If I guess high, I waste space. If I guess low, I have to either expand it (somehow, if that's even possible) or lose my snapshot.

    That still only covers one deficiency of LVM snapshots. LVM snapshots are read only and intended to be temporary. I'm also pretty sure you can't snapshot a snapshot, which wouldn't be at all helpful with a read only snapshot anyway.

    A humble proposal: We need another layer, between the block layer and the filesystem layer -- call it an extent layer -- which is simply concerned with allocating some amount of space, and (perhaps) assigning it a unique ID. Filesystems could sit above this layer and implement whatever crazy optimizations or semantics they want -- linear vs btree vs whatever for directories, POSIX vs SQL, whatever.

    We'd never be able to get it right and it would probably be more likely to get in the way. We seem to be learning that we can do much niftier things by tightly coupling what used to be very separate layers.

    I haven't actually been seriously following filesystem development for years, so maybe this is already done. Or maybe it's a bad idea. If not, hopefully some kernel developers are reading this.

    I don't really believe it is a bad idea. I do think it would have to be too heavy of a layer, though. It would have to track which file systems own each extent, and if you want to come close to matching RAID-Z you are going to need to be able to return very small extents (LVM defaults to 4MB, IIRC). If a file system is going to be requesting 4k extents you're going to have a lot of overhead in storing the extent ownership and size information. You're also going to have a lot of overhead in checking who owns each extent on any given read or write. I can think of ways to optimize that a bit, but I imagine it'll still have a significant space+performance impact.

  • by Jah-Wren Ryel ( 80510 ) on Monday October 20, 2008 @06:09AM (#25438815)

    The weakness with linux is in the LVM or EVMS layer. They both suck in that they are not enterprise ready (ie multi TB filesystems, 100+ MB/s sustained read/write) in that they cause unexplained IO hicups, lockups and kernel panics. LVM/EVMS certainly work fine for Joe Blow's HTPC, or a paltry 100GB database but they fall down when under serious load.

    LVM has been rock-solid for me with a ~7TB and 2 2TB ext3 filesystems (24 500GB disks) over the course of a year and a half. No problems migrating extents all over the place when I needed to swap disks in and out. Almost identical to HPUX in functionality, but without the sizing constraints.

    But, when I tried xfs for kicks I found out that a 7TB filesystem means you need 7GB of RAM to fsck it - impossible on a 32-bit system, I also had a week where I it all went in the shitter because I ran free-space to zero and started getting OS panics and data corruption.

    I'm definitely considering jfs for the next generation, my main complaint with ext3 has been ridiculously slow deletes and fsck's. Problems I have read don't exist with jfs.

  • Re:Why not ZFS? (Score:4, Interesting)

    by Wonko ( 15033 ) <thehead@patshead.com> on Monday October 20, 2008 @06:32AM (#25438923) Homepage Journal

    Of course, the very same copy-on-write will also result in massive file fragmentation, carefully smearing your dbf files over the entire platters, making your SAN caches useless. Over time resulting in horrible read performance.

    If you want good database performance you probably want as little file system overhead as possible between your database and the disk. I wouldn't have expected ZFS to be the most efficient place to store a database.

    I would have to imagine your SAN is just doing uninformed readaheads. That would be a very good way to fill up a cache with useless data if you are reading from a fragmented file system. :)

    This issue with copy-on-write will be entirely sidestepped in a few years by flash storage's lightning fast seek times and smarter caching. IIRC, isn't the reason that zfs-fuse uses so damn much ram because ZFS has its own caching logic built in? If the file system knows where all the blocks in a file are it can do readaheads on its own.

    ZFS is certainly a huge improvement for anyone used to ufs and disksuite, but I have to say that using it in the real world it's not all it's cracked up to be.

    I don't have enough of my own real world experience with ZFS to argue with your experience. In fact, what I know of how ZFS works makes me believe that it can cause exactly the problems of which you speak.

    However, I don't think that means that there aren't a ton of workloads that wouldn't be impacted by these problems. I also believe that a large percentage of those workloads could benefit greatly from some of the features ZFS brings to the table.

    RAID-Z is nice when you need write performance but can't afford the drives for RAID 10. I can think of plenty of times when it would have been nice to have a writable snapshot to chroot into.

    Hell, I would even love to have ZFS on my laptop for snapshotting and cloning. It also seems like ZFS send/recv would make for much more efficient backups of my laptop than rsync buys me.

    Mixing together the features of various layers is, imo, no matter how tempting, simply the wrong approach. Proceed further along that road and you get to record based filesystems or even more special-purpose variants. I mean, there are even more optimizations that you can do if you know the _contents_ of the files.

    I think we are getting some pretty neat new features out of our file systems by blurring the lines between the layers. I wouldn't be surprised if we stumble upon a few more neat ideas before we're through.

    There is still quite a bit of improvement to make even before we have to make the file system aware of what is inside our files. :)

  • by standbypowerguy ( 698339 ) on Monday October 20, 2008 @06:44AM (#25438957) Homepage
    Jail is supposed to be punitive & reflective, not fun or interesting. There are plenty of worthwhile jobs in prison... laundry, cook, librarian, janitor, license plate stamper, etc.
  • Re:Why not ZFS? (Score:3, Interesting)

    by segedunum ( 883035 ) on Monday October 20, 2008 @08:24AM (#25439363)

    ZFS has checksums and will find errors, but only will be able to self-heal the errors in a redundant configuration. On a single disk, ZFS will find the error thanks to checksums but will not be able to recover your data. Since ZFS was mainly designed for systems that will use redundant configurations, it may have sense there, but desktops are not never going to do such things.

    I find this checksumming and self-healing interesting, but the real question is what do you actually do to really solve it? With ZFS, an awful lot of people over at OpenSolaris get excited about detecting 'bit rot', but answers are a bit thin on the ground when you ask what can be done about it or what some of the errors actually mean. Yer, you're a bit less likely to get data loss, but you can only really avoid that if you have redundancy. Also, most of the problems ZFS has detected that I have seen have, at a best guess, probably been caused by a Solaris device driver doing something no one had known about. The filesystem can't help you there, no matter how advanced it is.

    The problem is our current storage technology, and more needs to be done where the problems occur - within disk drives themselves. I'm hoping SSDs will end up giving us a better fundamental starting point when it comes to storage.

  • by Chemisor ( 97276 ) on Monday October 20, 2008 @09:34AM (#25439931)

    > Just search for benchmarks, something like reiserfs beats ext2 by huge margins

    You mean like these ones [netnation.com] where ext2 beats reiserfs in most cases and is at least as fast in the others?

    > I hope you're joking. ext2 is nice and simple, but it's neither fast not reliable.
    > It uses a linear search to find directory entries, which means it's very slow on
    > large directories, like Maildir mailboxes.

    Believe it or not, the world does not revolve around huge mail servers. Some of us actually run Linux on a desktop, and so don't really care about how well an fs handles a million maildir mailboxes. Latency is the most important criteria, and reiserfs is just too complicated to deliver it, as well as being a largely fringe fs. Especially now with Hans gone, it would become even more fringe.

    > It doesn't do tail packing which means it wastes space and is slower with small files.

    Yup, I'd like to have efficient small file handling. But really, it is better to avoid having many small files in the first place. Use compressed archives to store such things; it's quite a bit more efficient, and does not require exotic file systems which most normal people (i.e. your customers) will not use.

    > It's not reliable because without a journal it needs a fsck after a bad shutdown

    I used to do that, and then I got a UPS instead and switched back to pure ext2. The performance hit from journalling is simply too high to tolerate. A decent UPS (pretty much anything made by APC) will prevent the crashes in the first place, solving the problem completely and without any unnecessary overhead. With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.

  • by RAMMS+EIN ( 578166 ) on Monday October 20, 2008 @12:27PM (#25442347) Homepage Journal

    ``Believe it or not, the world does not revolve around huge mail servers. Some of us actually run Linux on a desktop, and so don't really care about how well an fs handles a million maildir mailboxes.''

    What if I have large Maildir mailboxes on my desktop system? Or anything else that puts many files in a single directory? Just because _you_ don't need that case to be fast doesn't mean it isn't a good idea to have it be fast, anyway.

    ``Latency is the most important criteria, and reiserfs is just too complicated to deliver it''

    Excuse me? Do you have any numbers to back up that claim? Because I'm having a hard time taking it on face value.

    ``as well as being a largely fringe fs''

    A filesystem that has been included in the mainline Linux kernel for several years, is offered as a prominent choice during installation of various distros, used to be the default fs on some distros, and is widely used by people who make conscious and informed choices about which filesystem to use. But yes, if you want to call it a "fringe fs", go right ahead.

    ``Especially now with Hans gone, it would become even more fringe.''

    This, unfortunately, is all too true. ReiserFS still is a great filesystem in terms of reliability and performance, from tiny files to huge ones, under a wide range of scenarios. Reiser4 was going to be even better: faster and more flexible and extensible, with fast arbitrary attributes and a lot of other goodness. But it never made it into the mainline kernel, and, with Hans Reiser in jail, the future doesn't seem bright for Reiser4. On the other hand, there are various new contenders: ZFS, btrfs, and ext4, just to name a few. None of them seem to be quite there yet, but hey, neither was Reiser4.

    ``Yup, I'd like to have efficient small file handling. But really, it is better to avoid having many small files in the first place. Use compressed archives to store such things; it's quite a bit more efficient''

    Kindly point me at this compressed archive format that lets me fetch files (small and large) by name and other attributes more efficiently than Reiser4 or even ReiserFS. Then please point out how I can use this as I would a filesystem: so that the good old Unix software can access the files. And remember: I need random access to the file contents, and I need to be able to add, remove, write, etc. files. And if any operation is interrupted suddenly and unexpectedly, the integrity of my tree needs to be preserved. Bonus points for full data integrity preservation.

    ``The performance hit from journalling is simply too high to tolerate.''

    Performance hit from journalling? And you're using ext2 to avoid it? Your usage patterns must be very different from mine. True, ext2 running in async mode (i.e. no consistency guarantee at all) is slower than ext3 with journalling which guarantees consistency. On the other hand, with ReiserFS, I can have journalling, guaranteed consistency of at least the filesystem structure, and better performance. Plus, for some strange reason, ext3 seems to lose a lot of files on my systems (although they can be recovered by running fsck) during normal operation. Among the 3, ReiserFS is the clear winner for me. I am not disputing that you may be seeing other data, but let's at least conclude that ext2 is _not_ faster than all journalled filesystems for everyone, and that the performance hit of journalling, if any, is not "too high to tolerate" for everyone.

    ``With UPS prices being as low as they are, there is no excuse for not having one, so I think that journalling will become obsolete in some near future.''

    I think smart people realize that having a UPS is no guarantee that your system will never fail in the middle of a write. So a method to bring the system back to a consistent state is needed in any case. Let's also realize that journalling isn't only for recovery. It is one way to implement transactions, and transactions are useful for more than recovery alone; for example, they can be used to ensure consistency of da

  • Re:Ring 1 and 2? (Score:4, Interesting)

    by DamnStupidElf ( 649844 ) <Fingolfin@linuxmail.org> on Monday October 20, 2008 @02:57PM (#25444601)

    Not exactly. To effectively change the actual permissions that the permissions rings allow, stacks, segment registers, i/o permission bitmaps, and page tables (among other things) have to be changed. Generally this means reading values from memory into caches, which is slow. Probably the slowest of them all is the page cache. Invalidating the entire page cache is godawful slow, and is necessary if each separate user-space has a truly private address space and not simply a chunk out of the entire virtual address space. Even for operating systems that partition the virtual address space into regions for each user process, the local descriptor (or equivalent) table for segment access needs to be reloaded. This has to happen for every cross-privilege-level call. It is *much* faster to simply call another kernel mode function (push some stuff on the stack, change the instruction register, and you're done) without messing with caches.

    In fact, it would be even faster to not separate the kernel and user space processes at all, and instead use formal verification or a virtual machine (which really just means a smaller instruction set that's easier to verify) to prove that no user process could ever mess with the kernel or other processes. Virtual machines for languages are essentially at this stage today; they implement what would constitute a kernel as the run-time level portions of the virtual machine, running the virtualized software in the same address space. There have been some attacks based on virtual machine weaknesses or memory corruption that break the protection model by changing data structures so that they violate the security model. This can happen in OS's that use hardware protection as well, there are just fewer places in memory that random changes can cause problems (just the page tables and other security paraphernalia), making it less likely.

  • by Anonymous Coward on Monday October 20, 2008 @05:33PM (#25446507)

    deserved to fsck.btrfs /

    Maybe the GPP intended a read-only mount, as a diversified file system is the intent of stability and performance.
    Notice how ext2 is fastest, but none rememered to mount their / in "ro" (read-only) and mount their /usr to a peculiarly advantaged filesystem. Tell me, who has the base system directories constantly in read-write mode, as though they just can't decide what software they want on their computer? Some people have already decided and installed their user applications and libraries; we aren't shuffling everything around like mad XP, OSX, and muVista. Tell me who is faster in read-only? Now consider why one needs journal outside their /home (better to symlink that to the actual, /usr/local/home or /usr/home)? What keeps people from remounting their system root fs to read-write but only to move system binaries and libraries (maintenance), and then just remount read-only when done?

  • Re:Why not ZFS? (Score:3, Interesting)

    by harry666t ( 1062422 ) <harry666t@nospAM.gmail.com> on Tuesday October 21, 2008 @11:41AM (#25454523)
    What about kernels written in type-safe languages? (Singularity, all the Java OSs)

    In these systems, ALL the programs are run in one address space. Does it make the whole OS (not just the kernel) monolithic or what?...
  • Re:Why not ZFS? (Score:2, Interesting)

    by adrianwn ( 1262452 ) on Tuesday October 21, 2008 @02:32PM (#25457235)
    Obviously the common definition of "microkernel" does not apply to SAS (Single Address Space) systems. The difference between Singularity and Linux is that in Linux all the modules logically belong to the kernel, while they are logically separated in Singularity: in Linux all data structures can potentially be accessed by every module; this is not the case in Singularity. Hence you can call Singularity a microkernel system, even though everything runs in the same address space.

And it should be the law: If you use the word `paradigm' without knowing what the dictionary says it means, you go to jail. No exceptions. -- David Jones

Working...