Forgot your password?
Data Storage Operating Systems Software Linux

Kernel Hackers On Ext3/4 After 2.6.29 Release 316

Posted by timothy
from the good-things-come-from-certain-clashes dept.
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
This discussion has been archived. No new comments can be posted.

Kernel Hackers On Ext3/4 After 2.6.29 Release

Comments Filter:
  • by Anonymous Coward on Wednesday March 25, 2009 @08:47AM (#27328043)

    Quote from Linus:

    "...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."

    In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.

      How about ASKING them rather than calling the Morons?

    (note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)


  • by pla (258480) on Wednesday March 25, 2009 @08:47AM (#27328051) Journal
    FTA: "if you write your data _first_, you're never going to see corruption at all"

    Agreed, but I think this still misses the point - Computers go down unexpectedly. Period.

    Once upon a time, we all seemed to understand that, and considered writeback behavior (when rarely available) always a dangerous option only for use in non-production systems and with a good UPS connected. And now? We have writeback FS caching enabled by silent default, sometimes without even a way to disable it!

    Yes, it gives a huge performance boost... But performance without reliability means absolutely nothing. Eventually every computer will go down without enough warning to flush the write buffers.
  • by Anonymous Coward on Wednesday March 25, 2009 @09:09AM (#27328307)

    Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
    Also, there has been a fairly public discussion including a statement by the responsible person in question.

    Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.

    Yes, Mr. Torvalds is fairly outspoken.

  • by Anonymous Coward on Wednesday March 25, 2009 @09:11AM (#27328345)

    Yes! This is the whole point. I am not a filesystem guy either. I don't even know that much about filesystems. But imagine you write a program with some common data storage. Imagine part of that common data is a pointer to some kind of matrix or whatever. Does anybody think it is a good idea to set that pointer first, and then initialize the data later?

    Sure, a realy robust program should be able to somehow recover from corrupt data. But that doesn't mean you can just switch your brain off when writing the data.

  • by morgan_greywolf (835522) on Wednesday March 25, 2009 @09:29AM (#27328575) Homepage Journal

    Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.

    It's common sense! Duh. Write data first, pointers to data second. If the system goes down, you're far less likely to lose anything. That's obvious. Those who think this is somehow not obvious don't have the right mentality to be writing kernel code.

    I think the problem is Ted T'so has had a slight 'works for me' attitude about it:

    All I can tell you is that *I* don't run into them, even when I was
    using ext3 and before I got an SSD in my laptop. I don't understand
    why; maybe because I don't get really nice toys like systems with
    32G's of memory. Or maybe it's because I don't use icecream (whatever
    that is).

  • by Anonymous Coward on Wednesday March 25, 2009 @09:34AM (#27328629)

    Well this is just my meta comment. I'll be writing my real comment later...

    You forgot to include a link to the comment you'll be writing later.

  • by houghi (78078) on Wednesday March 25, 2009 @09:35AM (#27328649)

    Knowing the humor that Linus has, it could be himself.

  • by Blackknight (25168) on Wednesday March 25, 2009 @09:37AM (#27328689) Homepage

    Solaris 10 with ZFS, if you actually care about your data.

  • by Colin Smith (2679) on Wednesday March 25, 2009 @09:41AM (#27328739)

    Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.

    from the FAQ: []

    "mount -o data=ordered"
                    Only journals metadata changes, but data updates are flushed to
                    disk before any transactions commit. Data writes are not atomic
                    but this mode still guarantees that after a crash, files will
                    never contain stale data blocks from old files.

    "mount -o data=writeback"
                    Only journals metadata changes, and data updates are entirely
                    left to the normal "sync" process. After a crash, files will
                    may contain stale data blocks from old files: this mode is
                    exactly equivalent to running ext2 with a very fast fsck on reboot.

    So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...

  • by Anonymous Coward on Wednesday March 25, 2009 @09:44AM (#27328789)


  • by Hatta (162192) on Wednesday March 25, 2009 @09:55AM (#27328947) Journal

    In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.

    i.e. You truncated a file to 0 bytes, and wrote the data.

    Why on earth would you do that? Write the new data, update the metadata, THEN remove the old file.

  • Re:A UPS (Score:3, Insightful)

    by ledow (319597) on Wednesday March 25, 2009 @10:15AM (#27329225) Homepage

    Yeah, I have to second this... all the journalling filesystems in the world can't compete with a bog-standard, home-based UPS. You just need to make ABSOLUTELY sure that the system shuts down when the battery STARTS going (don't try and be fancy about getting it to run until the battery lifetime) and that the system WILL shut down, no questions asked.

    A UPS costs, what, £50 for a cheap, home-based one? Batteries might cost you £20 a year or so on average (and probably a lot less if you just need "shutdown safely" rather than "carry on running"). You don't need it to give a lot of power (run ONLY the base unit off it... anything else and you could hit overloads, etc... you *won't* be operating the PC when it's on battery, you just want it to shut down and, optionally, give you a beep or two when it has shut down successfully), or for very long at all. You just need a fail-safe way of detecting when the power is out so that you can safely shutdown. You also want to check that your cabling is good (nothing more embarassing than having a UPS and then pulling the wrong cable out).

    Above and beyond that, filesystem and/or data corruption is one of those things that are almost guaranteed to happen unless you put a lot of effort into it (battery-backed RAID controllers, filesystems with slow-but-sure settings, integrity checking etc.). Make it easy on yourself - use a UPS to stop the problem happening ever, rather than try to have something *might* clean up nicely if it does happen. Even Google don't bother with journalling - if a PC loses power, it's rebuilt from an image. It's not worth faffing about to see if/when/how a filesystem can be repaired, just ensure you have adequate backups and try to stop it happening in the first place.

  • by SpinyNorman (33776) on Wednesday March 25, 2009 @10:19AM (#27329273)

    fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.

    I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.

    IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.

  • by Skuto (171945) on Wednesday March 25, 2009 @10:29AM (#27329373) Homepage

    I agree that the who-dun-it part is irrelevant. I disagree on the "SO f***d" part. We have three filesystems that write the journal prior to the data. Basically, we know the issue, and a similar fix can be shared amongst the three affected filesystems.

    I would be very surprised if the fix can be shared between the filesystems. At least the most serious among those involved, XFS, sits on a complete intermediate compatibility layer that makes Linux looks like IRIX.

    Linux filesytems are seriously in a bad state. You simply cannot pick a good one. Either you get one that does not actively kill your data (ext3 ordered/journal) or you pick one which actually gives decent performance (anything besides ext3).

    Obviously, we should have both. It's not like that is impossible. But it's surprising how long those problems lasted. It's not like filesystems are a MINOR part of the entire OS.

    Probably part of the reason is that we have JFS, XFS, ext3/4, reiser3/4, tux3, btrfs... Filesytem developers suffer very heavily from NIH syndrome. Instead of one good we have 8 that "almost" work.

    But almost is not good for something so essential. This is not the kind of choice that is good. It's time one filesystem wins, gets fixed, and the rest is left dead.

  • by linuxrocks123 (905424) on Wednesday March 25, 2009 @10:30AM (#27329391) Homepage Journal

    Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.

  • by Skuto (171945) on Wednesday March 25, 2009 @10:33AM (#27329433) Homepage

    So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...

    The thread starts with someone having serious performance problems exactly because ext3 ordered mode is so slow in some circumstances...

    Like when you fsync().

  • Re:Linus (Score:1, Insightful)

    by Anonymous Coward on Wednesday March 25, 2009 @10:48AM (#27329625)

    Oh come off it. You must be an American, because in America excessive gentleness and tenderness in dealing with even the most outrageous and inexcusable problems seems to the present cultural norm.

    Linus, perhaps, is a taskmaster and perfectionist. The Linux OS is his baby and any major difficulties will ultimately be a bad reflection on him alone.

    It is not inappropriate to sometimes rudely castigate one's associates. It is a kind of shaming game that is intended to inspire better performance. I recall that during the Intel ethernet fiasco involving the e1000e driver, Torvalds was equally brusque toward the Intel developers for their "stupid" oversights.

    What we need is more, and not less, of such an aggressive attitude. A real man can take it. Indeed, real men will welcome it, because the end result, in spite of any hurt feelings, is an overall higher quality of craftsmanship.

  • by Skuto (171945) on Wednesday March 25, 2009 @11:08AM (#27329873) Homepage

    fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.

    The two issues are very closely related, not "an entirely different issue". What the apps want is not "put this data on the disk, NOW", but "put this data on the disk sometime, but do NOT kill the old data until that is done".

    Applications don't want to be sure that the new version is on disk. They want to be sure that SOME version is on disk after a crash. This is exactly what some people can't seem to understand.

    fsync() ensures the first at a huge performance cost. rename() + ext3 ordered gives you the latter. The problem is that ext4 breaks this BECAUSE of the journal ordering. The "consistent state" is broken for application data.

    I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.

    Yes. But they are assuming this exists and the API is called rename() :)

  • by Anonymous Coward on Wednesday March 25, 2009 @11:47AM (#27330375)

    Umm... If this was Microsoft's filesystem, we wouldn't be following a conversation between the filesystem developer and the lead kernel developer. And no matter how curious or knowledgeable we were, no one outside of Redmond would know the details.

    We would be privileged to know about any issue at all, and any knowledge of it would be filtered through Microsoft marketing and thousands of paid and unpaid Microsoft apologists (like yourself); the developers themselves would be gagged by NDAs (I'm not even going to talk about the fact that we are all able to customize the kernel, filesystem, and even the applications causing the problems for our own requirements).

    If it were Microsoft's filesystem, we likely wouldn't be having this discussion at all.

  • by cream wobbly (1102689) on Wednesday March 25, 2009 @11:48AM (#27330385)

    Indeed, it's just name calling.

    Given a choice, I'd employ the more mundane developer rather than the brilliant kid with the mouth.

    For exactly the same reasons, I wouldn't want to work with the him. I've had to work with loudmouths in the past, and the abuse is wearying. It saps your creativity, because you don't want to risk triggering another outburst; but they come anyway. Each time, I've left with renewed energy.

    Right now, I have a job where everyone treats everyone else with respect, and not without justification. Sometimes there are tensions, but nothing like the childish bullying described above.

  • by SpinyNorman (33776) on Wednesday March 25, 2009 @12:05PM (#27330629)

    (1) Never point to a structure before it has been initialized

    Which surely includes writing data before meta-data (and write the data someplace other than where the old meta-data is pointing), which is what Linus was saying.

  • Re:Linus (Score:3, Insightful)

    by clarkn0va (807617) < minus bsd> on Wednesday March 25, 2009 @12:14PM (#27330785) Homepage

    What we need is more, and not less, of such an aggressive attitude. A real man can take it.

    That depends if you're trying to construct a team of "real men" or a team of skilled developers.

    People sometimes confuse the idea or the act with the person that is associated with. If I propose a stupid idea or commit a stupid act, then by all means call me out and tell me that it's stupid and why. But save the ad hominem attacks. Calling somebody a moron accomplishes no good thing, and doing it in public is an extremely quick and effective way of destroying team morale.

  • Re:Linus (Score:3, Insightful)

    by moderatorrater (1095745) on Wednesday March 25, 2009 @12:19PM (#27330877)
    I think it's more a matter of dealing with divas all day. It's pretty clear that the two sides of this issue are the side with technical people convinced that the correctness of the journaling system overcomes any difficulties with integrity, and people who think that integrity should be paramount. For most users, disk integrity IS the number one priority. It seems to me that this is a case of some people not being able to see that they're wrong.

    In a corporation, it's as simple as saying, "do it our way or hit the street." With Linux development the leaders don't have that power, so they may replace it with forcefulness. Besides, the honesty is kind of refreshing. Linus lays out a clear argument and only then starts insulting the other person. He's being brutal, but he's giving them more information than a more polite person might.
  • by Cassini2 (956052) on Wednesday March 25, 2009 @12:32PM (#27331079)

    When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.

    Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
    - Every UNIX machine has a UPS.
    - Production servers run UNIX. What's this Linux you are talking about?
    - Disk space is expensive. No one will pay for unused disk space.
    - RAM is expensive. As such, it can be quickly flushed to disk.
    - No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
    Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.

    Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.

    Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.

  • by Anonymous Coward on Wednesday March 25, 2009 @01:19PM (#27331917)

    The situation you describe doesn't occur with a journaled filesystem. The journal does not rollback, it is a to-do list. The metadata update is added to the journal first, so even if the data is written before the actual metadata update, the metadata update is not lost. After the crash, the journal ensures that the new metadata becomes the current state of the filesystem.

    The interesting case (the one which triggered this whole discussion) is when the metadata update is performed without the corresponding data update. This happens when data is not journaled and the filesystem doesn't ensure that metadata updates related to unwritten data are discarded. The described behavior is more likely in Ext4 because of the longer data write delay, but it exists just the same in Ext3.

  • by mmontour (2208) <> on Wednesday March 25, 2009 @02:56PM (#27333513)

    Some of us have discovered the 'shutdown' command. [...]Anyhow, I suggest you use it occasionally. Then perhaps you can only fsck when something bad has happened.

    Don't be too smug - a "shutdown" doesn't always guarantee a clean startup. I remember a bug (hopefully fixed now) where "shutdown" was completing so quickly that it powered off the computer while data was still sitting in the hard drive's volatile write cache. Even though the OS had unmounted the filesystem, the on-disk blocks were still dirty.

    p.s. If any OS/kernel developers are listening - how about implementing a standard API through which drive write-caches can be flushed+disabled whenever a system starts a shutdown procedure, gets a signal that the UPS is running on battery power, or otherwise concludes that it is in a state where a temporarily-increased risk of data loss justifies slowing down I/O?

  • by jabuzz (182671) on Wednesday March 25, 2009 @03:40PM (#27334101) Homepage

    ZFS is production ready my ass. ZFS will be production ready when I can take a disk out the filesystem, when I can set quota's when it supports HSM and when it supports clustering.

    Finally it will be production ready when it has a decade of hardening in the real world.

    In the meantime both JFS and XFS offer better alternatives, and for me only GPFS (which admittedly is closed source but does run under Linux) ticks all the boxes.

    The crazy thing is that ext4 offers nothing that we don't get with XFS or JFS, and if RedHat would stop pussy footing about, and support either one (and I don't care which) the whole ext? could die.

    The ext2/3 line had a place and a time, and that place and time has long gone. It needs to die...

  • Re:ZFS (Score:4, Insightful)

    by Mr.Ned (79679) on Wednesday March 25, 2009 @04:55PM (#27335051)

    FreeBSD has ZFS. My understanding is while ZFS is a good filesystem, it isn't without issues. It doesn't work well on 32-bit architectures because of the memory requirements, isn't reliable enough to host a swap partition, and can't be used as a boot partition when part of a pool. Here's FreeBSD's rundown of known problems: [].

    On the other hand, the new filesystems in the Linux kernel - ext4 and btrfs - are taking the lessons learned from ZFS. I'm excited about next-generation filesystems, and I don't think ZFS is the only way to go.

  • by Anonymous Coward on Wednesday March 25, 2009 @06:27PM (#27336033)

    Yes, but in this case ext3 and ext4 keep (convenient, fast) consistency of the filesystem at the cost of worse behavior regarding the user experience (and user data).

Never invest your money in anything that eats or needs repainting. -- Billy Rose