Forgot your password?
typodupeerror
Data Storage GUI KDE Software Linux

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4 830

Posted by timothy
from the heavy-trade-off dept.
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
This discussion has been archived. No new comments can be posted.

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

Comments Filter:
  • Not a bug (Score:5, Informative)

    by casualsax3 (875131) on Wednesday March 11, 2009 @05:06PM (#27157149)
    It's a consequence of not writing software properly. Relevant links later in the same comment thread for those who don't might otherwise miss them:

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45 [launchpad.net]

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net]

    • Bull (Score:4, Insightful)

      by Jane Q. Public (1010737) on Wednesday March 11, 2009 @05:16PM (#27157285)
      Blaming it on the applications is a cop-out. The filesystem is flawed, plain and simple. The journal should not be written so far in advance of the records actually being stored. That is a recipe for disaster, no matter how much you try to explain it away.
      • Re:Bull (Score:5, Funny)

        by Lord Ender (156273) on Wednesday March 11, 2009 @05:34PM (#27157609) Homepage

        In fact, there is no such thing as an OS bug! All good programmers should re-implement essential and basic operating system features in their user applications whenever they run into so-called "OS bugs." If you question this, you must be a bad programmer, obviously.

      • Re:Bull (Score:5, Insightful)

        by wild_berry (448019) * on Wednesday March 11, 2009 @05:36PM (#27157629) Journal

        The journal isn't being written before the data. Nothing is written for periods between 45-120 seconds so as to batch up the writing to efficient lumps. The journal is there to make sure that the data on disk makes sense if a crash occurs.

        If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.

        • Re: (Score:3, Insightful)

          by Dahamma (304068)

          Oh great... basing ext4 performance gains on caching writes in the OS for 2 minutes just means they will focus their optimizations in ways that will suck even worse than ext3 does for applications that can't afford the risk of enabling write caching...

      • Re:Bull (Score:5, Informative)

        by Anonymous Coward on Wednesday March 11, 2009 @05:36PM (#27157635)

        This is NOT a bug. Read the POSIX documents.

        Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.

        It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).

        RTFPS (Read The Fine POSIX Spec).

      • Re:Bull (Score:5, Insightful)

        by Eugenia Loli (250395) on Wednesday March 11, 2009 @05:44PM (#27157761) Homepage Journal

        Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync. In other words, app developers must be more careful of their doings, not put all blame to the filesystems. It's so much that an fs can do to avoid such bruhahas. Many other filesystems have similar behavior to the ext4 btw.

      • Re:Bull (Score:5, Informative)

        by pc486 (86611) on Wednesday March 11, 2009 @05:50PM (#27157841) Homepage

        Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.

        All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).

        POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.

    • Re:Not a bug (Score:5, Insightful)

      by mbkennel (97636) on Wednesday March 11, 2009 @05:19PM (#27157323)

      I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.

      Quoting T'so:

      "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."

      In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.

      I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

      File systems are nice. That's what Unix is about.

      I don't think programmers ought to be required to treat them like a pouty flake: "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work."

      • Re:Not a bug (Score:5, Interesting)

        by Qzukk (229616) on Wednesday March 11, 2009 @05:29PM (#27157501) Journal

        I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

        It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.

        Fortunately his patches will include an option to turn the magic computer fairy off.

      • Re: (Score:3, Insightful)

        by Hatta (162192)

        The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries

        Translation: "Our filesystem is so fucked up, even SQL is better."

        WTF is this guy thinking? UNIX has used hundreds of tiny dotfiles for configuration for years and it's always worked well. If this filesystem can't handle it, it's not ready for production. Why not just keep ALL your files

        • Re:Not a bug (Score:5, Insightful)

          by xenocide2 (231786) on Wednesday March 11, 2009 @06:23PM (#27158283) Homepage

          UNIX filesystems have used tiny files for years and they've had data loss under certain conditions. My favorite example is the XFS that would journal just enough to give you a consistent filesystem full of binary nulls on power failure. This behavior was even documented in their FAQ with the reply "If it hurts, don't do it."

          Filesystems are a balancing act. If you want high performance, you want write caching to allow the system to flush writes in parallel while you go on computing, or make another overlapping write that could be merged. If you want high data security, you call fsync and the OS does its best possible job to write to disk before returning (modulo hard drives that lie to you). Or you open the damn file with O_SYNC.

          What he's suggesting is that the POSIX API allows either option to programmers, who often don't know theres even a choice to be had. So he recommends concentrating the few people who do know the API in and out focus on system libraries like libsqllite, and have dumbass programmers use that instead. You and he may not be so far apart, except his solution still allows hard-nosed engineers access to low level syscalls, at the price of shooting their foot off.

      • Re:Not a bug (Score:5, Informative)

        by Anonymous Coward on Wednesday March 11, 2009 @05:37PM (#27157655)

        Quoting T'so:

        "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, ...

        Linux reinvents windows registry?
        Who knows what they will come up with next.

      • Re: (Score:3, Insightful)

        by GigsVT (208848)

        Instead, the answer is to use a proper small database like sqllite for application registries

        Yeah, linux should totally put in a Windows style registry. What the fuck is this guy on.

        • by Profane MuthaFucka (574406) <busheatskok@gmail.com> on Wednesday March 11, 2009 @06:01PM (#27158005) Homepage Journal

          That would be smart, but only if the SQL database is encrypted too. It's theoretically possible to read a registry with an editor, and we can't have that. Also, we need a checksum on the registry. If the checksum is bad, we have to overwrite the registry with zeroes. Registries are monolithic, and we have to make sure that either it's good data, or NONE of it is good data. Otherwise the user would get confused.

          I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.

      • Re:Not a bug (Score:5, Insightful)

        by Logic and Reason (952833) on Wednesday March 11, 2009 @05:42PM (#27157717) Homepage

        I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

        To paraphrase https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net] : You certainly can use tons of tiny files, but if you want to guarantee your data will still be there after a crash, you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.

        • Re: (Score:3, Insightful)

          by gweihir (88907)

          Indeed. And that is what the suggestion about using a database was all about. You still can use all the tiny files. And there are better options than syncing for reliability. For example, rename the file to backup and then write a new file. The backup will still be there and can be used for automated recovery. Come to think of it, any decent text editor does it that way.

          Tuncating critical files without backup is just incredibly bad design.

      • Re:Not a bug (Score:5, Insightful)

        by davecb (6526) * <davec-b@rogers.com> on Wednesday March 11, 2009 @05:46PM (#27157797) Homepage Journal

        It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.

        Generally when one is trying to maintain correctness one allocates space, places the data into it and only then links the space into place (paraphrased from from Barry Dwyer's "One more time - how to update a master file", Communications of the ACM, January 1981).

        I'd be inclined to delay the metadata update until after the data was written, as Mr. Tso notes was done in ext3. That's certainly what I did back in the days of CP/M, writing DSA-formated floppies (;-))

        --dave

        • Re: (Score:3, Informative)

          by PhilHibbs (4537)

          It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.

          But you never create and write to a file as a single operation, there's always one function call to create the file and return a handle to it, and then another function call to write the data using the handle. The first operation writes data to the directory, which is itself a file that already exists, the second allocates some space for the file, writes to it, and updates the directory. Having the file system spot what your application is trying to do and reversing the order of the operations would be... t

      • Re:Not a bug (Score:5, Informative)

        by OeLeWaPpErKe (412765) on Wednesday March 11, 2009 @05:51PM (#27157867) Homepage

        Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.

        Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.

        You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).

        The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.

        Please don't FUD.

        You know I'll look up the docs for you :

        (quote from man 2 write)

        NOTES
                      A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
                      that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.

                      If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
                      been written, the call succeeds, and returns the number of bytes written.

        That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)

        So the normal case for a "reliable write" would be this code :

        size_t written = 0;
        int r = write(fd, &data, sizeof(data))
        while (r >= 0 && r + written sizeof(data)) {
                written += r;
                r = write(fd, &data, sizeof(data));
        }
        if (r 0) { // error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets
        }

        and *NOT*

        write(fd, data, sizeof(data)); // will probably work

        Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)

        Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.

      • Re: (Score:3, Insightful)

        by caerwyn (38056)

        I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.

        No. Writing software properly means calling fsync() if you need a data guarantee.

        Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.

        And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.

      • Re:Not a bug (Score:4, Insightful)

        by QuasiEvil (74356) on Wednesday March 11, 2009 @06:27PM (#27158325)

        In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.

        I couldn't agree more. A filesystem *is* a database, people. It's a sort of hierarchical one, but a database nonetheless.

        It shouldn't care if there's some mini-SQL thing app sitting on top providing another speed hit and layer of complexity or just a bunch of apps making hundreds of f{read|write|open|close|sync}() calls against hundreds of files. Hundreds of files, while cluttered, is very simple and easily debugged/fixed when something gets trashed. Some sort of obfuscated database cannot be fixed with mere vi. (Emacs, maybe, but only because it probably has 17 database repair modules built in, right next to the 87 kitchen sinks that are also included.)

        I do rather agree that it's not a bug. An unclean shutdown is an unclean shutdown, and Ts'o is right - there's not a defined behaviour. Ext4 is better at speed, but less safe in an unstable environment. Ext3 is safer, but less speedy. It's all just trade-offs, folks. Pick one appropriate to your use. (Which is why, when I install Jaunty, I'll be using Ext3.)

    • Re: (Score:3, Insightful)

      by idontgno (624372)

      lol.

      It's a consequence of a filesystem that makes bad assumptions about file size.

      I suppose in your world, you open a single file the size of the entire filesystem and just do seek()s within it?

      It's a bug. A filesystem which does not responsibly handle any file of any size between 0 bytes and MAXFILESIZE is bugged.

      Deal with it and join the rest of us in reality.

      • Re:Not a bug (Score:5, Insightful)

        by TerranFury (726743) on Wednesday March 11, 2009 @05:30PM (#27157523)
        Ummm... it deals correctly with files of any size. It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk. That's the case for pretty much any filesystem; it's just a matter of degree, and how "recent" is recent.
        • Re:Not a bug (Score:5, Insightful)

          by fireman sam (662213) on Wednesday March 11, 2009 @05:37PM (#27157657) Homepage Journal

          The benefit of journaling file systems is that after the crash you still have a file system that works. How many folks remember when Windows would crash, resulting in a HDD that was so corrupted the OS wouldn't start. Same with ext2.

          If these folks don't like asynchronous writes, they can edit their fstab (or whatever) to have the sync option so all their writes will be synchronous and the world will be a happy place.

          Note that they will also have to suffer a slower system, and possible shortened lifetime of their HDD, but at least there configuration files will be safe.

        • Re:Not a bug (Score:5, Informative)

          by davecb (6526) * <davec-b@rogers.com> on Wednesday March 11, 2009 @05:49PM (#27157825) Homepage Journal

          Er, actually it removes the previous data, then waits to replace it for long enough that the probability of noticing the disappearance approaches unity on flaky hardware (;-))

          --dave

        • Re:Not a bug (Score:5, Informative)

          by Jurily (900488) <[jurily] [at] [gmail.com]> on Wednesday March 11, 2009 @06:04PM (#27158059)

          It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk.

          No, that's the bug. It loses ALL data. You get 0 byte files on reboot.

    • Re: (Score:3, Interesting)

      by jgarra23 (1109651)
      Talk about doublespeak! Not a bug vs. It's a consequence of not writing software properly. reminds me of that FG episode where Stewie says, "it's not that I want to kill Lois... it's that I don't... want... her... to... live... anymore."
  • Don't worry (Score:5, Funny)

    by sakdoctor (1087155) on Wednesday March 11, 2009 @05:06PM (#27157155) Homepage

    Don't worry guys, I read the summary this time, and it only affects the German version of ext4.

  • pr0n (Score:5, Funny)

    by Quintilian (1496723) on Wednesday March 11, 2009 @05:11PM (#27157223)
    Real reason for the bug report: Someone's angry and wants his porn back.
  • by gweihir (88907) on Wednesday March 11, 2009 @05:16PM (#27157289)

    The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.

    Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.

    Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

    • by girlintraining (1395911) on Wednesday March 11, 2009 @05:27PM (#27157467)

      Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.

      You're right, there really is nothing to see here. Or rather, there's nothing left. As the article says, a large number of configuration files are opened and written to as KDE starts up. If KDE crashes and takes the OS with it (as it apparently does), those configuration files may be truncated or deleted entirely -- the commands to re-create and write them having never been sync'd to disk. As the startup of KDE takes longer than the write delay, it's entirely possible for this to seriously screw with the user.

      The two problems are:

      1. Bad application development. Don't delete and then re-create the same file. Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System, whatever the four color glossy told you.

      2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.

      • by gweihir (88907) on Wednesday March 11, 2009 @05:45PM (#27157769)

        I agree on both counts. Some comments

        1) The right sequence of events is this: Rename old file to backup name (atomic). Write new file, sync new file and then delete the backup file. It is however better for anything critical to keep the backup. In any case an application should offer to recover from the backup if the main file is missing or broken. To this end, add a clear end-mark that allows to check whether the file was written completely. Nothing new or exciting, just stuff any good software developer knows.

        2) Yes, a kernel should not crash. Occasionally it happens nonetheless. It is important to notice that ext4 is blameless in the whole mess (unless it causes the crash).

    • Translation (Score:4, Insightful)

      by microbee (682094) on Wednesday March 11, 2009 @05:50PM (#27157851)

      We use techniques that show great performance so people can see we beat ext3 and other filesystems.

      Oh shit, as a tradeoff we lose more data in case of a crash. But it's not our fault.

      Honestly, you cannot eat your cake and have it too.

  • Classic tradeoff (Score:5, Insightful)

    by Otterley (29945) on Wednesday March 11, 2009 @05:26PM (#27157445)

    It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.

    Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.

    Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.

    • Re:Classic tradeoff (Score:4, Informative)

      by imsabbel (611519) on Wednesday March 11, 2009 @05:36PM (#27157643)

      Its even WORSE than just being asynchronous:

      EXT4 reproducably delays write ops, but commits journal updates concerning this write.

    • Re: (Score:3, Interesting)

      Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.

      You also have to disable HDD caching, e.g., using
        hdparm -W0 /dev/hda1

      • Re: (Score:3, Insightful)

        by gweihir (88907)


        Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.

        You also have to disable HDD caching, e.g., using
            hdparm -W0 /dev/hda1

        Well, yes, but unless you have an extreme write pattern, the disk will not take long to flush to platter. And this will only result in data loss on power failure. If that is really a concern, get an UPS.

  • by microbee (682094) on Wednesday March 11, 2009 @05:28PM (#27157479)

    So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.

    But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.

    EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.

    It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.

    • by caerwyn (38056) on Wednesday March 11, 2009 @06:02PM (#27158023)

      This is the attitude that has the web stuck with IE.

      There's a standard out there called POSIX. It's just like an HTML or CSS standard. If everyone pays attention to it, everything works better. If you fail to pay attention to it for your bit (writing files or writing web pages), it's not *my* fault if my conforming implementation (implementing the writing or the rendering) doesn't magically fix your bugs.

      • by microbee (682094) on Wednesday March 11, 2009 @06:50PM (#27158659)

        Apparently, you don't know real life.

        Does POSIX tell you what happens if your OS crashes? That's right, it says "undefined". Oops, sorry, it's too hard a problem and we'll just leave it to you OS implementers.

        Asking everyone to use fsync() to ensure their data not being lost is insane. Nobody want to pay that kind of performance penalty unless the data is very critical.

        Normal applications have a reasonable expectation that the OS doesn't crash, or doesn't crash too often for this to be a big problem. However, shit happens, and people scream loud if their data is lost BEYOND reasonable expectations.

        Forget POSIX. It's irrelevent in the real world. It's exactly this pragmatic attitude that brought Linux to its current state.

        • by caerwyn (38056) on Wednesday March 11, 2009 @07:52PM (#27159577)

          Apparently, you don't know how to *deal* with real life.

          POSIX *does* tell you what happens if your OS crashes. It says "as an application developer, you cannot rely on things in this instance." It also provides mechanisms for successfully dealing with this scenario.

          As for fsync() being a performance issue, you can't have your cake and edit it too. If you don't want to pay a performance penalty, you can lose data. Ext4 simply only imparts that penalty to those applications that say they need it, and thereby gives a performance boost to others who are, due to their code, effectively saying "I don't particularly care about this data" - or more specifically, "I can accept a loss risk with this data."

          Normal applications have a reasonable expectation that the OS doesn't crash, yes. And usually it doesn't. Out of all the installs out there... how often is this happening? Not very. They've made a performance-reliability tradeoff, and as with any risk... sometimes it's the bad outcome that occurs. If they don't want that to happen, they need to take steps to reduce that risk- and the correct way to do that has always been available in the API.

          As for forgetting POSIX... it's the basis of all unix cross-platform code. It's what allows code to run on linux, BSD, Solaris, MacOS X, embedded platforms, etc, without (mostly) caring which one they're on. It's *highly* relevant to the real world because it's the API that most programs not written for windows are written to. Pull up a man page for system calls and you'll see the POSIX standard referenced- that's where they all came from.

          Saying "Forget POSIX. It's irrelevant in the real world." is like people saying a few years ago "Forget CSS standards. It's irrelevant in the real world." And you know what? That's the attitude that's dying out in the web as everything moves toward standards compliance. So it is in this case with the filesystem.

    • Re: (Score:3, Insightful)

      by somenickname (1270442)

      "The machine crashed" isn't a common situation. In fact, it's a very, very rare situation.

  • by rpp3po (641313) on Wednesday March 11, 2009 @05:28PM (#27157485)
    There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
    This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
    ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
    • by Anonymous Coward on Wednesday March 11, 2009 @06:11PM (#27158159)

      Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).

      You seem to misunderstand that's *exactly* what is happening.

      KDE is *DELETING* all of its config files, then writing them back out again in two operations.

      Three states now exist, the 'old old' state, where the original file existed, the 'old' state, where it is empty, and the 'new' state where it is full again.

      The problem is getting caught between step #2 and step #3, which on ext3 was mostly mitigated by the write delay being only 5 seconds.

      KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.

      • Re: (Score:3, Insightful)

        by rpp3po (641313)
        That's not true. KDE is not "*DELETING*" any of its files. It's just opening them with the O_TRUNC flag (expressing an intent to overwrite its contents). That's perfectly safe for a copy-on-write filesystems (as ZFS) but not for ext4. So calling all "modern" filesystems at risk is pure ignorance. Ext4 could delay content deletion of open files until write time and write both within a single transaction.
      • by Tadu (141809) on Wednesday March 11, 2009 @07:22PM (#27159143)

        KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.

        Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.

        • rename and fsync (Score:4, Insightful)

          by DragonHawk (21256) on Wednesday March 11, 2009 @11:34PM (#27161797) Homepage Journal

          "Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. "

          Two things are happening:
          (1) KDE is writing a new inode.
          (2) KDE is renaming the directory entry for the inode, replacing an existing inode in the process.

          KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.

          EXT4 isn't "broken" for doing this, as endless people have pointed out. The spec says if you don't call fsync(2) you're taking your chances. In this case, you gambled and lost.

          KDE isn't "broken" for doing this unless KDE promised never to leave the disk in an inconsistent state during a crash. That's a hard promise to keep, so I doubt KDE ever made it.

          A system crash means loss of data not committed to disk. A system crash frequently means loss of lots of other things, too. Unsaved application data in memory which never even made it to write(2). Process state. Service availability. Jobs. Money. System crashes are bad; this should not be news.

          The database suggestion some are making comes from the fact that if you want on-disk consistency *and* good performance, you have to do a lot of implementation work, and do things like batching your updates into calls to write(2) and fsync(2). Otherwise, performance will stink. This is a big part of what databases do.

          As someone else suggested, it's perfectly easy to make writes atomic in most filesystems. Mount with the "sync" option. Write performance will absolutely suck, but you get a never-loses-uncommitted-data filesystem.

    • Re: (Score:3, Informative)

      by macshit (157376)

      ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!

      I read the FA, and it actually really does look like the applications are simply using stupidly risky practices:

      These applications are truncating the file before writing (i.e., opening with O_TRUNC), and then assuming that the truncation and any following write are atomic. That's obviously not true -- what happens if your system is very busy (not surprising in the startup flurry which is apparently where this stuff happens), the process doesn't get scheduled for a while after the truncate (but before the

  • by dltaylor (7510) on Wednesday March 11, 2009 @05:40PM (#27157693)

    When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.

    While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.

    BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.

  • by ivoras (455934) <(ivoras) (at) (fer.hr)> on Wednesday March 11, 2009 @06:07PM (#27158105) Homepage

    *No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.

    The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.

    This is an issue of great sensitivity for databases. See for example:

    That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.

Luck, that's when preparation and opportunity meet. -- P.E. Trudeau

Working...