Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Data Storage Software Linux

Ext4 Data Losses Explained, Worked Around 421

ddfall writes "H-Online has a follow-up on the Ext4 file system — Last week's news about data loss with the Linux Ext4 file system is explained and new solutions have been provided by Ted Ts'o to allow Ext4 to behave more like Ext3."
This discussion has been archived. No new comments can be posted.

Ext4 Data Losses Explained, Worked Around

Comments Filter:
  • by morgan_greywolf ( 835522 ) on Thursday March 19, 2009 @02:03PM (#27258933) Homepage Journal

    No, we don't salute them. If you ask me, now matter what Ted T'so says about it complying with the POSIX standard, sorry, but it's a bug if it causes known, popular applications to seriously break, IMHO.

    Broken is broken, whether we're talking about Ted T'so or Microsoft.

  • Re:LOL: Bug Report (Score:1, Interesting)

    by berend botje ( 1401731 ) on Thursday March 19, 2009 @02:15PM (#27259147)
    Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

    Mr Ts'o is mistaken about this. When he introduces optimasation features that other filesystems (Reiser, for example) have already tried and undone because it doesn't work he is not fit to write filing systems. First learn how others did it, then do it better.

    With Ext4 now proven unstable, the only viable new filesystem is ZFS. Or just stick with ext3 or UFS.
  • Bad POSIX (Score:5, Interesting)

    by Skapare ( 16644 ) on Thursday March 19, 2009 @03:01PM (#27259823) Homepage

    Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

    Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

    If that is true, then to the extent that is true, POSIX is "broken". Related changes to a file system really need to take place in an orderly way. Creating a file, writing its data, and renaming it, are related. Letting the latter change persist while the former change is lost, is just wrong. Does POSIX really require this behavior, or just allow it? If it requires it, then IMHO, POSIX is indeed broken. And if POSIX is broken, then companies like Microsoft are vindicated in their non-conformance.

  • Re:LOL: Bug Report (Score:4, Interesting)

    by causality ( 777677 ) on Thursday March 19, 2009 @03:04PM (#27259871)

    Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

    For those of us who are not so familiar with the data loss issues surrounding EXT4, can someone please explain this? The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?" I.e. if I ask OpenOffice to save a file, it should do that the exact same way whether I ask it to save that file to an ext2 partition, an ext3 partition, a reiserfs partition, etc. What would make ext4 an exception? Isn't abstraction of lower-level filesystem details a good thing?

  • Re:LOL: Bug Report (Score:5, Interesting)

    by swillden ( 191260 ) <shawn-ds@willden.org> on Thursday March 19, 2009 @03:24PM (#27260213) Homepage Journal

    The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

    They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

    The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

  • by ChienAndalu ( 1293930 ) on Thursday March 19, 2009 @03:33PM (#27260337)

    Ext4 *is* better [phoronix.com], and probably because it benefits from the wiggle room provided by the specifications. The question is if you accept the tradeoff between performance and security. I choose performance, because my system doesn't crash that often.

  • by Anonymous Coward on Thursday March 19, 2009 @03:47PM (#27260537)

    behaves precisely as demanded by the POSIX standard

    Application developers reasonably expect

    Apples and oranges. POSIX != "what app developers reasonably expect".

    Of course you have a point insofar as that just pointing to POSIX and saying it's a correct implementation of the spec is not enough, but let's be clear here that one of these things is not like the other.

  • Re:LOL: Bug Report (Score:5, Interesting)

    by MikeBabcock ( 65886 ) <mtb-slashdot@mikebabcock.ca> on Thursday March 19, 2009 @04:15PM (#27260927) Homepage Journal

    The POSIX standard is just fine. The problem is application assumptions that aren't up to snuff.

    Read the qmail source code sometime. Every time the author wants to assure himself that data has been written to the disk, it calls fsync.

    If you don't, you risk losing data. Plain and simple.

  • Re:LOL: Bug Report (Score:5, Interesting)

    by causality ( 777677 ) on Thursday March 19, 2009 @04:16PM (#27260941)

    So, in principle, the filesystem could just throw away the data unless the application explicitly calls a fsync ?
    This seems to be a slightly bit of...hmmm....stupid ?

    From the explanations I received and some reading I've done, I don't think the data is just getting "thrown away" so that isn't really a valid question. The issue seems to be that unless fsync is called, the changes requested by the application may happen in a sequence that is other than what the application programmer expected. The example I saw in this discussion involved first writing data to a file and then renaming it soon afterwards. If I understand this correctly, the application is assuming that the rename cannot possibly happen before the writing of the data is done even though the specification has no such requirement. If the application needs this to happen in the order in which it was requested, it needs to write the data, then call fsync, then rename the file. You could probably fill a library with what I don't know about low-level filesystem details, so please correct me if I have misunderstood this.

    The example I found in the Wikipedia entry on ext4 [wikipedia.org] was different. That one involved data loss because the application updates/overwrites an existing file and does not call fsync and then the system crashes. The Wiki article states that this leads to undefined behavior (which, afaik, is correct per the spec). The article also states that a typical result is that the file was set to zero-length in preparation for being overwritten but because of the crash, the new data was never written so it remains zero-length, causing the loss of the old version of the file. Under ext3 you would usually find either the old version of the file or the new version.

    What I don't understand and hope that a more knowledgable person could explain is why this can't be done a slightly different way. This is where I can apply reason to come up with something that sounds preferable to me but I simply don't have the background knowledge of filesystems to understand the "why". If the overwrite of the file is delayed, why isn't the truncation of the file to zero-length also delayed? That is, instead of doing it this way:

    Step 1: Truncate file length to zero in preparation of overwriting it.
    Step 2: Delay the writing of the new data for performance reasons.
    Step 3: After the delay has elapsed, actually write the data to the disk.

    Why can't it be done this way instead?

    Step 1: Delay the truncation of the file length to zero in preparation of overwriting it.
    Step 2: Delay the writing of the new data.
    Step 3: After the delay has elapsed, set the file length to zero and immediately write the new data, as a single operation if that is possible, or as one operation immediately followed by the other.

    That way if there is a crash, you'd still get either the old version or the new one and not a zero-length file where data used to be. The only disadvantage I can see is that this might continue to enable developers to make assumptions that are not found in the standard because the buggy behavior ext4 is now exposing may continue to work. If there's no technical reason why it cannot be done that way, perhaps the bad precedent alone is a good reason to either not handle it this way or to change the spec.

  • Bollocks (Score:3, Interesting)

    by Colin Smith ( 2679 ) on Thursday March 19, 2009 @05:30PM (#27261939)

    A filesystem is not a Database Management System. It's purpose is to store files. If you want transactions, use a DBMS. There are plenty out there which use fsync correctly. Try SQLite.


  • by tkinnun0 ( 756022 ) on Thursday March 19, 2009 @05:34PM (#27261975)
    If the filesystem is a few percents faster but then your disk sits idle half of the time and then you have a crash and lose a file that takes two hours to recreate, have you actually gained any performance?
  • Re:LOL: Bug Report (Score:5, Interesting)

    by Cassini2 ( 956052 ) on Thursday March 19, 2009 @06:22PM (#27262517)

    PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer.
    1st. There will be no fsyncs of config files at startup once the KDE startup is fixed.

    KDE isn't fixed right now. Additionally, KDE is not the only application that generates lots of write activity. I work with real-time systems, and write performance on data collection systems is important.

    2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change.

    I did some benchmarks on the ext3 file system, the ext4 system without the patch, and the ext4 system with the patch. Code followed the open(), write(), close() sequence was 76% faster than the code with fsync(). Code that followed the open(), write(), close(), rename() sequence was 28% faster than code with that followed the open(), write(), fsync(), close(), rename() sequence. Additionally, the benchmarks were not significantly affected by the presence which file system was used (ext3, ext4, or ext4 patched.) You can look up the spreadsheet and the discussion at the launchpad discussion. [launchpad.net]

    3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

    Major Linux file backup utilities, like tar, gzip, and rsync don't use fsync as part of normal operations. The only application of the three, tar, that uses fsync, only uses it when verifying data is physically written to disk. In that situation, it writes the data, calls fsync, calls ioctl(FDFLUSH), and the reads the data back. Strictly speaking, that is the only way to make sure the file is written to disk, and is readable.

    Finally, as Theodore Ts'o has pointed out, if you really want to make sure the file is saved to disk, you also have to fsync() the directory too. I have never seen anyone do that, as part of a normal file save. Most C programming textbooks simply have fopen, fwrite, fclose as the recommended way to save files. Calling fsync this often is unusual for most C programmers.

    I would hate to be in your programming class. Your enforcing programming standards that aren't followed by key Linux utilities, aren't in most textbooks, and aren't portable to non-Linux file systems.

    If you require your students to fsync() the file and the directory, as part of a normal assignment, you are requiring them to do things that aren't done by any Linux utility out there. Further, if you are that paranoid, you better follow the example from the tar utility, and after the fsync completes, read all the data back to verify it was successfully written.

  • Re:LOL: Bug Report (Score:4, Interesting)

    by spitzak ( 4019 ) on Thursday March 19, 2009 @07:49PM (#27263361) Homepage

    Yes I would like that as well. It would remove the annoying need to figure out a temp filename and to do the rename.

    One suggestion was to add a new flag to open. I think it might also work to change O_CREAT|O_TRUNC|O_WRONLY to work this way, as I believe this behavior is exactly what any program using that is assuming.

    f = creat(filename) would result in an open file that is completely hidden to any process. Anybody else attempting to open filename will either get the old file or no file. This should be easy to implement as the result should be similar to unlinking an already-opened file.

    close(f) would then atomically rename the hidden file to filename. Anything that already has filename open would keep seeing the old file, anything that opens it afterwards will see the new file.

    If the program crashes without closing the file then the hidden file goes away with no side effects. It might also be useful to have a call that does this, so a program could abandon a write. Not sure what call to use for that.

    Calling fsync(f) would act like close() and force the rename, so after fsync it is exactly like current creat().

Experience varies directly with equipment ruined.