Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Data Storage Software Linux

Ext4 Data Losses Explained, Worked Around 421

ddfall writes "H-Online has a follow-up on the Ext4 file system — Last week's news about data loss with the Linux Ext4 file system is explained and new solutions have been provided by Ted Ts'o to allow Ext4 to behave more like Ext3."
This discussion has been archived. No new comments can be posted.

Ext4 Data Losses Explained, Worked Around

Comments Filter:
  • by canadiangoose ( 606308 ) <(moc.liamg) (ta) (mahargjd)> on Thursday March 19, 2009 @02:32PM (#27259381)
    If you mount your ext4 partitions with nodelalloc you should be fine. You will of course no longer benefit from the performance enhancements that delayed allocation bring, but at least you'll have all of your freaking data. I'm running Debian on Linux 2.6.29-rc8-git4, and so far my limited testing has shown this to be very effective.
  • by iYk6 ( 1425255 ) on Thursday March 19, 2009 @02:34PM (#27259411)

    Someone above says that the POSIX standard is fine, but that ext4 violates it. Here is his quote:
    "When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename("

    It seems that ext4 renames the file first, and then writes the file up to 60 seconds later.

  • by ManWithIceCream ( 1503883 ) on Thursday March 19, 2009 @02:47PM (#27259607)

    We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.

    Is Ted Ts'o not professional? Does he not get paid? Ts'o's employed by the Linux Foundation, on leave from IBM. Free Software does not mean volenteer-made software!

  • by Kjella ( 173770 ) on Thursday March 19, 2009 @02:48PM (#27259631) Homepage

    Fixed code:
    fwrite()
    fsync() - sync this file before close
    fclose()
    rename()

    Either you're a troll or an idiot, since you're AC'ing I guess I got trolled. This will sync immidiately and kill performance and battery life, since every block must be confirmed written before the process can continue. What you need to fix this is a delayed rename that happens after the delayed write.

    Problem:
    fwrite()
    fclose()
    rename()
    *ACTUAL RENAME*
    *TIME PASSES* <-- crash happens here = lose old file
    *ACTUAL WRITE*

    Real solution:
    fwrite()
    fclose()
    rename()
    *TIME PASSES* <-- crash happens here = keep old file
    *ACTUAL WRITE*
    *ACTUAL RENAME*

  • by ChienAndalu ( 1293930 ) on Thursday March 19, 2009 @02:49PM (#27259637)

    As explained in the article - he hasn't made a mistake. The behaviour of ext4 is perfectly compatible with the POSIX standard.

    man fsync

  • Re:LOL: Bug Report (Score:2, Informative)

    by larry bagina ( 561269 ) on Thursday March 19, 2009 @03:08PM (#27259933) Journal
    My one experience with XFS involved the partition being corrupted beyond recoverability within 15 minutes. Too bad, in theory XFS is great.

    Anyhow, ZFS is raid, lvm, and fs rolled up into one, so keeping the patch up to date with linux changes could be a bit of work.

  • by dshadowwolf ( 1132457 ) <dshadowwolf&gmail,com> on Thursday March 19, 2009 @03:11PM (#27259975)

    And you don't get it... The truth is that Ext4 was writing the journal out before any changes took place. This means that when the crash happens between the metadata write and the actual write a replay of the journal will cause data loss.

    Other filesystems with delayed allocation solve this by not writing the journal before the actual data commits happen. The fix that TFA is talking about introduces this to Ext4.

  • by david_thornley ( 598059 ) on Thursday March 19, 2009 @03:15PM (#27260073)

    In which case the standard sucks, big time, and finding a loophole that trashes normal expected behavior should not be cause for rejoicing.

    There needs to be a way to write a file such that either the old or the new is preserved. Agreed on this?

    Now, in a file system that's going to run real well, there needs to be a way to delay writes in order to batch them. Agreed on this?

    We have two reasonable demands here. Pick one, because that's all you're going to get.

    Currently, in order to keep either the old or new file, it's necessary to write the new file right now. This is the standard behavior, and it trashes performance. Alternatively, the writes can be batched up for later, for good performance, and we run the risk of losing both old and new versions of a file.

    In other words, in order to optimize the heck out of the file system, it's necessary to trash the performance.

    What we need is a way to do the rewrite-rename thing in a way so it can be safely delayed, so the file system can batch up a lot of writes to do in a really fancy optimized way, but writing the new file fully before renaming it. There's no obvious reason to me why the file system can't keep track of this and guarantee the order. It may not be required by the standard, but that's no excuse for not implementing it.

  • by BigBuckHunter ( 722855 ) on Thursday March 19, 2009 @03:21PM (#27260159)

    There needs to be a solution that supports write-replace without spinning up the disk drive.

    How do you intend on writing to the disk drive... without spinning it up? Is this not what you're asking? If this is indeed your question, the answer is already "by using a battery backed cache".

    BBH

  • Re:LOL: Bug Report (Score:3, Informative)

    by PitaBred ( 632671 ) <slashdot&pitabred,dyndns,org> on Thursday March 19, 2009 @03:25PM (#27260233) Homepage
    Basically, the spec was written one way, but the actual behavior was slightly different. Even though the standard didn't guarantee something to be written, most filesystems did it anyway. When EXT4 didn't write things immediately to improve performance, the applications that depended on filesystems writing data ASAP (even though it wasn't required behavior) started risking data loss in case of a crash and data not being written explicitly.
    br/> The mechanism (fsync) has been around for ages, it's just that most apps didn't use it when they should because there wasn't a "need" to until EXT4, and other systems like XFS which are less popular and tend to be run by people who know what behavior to expect.
  • Re:LOL: Bug Report (Score:1, Informative)

    by Anonymous Coward on Thursday March 19, 2009 @03:34PM (#27260345)

    My reading is that applications have been relying on an undocumented feature of the old filesystems instead of being implemented in an fs independent way. Ext4 removed this "feature" and exposed the already existing dependence of these applications. Thus to be fs independent, applications should call fsync to force data be physically written to disk.

    The problem is they weren't. Instead they are relying on an (undocumented) feature of ext2/3 to do the fsync for them.

  • Re:LOL: Bug Report (Score:1, Informative)

    by Anonymous Coward on Thursday March 19, 2009 @03:35PM (#27260361)

    Even though the standard didn't guarantee something to be written, most filesystems did it anyway

    No they didn't - ext3 was quite atypical. Even on Windoze, NTFS requires a fsync. (Mind you, vista introduces a transactional API like reiser was on about for linux before he turned all murdery...)

  • by diegocgteleline.es ( 653730 ) on Thursday March 19, 2009 @03:46PM (#27260507)

    "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

    It turns out that all the modern operative systems work exactly like that. In ALL of them you need to use explicit syncronization (fsync and friends) to get a notification that your data has really been written to disk (and that's all what you get, a notification, because the system could oops before fsync finishes). You also can mount your filesystem as "sync", which sucks.

    Journaling, COW/transaction-based filesystems like ZFS only guarantee the integrity, not that your data is safe. It turns out that Ext3 has the same problem, it's just that the window is smaller (5 seconds). And I wouldn't bet that HFS and ZFS have not the same problem (btrfs is COW and transaction based, like ZFS, and has the same problem).

    Welcome to the real world...

  • by Anonymous Coward on Thursday March 19, 2009 @03:47PM (#27260531)

    This would never happen with Microsoft, they're all for crippling their OS just so it can be backwards compatible with broken applications

  • Re:LOL: Bug Report (Score:1, Informative)

    by Anonymous Coward on Thursday March 19, 2009 @03:52PM (#27260599)

    When was this that you tried XFS? Originally it did have problems but it got very stable several years ago.

    ZFS is just a filesystem with lots of features. Hell, it runs in userspace via FUSE. There is nothing magical or difficult about it.

  • Re:No kidding (Score:3, Informative)

    by SanityInAnarchy ( 655584 ) <ninja@slaphack.com> on Thursday March 19, 2009 @04:00PM (#27260705) Journal

    The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing.

    Part of usability is performance. This is a significant performance improvement.

    So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.

    The real problem here is that application developers were relying on a hack that happened to work on ext3, but not everywhere else.

    Let me ask you this -- should XFS change the way it behaves because of this? EXT4 is doing exactly what XFS has done for decades.

    I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss.

    No, that's actually precisely what the spec says, with one exception: You can guarantee it to be written to disk by calling fsync.

    I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not.

    Only because you haven't looked.

    In fact, there's a mount option to turn this behavior on in ext3.

    The "bad design" goes deeper than that.

  • by Cassini2 ( 956052 ) on Thursday March 19, 2009 @04:17PM (#27260951)

    Calling fsync() excessively completely trashes system performance and usability. Essentially, operating systems have write back caches to speed code execution. fsync() disables the write back cache by writing data out immediately, and making your program wait while the flush happens. Modern computers can do activities that involve rapidly touching hundreds of files per second. Forcing each write to use an fsync() slows things down dramatically, and makes for a poor user experience.

    To make matters worse, from a technical point of view, it is necessary for strict POSIX compliance to fsync() the file and then fsync() the containing directory. I have never seen a piece of normal application code that fsync() the containing directory. Even common linux utilities like rsync, and gzip don't use fsync anymore. tar uses fsync in one special case: for file verification before calling ioctl(FDFLUSH). The documentation on tar is instructive:

    /* Verifying an archive is meant to check if the physical media got it correctly, so try to defeat clever in-memory buffering pertaining to this particular media. On Linux, for example, the floppy drive would not even be accessed for the whole verification.
    The code was using fsync only when the ioctl is unavailable, but Marty Leisner says that the ioctl does not work when not preceded by fsync. So, until we know better, or maybe to please Marty, let's do it the unbelievable way :-). */

    #if HAVE_FSYNC
    fsync (archive);
    #endif
    #ifdef FDFLUSH
    ioctl (archive, FDFLUSH);
    #endif

    In general, application writers are interested in making sure the file is readable. Unless you are really determined, and willing to go through the file verification like in the tar command, fsync() does little to guarantee a file will be readable at a later date. Under modern file systems, there are so many reasons why a file may become unreadable, and so few of them are fixed with fsync(), that one has to ask: Why bother with fsync()?

    In fact, there are so few good reasons to use fsync(), that many applications have completely given up on fsync(). fsync() is disabled on Apple Macs running OSX. If you run NFS, fsync() will probably flush your data to the network, but not to the hard disk. If you are running a PC with a modern hard drive, the hard drive probably has a write back cache. As such, fsync() doesn't guarantee your data is physically on the disk. fsync() is disabled in laptop mode.

    For most applications, using fsync() will only slow down your C code. It is useful for certain applications, like databases. Many other programming languages have no equivalent to fsync(). For most programs, fsync() is an infrequently used call, and is primarily used in special purpose libraries like databases.

  • by Wodin ( 33658 ) on Thursday March 19, 2009 @04:18PM (#27260969)

    If power is lost at the right time, the same results would happen.

    The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.

    No, not "hundredths of a second". Five seconds. Or 30 if you're using laptop mode.
    https://bugs.launchpad.net/ubuntu/jaunty/+source/ecryptfs-utils/+bug/317781/comments/54 [launchpad.net]

  • Re:LOL: Bug Report (Score:3, Informative)

    by MikeBabcock ( 65886 ) <mtb-slashdot@mikebabcock.ca> on Thursday March 19, 2009 @04:20PM (#27260993) Homepage Journal

    You don't risk any data loss, ever, if you shut down your system properly. The system will sync the data to disk as expected and everything will be peachy. You risk data loss if you lose power or otherwise shut down at an inopportune time and the data hasn't been sync'd to disk yet.

    That is to say, 99% of people who use their computers properly won't have a problem.

    Also note, the software you use should be doing something like:

    loop: write some data, write some more data, finish writing data, fsync the data.

    The problem here is that the program is doing the "writing" part and because of how caching and delayed writes work (without which, your computer would crawl), the data isn't written to disk _yet_ but will be, eventually.

    Old software assumed the data would be written soon. With Ext4 its possible it won't be written until much much later for performance and power benefits.

    PS you can just open a terminal window and type "sync" at any time to flush the data to disk on your system. I'm sure someone could write a tray icon that does the same in 30 seconds.

  • Re:Dunno (Score:3, Informative)

    by MikeBabcock ( 65886 ) <mtb-slashdot@mikebabcock.ca> on Thursday March 19, 2009 @04:31PM (#27261131) Homepage Journal

    Without write-back (that's delaying writes until later and keeping them in a cache), you lose elevator sorting. No elevator sorting makes heavy drive usage ridiculously slower than with.

    You can't re-sort and organize your disk activity without the ability to delay the data in a pool.

    The difference between EXT3 and EXT4 is not whether the data gets written immediately -- neither do that. The difference is how long they wait. EXT4 finally gives major power preservation by delaying writes until necessary so my laptop hard drive doesn't spin up for brief moments of unnecessary disk activity all the time.

    You want your data written synchronously? Just mount your filesystem with 'sync' and its all done for you. No problem, no bug.

    "mount -o remount,sync /dev/sda1 /" all done.

  • Re:No kidding (Score:5, Informative)

    by Tacvek ( 948259 ) on Thursday March 19, 2009 @04:40PM (#27261275) Journal

    I don't think you have it right.

    On Ext3 with "data=ordered" (a default mount option), if one writes the file to disk, and then renames the file, ext3 will not allow the rename to take place until after the file has been written to disk.

    Therefore if an application that wants to change a file uses the common pattern of writing to a temporary file and then renaming (the renaming is atomic on journaling file systems), if the system crashes at any point, when it reboots the file is guaranteed to be either the old version or the new version.

    With Ext4, if you write a file and then rename it, the rename can happen before the write. Thus if the computer crashes between the rename and the write, on reboot the result will be a zero byte file.

    The fact that the new version of the file may be lost is not the issue. The issue is that both versions of the file may be lost.

    The end result is the write and rename method of ensuring atomic updates to files does not work under Ext4.

    A new mount option that forces the rename to come after the data is written to disk is being added. Once that is available, the problem will be gone if you use that mount option. Hopefully it will be made a default mount option.

  • Re:LOL: Bug Report (Score:5, Informative)

    by zenyu ( 248067 ) on Thursday March 19, 2009 @04:47PM (#27261367)

    They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

    Yup, and the problem has existed with KDE startup for years. I remember the startup files getting trashed when Mandrake first came out and I tried KDE for long enough to get hooked, and it's happened to me a few times a year ever since with every filesystem I've used. I just make my own backups of the .kde directory and fix this manually when it happens. I'm pretty good at this restore by now. Hopefully this bug in KDE will get fixed now that it is causing the KDE project such great embarrassment. I had a silent wish Tso would increase the default commit interval to 10 minutes when the first defenders of the KDE bug started squawking, but he's was too gracious for that.

    PS I use a lot of experimental graphics drivers for work, hence lockups during startup are common enough that I probably see this KDE bug more than most KDE users. But they really violate every rule of using config files: 1st. open with minimum permission needed, in this case read only, unless a write is absolutely necessary. 2nd. only update a file when it needs updating. 3rd. when updating a config file make a copy, commit it to disk, and then replace the original, making sure file permissions and ownership are unchanged, then commit the rename if necessary.

    PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer. 1st. There will be no fsyncs of config files at startup once the KDE startup is fixed. 2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change. 3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.

  • Re:LOL: Bug Report (Score:4, Informative)

    by DragonWriter ( 970822 ) on Thursday March 19, 2009 @04:47PM (#27261373)

    Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

    Its a fairly typical way of trying to acheive something loosely approximating transactional behavior with respect to updates to the file in question without relying on transactional file system semantics.

  • by Tacvek ( 948259 ) on Thursday March 19, 2009 @04:52PM (#27261439) Journal

    The Ext3 5 seconds thing is true, but that is not the important difference.

    On Ext3, with the default mount options, if one writes a file to disk, and then renames the file the write is guarantee to come before the rename. This can be used to ensure atomic updates to files, by writing a temporary copy of the file with the desired changes, and then renaming the file.

    On Ext4, if one writes a file to the disk, and then renames the file, the rename can happen first. The result of this is that it is not possible to ensure atomic updates to files unless one uses fsync between the writing and the renaming. However, that would hurt performance, since fsync will force the file to be committed to disk right now, when all that is really important is that it is committed to disk before the rename is.

    Thankfully the Ext4 module will be gaining a new mount option that will ensure that a file is written to disk before the renaming occurs. This mount option should have no real impact on performance, but will ensure the atomic update idiom that works on Ext3 will also work on Ext4.

  • by nusuth ( 520833 ) <oooo_0000us&yahoo,com> on Thursday March 19, 2009 @04:53PM (#27261443) Homepage
    Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write.

    That is sounds like a reasonable assumption but it is certainly not reasonable to write code that depends on that. 60 seconds is an eternity for a computer, but so is a second. Therefore the fact that 60 seconds is much longer than what you would expect has no bearing on the situation. If your applications depend on frequent data writes, they will have exactly the same file zeroing problem regardless of the actual amount of delay. You can't know that a crash will happen a least -say- 0.06 seconds after a write and rename, so you will still be losing files on crashes, only 1000 times less frequently with a 0.06 sec delay instead of 60. Considering how many times the problematic idiom may be used in 0.06 seconds, and how many computers are using linux, that is still an unacceptable way to write programs.

    It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.

    Ensuring rename happens after write is fundamentally different from not ensuring it but writing data frequently enough that it often happens that way. This is also exactly what has been done with ext3's ordered mode and what is being proposed for fixing ext4.

  • Kirk McKusick spent a lot of time working out the right order to write metadata and file data in FFS and the resulting file system, FFS with Soft Updates, gets high performance and high reliability... even after a crash.

  • Re:No kidding (Score:1, Informative)

    by blazerw ( 47739 ) on Thursday March 19, 2009 @05:53PM (#27262207)

    This is an excellent description of the issue. However, if the application writers, in this instance KDE, had synched after messing with extremely important files then the issue wouldn't occur.

    The real issue is this, should the filesystem itself have to figure out whether it's dealing with important files or not. Or, should the application tell the filesystem the files are important by forcing the updates to be written. Since the former is impossible, the filesystem would have to treat ALL files as important and thus never be able to do the cool things the Ext4 can do that decrease wear on SSDs, save battery power, save disk space and speed things up.

  • Re:LOL: Bug Report (Score:4, Informative)

    by grumbel ( 592662 ) <grumbel+slashdot@gmail.com> on Thursday March 19, 2009 @06:08PM (#27262365) Homepage

    3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.

    You completly missed the point. The new data isn't important, it could be lost and nobody would care. The troublesome part is that you lose the old data too. If you would lose the last 5 minutes of changes in your KDE config that would be a non-issue, what however happens is that you not just lose the last few changes, but your complete config, it ends up as 0 byte files, which is a state that the filesystem never had.

  • by stevied ( 169 ) * on Thursday March 19, 2009 @06:10PM (#27262383)

    The "workaround" is understanding how the platform you're targeting actually works rather than making guesses. fsync() and even fdatasync() have been around for ages and are documented. *NIX directories have always just been more or less lists of (name,inode_no) tuples, which is why hard links are part of the platform. There isn't really any magical connection between an inode and the directories it happens to be listed in.

    Ted knows this stuff inside and out and is almost ridiculously reasonable compared to many people I've met with his level of expertise. The patches to enable the actual workaround were available pretty much at the same time the awareness of this bug hit the mainstream. Given the flak he was taking, the fact that he expressed his opinions about the way some of the userspace software may or may not have been behaving doesn't seem unreasonable.

    The answer here is (1) roll out the workaround so nobody is horribly surprised when the latest distros ship with ext4, and (2) for developers to _listen_ to the guy who knows what he's talking about and fix their apps, ideally by providing some standard functions in the GNOME / KDE / etc. libs to handle the common situation, thus allowing the full performance advantages to be extracted from all the hard work that's been put into ext4 (and other file systems.)

    There are a relatively small number of people in the world who are worth listening to when they say something. Take a lesson from a guy with a 3 digit UID (sorry to pull rank, but sometimes it has to be done!), and let me tell you that Ted Ts'o is one of them.

  • Re:LOL: Bug Report (Score:3, Informative)

    by spitzak ( 4019 ) on Thursday March 19, 2009 @06:23PM (#27262535) Homepage

    Is writing a new file, and then renaming it over an existing file really a 'typical workload'???

    YES!!!!!!

  • Re:LOL: Bug Report (Score:3, Informative)

    by spitzak ( 4019 ) on Thursday March 19, 2009 @06:41PM (#27262717) Homepage

    ARRGH! This has nothing to do with the data being written "soon".

    The problem with EXT4 is that people expect the data to be written before the rename!

    Fsync() is not the solution. We don't want it written now. It is ok if the data and rename are delayed until next week, as long as the rename happens after the data is in the file!

  • by Yokaze ( 70883 ) on Thursday March 19, 2009 @06:42PM (#27262731)

    It is not about losing data of the write due, it is about losing data already written, by completing the operations in a different order as issued.

  • Re:LOL: Bug Report (Score:3, Informative)

    by Sparr0 ( 451780 ) <sparr0@gmail.com> on Thursday March 19, 2009 @07:33PM (#27263223) Homepage Journal

    No, both of those are, implicitly, expected to be world readable, and at least usually for software that any user can run (to some degree of success). /root is the only place for root to put a local application (or any other files) that he doesn't want a user to be able to see at all.

  • by tytso ( 63275 ) * on Thursday March 19, 2009 @11:04PM (#27264493) Homepage

    It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.

    Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.

    And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.

    If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.

    Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.

  • Re:No kidding (Score:3, Informative)

    by AvitarX ( 172628 ) <me&brandywinehundred,org> on Thursday March 19, 2009 @11:48PM (#27264785) Journal

    But if the application syncs the file, the new data is written to disk.

    This wastes time and performance, and for most files is un-needed.

    There are not only "important" and "unimportant" files, there are also "typical" files.

    We don't want to lose them, but who cares if recent changes are lost.

    Take for example a KDE config file. I am willing to risk all changes made to it since boot (I generally leave my computer off at night, so this is 12 or so hours). I do not want to lose all of my changes since install (this is 10,000 hours).

    The method of writing a temporary file and then renaming prevents the second from happening (in EXT3, XFS now, ReiserFS now, and soon EXT4) while still allowing for very aggressive write caching.

    EXT4 currently allows for the the second to happen unless a disk write is forced preventing either of the scenarios.

    The loss of the file already synced to disk potentially years ago is the issue, not the loss of the relatively recent data.

    EXT4 has essentially removed the option for having "typical" files, and forces them to be treated as "important".

    So everything becomes every change forces a write, or we care not about this (cache for example). The typical stuff that every change is not so critical (in the rare event of a crash), but it is sure nice to have something becomes elevated to an "important" file that does all of those bad things you describe, and eliminates the ability to cache writes.

  • Re:LOL: Bug Report (Score:4, Informative)

    by Eskarel ( 565631 ) on Friday March 20, 2009 @04:53AM (#27265927)
    I did flip read and write, long day.
  • by mr3038 ( 121693 ) on Friday March 20, 2009 @08:08AM (#27266671)

    The POSIX specifies that closing a file does not force it to permanent storage. To get that, you MUST call fsync() [manpagez.com] .

    So the required code to write a new file safely is:

    1. fd = fopen(...)
    2. fwrite(..., fd)
    3. fsync(fd)
    4. fclose(fd)

    The is no performance problem because fsync(fd) syncs only the requested file. However, that's in theory... use EXT3 and you'll quickly learn that fsync() is only able to sync the whole filesystem - it doesn't matter which file you ask it to sync, it will always sync the whole filesystem! Obviously that is going to be really slow.

    Because of this, way too many software developers have dropped the fsync() call to make the software usable (that is, not too slow) with EXT3. The correct fix is to change all the broken software and in the process that will make EXT3 unusable because of slow performance. After that EXT3 will be fixed or it will be abandoned. An alternative choice is to use fdatasync() instead of fsync() if the features of fdatasync() are enough. If I've understood correctly, EXT3 is able to do fdatasync() with acceptable performance.

    If any piece of software is writing to disk without using either fsync() or fdatasync() it's basically telling the system: the file I'm writing is not important, try to store it if you don't have better things to do.

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...