Forgot your password?
typodupeerror
Data Storage GUI KDE Software Linux

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4 830

Posted by timothy
from the heavy-trade-off dept.
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
This discussion has been archived. No new comments can be posted.

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

Comments Filter:
  • Not a bug (Score:5, Informative)

    by casualsax3 (875131) on Wednesday March 11, 2009 @05:06PM (#27157149)
    It's a consequence of not writing software properly. Relevant links later in the same comment thread for those who don't might otherwise miss them:

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45 [launchpad.net]

    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net]

  • Re:Bull (Score:5, Informative)

    by Anonymous Coward on Wednesday March 11, 2009 @05:36PM (#27157635)

    This is NOT a bug. Read the POSIX documents.

    Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.

    It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).

    RTFPS (Read The Fine POSIX Spec).

  • Re:Classic tradeoff (Score:4, Informative)

    by imsabbel (611519) on Wednesday March 11, 2009 @05:36PM (#27157643)

    Its even WORSE than just being asynchronous:

    EXT4 reproducably delays write ops, but commits journal updates concerning this write.

  • Re:Not a bug (Score:5, Informative)

    by Anonymous Coward on Wednesday March 11, 2009 @05:37PM (#27157655)

    Quoting T'so:

    "The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, ...

    Linux reinvents windows registry?
    Who knows what they will come up with next.

  • Re:Not a bug (Score:5, Informative)

    by davecb (6526) * <davec-b@rogers.com> on Wednesday March 11, 2009 @05:49PM (#27157825) Homepage Journal

    Er, actually it removes the previous data, then waits to replace it for long enough that the probability of noticing the disappearance approaches unity on flaky hardware (;-))

    --dave

  • Re:Bull (Score:5, Informative)

    by pc486 (86611) on Wednesday March 11, 2009 @05:50PM (#27157841) Homepage

    Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.

    All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).

    POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.

  • Re:Not a bug (Score:5, Informative)

    by OeLeWaPpErKe (412765) on Wednesday March 11, 2009 @05:51PM (#27157867) Homepage

    Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.

    Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.

    You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).

    The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.

    Please don't FUD.

    You know I'll look up the docs for you :

    (quote from man 2 write)

    NOTES
                  A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
                  that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.

                  If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
                  been written, the call succeeds, and returns the number of bytes written.

    That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)

    So the normal case for a "reliable write" would be this code :

    size_t written = 0;
    int r = write(fd, &data, sizeof(data))
    while (r >= 0 && r + written sizeof(data)) {
            written += r;
            r = write(fd, &data, sizeof(data));
    }
    if (r 0) { // error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets
    }

    and *NOT*

    write(fd, data, sizeof(data)); // will probably work

    Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)

    Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.

  • by Anonymous Coward on Wednesday March 11, 2009 @05:57PM (#27157953)

    > Delayed writes should lose at most any data between commit and actual write to disk.

    And that's exactly what ext4 does.

    Application decides to update some file:
    1) Reads the some file
    2) Modifies the buffer as needed
    3) Truncates the file
    4) Writes the buffer to the file

    Now, if the filesystem commit happens right between, 3 and 4, the truncation hits the disk, but the new content does not (yet). If a crash happens before the next commit, all what remains is the truncated file.

  • Re:Not a bug (Score:5, Informative)

    by Jurily (900488) <[jurily] [at] [gmail.com]> on Wednesday March 11, 2009 @06:04PM (#27158059)

    It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk.

    No, that's the bug. It loses ALL data. You get 0 byte files on reboot.

  • by caerwyn (38056) on Wednesday March 11, 2009 @06:11PM (#27158147)

    Nothing- except that it's not in the spec.

    POSIX is like a contract. KDE is breaking the contract and then whining about it to ext4- which isn't breaking the contract. Just as in a court, KDE here doesn't have much of a leg to stand on.

  • by Anonymous Coward on Wednesday March 11, 2009 @06:11PM (#27158159)

    Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).

    You seem to misunderstand that's *exactly* what is happening.

    KDE is *DELETING* all of its config files, then writing them back out again in two operations.

    Three states now exist, the 'old old' state, where the original file existed, the 'old' state, where it is empty, and the 'new' state where it is full again.

    The problem is getting caught between step #2 and step #3, which on ext3 was mostly mitigated by the write delay being only 5 seconds.

    KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.

  • by gweihir (88907) on Wednesday March 11, 2009 @06:12PM (#27158161)

    Whats wrong with "After a file is closed, its synced to disk"?!?

    What, you want people to have to delay/stagger/coordinate their file closes in order to avoid overloading the filesystem? That is the wrong approach. close() just means that the application is done with the file. The sync calls are not a joke, they are there precisely for the reason that close() already has an antirely sensible but different semantics. Anybody that wants close also to sync can code it that way without problem. Anybody else probably does not want this behaviour in the first place.

    This is not hidden in any way. A simple "man close" not warns of this, it also refers the reader to the fsync call. Anybody getting bitten by this did not no their homework.

  • Re:Not a bug (Score:5, Informative)

    by caerwyn (38056) on Wednesday March 11, 2009 @06:16PM (#27158213)

    You're right. The correct thing to do is to *always* call fsync() when you need a data guarantee, *regardless* of which FS you're on. The fact that not doing it in the past hasn't caused problems isn't the problem- those calls are the correct way of handling things.

  • Re:Bull (Score:1, Informative)

    by Anonymous Coward on Wednesday March 11, 2009 @06:18PM (#27158227)

    You are quite simply wrong. The GP states the correct POSIX behaviour. If anything is a flaw it is a flaw in POSIX, *not*the filesystem.

    This kind of crap coupled with the recent Active Directory question where the Slashdot community proved that it does not know what the hell group policies do is the reason that GNU/Linux/GNOME/KDE will not get a (significant) share of the enterprise desktop - Linux fucking weenies who don't know jack.

  • Re:Not a bug (Score:5, Informative)

    by dmiller (581) <djm&mindrot,org> on Wednesday March 11, 2009 @06:21PM (#27158249) Homepage
    You are doing it wrong; permanently failing on recoverable EINTR and EAGAIN errors. See here [openbsd.org] for how to do it right.
  • man 2 fsync (Score:5, Informative)

    by Nicolas MONNET (4727) <nicoaltiva@gmaiGINSBERGl.com minus poet> on Wednesday March 11, 2009 @06:23PM (#27158277) Journal

    The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.

  • Re:Not a bug (Score:3, Informative)

    by Kaboom13 (235759) <kaboom108@@@bellsouth...net> on Wednesday March 11, 2009 @06:25PM (#27158309)

    The point of a journal is to allow the file system to return to a defined state in the case the unexpected happens. This keeps the whole file system from being fucked by a crash or sudden data loss. It's better to know you lost some data, then have the filesystem in a state where some data is corrupt but you have no way to tell where or what it is. The situation here is ext 4 has increased the timeframe between commits. This increases performance at the cost of losing more data if a crash happens. Total crashes are pretty rare these days (unless you run some really shitty code) and UPS's are inexpensive. Hell my XP system has Blue Screened once over the last two years, and it was directly related to a beta nvidia driver.

    If your system is likely to crash or lose power, don't use ext4.

  • Re:Not a bug (Score:2, Informative)

    by OeLeWaPpErKe (412765) on Wednesday March 11, 2009 @06:28PM (#27158335) Homepage

    You're only partially right. EAGAIN cannot occur unless I asked for it first (and modified my error catching accordingly).

    But you're right about EINTR causing unwarranted disruption. I should ignore that one in the while loop.

  • by macshit (157376) <miles.gnu@org> on Wednesday March 11, 2009 @06:36PM (#27158435) Homepage

    ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!

    I read the FA, and it actually really does look like the applications are simply using stupidly risky practices:

    These applications are truncating the file before writing (i.e., opening with O_TRUNC), and then assuming that the truncation and any following write are atomic. That's obviously not true -- what happens if your system is very busy (not surprising in the startup flurry which is apparently where this stuff happens), the process doesn't get scheduled for a while after the truncate (but before the write), and the system happens to crash in that interval?

    I'm as lazy as they get, but even I know enough not to do that kind of crap...

    There's probably some way the FS could finesse this issue -- e.g., don't actually schedule truncation until you see the first write or close -- but it would be a workaround for buggy applications, not a FS bugfix.

  • Re:Not a bug (Score:3, Informative)

    by PhilHibbs (4537) <snarks@gmail.com> on Wednesday March 11, 2009 @06:37PM (#27158451) Homepage Journal

    It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.

    But you never create and write to a file as a single operation, there's always one function call to create the file and return a handle to it, and then another function call to write the data using the handle. The first operation writes data to the directory, which is itself a file that already exists, the second allocates some space for the file, writes to it, and updates the directory. Having the file system spot what your application is trying to do and reversing the order of the operations would be... tricky.

  • by Bronster (13157) <slashdot@brong.net> on Wednesday March 11, 2009 @06:44PM (#27158563) Homepage

    mount -o sync. Enjoy your slow returns and strictly ordered writes.

  • Re:Not a bug (Score:3, Informative)

    by gweihir (88907) on Wednesday March 11, 2009 @06:59PM (#27158781)

    The point of having a rock-solid filesystem is to have a rock-solid filesystem. Any filesystem that crashes and loses data is bad. What is the point of a journal again? To enforce someone's idea of how an API should be coded to, or to reduce data loss?

    ext4 did not crash. Ext4 also did not lose any data it claimed to have gotten to disk. However, unless you want the filesystem slower by a factor of 10x....100x, you have to delay writes. And that means your data is only reliably on disk after an fsync. Any good developer knows that.

    Indicentially, the journal serves to avoid filesystem corruption on crash, nothing else. And no other claim was ever made by the developers.

  • Re:Bull (Score:5, Informative)

    by LWATCDR (28044) on Wednesday March 11, 2009 @07:10PM (#27158975) Homepage Journal

    It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem.
    In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html [opengroup.org]

    Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple.
    They didn't use the api properly and it really is just that simple.

  • Re:Bull (Score:2, Informative)

    by Anonymous Coward on Wednesday March 11, 2009 @07:11PM (#27158991)

    Right... that way a single error can brick the whole system at once.

  • by Tadu (141809) on Wednesday March 11, 2009 @07:22PM (#27159143)

    KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.

    Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.

  • Mod parent up (Score:3, Informative)

    by betterunixthanunix (980855) on Wednesday March 11, 2009 @07:36PM (#27159321)
    Much as I love to fallback on the "POSIX says that this could be the case so it is OK that it is the case" excuse, it really does not fly in this case. The POSIX doesn't allow this sort of behavior because it is a "good" thing to do, it allows it because there are systems where this is an OK thing to do -- systems intended to manage database, systems that are heavily verified and have backup power supplies, etc. This is not appropriate for desktop systems -- desktop systems have to be robust against all kinds of stupid situations, like sudden power losses, users hitting the reset button because an application is hanging, and so forth. EXT4 should not be used in a desktop system if it can cause data loss when the unexpected happens, regardless of the technical merits of writing to small configuration files.
  • by Qzukk (229616) on Wednesday March 11, 2009 @07:52PM (#27159573) Journal

    change their applications because a new version of the file system breaks their stuff is madness

    Their applications were already broken, committing everything every 5 seconds* regardless of what the applications had wanted was the workaround in ext3, but I guess it's only madness when street-makers demand that you drive with round wheels, not when you demand that street-makers accommodate your square ones.

    * Unless you increased the commit time to reduce power usage (eg laptop_mode)

  • Re:Not a bug (Score:4, Informative)

    by Ed Avis (5917) <ed@membled.com> on Wednesday March 11, 2009 @08:01PM (#27159689) Homepage

    YES!! That is EXACTLY what I expect the every modern file system to do.

    Your expectation is quite reasonable. When the application writes something to disk, it should be there on disk, right? The way the article is presented makes it sound like a horrible bug in ext4 that it doesn't do this. But believe it or not, almost no filesystem provides this guarantee by default. ext3 doesn't (in the default mode), nor does ext2, nor a typical implementation of FAT or NTFS or the Minix filesystem or whatever.

    For decades now it has been an accepted trade-off that the filesystem can hold back disk writes and do them later, giving better disk performance at the expense of losing data if there is a crash. Losing file data is bad but losing metadata is even worse, since corrupt filesystem metadata can trash the contents of many files and requires a lengthy fsck on startup. So journalling filesystems, as typically configured, keep a journal for metadata so it's not corrupted even if the power gets cut at the most inconvenient moment. But they don't extend the same care to file contents, because it would be too slow. You can enable it by setting the data=journal parameter in ext3 (and I guess ext4 too) but this isn't the detail.

    It is certainly a bit unfair that the filesystem takes such pains with its own bookkeeping information but doesn't bother to be so careful about user data. But as I said, it's a known tradeoff to get better performance. If you want to be sure your file has reached disk you need to fsync(). This sucks, but it's the Unix way, and has been so for like, forever. So it's not a bug in ext4 - just bad luck and perhaps a misunderstanding between kernel and userspace about what guarantees the filesystem provides.

    As SSDs replace rotating storage, there is less need to buffer writes (certainly the need to minimize seek time goes away, and that's the biggest reason), so we might see this whole situation resolved within a few years. Perhaps in 2015, when the system call returns, you can be sure that the data is written. Until that longed-for day, bear in mind that your filesystem is permitted to temporarily lie to you about what has been written, and call fsync() if you are paranoid.

  • by drolli (522659) on Wednesday March 11, 2009 @08:08PM (#27159769) Journal

    Citing from the message Ts'o post:

    ----
    So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.
    ----

    And indeed, and reading the NOTES section of "man -S2 close" explicitely notes what is not mentioned in the other sections. I up to this day also lived under the assumption that a close implies a fsync. Now i have to change my ptograms where it matters. All the Idiots who scream here that the OS is doing something worng: no, it's not. AFAIU it's following the befined behaviour which is what i expect an OS to do. It should NOT try to magically guess where i forgot to fsync my files.

  • Re:Bull (Score:3, Informative)

    by LWATCDR (28044) on Wednesday March 11, 2009 @08:12PM (#27159833) Homepage Journal

    Just use fsync()
    Problem solved. Read the Posix docs, or the clib docs and you will never run into this problem.

  • Re:Not a bug (Score:5, Informative)

    by Qzukk (229616) on Wednesday March 11, 2009 @08:49PM (#27160295) Journal

    A file system should take my data buffer, and after saying "Ok, I got it"

    There's your problem, you didn't even bother to ask if it got it, you just threw a ton of data into the file descriptor and closed it, now didn't you. And you want me on thedailywtf?

    But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it".

    All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.

    The filesystem does exactly what you tell it to do. If you don't want it to make a zero byte file, then DON'T USE O_TRUNC OR *truncate() TO EMPTY YOUR FILE. Make a new file, fill it up, rename it over the other file. Don't assume that in just a few instructions, you're going to be filling it back up with new data, because those instructions may never arrive.

    You don't like it? Try and convince people that (open file, erase all the data in it, do some stuff, write some data, do some more stuff, write some more data, write data to disk, close file) should be an uninterruptable atomic operation. You want a versioning filesystem? Take your pick [wikipedia.org].

  • Re:Bull (Score:3, Informative)

    by amirulbahr (1216502) on Wednesday March 11, 2009 @08:58PM (#27160391)

    Who modded this up? Jane Q. Public is completely clueless on this topic, but she manages to sound like she has an idea to fellow clueless moderators. She should be called out for the karma whoring ignoramus she is.

    Some choice quotes from her on this thread.

    Delayed allocation is like leading a moving target when shooting.

    BadAnalogyGuy would be proud. Probably also worth mentioning that without delayed allocation, the system would be unbearably slow.

    The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong

    A kernel crash or power outage is certainly something that could go wrong. Modern journalling file-systems handle this gracefully by making sure the file-system is in a consistent state when it comes back up.

    The filesystem is flawed, plain and simple.

    You'll realize why that one is a gem when you read her next quote. As the discussion continues, she begins to realize how far off the mark she is and begins to correct...

    It most definitely is a filesystem limitation. That is different from saying that it's the filesystem's fault.

    Still off the mark, but perhaps she is beginning to figure out what a file system should offer and what the issue being discussed is.

    If an application that reads and writes lots of small files fails under Ext4, then it is Ext4's fault, not the application. An application should be able to read and write lots of small files if it wants... I can think of a great many practical examples.

    Go ahead and do that. But if you want to make sure you're data is written, in case of a kernel crash or power outage, then you had better understand what is going on at the FS level.

    As a user of a high-level language, I should not be expected to know the disk I/O API in a given OS. That is for the authors of the compiler or interpreter.

    No, but you should understand the API of the language you are dealing with. Since when does a compiler handle disk I/O anyway? As for your interpreter, it is free to call fsync whenever it wants, but what has that got to do with the FS again?

    Excuse the heck out of me, but the issue being discussed was a failure that was NOT due to a power loss or other such system problem. It was a crash caused by this very issue. If it were simply missing data due to power loss or some such, there would be no point in this discussion at all.

    The purpose of this quote is to demonstrate that she both has no regard for TFA and also has no idea what this issue being discussed is. I encourage anyone looking to give her mod points actually RTFA and also do a bit of background reading on file systems and in particular delayed writes.

    My point was and still is: if the data is not flushed to disk yet, it should either be accessible from the buffer, or not at all.

    This sentence alone deserves a -1 Huh? If you do a write, and it is successful, then you can do a read on the same file and it will return what you wrote, whether or not it had been flushed to disk. This is the way it is supposed to work. Think about it for like 10 seconds and you'll begin to get it.

    not supposed to have to worry about OS-specific details

    WE ARE TALKING ABOUT UNEXPECTED KERNEL CRASHED AND POWER OUTAGES. If you care about that situation then you should get a clue before you start coding. If not, then what is the problem, or was it fault... er, sorry limitation?

    One should not have to know about syncing to do something like a few simple file writes

    And one doesn't need to if she is not concerned with the rare possibility that the system CRASHES OR LOSES POWER in the next few minutes.

    Anyway, I've never called out another poster like this before and now I feel dirty.

  • Re:Not a bug (Score:3, Informative)

    by shutdown -p now (807394) on Wednesday March 11, 2009 @09:07PM (#27160497) Journal

    The problem with that is that you have to use fsync() for each and every file descriptor you have, and for lots of small times, this is very slow (because if you're syncing after every 10-byte write, you might as well have no caching). What's needed is a way to write those files in a batch, close them all, and then say "now sync all of that".

    To the best of my knowledge, though, Windows has the same problem - its fsync analog, FlushFileBuffers, also applies to a single file handle only (you can flush all writes for the volume, but only if you're an admin.

  • Re:amirulbahr: (Score:3, Informative)

    by amirulbahr (1216502) on Wednesday March 11, 2009 @09:32PM (#27160735)
    I assure you it is you who has mis-understood the situation. From the bug report referenced in the summary:

    Today, I was experimenting with some BIOS settings that made the system crash right after loading the desktop. After a clean reboot pretty much any file written to by any application (during the previous boot) was 0 bytes. For example Plasma and some of the KDE core config files were reset. Also some of my MySQL databases were killed...

    My EXT4 partitions all use the default settings with no performance tweaks. Barriers on, extents on, ordered data mode..

    I used Ext3 for 2 years and I never had any problems after power losses or system crashes.

    The crash was not caused by ext4 but by something else. The file system was in a consistent state because of the journal. Some data had not yet been written to disk, because of the delayed write and was thus lost.

    Maybe you need to take a break, or have a coffee, or get some sleep or something. But you really are way off and posting way too much on this topic that you are not well informed of.

    This is not a bug, not a flaw, not a limitation. You can write and then read regardless of whether or not actual disk commits take place. The file system takes care of that for you. If you're doing file I/O, and you want to call yourself half-way competent, then you should have some clue about the possibility that the underlying file-system will be doing delayed writes. If you a writing critical applications for which this may cause issue then you might decide to throw in some fsync calls (or there equivalent in whatever platform you are using).

    I know you have learnt something today. Glad to help out.

  • by swillden (191260) <shawn-ds@willden.org> on Wednesday March 11, 2009 @09:35PM (#27160759) Homepage Journal

    It most definitely is a filesystem limitation.

    No, it's not. The file system is perfectly capable of making sure all your writes hit the disk as soon as possible.

    Just mount it with the 'sync' option.

    If you want the significant performance benefits of delayed writes, however, you should not use 'sync' and accept that, with Ext4, write() works the way the documentation says it does.

  • Re:Not a bug (Score:2, Informative)

    by GXTi (635121) <gxti@partiallystapled.com> on Wednesday March 11, 2009 @09:47PM (#27160871) Homepage

    and after saying "Ok, I got it", *guarantee*, that I can turn off the system in that very moment, without losing data or corrupting the file system in any way.

    Which is precisely what fsync does, and is precisely what these developers didn't use. The filesystem knows better than you do how to get all the data it has to write onto the platters as fast as possible so if you need something specific like "it's important that this data get written now, so I'll wait for you to finish", you have to ask. Otherwise your apps would run a great deal slower since every little write (even a single byte!) would have to wait for the OS to say "OK, it's on disk". And if you really want that, there are flags you can use, e.g. O_SYNC. But you don't.

  • Re:Bull (Score:3, Informative)

    by amirulbahr (1216502) on Thursday March 12, 2009 @12:30AM (#27162231)
    They are referring to the case when the system isn't shut down cleanly. This means a kernel crash or a power outage. What is your point exactly? Seriously, and I really am doing my best to hold back on the personal insults (even when you something as annoying as "And calm down !!"), what is so difficult that you fail to comprehend what the real issue being discussed here is?
  • by greg1104 (461138) <gsmith@gregsmith.com> on Thursday March 12, 2009 @01:23AM (#27162563) Homepage

    If your battery-backed RAID controller ever fakes a fsync it is fundamentally broken or misconfigured. When the cache is filled with a write backlog and you try to write something else, that write will block until there is free space. Same as any other write cache that fills up.

    When cache space is available to cache the write again, the data goes into there, and then a fsync request after it can then return success.

  • Re:Bull (Score:2, Informative)

    by mysidia (191772) on Thursday March 12, 2009 @01:31AM (#27162615)

    5 seconds might reduce the probability of problems, but it doesn't make the assumption a non-bug.

    That's like saying if my code has a buffer overflow in it, but if it's only by 5 bytes, everything's ok, whereas if it's by 150 bytes, I should panic...

    One way to test if your argument makes sense is to extend it to absurdity.

    And the result has absolutely no bearing on the issue. Extending 5 seconds to infinity is nothing like extending 5 seconds to 150 seconds.

    If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.

    On some systems you may actually find this to be the case. On certain kernels, certain hard drives had write cache, and sync() would not force the drive itself to flush its own cache, data could be in there for minutes, to be lost in the event of an untimely power failure..

    Most applications handle this reasonably; maintain transactional integrity, and sync() when it is critical that a write finish on a timely basis, and in event of a crash, revert to the last 'good' state.

    Transactional database software like PostgreSQL are exceptional at this, and they do use sync.

    If you have a lot of critical data, the right place to put it is in a DBM, that will handle and manage syncing correctly and optimally for the OS.

    If you have small amounts of critical data, then you write them to flatfiles, and sync. The small size of the files, and the small number of writes you do to them will make performance a non-issue.

    Maintaining integrity of critical data requires a lot more than a good filesystem, and the ability to ensure data is sync'ed to disk.

    Because even 5 seconds is non-zero, which is all the time in the world, if you leave the files on disk such that they would be corrupt or inconsistent (should the system crash at that moment)

    Filesystems don't and never did totally relieve application developers of having to worry about what might (or might not) be written to disk by the OS.

    Certainly it's unreasonable they make particular assumptions about the exact nature of the duration it takes, since there are so many filesystems available, including some unusual ones like NFS.

    (void)sleep(5); after a write is not, and never was a substitute for fsync(); for assuring data is written before writing more.

  • Re:rename and fsync (Score:3, Informative)

    by QuoteMstr (55051) <dan.colascione@gmail.com> on Thursday March 12, 2009 @01:36AM (#27162649)

    Telling application developers to use a database is bullshit. The filesystem is a database, albeit not a relational one. A open-write-close-rename sequence merely asks for atomicity without durability, something that's perfectly reasonable. As other posters have mentioned in vain, all the application wants is for either the old version of a file or the entire new version to appear on a reboot. He doesn't care at the instant of the rename whether that replacement has been recorded on disk, just that eventually, when the filesystem does record that replacement, that it's recorded atomically.

    You might want the open-write-fsync-close-rename behavior for a mailserver, in which you must acknowledge receipt (i.e., you need durability), but asking for that same durability in a multi-file configuration setup is just stupidly degrading performance.

    open-write-close-rename is saying something fundamentally different from open-write-fsync-close-rename, and it's perfectly reasonable for a filesystem to act sanely in response to both kinds of request.

  • Hiding behind POSIX (Score:3, Informative)

    by antientropic (447787) on Thursday March 12, 2009 @04:10AM (#27163455)

    All the Idiots who scream here that the OS is doing something worng: no, it's not.

    This is called "hiding behind the standard" (a disease very common among kernel developers). Just because the standard doesn't specify behaviour in a certain situation doesn't mean that any behaviour is equally okay. In this case, ext4's behaviour very much hurts the robustness of the system, which is rather important in unreliable environments like laptops.

    In this case, what KDE does is certainly not unreasonable (and its developers are certainly not "idiots"). It doesn't overwrite configuration files in place, which would be bad even in the absence of system crashes, as doing it that way is not atomic. Instead it creates a new temporary file, writes the new contents, then renames the temporary file to the old one. This is an atomic operation on Unix: you either see the old contents or the new contents, but nothing in between. Now, the problem is that in case of a crash, ext4 gives you the worst possible outcome by reordering the operations: it will "recover" the rename for you, but not the actual write of the new data. So you end up with a 0-byte file - far from atomic. POSIX of course allows this, but POSIX allows just about anything: that doesn't mean its reasonable. The only guaranteed solution - use an fsync/fdatasync - is something that almost nobody does because the performance is horrible (ext3 in fact will write the entire journal, IIRC, when doing an fsync() on a single file - this really hurt Firefox 3 performance [mozilla.org]). So the KDE developers can be excused for not doing that.

    It's the job of a modern filesystem to ensure robustness and performance. If you don't use an fsync, you should expect that there is a time window during which transactions might become undone (not the end of the world for configuration files), but they should never be reordered. For instance, this is how Berkeley DB works if you disable fsync: it guarantees ACI but not ACID. For many desktop applications, that's good enough. Destroying every file that has been updated since the last fsync isn't. And your users aren't going to be impressed by the argument that POSIX allows it.

  • Re:Not a bug (Score:3, Informative)

    by EsbenMoseHansen (731150) on Thursday March 12, 2009 @05:02AM (#27163747) Homepage

    No. Writing software properly means calling fsync() if you need a data guarantee.

    But neither Gnome nor KDE needs this. What they need is that the file in question is either left in the old state or in the new state. The problem is that ext4 rushes in to complete the truncation, but lazily after 1-2 minutes (!) writes the actual data. That is quite broken, in my opinion. The obvious solution would be to bundle the truncation with the writing out the data.

    Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.

    And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.

    In KDEs case, it would be as simple as reverting a patch. The fsyncs() were removed because of the bugs associated with it, including killing laptop batteries. Dig through kde-core-devel for the gory details. The code in question is posted elsewhere.

    The bug is in ext4, like it was in XFS --- where it was finally fixed. And it looks like ext4 has introduced a hack to sort of fix this problem there, too.

  • Re:Bull (Score:1, Informative)

    by Anonymous Coward on Thursday March 12, 2009 @05:42AM (#27163935)

    open-write-close-rename already asks for atomic but asynchronous rename under all sane systems

    I'm not sure what you're saying here. Are you arguing that such a sequence should be treated specially by the OS? Why?

    XFS and ext4 break that perfectly sane sequence of operations

    It isn't sane. It's like replacing your tires with the engine running and your kid sitting behind the wheel. Sure it might work 9 out of 10 times, until your kid switches the car into gear.

    KDE (and Gnome) are truncating critical system files without a backup available. How is that sane? Sure they will immediately rewrite the file, but who will guarantee that the system will not crash between the truncate and the write?

    And finally, they aren't doing open-write-close-rename. They're doing truncate-write-close. What they should be doing is create-write-close-sync-rename, i.e. do not overwrite the old config file before the new content is safely stored on disk. And I think the reason that they did not go the correct way (assuming they were aware of the issue) is because the "safe" way sucked performance-wise. Well duh, if you write hundreds of 50-byte files, performance will suck, unless you skip safety protocol.

  • by Eunuchswear (210685) on Thursday March 12, 2009 @06:56AM (#27164355) Journal

    People don't fsync() all the time because it's SLOW. Not just a little slow, but RTFS's bug report for the link to the Firefox 3 bug due to performing 8 syncs per page load: if there's any IO going on, firefox ground to a halt to wait its turn to ensure that your bookmarks and history and cookies and everything else were really, really written to disk.

    Well, it has to be said that fsync() on ext3 is slow because of an ext3 bug - fsync() is the same as sync() on ext3.

  • by toby (759) on Thursday March 12, 2009 @10:39AM (#27166707) Homepage Journal

    Re: "backup old file and write a new one" - A transactional copy-on-write filesystem such as Sun's ZFS [sun.com] is doing almost the same job, transparently.

    I have little doubt that copy-on-write will eventually supersede overwrite-and-pray filesystems. The wins are numerous, including cheap snapshotting, etc, etc. Install OpenSolaris [opensolaris.org] and give ZFS a try today!

Every nonzero finite dimensional inner product space has an orthonormal basis. It makes sense, when you don't think about it.

Working...