Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
Data Storage GUI KDE Software Linux

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4 830

Posted by timothy
from the heavy-trade-off dept.
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
This discussion has been archived. No new comments can be posted.

Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4

Comments Filter:
  • Re:Not a bug (Score:3, Interesting)

    by jgarra23 (1109651) on Wednesday March 11, 2009 @04:26PM (#27157433)
    Talk about doublespeak! Not a bug vs. It's a consequence of not writing software properly. reminds me of that FG episode where Stewie says, "it's not that I want to kill Lois... it's that I don't... want... her... to... live... anymore."
  • by microbee (682094) on Wednesday March 11, 2009 @04:28PM (#27157479)

    So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.

    But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.

    EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.

    It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.

  • by rpp3po (641313) on Wednesday March 11, 2009 @04:28PM (#27157485)
    There are several excuses circulating: 1. This is not a bug, 2. It's the apps' fault, 3. all modern filesystems are at risk.
    This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
    ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
  • Re:Not a bug (Score:5, Interesting)

    by Qzukk (229616) on Wednesday March 11, 2009 @04:29PM (#27157501) Journal

    I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.

    It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.

    Fortunately his patches will include an option to turn the magic computer fairy off.

  • by dltaylor (7510) on Wednesday March 11, 2009 @04:40PM (#27157693)

    When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.

    While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.

    BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.

  • Re:Classic tradeoff (Score:3, Interesting)

    by slashdotmsiriv (922939) on Wednesday March 11, 2009 @04:44PM (#27157759)

    Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.

    You also have to disable HDD caching, e.g., using
      hdparm -W0 /dev/hda1

  • Re:Bull (Score:2, Interesting)

    by Jane Q. Public (1010737) on Wednesday March 11, 2009 @04:48PM (#27157821)
    That does not make it any less of a filesystem limitation. While it is true that a well-written app should be aware of potential timing issues, all the application itself should ever suffer is delays in the I/O. Anything else is a flaw. Other FSs may share the flaw, but it is still a flaw.
  • Re:Not a bug (Score:3, Interesting)

    by PIBM (588930) on Wednesday March 11, 2009 @05:05PM (#27158083) Homepage

    That's your filesystem definition. Even there, I can guarantee you it can't be built, thus, from your point of view, no file system will ever not be bugged.

    How come ?

    I open a file
    I write one byte
    I close the file

    Data is not on disk BECAUSE IT WAS FULL and you failed to plan for intercepting errors / warnings.

    The filesystems needs to be used along with their specifications, not the way you'd want them to work.

  • Re:Bull (Score:1, Interesting)

    by dedazo (737510) on Wednesday March 11, 2009 @05:15PM (#27158195) Journal

    If I want asynchronous or lazy writes to the disk, I'll code that myself. The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is. I cannot believe this is actually being spun as a "feature" that application developers should code against. It's just mind boggling.

  • Re:Actually, no. (Score:3, Interesting)

    by TheRaven64 (641858) on Wednesday March 11, 2009 @05:32PM (#27158387) Journal

    As a user of a framework that doesn't suck, I don't have to worry about this problem. When I need to write a file in such a way that the entire operation either succeeds, or the entire operation fails (a common requirement), the framework I use provides a flag that I can set on the write operation to do all of the write/rename juggling that needs to happen, according to POSIX, to make it work. As such, my code will work happily on any filesystem that doesn't break the spec.

    If you are using a high-level language with a low-level framework, you might want to reconsider your approach.

  • by hattig (47930) on Wednesday March 11, 2009 @05:47PM (#27158607) Journal

    Bah. Maybe all computers should come with a single-cell battery, for a couple of minutes of backup power.

    As soon as power fails to the system and it resorts to battery, all calls to write() should also call fsync(), even if that slows the system down.

    Never mind an option that implicitly calls fsync() if it hasn't been called in the past 3 seconds, for a minimal performance hit. If you have a specific application that doesn't want fsync() then you can disable that feature, but clearly on a consumer box, no UPS, potentially dodgy hardware and drivers, it makes sense. 150 seconds without a sync, just dumping into a buffer for writing ... sheesh.

  • Re:Bad defaults (Score:3, Interesting)

    by 0123456 (636235) on Wednesday March 11, 2009 @07:55PM (#27160351)

    The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.

    Actually, I've lost multi-gigabyte files on NTFS; in one particular case I left IE downloading a game installer overnight, heard it beep around 8am to tell me it had completed, and then the power went out a couple of hours later before I got up. The file system was magically 'consistent' after the power came back and it rebooted, but it achieved that by deleting over two gigabytes of my data.

    Modern file systems may be a bit faster than FAT32, but they're shit when it comes to reliably storing data.

    In this case, yes, the KDE developers are retarded, but if the ext4 developers want ext4 to become the default filesystem for Linux, they need to make it work with retarded developers. 'But POSIX says we can do this' is worthless if it loses large amounts of user data; heck, you can easily guarantee 'file system consistency' by simply reformatting the disk on every reboot, but your users would be pretty damn pissed.

  • Re:Bull (Score:3, Interesting)

    by phantomlord (38815) <slashdot@kr w t e c h . com> on Wednesday March 11, 2009 @08:27PM (#27160667) Journal
    I just bought a new laptop that, unfortunately, came pre-installed with Vista. I spent the better part of the day creating settings by hand, tweaking this and that, to get things setup how I wanted them to be. I don't know of any handy way to copy my XP registry over from my old laptop to Vista on the new laptop(I could be wrong, I don't use windows for anything of importance so I haven't taken the time to learn all the power user tricks). That's to say nothing of all my application settings that were lost since they were written to the registry in my old laptop.

    I installed Linux on it as well. You know what it took to copy over all of my settings and data?
    cd /hpme
    cp -a /mnt/nfs/home/user .

    <sarcasm>That registry sure does make everything so much easier...</sarcasm> and that cp works even across different architectures, Linux distributions, etc.
  • Re:man 2 fsync (Score:3, Interesting)

    by setagllib (753300) on Wednesday March 11, 2009 @08:46PM (#27160861)

    No, disk caching is now considered the default. Nothing is written until the disk decides it is time, and this is completely up to them. It doesn't even have to occur in the same order the writes were issued in, especially with TCQ.

  • Re:Bull (Score:5, Interesting)

    by vadim_t (324782) on Wednesday March 11, 2009 @08:54PM (#27160973) Homepage

    That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.

    Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.

    There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.

    Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.

    Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.

    Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.

    Also, with a modern SATA disk supporting Native Command Queuing, the OS should immediately write the data to the disk's buffer, and the disk's firmware gets to decide about re-ordering.

    NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:

    1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.

    2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.

    As for the argument about using sqlite - why have yet another abstraction? After all, the filesystem is already a sort of database!

    Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.

  • by moonbender (547943) <moonbender@NOsPaM.gmail.com> on Wednesday March 11, 2009 @08:57PM (#27161019)

    The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.

    As I understand it, that is EXACTLY what happens. The move/relinking is commited, but the data isn't. If true, a real case of WTF. The relinking should only be executed AFTER the data has been commited to the drive.

  • Re:Bull (Score:3, Interesting)

    by slamb (119285) * on Wednesday March 11, 2009 @09:25PM (#27161235) Homepage

    RTFPS (Read The Fine POSIX Spec).

    I've RTFPS (well, not quite - the Single Unix Specification; where do I find the Fine POSIX Spec free online?).

    I am...dissatisfied with this answer because POSIX appears to provide so few guarantees that applications basically have to assume more than it promises to get anything done. The Linux documentation doesn't appear to promise anything more. For instance,

    • If I create a new file and fsync it, am I guaranteed that it hit disk? (Hint: on Linux this isn't true according to the #ifdef linux block of this file [collab.net]. It says I must fsync the directory, and nothing in Posix even says it's possible to open() or fsync() a directory; you have to use opendir().)
    • If I overwrite or append just a few bytes of an existing file and lose power before calling fdatasync(), what is guaranteed about the contents of the file? If you say "nothing", the only safe approach to updating anything is to write a complete replacement for the file, fsync() it (but pay attention to the special Linux case described above), and rename() it into place. Of course, that's a pretty significant performance hit and basically screws over any reasonable way of implementing shadow paging or write-ahead logging.

    So...where is the specification that describes the filesystem's behavior in a useful way?

  • Re:Bull (Score:3, Interesting)

    by slamb (119285) * on Wednesday March 11, 2009 @09:36PM (#27161299) Homepage
    To clarify my own question:

    # # If I overwrite or append just a few bytes of an existing file and lose power before calling fdatasync(), what is guaranteed about the contents of the file?

    I'd like to know which of the unmodified bytes are guaranteed to be preserved. None of them? All of them? Ones not in the same block as new bytes? (And what's a block? Is it st_blksize, or is it possible that block size varies within the file or changes over time?)

"I've seen it. It's rubbish." -- Marvin the Paranoid Android

Working...