Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4 830
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
Re:Not a bug (Score:3, Interesting)
Theory doesn't matter; practice does (Score:3, Interesting)
So, POSIX never guarantees your data is safe unless you do fsync(). So, ext3 was not 100% safer either. So, it's the applications' fault that they truncate files before writing.
But it doesn't matter what POSIX says. It doesn't matter where the fault belongs to. To the users, a system either works nor not, as a whole.
EXT4 aims to replace EXT3 and becomes the next gen de-facto filesystem on Linux desktop. So it has to compete with EXT3 in all regards; not just performance, but data integrity and reliability as well. If in the common scenarios people lose data on EXT4 but not EXT3, the blame is on EXT4. Period.
It's the same thing that a kernel does. You have to COPE with crappy hardware and user applications, because that's your job.
Excuses are false. This is a severe flaw. (Score:3, Interesting)
This is all a bunch of BS! Delayed writes should lose at most any data between commit and actual write to disk. Ext4 loses the complete files (even their content before the write).
ZFS can do it: it writes the whole transaction to disk or rolls back in case of a crash, so why not ext4? These lame excuses that this is totally "expected" behavior is a shame!
Re:Not a bug (Score:5, Interesting)
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.
Fortunately his patches will include an option to turn the magic computer fairy off.
not mounted sync,dirsync? (Score:5, Interesting)
When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.
While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.
BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.
Re:Classic tradeoff (Score:3, Interesting)
Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.
You also have to disable HDD caching, e.g., using /dev/hda1
hdparm -W0
Re:Bull (Score:2, Interesting)
Re:Not a bug (Score:3, Interesting)
That's your filesystem definition. Even there, I can guarantee you it can't be built, thus, from your point of view, no file system will ever not be bugged.
How come ?
I open a file
I write one byte
I close the file
Data is not on disk BECAUSE IT WAS FULL and you failed to plan for intercepting errors / warnings.
The filesystems needs to be used along with their specifications, not the way you'd want them to work.
Re:Bull (Score:1, Interesting)
If I want asynchronous or lazy writes to the disk, I'll code that myself. The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is. I cannot believe this is actually being spun as a "feature" that application developers should code against. It's just mind boggling.
Re:Actually, no. (Score:3, Interesting)
As a user of a framework that doesn't suck, I don't have to worry about this problem. When I need to write a file in such a way that the entire operation either succeeds, or the entire operation fails (a common requirement), the framework I use provides a flag that I can set on the write operation to do all of the write/rename juggling that needs to happen, according to POSIX, to make it work. As such, my code will work happily on any filesystem that doesn't break the spec.
If you are using a high-level language with a low-level framework, you might want to reconsider your approach.
Things that should be improved ... (Score:2, Interesting)
Bah. Maybe all computers should come with a single-cell battery, for a couple of minutes of backup power.
As soon as power fails to the system and it resorts to battery, all calls to write() should also call fsync(), even if that slows the system down.
Never mind an option that implicitly calls fsync() if it hasn't been called in the past 3 seconds, for a minimal performance hit. If you have a specific application that doesn't want fsync() then you can disable that feature, but clearly on a consumer box, no UPS, potentially dodgy hardware and drivers, it makes sense. 150 seconds without a sync, just dumping into a buffer for writing ... sheesh.
Re:Bad defaults (Score:3, Interesting)
The old defaults were: 5 seconds in ext3, in NTFS metadata is always and data flushed asap with but no guarantees. In practice, people don't lose huge amount of work.
Actually, I've lost multi-gigabyte files on NTFS; in one particular case I left IE downloading a game installer overnight, heard it beep around 8am to tell me it had completed, and then the power went out a couple of hours later before I got up. The file system was magically 'consistent' after the power came back and it rebooted, but it achieved that by deleting over two gigabytes of my data.
Modern file systems may be a bit faster than FAT32, but they're shit when it comes to reliably storing data.
In this case, yes, the KDE developers are retarded, but if the ext4 developers want ext4 to become the default filesystem for Linux, they need to make it work with retarded developers. 'But POSIX says we can do this' is worthless if it loses large amounts of user data; heck, you can easily guarantee 'file system consistency' by simply reformatting the disk on every reboot, but your users would be pretty damn pissed.
Re:Bull (Score:3, Interesting)
I installed Linux on it as well. You know what it took to copy over all of my settings and data?
cd
cp -a
<sarcasm>That registry sure does make everything so much easier...</sarcasm> and that cp works even across different architectures, Linux distributions, etc.
Re:man 2 fsync (Score:3, Interesting)
No, disk caching is now considered the default. Nothing is written until the disk decides it is time, and this is completely up to them. It doesn't even have to occur in the same order the writes were issued in, especially with TCQ.
Re:Bull (Score:5, Interesting)
Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.
There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.
Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.
Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.
Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.
NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:
1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.
2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.
Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.
Re:Works as expected... (Score:3, Interesting)
The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.
As I understand it, that is EXACTLY what happens. The move/relinking is commited, but the data isn't. If true, a real case of WTF. The relinking should only be executed AFTER the data has been commited to the drive.
Re:Bull (Score:3, Interesting)
I've RTFPS (well, not quite - the Single Unix Specification; where do I find the Fine POSIX Spec free online?).
I am...dissatisfied with this answer because POSIX appears to provide so few guarantees that applications basically have to assume more than it promises to get anything done. The Linux documentation doesn't appear to promise anything more. For instance,
So...where is the specification that describes the filesystem's behavior in a useful way?
Re:Bull (Score:3, Interesting)
I'd like to know which of the unmodified bytes are guaranteed to be preserved. None of them? All of them? Ones not in the same block as new bytes? (And what's a block? Is it st_blksize, or is it possible that block size varies within the file or changes over time?)