Apps That Rely On Ext3's Commit Interval May Lose Data In Ext4 830
cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T : Headline clarified to dispel the impression that this was a fault in Ext4.
Bull (Score:4, Insightful)
Works as expected... (Score:5, Insightful)
The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.
Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.
Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
Re:Not a bug (Score:5, Insightful)
I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.
Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."
In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
File systems are nice. That's what Unix is about.
I don't think programmers ought to be required to treat them like a pouty flake: "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work."
Re:If in other "modern" filesystems.... (Score:4, Insightful)
Re:Not a bug (Score:3, Insightful)
lol.
It's a consequence of a filesystem that makes bad assumptions about file size.
I suppose in your world, you open a single file the size of the entire filesystem and just do seek()s within it?
It's a bug. A filesystem which does not responsibly handle any file of any size between 0 bytes and MAXFILESIZE is bugged.
Deal with it and join the rest of us in reality.
Classic tradeoff (Score:5, Insightful)
It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.
Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.
Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.
Re:Exactly (Score:5, Insightful)
Re:Works as expected... (Score:5, Insightful)
Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
You're right, there really is nothing to see here. Or rather, there's nothing left. As the article says, a large number of configuration files are opened and written to as KDE starts up. If KDE crashes and takes the OS with it (as it apparently does), those configuration files may be truncated or deleted entirely -- the commands to re-create and write them having never been sync'd to disk. As the startup of KDE takes longer than the write delay, it's entirely possible for this to seriously screw with the user.
The two problems are:
1. Bad application development. Don't delete and then re-create the same file. Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System, whatever the four color glossy told you.
2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.
Re:Not a bug (Score:5, Insightful)
Re:If in other "modern" filesystems.... (Score:4, Insightful)
I'll take "I didn't lose my data" over "ext4 runs 1.5x faster than ext3," thank you. What use is performance to me if I have to be absolutely certain that it won't crash, or I lose my (in my very high performance filesystem) data?
Also, ext4 is toted as having additional reliability checks to keep up with scalability, etc... not less reliable at expense of performance.
Reliability
As file systems scale to the massive sizes possible with ext4, greater reliability concerns will certainly follow. Ext4 includes numerous self-protection and self-healing mechanisms to address this.
(from Anatomy of ext4 [ibm.com])
I can only imagine the response if tests were done on Windows 7 beta that showed a crash after this or that resulted in loss of data. :)
Re:Not a bug (Score:3, Insightful)
Translation: "Our filesystem is so fucked up, even SQL is better."
WTF is this guy thinking? UNIX has used hundreds of tiny dotfiles for configuration for years and it's always worked well. If this filesystem can't handle it, it's not ready for production. Why not just keep ALL your files in an SQL database and cut out the filesystem entirely?
Re:Exactly (Score:5, Insightful)
The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.
True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.
What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.
Re:Bull (Score:5, Insightful)
The journal isn't being written before the data. Nothing is written for periods between 45-120 seconds so as to batch up the writing to efficient lumps. The journal is there to make sure that the data on disk makes sense if a crash occurs.
If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.
Re:Not a bug (Score:5, Insightful)
The benefit of journaling file systems is that after the crash you still have a file system that works. How many folks remember when Windows would crash, resulting in a HDD that was so corrupted the OS wouldn't start. Same with ext2.
If these folks don't like asynchronous writes, they can edit their fstab (or whatever) to have the sync option so all their writes will be synchronous and the world will be a happy place.
Note that they will also have to suffer a slower system, and possible shortened lifetime of their HDD, but at least there configuration files will be safe.
Re:If in other "modern" filesystems.... (Score:3, Insightful)
Re:Not a bug (Score:3, Insightful)
Instead, the answer is to use a proper small database like sqllite for application registries
Yeah, linux should totally put in a Windows style registry. What the fuck is this guy on.
Re:Not a bug (Score:5, Insightful)
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
To paraphrase https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net] : You certainly can use tons of tiny files, but if you want to guarantee your data will still be there after a crash, you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.
Re:Bull (Score:1, Insightful)
The longer you delay allocation after writing the journal (and Ext4 seems to take this to extremes), the more chance there is of something -- almost anything really -- going wrong between the time the journal is written and the files being written. And here is just such a case of something changing state (whether it should or not) between those times. You many call it an anomaly but a competent engineer would have to expect this to occur.
Re:Bull (Score:5, Insightful)
Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync. In other words, app developers must be more careful of their doings, not put all blame to the filesystems. It's so much that an fs can do to avoid such bruhahas. Many other filesystems have similar behavior to the ext4 btw.
Re:Works as expected... (Score:4, Insightful)
I agree on both counts. Some comments
1) The right sequence of events is this: Rename old file to backup name (atomic). Write new file, sync new file and then delete the backup file. It is however better for anything critical to keep the backup. In any case an application should offer to recover from the backup if the main file is missing or broken. To this end, add a clear end-mark that allows to check whether the file was written completely. Nothing new or exciting, just stuff any good software developer knows.
2) Yes, a kernel should not crash. Occasionally it happens nonetheless. It is important to notice that ext4 is blameless in the whole mess (unless it causes the crash).
Re:Not a bug (Score:5, Insightful)
As an application developer, the last thing I want to worry about is whether or not the fraking filesystem is going to persist my data to disk.
As an application developer, you are expected to know what the API does, in order to use it correctly. What Ext4 is doing is 100% respectful of the spec.
Re:Not a bug (Score:5, Insightful)
It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.
Generally when one is trying to maintain correctness one allocates space, places the data into it and only then links the space into place (paraphrased from from Barry Dwyer's "One more time - how to update a master file", Communications of the ACM, January 1981).
I'd be inclined to delay the metadata update until after the data was written, as Mr. Tso notes was done in ext3. That's certainly what I did back in the days of CP/M, writing DSA-formated floppies (;-))
--dave
Re:Not a bug (Score:2, Insightful)
Beyond that, he's essentially advocating the Windows Registry. He's a very smart person but, Unix is about dot files. If you take them away you, take away the "Unixness" of the machine. I don't care if a filesystem isn't pleased by hundreds or thousands of tiny config files. That's how the machine works. Make your filesystem handle it.
Cordially,
An ext4 user.
Translation (Score:4, Insightful)
We use techniques that show great performance so people can see we beat ext3 and other filesystems.
Oh shit, as a tradeoff we lose more data in case of a crash. But it's not our fault.
Honestly, you cannot eat your cake and have it too.
Re:Bull (Score:2, Insightful)
The problem is KDE not doing syncs and not keeping backups on updates of critical files. Any competent implementor will try to keep these to a minimum with critical files and if they have to be done, do them carefully. Seems to me the KDS folks have to learn a basic lesson in robustness now.
Re:Not a bug (Score:2, Insightful)
Re:Not a bug (Score:5, Insightful)
No. It's not.
If what you say is true there would be no need for the fsync() function (and related ones).
Read the standards if you want. The filesystem is only bugged if it loses recent data under conditions where the application has asked it to guarantee that the data is safe. If the app hasn't asked for any such guarantee by calling fsync() or the like, the filesystem is free to do as it likes.
Re:Not a bug (Score:2, Insightful)
You're wrong, and so are most comments here.
When you open() a file in the filesystem, wrtei() one byte to it, and close() that file, you haven't really guaranteed crap on any normal filesystem, unless you're using a very strange filesystem or you're using non-standard mount options to force every action to happen synchronously.
If a crash happens between close() and the filesystem flushing data to disk, you will lose data. If you want to prevent this happening, you must either use calls like fsync() or fdatasync() (or many other mechanisms that act similarly), or use mount options that make all calls synchronous.
The only reason this has become a big blow-up issue with ext4 is that while other filesystems generally would sync the data shortly anyways, ext4 does not. Everyone has been relying on bad assumptions about filesystem behavior and getting by on the fact that "usually" the situation was resolved "somewhat quickly". ext4 does not resolve these things quickly, in the name of efficiency and performance. There was a never a guarantee under any filesystem of things getting done (to disk) quickly unless you explicitly ask for it.
Re:Classic tradeoff (Score:3, Insightful)
Even if you use O_SYNC, or fsync() there is no guarantee that the data are safely stored on disk.
You also have to disable HDD caching, e.g., using /dev/hda1
hdparm -W0
Well, yes, but unless you have an extreme write pattern, the disk will not take long to flush to platter. And this will only result in data loss on power failure. If that is really a concern, get an UPS.
Re:Theory doesn't matter; practice does (Score:5, Insightful)
This is the attitude that has the web stuck with IE.
There's a standard out there called POSIX. It's just like an HTML or CSS standard. If everyone pays attention to it, everything works better. If you fail to pay attention to it for your bit (writing files or writing web pages), it's not *my* fault if my conforming implementation (implementing the writing or the rendering) doesn't magically fix your bugs.
Alarmist and ignorant article - not a "problem" (Score:5, Insightful)
*No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.
The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.
This is an issue of great sensitivity for databases. See for example:
That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.
Re:Not a bug (Score:2, Insightful)
Filesystems that cannot handle thousands of tiny files efficiently are completely broken. I think the Linux filesystem people have been complete idiots for years for not considering this use case to be worth it. Too many big iron database vendors whispering in their ears apparently.
I want to be able to use the filesystem to appropriately name and reference my data. I do not want to have to rely on some completely different set of tools to actually see what data I have stored on my filesystem. If that's the case, I'll just use LVM for my 'filesystem' and use something vaguely decent to actually hold my data and use those tools instead of the Unix filesystem tools.
Now, those applications that are broken because they are written incorrectly should be re-written so they are correct and coincidentally god-awful slow on ext4. Then maybe the designers of ext4 will get a clue and actually write a filesystem instead of a glorified version of LVM with fancy hierarchical namespace for partitions instead of the the flat one LVM has.
Re:Not a bug (Score:3, Insightful)
I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.
No. Writing software properly means calling fsync() if you need a data guarantee.
Pretty sure no one in their right mind would call using fsync() barriers a "huge burden". There are an enormous number of programs out there that do this correctly already.
And then there are some that don't. Those have problems. They're bugs. They need to be fixed. Fixing bugs is not a "huge burden", it's a necessary task.
Re:Not a bug (Score:2, Insightful)
So the way I understand these comments, the file system has been written to be very fast by delaying certain operations, and it succeeds, except that in case of a crash your hard drive is in a very undesirable state. Programmers can do something about this, but the consequence is that performance drops down through the floor. So the file system is fast with unsafe applications, and dead slow for safe operations. Nice.
Re:Actually, no. (Score:4, Insightful)
If those high level wrappers do not exist, then do not blame the API developers for you not knowing how they work.
Re:Theory doesn't matter; practice does (Score:3, Insightful)
"The machine crashed" isn't a common situation. In fact, it's a very, very rare situation.
Re:Bull (Score:5, Insightful)
Bullshit. It is not a filesystem limitation. POSIX tells you what you can expect from file system calls. Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect. (And this is true of every OS I've used in the last 20 years.)
A great many crap programmers think APIs ought to do what they'd like them to. But APIs don't. At best they do what they are specified to do.
Re:Not a bug (Score:5, Insightful)
UNIX filesystems have used tiny files for years and they've had data loss under certain conditions. My favorite example is the XFS that would journal just enough to give you a consistent filesystem full of binary nulls on power failure. This behavior was even documented in their FAQ with the reply "If it hurts, don't do it."
Filesystems are a balancing act. If you want high performance, you want write caching to allow the system to flush writes in parallel while you go on computing, or make another overlapping write that could be merged. If you want high data security, you call fsync and the OS does its best possible job to write to disk before returning (modulo hard drives that lie to you). Or you open the damn file with O_SYNC.
What he's suggesting is that the POSIX API allows either option to programmers, who often don't know theres even a choice to be had. So he recommends concentrating the few people who do know the API in and out focus on system libraries like libsqllite, and have dumbass programmers use that instead. You and he may not be so far apart, except his solution still allows hard-nosed engineers access to low level syscalls, at the price of shooting their foot off.
Re:Not a bug (Score:4, Insightful)
In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.
I couldn't agree more. A filesystem *is* a database, people. It's a sort of hierarchical one, but a database nonetheless.
It shouldn't care if there's some mini-SQL thing app sitting on top providing another speed hit and layer of complexity or just a bunch of apps making hundreds of f{read|write|open|close|sync}() calls against hundreds of files. Hundreds of files, while cluttered, is very simple and easily debugged/fixed when something gets trashed. Some sort of obfuscated database cannot be fixed with mere vi. (Emacs, maybe, but only because it probably has 17 database repair modules built in, right next to the 87 kitchen sinks that are also included.)
I do rather agree that it's not a bug. An unclean shutdown is an unclean shutdown, and Ts'o is right - there's not a defined behaviour. Ext4 is better at speed, but less safe in an unstable environment. Ext3 is safer, but less speedy. It's all just trade-offs, folks. Pick one appropriate to your use. (Which is why, when I install Jaunty, I'll be using Ext3.)
Re:Excuses are false. This is a severe flaw. (Score:3, Insightful)
Re:Bull (Score:5, Insightful)
The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is.
If that's the behavior you expect, then you need to be running your apps under an OS like DOS, not POSIX or Windows (which both clearly specify that this is *not* how they function).
To Anonymous Coward: (Score:2, Insightful)
That still does not make it any less of a filesystem limitation! Are we speaking the same language?
Re:Exactly (Score:5, Insightful)
"And lets face it: fsync/fdatasync are not really a secret to any competent developer."
I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.
And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.
I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.
Re:Bull (Score:5, Insightful)
Does anyone else think that 150 second is a bit over the top in terms of writing to disk?
I could understand one or two seconds as you speculate more data might come that needs to be written.
5 seconds is a bit iffy, as with ext3.
150 seconds? That's surely a bug.
Re:Bull (Score:3, Insightful)
Why should synchronous writes be the default ? Programmers are already too lazy and/or stupid to add a simple fsync() where needed, why should we all drop what we're doing, make the slowest option the default, and then have to jump through hoops to make things workable again ?
If asynchronous writes are the biggest of your problems, you need to find yourself a new career. One that hopefully doesn't require meticulous attention to detail.
Re:Bull (Score:1, Insightful)
You mean unlike those Windows "Admins" who tell me how great the Windows "Event Manager" (log files + viewer) is, and tell me "Ha, I bet Linux does not have such a great tool!". I had to explain them, that Unix had those features before Windows even existed. He told me I was talking shit.
Then he tried to enter "some obscure command" into the "black command window", that someone told him would create a VPN. What he meant were a few routing commands at the shell.
And this is not rare. It rather is the normal case with "Windows Admins".
Of course we got our POSIX ACLs and security labels. And PaX, RSbac, SElinux, GRsecurity, LDAP, PAM. And whatever the fuck you want.
And of course, "Active Directory" is -- again -- just a fancy name for a bad copy of those technologies, that existed in Linux/Unix for years before they were "invented" by Microsoft.
So I ask you: Who does not know jack?
Re:Theory doesn't matter; practice does (Score:4, Insightful)
Apparently, you don't know real life.
Does POSIX tell you what happens if your OS crashes? That's right, it says "undefined". Oops, sorry, it's too hard a problem and we'll just leave it to you OS implementers.
Asking everyone to use fsync() to ensure their data not being lost is insane. Nobody want to pay that kind of performance penalty unless the data is very critical.
Normal applications have a reasonable expectation that the OS doesn't crash, or doesn't crash too often for this to be a big problem. However, shit happens, and people scream loud if their data is lost BEYOND reasonable expectations.
Forget POSIX. It's irrelevent in the real world. It's exactly this pragmatic attitude that brought Linux to its current state.
Re:Not a bug (Score:3, Insightful)
Indeed. And that is what the suggestion about using a database was all about. You still can use all the tiny files. And there are better options than syncing for reliability. For example, rename the file to backup and then write a new file. The backup will still be there and can be used for automated recovery. Come to think of it, any decent text editor does it that way.
Tuncating critical files without backup is just incredibly bad design.
Re:Bull (Score:1, Insightful)
Further, it appears that in the name of "efficiency", for a given execution thread Ext4 does not queue disk I/O calls chronologically, as it should. (I.e., it does not delay calls for data that has not yet been flushed to disk.) That is a design decision and most definitely has to do with the filesystem.
Could I write a better one? Likely not, but that is irrelevant. I do not manufacture automobiles either but I know when mine is not working the way it should.
Re:Bull (Score:3, Insightful)
Re:Not a bug (Score:4, Insightful)
You're welcome to write lots of little files. It will just be slow if you sync them all, or unsafe if you don't.
Same way a database will tell you to wrap lots of actions in a single transaction if you don't want the cost of a full commit after each action.
Except the filesystem API doesn't have any way to says "commit these 500 little files in a single transaction", unfortunately.
Annoyingly, it also doesn't have "unlink this directory and the files inside it in a single transaction", because unlink performance blows goats.
Comment removed (Score:5, Insightful)
Re:Bull (Score:3, Insightful)
Oh great... basing ext4 performance gains on caching writes in the OS for 2 minutes just means they will focus their optimizations in ways that will suck even worse than ext3 does for applications that can't afford the risk of enabling write caching...
Re:Actually, no. (Score:3, Insightful)
Re:Exactly (Score:3, Insightful)
You are an idiot. The design of the POSIX API dictates that fsync (or equivalent) is required to ensure data is flushed to disk. This has been true forever. If an abstraction in an i/o library is not using the API correctly, it is the fault of the library.
You are correct that the user of the abstraction should not care, but you are putting the blame in the wrong place. The whole point of using an abstraction is to hide details such as this. If the library author is too stupid to learn the API he is abstracting that is HIS fault.
Re:Bull (Score:4, Insightful)
Its not a KDE issue. Its not a Gnome issue.
Its a file system risk issue, and it affects everything running on the bos.
The EXT4 developers have decided its ok to increase the risk window by 3000% and
risk a crash for a minute and 20 seconds in an attempt to gain a little
performance. (Damn little performance).
With EXT3 the risk window was 5 seconds. Now its 150 seconds.
Its ridiculous to move what should be a low-level data integrity function
out of the File System and inflict it on user-land code.
Re:Bull (Score:5, Insightful)
Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide.
And then an antivirus program updates while Firefox is starting and a video is transcoding, and your program either slows to a crawl or crashes after 30 seconds of not receiving or being able to write any data.
Recently I was playing Left4Dead when one of my HDDs in my RAID array died in a very audible way. All the drives spun down, then 3 of them came back online. IOPS went to zero for over 60 seconds. No data in or out to those devices!
Interestingly, Ventrilo kept running fine. Left4Dead completely froze, but a minute or so after the 3 drives came back online, it unfroze. (CPU catching up?) All the while I was freaking out on Ventrilo, much to my friends' amusement.
Pretty much everything else crashed, except for Portable Firefox... uTorrent crashed, but first it left corrupted files all over - appearing as undeletable folders, which require a format to remove.
Time for a disk wipe. Thank you, shitty developers! Next time, use the API properly, and if you must have it written to disk, sync it immediately after you write!
Re:Bull... (Score:2, Insightful)
Optimize the reads all you want, but those writes better damn well happen before the calls that say data is written return.
And this is where most of the confusion comes from. There is a difference between a logical write and a physical write. When your write call completes, it says the logical write has completed. It says nothing about the physical write. Depending on file system semantics, your physical write may have already completed too - or shortly after. If you must explicitly ensure the physical write is complete then you must explicitly ensure it via code - otherwise the physical write can only be assumed. And this is where the the lessor informed seem confused by their own poor expectations and ignorance. Unless they are actually following their write with some sort of file system synchronization call, ignoring their ignorant expectation, they have no right what-so-ever to assume the data will still be there in the face of a system crash. Its a very poor coder who falls into that trap.
Good programmers know this and have known it for tens of years. Good database programmers know this. Good file system developers know this. Those that are outraged by their own ignorance are either not programmers or are not good programmers.
And lastly, I'll point out, which is exactly why Tso pointed it out - use a solution where its foundation is built by coders who already understand the proper way to ensure data is safe on the file system - for example, use a database. While I don't consider the use of a database to be an ideal solution here, it does a wonderful job of highlighting the crappy design both KDE and GNOME have used to store configuration data - and how unconcerned they are about data loss and data corruption. If the developers of KDE and GNOME don't give a crap about your configuration data then how on earth can you possibly be upset at the file system for doing what its suppose to do?
In short, both KDE and GNOME need to give a crap about how, when, and why they write configuration data. Since they don't care about data integrity, you now know who you should be angry at. Here's a hint, and it doesn't have anything to do with the file system.
Re:Not a bug (Score:3, Insightful)
The whole bit you quoted about SQLite was about optimization, not correctness.
the KDE and Gnome developers would be OK using the current file structure to save data so long as they had bothered to call fsync().
The problem is that the KDE developers were skipping step "d", presumably because they felt it slowed down the application too much. Fortunately(?) for them, with ext3 in its default configuration, it happened to not matter too much that they were skipping an important step.
The part you quoted was merely discussing a potential way to store lots of isolated bits of data without the overhead of calling fsync() constantly.
Re:Not a bug (Score:4, Insightful)
The idiocy is in expecting the FS to do something it was never asked to do. There is one way to commit data to disk in Posix systems. That function has existed for well over 20 years. It's probably going on 35 years now, but I don't know my Unix history well enough to be sure.
I think the problem is with more and more people beliving themselves to be good programmers, when they really do not undertstand what they are doing. Truncating and then writing critical files is a very bad idea to begin with. The way you do it is to rename the old file to backup and write to a new file. Also have a procedure in place to recover from backup if the main file is broken. Maybe even to checksums on the main file. In addition, only write if you have to. That is robust design, not the amateur-level truncate the KDE folks seem to be doing routinely.
Re:Bull (Score:3, Insightful)
Back when 10MB HDDs with 100ms access times where prevalent and floppies were all the rage, buffered I/O was a good idea. If I find that an application is somehow overwhelming my 3.0GB/s SATA bus and 10,000 RPM hard drive, I'll be sure to turn this "feature" on.
Use the "sync" option on a mount some day and be surprised. Synchronous I/O is dog-slow.
Re:Bull (Score:1, Insightful)
If you want to be really picky it's a limitation of the POSIX API. I don't just mean that POSIX allows this behaviour, but that this behaviour is necessary to get reasonable performance while conforming to the API.
Filesystem transactions (as seen in Windows 2008 and Vista) provide a much better balance of performance and data integrity than the POSIX API.
Could I write a better one? Likely not, but that is irrelevant.
Nobody could. The API isn't part of the filesystem, but it limits what can be done in the filesystem. It isn't possible to create a filesystem that offers the guarantees you want, uses the existing API and performs even half as well as existing filesystems.
You might think I'm being pedantic but this is absolutely not a problem with ext4. It's fundamental to the way Linux interfaces with filesystems. If you want to program on Linux (or any UNIX) you have to deal with it. Or work to get it changed :)
Re:To Anonymous Coward: (Score:3, Insightful)
It isn't a file system limitation. And here is why.
1. The POSIX standard specifies that writes may be delayed. Every modern file system may delay writes.
2. The POSIX standard then gives you a way to flush the buffer at the time of the programs choosing. It is called fsync(). If the programmer called that well documented function then all would have been well.
You have the best performance possible and you can insure that file is flushed before you do something else.
The file system didn't cause this bug. The posix spec didn't cause this bug. The programmer that didn't use the tools as documented caused his own bug.
Re:Works as expected... (Score:5, Insightful)
That's an improvement, but it can be made even safer by skipping the delete step. Once the new file is created just rename it on top of the original. The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.
Re:Theory doesn't matter; practice does (Score:4, Insightful)
Apparently, you don't know how to *deal* with real life.
POSIX *does* tell you what happens if your OS crashes. It says "as an application developer, you cannot rely on things in this instance." It also provides mechanisms for successfully dealing with this scenario.
As for fsync() being a performance issue, you can't have your cake and edit it too. If you don't want to pay a performance penalty, you can lose data. Ext4 simply only imparts that penalty to those applications that say they need it, and thereby gives a performance boost to others who are, due to their code, effectively saying "I don't particularly care about this data" - or more specifically, "I can accept a loss risk with this data."
Normal applications have a reasonable expectation that the OS doesn't crash, yes. And usually it doesn't. Out of all the installs out there... how often is this happening? Not very. They've made a performance-reliability tradeoff, and as with any risk... sometimes it's the bad outcome that occurs. If they don't want that to happen, they need to take steps to reduce that risk- and the correct way to do that has always been available in the API.
As for forgetting POSIX... it's the basis of all unix cross-platform code. It's what allows code to run on linux, BSD, Solaris, MacOS X, embedded platforms, etc, without (mostly) caring which one they're on. It's *highly* relevant to the real world because it's the API that most programs not written for windows are written to. Pull up a man page for system calls and you'll see the POSIX standard referenced- that's where they all came from.
Saying "Forget POSIX. It's irrelevant in the real world." is like people saying a few years ago "Forget CSS standards. It's irrelevant in the real world." And you know what? That's the attitude that's dying out in the web as everything moves toward standards compliance. So it is in this case with the filesystem.
Re:Bull (Score:5, Insightful)
It's not going to happen immediately in any case. Some optimizations can only be done if you introduce a delay, and once introduced you have to deal with that there's a delay. Just because it's one second instead of a minute doesn't mean your computer can't crash in the precisely wrong moment.
While I'm not an expert in filesystems, I'd expect writing a single file to be at least 4 writes: inode, data, update the directory the file is in, and a bitmap to show space allocation. If there's a journal add a write for the journal. Each of those will require a seek due to all of these things being in different places on the disk in most filesystems.
So your 40 small files just turned into 400-500 seeks, which at 8ms each will take 1.6 to 2 seconds to complete.
Now let's suppose we can batch things up. We need to write the inode and data for each file, and can do just one seek for the directory (the same for all), and the bitmap and journal can be updated in one operation. Now we're down to 2 writes per file, giving 80 seeks, plus 3 for metadata, giving 83 seeks, which can be done in 0.6 seconds.
But what if we do delayed allocation and create the all the inodes and write all the data as one large contigous area? We're now down to 5 writes total, with a seek time of 40ms. The time needed to write the data can probably be disregarded, since modern disks easily write at 50MB/s, and those 40 files with metatata probably amount to less than 32K.
And with some optimization, we just reduced the time it takes to write your 40 files to just 2% of the unoptimized time.
You're not going to get this sort of improvement without some sort of delay. If you insist on a per-file write you'll get really, really awful performance on the sort of workload you're using as an example. And you can even see it in practice, just boot a DOS box, and do benchmarks with and without smartdrv. Running something like a virus scanner should show a huge difference in the presence of a cache.
Re:Bull (Score:3, Insightful)
That application developers don't always get to choose what filesystem their application is being run on would be my guess.
Disk caching is a good thing(well at the moment, if/when SSD's become large enough and cheap enough to replace regular old spinning disks for speed dependent applications, then it probably won't be all that useful), it makes everything faster and more efficient. That said, 2.5 seconds is an absolutely huge amount of time in computer terms, even on a really slow PC these days that's thousands of operatings being executed before any attempt is even made to write the data to disk. It's a huge, and unecessary risk. Average latency on normal hard drives now is easily below 5 ms, queueing up for 30 times that to try and make things more efficient is just stupid.
Re:Bull (Score:4, Insightful)
It is a KDE issue. Only userland knows which data is critical. Only userland knows whether data can ba backed up or not. The OS cannot enure full data integrity without massice negative performance impact, however much you may wish for it. So what the OS does is give you a way to tell it which data needs to be on disk and which data should be on disk in a while if nothing goes wrong.
There really is no other way of doing it. Unless you think fundamentally defective code is acceptable if the risk of getting hit is a bit smaller?
Re:Bull (Score:2, Insightful)
only userland knows WHICH data is critical.
dude, ALL data is critical.
no, this is a serious implementation stoopidity in ext4, et.al.
blame the victim. eeesh. data rape is still rape.
and saying programs should be calling fsync is absurd.
i'm old enough to remember when programmers were admonished
to NOT call fsync, or it would "slow down the system."
sync/flushing data already written by userland standard i/o calls
should never be a userland responsibility.
[shaking head...]
Re:Bull (Score:5, Insightful)
No. That is why we have fsync().
No file system will promise you data integrity with a power failure. That is why you should run with a UPS.
You can not depend on the write delay time. What happens if you get a really fast processor and say a really slow drive? Unless you are building software that only runs on ONE set of hardware you just can not do that.
This is a bug that was always in KDE and they got lucky up till now.
Re:Bull (Score:5, Insightful)
dude, ALL data is critical.
If you really think that, then you should leave the aera of modern disk access and mount all your partitions with the "sync" option. Then none of your software will have to think about syncing. Of course all file access will be so slow that nobody will want to work with that system either.
Hmm. I wonder why "sync" is not a default mount option?
Exactly. (Score:5, Insightful)
People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.
Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:
Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.
It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.
It's the bad outcomes that we're trying to mitigate by having a spec in the first place!
So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.
The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.
Re:Not a bug (Score:5, Insightful)
It's called "gconf", and it's worse than that. It's no longer abandonware lurking at the heart of gnome but it's still a nightmare.
Re:Not a bug (Score:1, Insightful)
To be fair, the idea of KDE using a consolidated database is quite different than the idea of every single program on the system using the same consolidated database.
Re:To Anonymous Coward: (Score:3, Insightful)
Re:Not a bug (Score:4, Insightful)
Close, but no cigar. The data we need safe is the one already on the disk: if you don't flush, you get to keep the old version already on the disk.
That's an interesting interpretation of fsync(), but, unfortunately, one that's not supported by the POSIX spec. Nowhere it says that the system cannot flush the data that you've already written so far without an explicit fsync() call. If you're unlucky enough that this happened after you've truncated the file, but before you wrote anything into it - well, too bad. As I understand, ext3 could also exhibit this behavior, it was simply harder to reproduce because the implicit flushes were much more frequent.
Anyway, this post [slashdot.org] seems to explain what's actually going on there in the (very specific) case of KDE.
Re:Bull (Score:3, Insightful)
Except on Linux you must sync the parent directory as well [collab.net]. None of this behavior is usefully documented anywhere, so it's upsetting when kernel developers tell application developers they're doing it wrong.
Re:Bull (Score:3, Insightful)
On way to test if your argument makes sense is to extend it to absurdity.
What if the FS NEVER wrote anything until a fsync was called?
All applications would then have to add these calls.
The net affect would be uncontrolled write management at the application level with no hope of IO management or optimization at the FS/OS level.
Is this what you propose? Is this technically correct? Be careful what you wish for.
If this was done, the FS would (sooner or later) have to ignore fsync totally and re-assert control of commits in order to achieve any reasonable performance.
So you see, I believe you are recommending something that is not in the best interests of the OS or the users in the long run. (However technically correct it might be at the moment). This functionality really does belong at the OS/FS level. I could go further and say it would be nice if it could be done at the hardware level. If disk drives could manage this by themselves it would be great. A write would get immediately sent to the disk, and it would cache as needed but never more than it could write with stored power after feed power fails.
rename and fsync (Score:4, Insightful)
"Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. "
Two things are happening:
(1) KDE is writing a new inode.
(2) KDE is renaming the directory entry for the inode, replacing an existing inode in the process.
KDE never calls fsync(2), so the data from step one is not committed to be on disk. Thus, KDE is atomically replacing the old file with an uncommitted file. If the system crashes before it gets around to writing the data, too bad.
EXT4 isn't "broken" for doing this, as endless people have pointed out. The spec says if you don't call fsync(2) you're taking your chances. In this case, you gambled and lost.
KDE isn't "broken" for doing this unless KDE promised never to leave the disk in an inconsistent state during a crash. That's a hard promise to keep, so I doubt KDE ever made it.
A system crash means loss of data not committed to disk. A system crash frequently means loss of lots of other things, too. Unsaved application data in memory which never even made it to write(2). Process state. Service availability. Jobs. Money. System crashes are bad; this should not be news.
The database suggestion some are making comes from the fact that if you want on-disk consistency *and* good performance, you have to do a lot of implementation work, and do things like batching your updates into calls to write(2) and fsync(2). Otherwise, performance will stink. This is a big part of what databases do.
As someone else suggested, it's perfectly easy to make writes atomic in most filesystems. Mount with the "sync" option. Write performance will absolutely suck, but you get a never-loses-uncommitted-data filesystem.
Re:Not a bug (Score:1, Insightful)
You need to actually read the bug report and the FA before you comment. The problem isn't that the first operation truncates the file and then the later operations never make it to the disk. The problem is that the metadata operations make it to the disk but the data operations, even though they came first, don't. That's why writing to a new file and renaming it to replace the old file is not sufficient. You have to fsync() before you rename the file to ensure that the data is actually there. Otherwise a crash might occur and you end up without data because the new file (with zero length) replaced your old file.
Re:Bull (Score:3, Insightful)
Re:Bull (Score:4, Insightful)
Data that userland applications WRITES TO DISK is critical. If the filesystem takes its sweet time about actually doing the write, it's not the application's fault. And no, calling fsync() or fdatasync() constantly is no good, because that really does make your performance poor.
Comment removed (Score:5, Insightful)
Re:Don't write to files and your app will be fine (Score:1, Insightful)
Just don't expect that when you issue a file write command that the file system will ensure that the file will be written.
Glad to know that someone's reading the manpages, since you aren't. Go back and read your write() and close() manpages, then come back and tell us that write() is supposed to ensure that the file will be written.
Now, remount your filesystem -o sync, and come back and tell us WHY write() does not ensure that the file will be written by default.
Re:Bull (Score:2, Insightful)
Or to be more precise, POSIX lays out the bare fucking minimum for a half-sane system. It's a set of requirements, not a golden holy tome!
This is Slashdot, so here's a car analogy. POSIX is the law that says what's street-legal. A car needs two headlights, two tail-lights, emissions below a certain point, and so on. Both a base-model Chevy Aveo and a Ferrari are street legal, but I'd rather drive the Ferrari? Why? Because it makes guarantees that go beyond street legality.
Now say you drove Ferrari and the air conditioning malfunctioned. Image how angry you'd be if the Ferrari dealership said, "nope, sorry. We're not going to fix this. Your car is still street legal, so you should have just gotten used to driving without the air conditioning."
You know what I'd say? Fuck you.
You can guess what I'm saying about filesystems that break the perfectly reasonable open-write-close-rename sequence.
Re:Bull (Score:3, Insightful)
> I agree. This is not way to treat startup-critical configuration files.
This is bull. Most files are critical to someone. This would means that most processes that write data must use fsync.
Are you arguing that cp should use fsync for every file it copies ? In that case, you'd better tell the maintainers of coreutils-7.1, because copy_internal (used by cp.c) does not. (And you'll be laughted at)
So, right, now, on ext4, the sequence:
> cp /disk1/file1.data /disk2/file1.data
wait a few seconds
> rm /disk1/file1.data
crash
will probably cause the file to be lost. That you choose to blame it on cp is funny, but most of the rest of the world will blame it on ext4.
Re:Exactly. (Score:4, Insightful)
``It's not enough just to be true to spec;''
Yes, it is. That way, you get what the spec says you get.
It can even be argued that doing better than the spec is dangerous. After all, that is what got us this riot: things doing more than the spec said, people relying on that, and then getting angry when another implementation of the spec didn't have the same additional features.
You can only assume that you get what the spec says you get. If you assume more, it's your problem if your assumptions are wrong. If you want more than the spec gives you, you either need to implement it yourself or get a new spec implemented.
``the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.''
I don't think anyone jumped through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes. I think they jumped through hoops to get the best possible performance, while still being conformant to the spec. If this breaks applications that rely on behavior that isn't in the spec, it's because those applications are buggy.
``It's the bad outcomes that we're trying to mitigate by having a spec in the first place!''
I agree completely. But we seem to differ in how this is supposed to work.
I say that specifications can be used to avoid bad results by specifying exactly what can be relied on. Everything that is not in the specification is unspecified and thus cannot be relied on. Knowing this helps you write better software, because you know what you can assume, and what you have to write code for.
You seem to be saying that having a specification means we want to avoid bad results, so whomever implements the specification must do their best to avoid bad results, no matter what it says in the specification. I find that completely unreasonable.
Re:Exactly. (Score:3, Insightful)
And is that the woosh from what actually went wrong going over your head?
This is definitely an FS problem (Score:2, Insightful)
If the guys writing the FS can't figure out how to properly write a cache that's not the problem of the application writers.
If I save a file via an OS call and the OS tells me it didn't fail then if I can't immediately reread it then the OS is broken.
Data loss from write caching is not a new problem either. Guess this year's crop of programmers can't figure out how to use google to find out about past problems or they just figure they're smarter than everyone else that came before them.
Re:Not a bug (Score:3, Insightful)
Unfortunately that is case #2 as described here:
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net]
rename(2) is not guaranteed to be atomic. There are now some patches that get ext4 to perform what most people expect #2 to do. I got bitten by #2 not working correctly on MacOS X some time back, I just googled and found this:
http://www.weirdnet.nl/apple/rename.html [weirdnet.nl]
Ever since that time I have been using fsync in my code when I needed it. You just get into a world of hurt when you expect #2 to work right under every OS and fs and set of mount options because it doesn't.
Re:Exactly. (Score:3, Insightful)
The spec in any design is the final authority.
For example, if the spec for a bridge crossing a river says that the bridge ought to hold 20 tons of weight, then it must do at least that. If the bridget collapses because you put 20 tons on and a spec of dust landed on top of it, then it doesn't matter - it still held to spec. If you were able to get 40 tons on before it collapsed all the better, but you were only ever guaranteed by the spec (and thus the designers) 20 tons.
If the spec for an engine said it could handle 8k RPM and it blew up at 8001 RPM, it was in spec. If you managed to get it to 9k RPM great, but you were only guaranteed 8k RPM.
That doesn't mean you don't build tolerance into the spec - e.g. 8k RPM +/- 5% - or in try to exceed it where it makes sense e.g. delivering 25 ton to ensure you have 20 tons and some leeway for safety. (After all stupid is as stupid does.)
However, you can't fault the designers or engineers when the device lives up to spec and breaks because you (as the user) tried to exceed the spec and it failed.
Same goes for software. If the software spec says "provides A at rate B" then you better expect that and nothing more. If you need something different, then find a device (or API or file system, etc) that meets your requirements.
Pushing something beyond spec is not the problem of the spec designers - but of the users of the spec that expect it to exceed the spec.
And, btw, specs that supposedly are "minimum" standard specs are still specs just the same. They allow a certain minimum that (with software) allows portability; if you want to do better you still need to find another spec that supports what you want to do. For example: POSIX guarantees a portability between Unix and Unix-like OS's; but if you want to do better than POSIX then you use the Linux POSIX spec or the Solaris POSIX spec ( or BSD POSIX spec, etc.). You are get what you want, but at the cost of some portability. Failing to do that is the failure of the user of the spec, not the writers of the spec.
And just to be clear - by "user of the spec I do not mean the people implementing the spec but the people using the software (or device) that implements the spec. In this case, not the implementors of ext3 or ext4, but the implementors of the software going above the ext3/4 spec to do something else.
Furthermore the spec exists as a measurement to be able to tell when you've completed your job. If the spec says 10 tons and you get 11 tons you've finished the job; if you're only getting 9.99 tons you're not done. If you get 10.00001 tons you've got. If it say 10 tons +/- 5%, then might be done at 9.99 tons, but you really should go for the 10 tons + 5% just to be safe. Either way - once you've met the spec you're done. That doesn't mean you don't try to improve the spec and then make a better product; but there's no guarantee that will happen - the spec is the spec, and that's all you have to do - it's all you agreed to do to start with. (Think of it like a contract.)