cooper writes "Heise Open posted news about a bug report for the upcoming Ubuntu 9.04 (Jaunty Jackalope) which describes a massive data loss problem when using Ext4 (German version): A crash occurring shortly after the KDE 4 desktop files had been loaded results in the loss of all of the data that had been created, including many KDE configuration files." The article mentions that similar losses can come from some other modern filesystems, too. Update: 03/11 21:30 GMT by T: Headline clarified to dispel the impression that this was a fault in Ext4.
I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.
Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."
In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
File systems are nice. That's what Unix is about.
I don't think programmers ought to be required to treat them like a pouty flake: "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work."
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.
Fortunately his patches will include an option to turn the magic computer fairy off.
A file system should take my data buffer, and after saying "Ok, I got it"
There's your problem, you didn't even bother to ask if it got it, you just threw a ton of data into the file descriptor and closed it, now didn't you. And you want me on thedailywtf?
But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it".
All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.
The filesystem does exactly what you tell it to do. If you don't want it to make a zero byte file, then DON'T USE O_TRUNC OR *truncate() TO EMPTY YOUR FILE. Make a new file, fill it up, rename it over the other file. Don't assume that in just a few instructions, you're going to be filling it back up with new data, because those instructions may never arrive.
You don't like it? Try and convince people that (open file, erase all the data in it, do some stuff, write some data, do some more stuff, write some more data, write data to disk, close file) should be an uninterruptable atomic operation. You want a versioning filesystem? Take your pick [wikipedia.org].
by Anonymous Coward
on Wednesday March 11 2009, @04:37PM (#27157655)
Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks,...
Linux reinvents windows registry? Who knows what they will come up with next.
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
To paraphrase https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net] : You certainly can use tons of tiny files, but if you want to guarantee your data will still be there after a crash, you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.
It seems exceedingly odd that issuing
a write for a non-zero-sized file and having it
delayed causes the file to become zero-size before the new data is written.
Generally when one is trying to maintain correctness one allocates space, places the data into it and only then links the space into place (paraphrased from from
Barry Dwyer's "One more time - how to
update a master file", Communications of the ACM, January 1981).
I'd be inclined to delay the metadata update until after the data was written, as Mr. Tso notes was done in ext3. That's certainly what I did back in the days of CP/M, writing DSA-formated floppies (;-))
Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.
Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.
You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).
The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.
Please don't FUD.
You know I'll look up the docs for you :
(quote from man 2 write)
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
been written, the call succeeds, and returns the number of bytes written.
That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)
So the normal case for a "reliable write" would be this code :
size_t written = 0; int r = write(fd, &data, sizeof(data)) while (r >= 0 && r + written sizeof(data)) {
written += r;
r = write(fd, &data, sizeof(data)); } if (r 0) {// error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets }
and *NOT*
write(fd, data, sizeof(data));// will probably work
Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)
Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.
That would be smart, but only if the SQL database is encrypted too. It's theoretically possible to read a registry with an editor, and we can't have that. Also, we need a checksum on the registry. If the checksum is bad, we have to overwrite the registry with zeroes. Registries are monolithic, and we have to make sure that either it's good data, or NONE of it is good data. Otherwise the user would get confused.
I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.
You're right. The correct thing to do is to *always* call fsync() when you need a data guarantee, *regardless* of which FS you're on. The fact that not doing it in the past hasn't caused problems isn't the problem- those calls are the correct way of handling things.
UNIX filesystems have used tiny files for years and they've had data loss under certain conditions. My favorite example is the XFS that would journal just enough to give you a consistent filesystem full of binary nulls on power failure. This behavior was even documented in their FAQ with the reply "If it hurts, don't do it."
Filesystems are a balancing act. If you want high performance, you want write caching to allow the system to flush writes in parallel while you go on computing, or make another overlapping write that could be merged. If you want high data security, you call fsync and the OS does its best possible job to write to disk before returning (modulo hard drives that lie to you). Or you open the damn file with O_SYNC.
What he's suggesting is that the POSIX API allows either option to programmers, who often don't know theres even a choice to be had. So he recommends concentrating the few people who do know the API in and out focus on system libraries like libsqllite, and have dumbass programmers use that instead. You and he may not be so far apart, except his solution still allows hard-nosed engineers access to low level syscalls, at the price of shooting their foot off.
Ummm... it deals correctly with files of any size. It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk. That's the case for pretty much any filesystem; it's just a matter of degree, and how "recent" is recent.
The benefit of journaling file systems is that after the crash you still have a file system that works. How many folks remember when Windows would crash, resulting in a HDD that was so corrupted the OS wouldn't start. Same with ext2.
If these folks don't like asynchronous writes, they can edit their fstab (or whatever) to have the sync option so all their writes will be synchronous and the world will be a happy place.
Note that they will also have to suffer a slower system, and possible shortened lifetime of their HDD, but at least there configuration files will be safe.
Er, actually it removes the previous
data, then waits to replace it for long
enough that the probability of noticing
the disappearance approaches unity on flaky hardware (;-))
If what you say is true there would be no need for the fsync() function (and related ones).
Read the standards if you want. The filesystem is only bugged if it loses recent data under conditions where the application has asked it to guarantee that the data is safe. If the app hasn't asked for any such guarantee by calling fsync() or the like, the filesystem is free to do as it likes.
In fact, there is no such thing as an OS bug! All good programmers should re-implement essential and basic operating system features in their user applications whenever they run into so-called "OS bugs." If you question this, you must be a bad programmer, obviously.
People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.
Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:
Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.
It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.
It's the bad outcomes that we're trying to mitigate by having a spec in the first place!
So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.
The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.
The journal isn't being written before the data. Nothing is written for periods between 45-120 seconds so as to batch up the writing to efficient lumps. The journal is there to make sure that the data on disk makes sense if a crash occurs.
If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.
The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is.
If that's the behavior you expect, then you need to be running your apps under an OS like DOS, not POSIX or Windows (which both clearly specify that this is *not* how they function).
Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide.
And then an antivirus program updates while Firefox is starting and a video is transcoding, and your program either slows to a crawl or crashes after 30 seconds of not receiving or being able to write any data.
Recently I was playing Left4Dead when one of my HDDs in my RAID array died in a very audible way. All the drives spun down, then 3 of them came back online. IOPS went to zero for over 60 seconds. No data in or out to those devices!
Interestingly, Ventrilo kept running fine. Left4Dead completely froze, but a minute or so after the 3 drives came back online, it unfroze. (CPU catching up?) All the while I was freaking out on Ventrilo, much to my friends' amusement.
Pretty much everything else crashed, except for Portable Firefox... uTorrent crashed, but first it left corrupted files all over - appearing as undeletable folders, which require a format to remove.
Time for a disk wipe. Thank you, shitty developers! Next time, use the API properly, and if you must have it written to disk, sync it immediately after you write!
It's not going to happen immediately in any case. Some optimizations can only be done if you introduce a delay, and once introduced you have to deal with that there's a delay. Just because it's one second instead of a minute doesn't mean your computer can't crash in the precisely wrong moment.
While I'm not an expert in filesystems, I'd expect writing a single file to be at least 4 writes: inode, data, update the directory the file is in, and a bitmap to show space allocation. If there's a journal add a write for the journal. Each of those will require a seek due to all of these things being in different places on the disk in most filesystems.
So your 40 small files just turned into 400-500 seeks, which at 8ms each will take 1.6 to 2 seconds to complete.
Now let's suppose we can batch things up. We need to write the inode and data for each file, and can do just one seek for the directory (the same for all), and the bitmap and journal can be updated in one operation. Now we're down to 2 writes per file, giving 80 seeks, plus 3 for metadata, giving 83 seeks, which can be done in 0.6 seconds.
But what if we do delayed allocation and create the all the inodes and write all the data as one large contigous area? We're now down to 5 writes total, with a seek time of 40ms. The time needed to write the data can probably be disregarded, since modern disks easily write at 50MB/s, and those 40 files with metatata probably amount to less than 32K.
And with some optimization, we just reduced the time it takes to write your 40 files to just 2% of the unoptimized time.
You're not going to get this sort of improvement without some sort of delay. If you insist on a per-file write you'll get really, really awful performance on the sort of workload you're using as an example. And you can even see it in practice, just boot a DOS box, and do benchmarks with and without smartdrv. Running something like a virus scanner should show a huge difference in the presence of a cache.
That's all very well, but in most cases, the machine is a single-user desktop, and it's sitting *idle* during the 5-40 seconds in which the filesystem writes are being batched up. It seems to me that the writes should be sent to disk immediately, unless the disk is already busy, in which case they may be delayed for (at most) 40 seconds, while other reads take place ahead of them.
Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.
There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.
Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.
Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.
Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.
Also, with a modern SATA disk supporting Native Command Queuing, the OS should immediately write the data to the disk's buffer, and the disk's firmware gets to decide about re-ordering.
NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:
1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.
2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.
As for the argument about using sqlite - why have yet another abstraction? After all, the filesystem is already a sort of database!
Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.
Wish I had mod points for you AC as I agree with you. 150 seconds is 2.5 minutes! I don't know of any file system, let alone a RAID controller that waits that longs to commit the data.
If this is a feature and not a bug, better be sure your computer is connected to a UPS. Damn!
No. That is why we have fsync(). No file system will promise you data integrity with a power failure. That is why you should run with a UPS. You can not depend on the write delay time. What happens if you get a really fast processor and say a really slow drive? Unless you are building software that only runs on ONE set of hardware you just can not do that. This is a bug that was always in KDE and they got lucky up till now.
If you really think that, then you should leave the aera of modern disk access and mount all your partitions with the "sync" option. Then none of your software will have to think about syncing. Of course all file access will be so slow that nobody will want to work with that system either.
Hmm. I wonder why "sync" is not a default mount option?
by Anonymous Coward
on Wednesday March 11 2009, @04:36PM (#27157635)
This is NOT a bug. Read the POSIX documents.
Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.
It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).
Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync. In other words, app developers must be more careful of their doings, not put all blame to the filesystems. It's so much that an fs can do to avoid such bruhahas. Many other filesystems have similar behavior to the ext4 btw.
by Anonymous Coward
on Wednesday March 11 2009, @05:21PM (#27158245)
Bullshit. It is not a filesystem limitation. POSIX tells you what you can expect from file system calls. Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect. (And this is true of every OS I've used in the last 20 years.)
A great many crap programmers think APIs ought to do what they'd like them to. But APIs don't. At best they do what they are specified to do.
It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem. In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html [opengroup.org]
Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple. They didn't use the api properly and it really is just that simple.
Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.
All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).
POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.
As an application developer, the last thing I want to worry about is whether or not the fraking filesystem is going to persist my data to disk.
As an application developer, you are expected to know what the API does, in order to use it correctly. What Ext4 is doing is 100% respectful of the spec.
The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.
Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.
Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
You're right, there really is nothing to see here. Or rather, there's nothing left. As the article says, a large number of configuration files are opened and written to as KDE starts up. If KDE crashes and takes the OS with it (as it apparently does), those configuration files may be truncated or deleted entirely -- the commands to re-create and write them having never been sync'd to disk. As the startup of KDE takes longer than the write delay, it's entirely possible for this to seriously screw with the user.
The two problems are:
1. Bad application development. Don't delete and then re-create the same file. Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System, whatever the four color glossy told you.
2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.
Write new file into a temp file, sync, whatever you need to do. When you're done, delete original and rename the temp to the original's name.
That's an improvement, but it can be made even safer by skipping the delete step. Once the new file is created just rename it on top of the original. The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.
Meh, this is crap that happens only when the system crashes, and is pretty much unavoidable if you're doing a lot of caching in memory -- which, coincidentally, is what you need to do to maximize performance. This doesn't sound like the filesystem's "fault" or the application's "fault;" it's just the way things are. Everybody knows that if you don't cleanly unmount, most bets are off.
The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.
True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.
What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.
"And lets face it: fsync/fdatasync are not really a secret to any competent developer."
I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.
And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.
I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.
It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.
Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.
Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.
When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.
While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.
BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.
*No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.
The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.
This is an issue of great sensitivity for databases. See for example:
That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.
This is the attitude that has the web stuck with IE.
There's a standard out there called POSIX. It's just like an HTML or CSS standard. If everyone pays attention to it, everything works better. If you fail to pay attention to it for your bit (writing files or writing web pages), it's not *my* fault if my conforming implementation (implementing the writing or the rendering) doesn't magically fix your bugs.
KDE is *broken* to delete a file and expect it to still be there if it crashes before the write.
Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.
Not a bug (Score:5, Informative)
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45 [launchpad.net]
https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net]
Re:Not a bug (Score:5, Insightful)
I disagree. "Writing software properly" apparently means taking on a huge burden for simple operations.
Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place."
In other words, if the programmer took on the burden of tons of work and complexity in order to replicate lots of the functionality of the file system and make it not the file system's problem, then it wouldn't be my problem.
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
File systems are nice. That's what Unix is about.
I don't think programmers ought to be required to treat them like a pouty flake: "in some cases, depending on the whims of the kernel and entirely invisible moods, or the way the disk is mounted that you have no control over, stuff might or might not work."
Parent
Re:Not a bug (Score:5, Interesting)
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
It is perfectly OK to read and write thousands of tiny files. Unless the system is going to crash while you're doing it and you somehow want the magic computer fairy to make sure that the files are still there when you reboot it. In that case, you're going to have to always write every single block out to the disk, and slow everything else down to make sure no process gets an "unreasonable" expectation that their is safe until the drive catches up.
Fortunately his patches will include an option to turn the magic computer fairy off.
Parent
Re:Not a bug (Score:5, Informative)
A file system should take my data buffer, and after saying "Ok, I got it"
There's your problem, you didn't even bother to ask if it got it, you just threw a ton of data into the file descriptor and closed it, now didn't you. And you want me on thedailywtf?
But lets back up here, because there's more than just people too lazy to call fsync() in order to ask the file system to write the data to the disk and say "Ok, I got it".
All that stuff about creating a backup copy and doing this and that, has to happen inside the file system.
The filesystem does exactly what you tell it to do. If you don't want it to make a zero byte file, then DON'T USE O_TRUNC OR *truncate() TO EMPTY YOUR FILE. Make a new file, fill it up, rename it over the other file. Don't assume that in just a few instructions, you're going to be filling it back up with new data, because those instructions may never arrive.
You don't like it? Try and convince people that (open file, erase all the data in it, do some stuff, write some data, do some more stuff, write some more data, write data to disk, close file) should be an uninterruptable atomic operation. You want a versioning filesystem? Take your pick [wikipedia.org].
Parent
Re:Not a bug (Score:5, Informative)
Quoting T'so:
"The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, ...
Linux reinvents windows registry?
Who knows what they will come up with next.
Parent
Re:Not a bug (Score:5, Insightful)
It's called "gconf", and it's worse than that. It's no longer abandonware lurking at the heart of gnome but it's still a nightmare.
Parent
Re:Not a bug (Score:5, Insightful)
I personally think it should be perfectly OK to read and write hundreds of tiny files. Even thousands.
To paraphrase https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 [launchpad.net] : You certainly can use tons of tiny files, but if you want to guarantee your data will still be there after a crash, you need to use fsync. And if that causes performance problems, then perhaps you should rethink how your application is doing things.
Parent
Re:Not a bug (Score:5, Insightful)
It seems exceedingly odd that issuing a write for a non-zero-sized file and having it delayed causes the file to become zero-size before the new data is written.
Generally when one is trying to maintain correctness one allocates space, places the data into it and only then links the space into place (paraphrased from from Barry Dwyer's "One more time - how to update a master file", Communications of the ACM, January 1981).
I'd be inclined to delay the metadata update until after the data was written, as Mr. Tso notes was done in ext3. That's certainly what I did back in the days of CP/M, writing DSA-formated floppies (;-))
--dave
Parent
Re:Not a bug (Score:5, Informative)
Let's not forget that the only consequence of delayed allocation is the write-out delay changing. Instead of data being "guaranteed" on disk in 5 seconds, that becomes 60 seconds.
Oh dear God, someone inform the president ! Data that is NEVER guaranteed to be on disk according to spec is only guaranteed on disk after 60 seconds.
You should not write your application to depend on filesystem-specific behavior. You should write them to the standard, and that means fsync(). No call to fsync, look it up in the documentation (man 2 write).
The rest of what Ted T'so is saying is optimization, speeding up the boot time for gnome/kde, it is not necessary for correct workings.
Please don't FUD.
You know I'll look up the docs for you :
(quote from man 2 write)
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee
that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has
been written, the call succeeds, and returns the number of bytes written.
That brings up another point, almost nobody is ready for the second remark either (write might return after a partial write, necessitating a second call)
So the normal case for a "reliable write" would be this code :
size_t written = 0; // error handling code, at the very least looking at EIO, ENOSPC and EPIPE for network sockets
int r = write(fd, &data, sizeof(data))
while (r >= 0 && r + written sizeof(data)) {
written += r;
r = write(fd, &data, sizeof(data));
}
if (r 0) {
}
and *NOT*
write(fd, data, sizeof(data)); // will probably work
Just because programmers continuously use the second method (just check a few sf.net projects) doesn't make it the right method (and as there is *NO* way to fix write to make that call reliable in all cases you're going to have to shut up about it eventually)
Hell, even firefox doesn't check for either EIO or ENOSPC and certainly doesn't handle either of them gracefully, at least not for downloads.
Parent
Re:Not a bug (Score:5, Informative)
Parent
Re:Not a bug (Score:5, Funny)
That would be smart, but only if the SQL database is encrypted too. It's theoretically possible to read a registry with an editor, and we can't have that. Also, we need a checksum on the registry. If the checksum is bad, we have to overwrite the registry with zeroes. Registries are monolithic, and we have to make sure that either it's good data, or NONE of it is good data. Otherwise the user would get confused.
I am so excited about this that I'm going to start working on it just as soon as I get done rewriting all my userspace tools in TCL.
Parent
Re:Not a bug (Score:5, Informative)
You're right. The correct thing to do is to *always* call fsync() when you need a data guarantee, *regardless* of which FS you're on. The fact that not doing it in the past hasn't caused problems isn't the problem- those calls are the correct way of handling things.
Parent
Re:Not a bug (Score:5, Insightful)
UNIX filesystems have used tiny files for years and they've had data loss under certain conditions. My favorite example is the XFS that would journal just enough to give you a consistent filesystem full of binary nulls on power failure. This behavior was even documented in their FAQ with the reply "If it hurts, don't do it."
Filesystems are a balancing act. If you want high performance, you want write caching to allow the system to flush writes in parallel while you go on computing, or make another overlapping write that could be merged. If you want high data security, you call fsync and the OS does its best possible job to write to disk before returning (modulo hard drives that lie to you). Or you open the damn file with O_SYNC.
What he's suggesting is that the POSIX API allows either option to programmers, who often don't know theres even a choice to be had. So he recommends concentrating the few people who do know the API in and out focus on system libraries like libsqllite, and have dumbass programmers use that instead. You and he may not be so far apart, except his solution still allows hard-nosed engineers access to low level syscalls, at the price of shooting their foot off.
Parent
Re:Not a bug (Score:5, Insightful)
Parent
Re:Not a bug (Score:5, Insightful)
The benefit of journaling file systems is that after the crash you still have a file system that works. How many folks remember when Windows would crash, resulting in a HDD that was so corrupted the OS wouldn't start. Same with ext2.
If these folks don't like asynchronous writes, they can edit their fstab (or whatever) to have the sync option so all their writes will be synchronous and the world will be a happy place.
Note that they will also have to suffer a slower system, and possible shortened lifetime of their HDD, but at least there configuration files will be safe.
Parent
Re:Not a bug (Score:5, Informative)
Er, actually it removes the previous data, then waits to replace it for long enough that the probability of noticing the disappearance approaches unity on flaky hardware (;-))
--dave
Parent
Re:Not a bug (Score:5, Informative)
It just loses recent data if your system crashes before it has flushed what it's got in RAM to disk.
No, that's the bug. It loses ALL data. You get 0 byte files on reboot.
Parent
Re:Not a bug (Score:5, Insightful)
No. It's not.
If what you say is true there would be no need for the fsync() function (and related ones).
Read the standards if you want. The filesystem is only bugged if it loses recent data under conditions where the application has asked it to guarantee that the data is safe. If the app hasn't asked for any such guarantee by calling fsync() or the like, the filesystem is free to do as it likes.
Parent
Re:Bull (Score:5, Funny)
In fact, there is no such thing as an OS bug! All good programmers should re-implement essential and basic operating system features in their user applications whenever they run into so-called "OS bugs." If you question this, you must be a bad programmer, obviously.
Parent
Exactly. (Score:5, Insightful)
People keep making arguments about the spec, but this seems like a case of throwing the baby out with the bathwater. The spec is intended to serve the interest of robustness, not the other way around; demolishing robustness and then citing the spec is forgetting why there is a spec in the first place.
Yes, you can design something that's intentionally brain-dead, but still true to spec as a kind of intellectual exercise about extremes, but in the real world, the idea should be the opposite:
Stay true to the spec and try to robustly handle as many contingencies as is possible. Both developers should do this, filesystem and application, not "just" one or the other.
It's not enough just to be true to spec; the idea is to get something that works as well, not jump through hoops to cleverly demonstrate that the spec does not protect against all possible bad outcomes.
It's the bad outcomes that we're trying to mitigate by having a spec in the first place!
So my point: what exactly is wrong with meeting the spec and trying to prevent serious problems by other coders from affecting your own code? I thought this was a basic part of coding: even if someone else is an idiot programmer, that doesn't make it okay to let the whole system fall down. Or did we all miss the part where we went for protected memory access and pre-emptive multitasking? Hell, if everybody had just been a great programmer, none of that would have been needed.
The point is to have a working system by following the spec and to try to clean up behind other programmers when they don't as much as possible within your own spec-compliant code. The point is not simply to "meet spec" and the actual utility of the system or vulnerability to the mistakes of others be damned.
Parent
Re:Bull (Score:5, Insightful)
The journal isn't being written before the data. Nothing is written for periods between 45-120 seconds so as to batch up the writing to efficient lumps. The journal is there to make sure that the data on disk makes sense if a crash occurs.
If your system crashes after a write hasn't hit the disk, you lose either way. Ext3 was set to write at most 5 seconds later. Ext4 is looser than that, but with associated performance benefits.
Parent
man 2 fsync (Score:5, Informative)
The filesystem doesn't guarantee anything is written until you've called fsync and it has returned.
Parent
Re:Bull (Score:5, Insightful)
The filesystem should be hitting the metal about 0.001 microseconds after I call write() or whatever the function is.
If that's the behavior you expect, then you need to be running your apps under an OS like DOS, not POSIX or Windows (which both clearly specify that this is *not* how they function).
Parent
Re:Bull (Score:5, Insightful)
Ahh yes, I love developers like you. You assume your app is the only one running, and it must have full access to the entire IO bandwidth an HDD can provide.
And then an antivirus program updates while Firefox is starting and a video is transcoding, and your program either slows to a crawl or crashes after 30 seconds of not receiving or being able to write any data.
Recently I was playing Left4Dead when one of my HDDs in my RAID array died in a very audible way. All the drives spun down, then 3 of them came back online. IOPS went to zero for over 60 seconds. No data in or out to those devices!
Interestingly, Ventrilo kept running fine. Left4Dead completely froze, but a minute or so after the 3 drives came back online, it unfroze. (CPU catching up?) All the while I was freaking out on Ventrilo, much to my friends' amusement.
Pretty much everything else crashed, except for Portable Firefox... uTorrent crashed, but first it left corrupted files all over - appearing as undeletable folders, which require a format to remove.
Time for a disk wipe. Thank you, shitty developers! Next time, use the API properly, and if you must have it written to disk, sync it immediately after you write!
Parent
Re:Bull (Score:5, Insightful)
It's not going to happen immediately in any case. Some optimizations can only be done if you introduce a delay, and once introduced you have to deal with that there's a delay. Just because it's one second instead of a minute doesn't mean your computer can't crash in the precisely wrong moment.
While I'm not an expert in filesystems, I'd expect writing a single file to be at least 4 writes: inode, data, update the directory the file is in, and a bitmap to show space allocation. If there's a journal add a write for the journal. Each of those will require a seek due to all of these things being in different places on the disk in most filesystems.
So your 40 small files just turned into 400-500 seeks, which at 8ms each will take 1.6 to 2 seconds to complete.
Now let's suppose we can batch things up. We need to write the inode and data for each file, and can do just one seek for the directory (the same for all), and the bitmap and journal can be updated in one operation. Now we're down to 2 writes per file, giving 80 seeks, plus 3 for metadata, giving 83 seeks, which can be done in 0.6 seconds.
But what if we do delayed allocation and create the all the inodes and write all the data as one large contigous area? We're now down to 5 writes total, with a seek time of 40ms. The time needed to write the data can probably be disregarded, since modern disks easily write at 50MB/s, and those 40 files with metatata probably amount to less than 32K.
And with some optimization, we just reduced the time it takes to write your 40 files to just 2% of the unoptimized time.
You're not going to get this sort of improvement without some sort of delay. If you insist on a per-file write you'll get really, really awful performance on the sort of workload you're using as an example. And you can even see it in practice, just boot a DOS box, and do benchmarks with and without smartdrv. Running something like a virus scanner should show a huge difference in the presence of a cache.
Parent
Re:Bull (Score:5, Interesting)
Mount your filesystem with the "sync" option, that should do what you want I guess. Performance will be bad though.
There are only two ways to do this: either you do it completely synchronously, and get a guarantee of the write being done when the application is done writing, or you have a delay of arbitrary length. If you have a delay, even if it's 1ms, and you care about the possibility of something going wrong at that moment, the application has to deal with the possibility. Reducing the delay only makes it less likely, but given enough time it'll happen.
Even doing it fully synchronously you can run into problems. A file can be half written (it's written by the block, after all), and of those 40 files, perhaps one references data in another.
Point being, if such things are really a problem for the application, the application must do things correctly, by writing to temporary files, renaming, and writing in the right sequence so that even if something is interrupted in the middle the data on disk still makes sense.
Even if the FS does like you want and starts writing immediately, that won't save you from the fact that it has no clue how your file is internally structured, and will perform writes in fs-sized blocks. So your 10K sized file can be interrupted in the middle and get cut off at 4K in size after a crash. If your application then goes and chokes on that, there's no way the FS can fix that for you.
NCQ doesn't take care of half that's needed for safe writing to disk. Two problems for a start:
1. Your hard disk doesn't know about your filesystem's structure. Unless told otherwise, the HDD will happily reorder writes and update ondisk data first, journal second, leading to disk corruption. The hard disk can't magically figure out what's the right way to write the data so that it remains consistent, only the OS and the application can ensure that.
2. NCQ is limited to 32 commands anyway, the OS has to do handling on its own anyhow.
Because it's a simpler abstration. If you're not willing to learn or deal with the POSIX semantics, such as fsync and rename, and checking the return code of every system call, you can use something like sqlite that does it internally and saves you the effort, and returns one unique value that tells you whether the whole update worked or not.
Parent
Re:Bull (Score:5, Insightful)
Does anyone else think that 150 second is a bit over the top in terms of writing to disk?
I could understand one or two seconds as you speculate more data might come that needs to be written.
5 seconds is a bit iffy, as with ext3.
150 seconds? That's surely a bug.
Parent
Re:Bull (Score:5, Insightful)
Wish I had mod points for you AC as I agree with you. 150 seconds is 2.5 minutes! I don't know of any file system, let alone a RAID controller that waits that longs to commit the data.
If this is a feature and not a bug, better be sure your computer is connected to a UPS. Damn!
Parent
Re:Bull (Score:5, Insightful)
No. That is why we have fsync().
No file system will promise you data integrity with a power failure. That is why you should run with a UPS.
You can not depend on the write delay time. What happens if you get a really fast processor and say a really slow drive? Unless you are building software that only runs on ONE set of hardware you just can not do that.
This is a bug that was always in KDE and they got lucky up till now.
Parent
Re:Bull (Score:5, Insightful)
dude, ALL data is critical.
If you really think that, then you should leave the aera of modern disk access and mount all your partitions with the "sync" option. Then none of your software will have to think about syncing. Of course all file access will be so slow that nobody will want to work with that system either.
Hmm. I wonder why "sync" is not a default mount option?
Parent
Re:Bull (Score:5, Informative)
This is NOT a bug. Read the POSIX documents.
Filesystem metadata and file contents is NOT required to be synchronous and a sync is needed to ensure they are syncronised.
It's just down to retarded programmers who assume they can truncate/rename files and any data pending writes will magically meet up a-la ext3 (which has a mount option which does not sync automatically btw).
RTFPS (Read The Fine POSIX Spec).
Parent
Re:Bull (Score:5, Insightful)
Rewriting the same file over and over is known for being risky. The proper sequence is to create a new file, sync, rename the new file on top of the old one, optionally sync. In other words, app developers must be more careful of their doings, not put all blame to the filesystems. It's so much that an fs can do to avoid such bruhahas. Many other filesystems have similar behavior to the ext4 btw.
Parent
Re:Bull (Score:5, Insightful)
Bullshit. It is not a filesystem limitation. POSIX tells you what you can expect from file system calls. Data committed to disk as soon as an fwrite or fclose returns is not something you can or should expect. (And this is true of every OS I've used in the last 20 years.)
A great many crap programmers think APIs ought to do what they'd like them to. But APIs don't. At best they do what they are specified to do.
Parent
Re:Bull (Score:5, Informative)
It isn't a flaw. It is documented and the programmers didn't follow the docs. There is a specific command called fsync to flush the buffers to prevent the problem.
In fact here is a link to that call http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html [opengroup.org]
Yes if we had a prefect world we would have instant IO but we do not. The flaw is in the application plan and simple.
They didn't use the api properly and it really is just that simple.
Parent
Re:To Anonymous Coward: (Score:5, Informative)
mount -o sync. Enjoy your slow returns and strictly ordered writes.
Parent
Re:Bull (Score:5, Informative)
Ext3 doesn't write out immediately either. If the system crashes within the commit interval, you'll lose whatever data was written during that interval. That's only 5 seconds of data if you're lucky, much more data if you're unlucky. Ext4 simply made that commit interval and backend behavior different than what applications were expecting.
All modern fs drivers, including ext3 and NTFS, do not write immediately to disk. If they did then system performance would really slow down to almost unbearable speeds (only about 100 syncs/sec on standard consumer magnetic drives). And sometimes the sync call will not occur since some hardware fakes syncs (RAID controllers often do this).
POSIX doesn't define flushing behavior when writing and closing files. If your applications needs data to be in NV memory, use fsync. If it doesn't care, good. If it does care and it doesn't sync, it's a bad application and is flawed, plain and simple.
Parent
Re:Not a bug (Score:5, Insightful)
As an application developer, the last thing I want to worry about is whether or not the fraking filesystem is going to persist my data to disk.
As an application developer, you are expected to know what the API does, in order to use it correctly. What Ext4 is doing is 100% respectful of the spec.
Parent
Don't worry (Score:5, Funny)
Don't worry guys, I read the summary this time, and it only affects the German version of ext4.
pr0n (Score:5, Funny)
Works as expected... (Score:5, Insightful)
The problem here is that delaying writes speeds up things greatly but has this possible side-effect. For a shorter commit time, simply stay with ext3. You can also mount your filesystems "sync" for a dramatic performance hit, but no write delay at all.
Anyways, with moderen filesystems data does not go to disk immediately, unless you take additional measures, like a call to fsync. This should be well known to anybody that develops software and is really not a surprise. It has been done like that on server OSes for a very long time. Also note that there is no loss of data older than the write delay period and this only happens on a system crash or power-failure.
Bottom line: Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
Re:Works as expected... (Score:5, Insightful)
Nothing to see here, except a few people that do not understand technology and are now complaining that their expectations are not met.
You're right, there really is nothing to see here. Or rather, there's nothing left. As the article says, a large number of configuration files are opened and written to as KDE starts up. If KDE crashes and takes the OS with it (as it apparently does), those configuration files may be truncated or deleted entirely -- the commands to re-create and write them having never been sync'd to disk. As the startup of KDE takes longer than the write delay, it's entirely possible for this to seriously screw with the user.
The two problems are:
1. Bad application development. Don't delete and then re-create the same file. Use atomic operations that ensure that files you are reading/writing to/from will always be consistent. This can't be done by the Operating System, whatever the four color glossy told you.
2. Bad Operating System development. If an application kills the kernel, it's usually the kernel's fault (drivers and other code operating in priviledged space is obviously not the kernel's fault) -- and this appears to be a crash initiated from code running in user space. Bad kernel, no cookie for you.
Parent
Re:Works as expected... (Score:5, Insightful)
That's an improvement, but it can be made even safer by skipping the delete step. Once the new file is created just rename it on top of the original. The rename system call guarantees that at any point in time the name will refer to either the old or the new file. I'm not sure you really need the sync step. I haven't read the spec in that kind of detail, but if that sync step is really necessary I'd call that a design flaw. The file system may delay the write of the file as well as the rename, but it shouldn't perform the rename until the file has actually been written.
Parent
Re:Exactly (Score:5, Insightful)
Parent
Re:Exactly (Score:5, Insightful)
The problem is not the many small files, but the missing disk sync. The many small files just make the issue more pbvous.
True, with ext4 this is more likely to cause problems, but any delayed write can cause this type of issue when no explicit flush-to-disk is done. And lets face it: fsync/fdatasync are not really a secret to any competent developer.
What however is a mistake, and a bad one, is making ext4 the default filesystem at this time. I say give it another half year, for exactly this type of problem.
Parent
Re:Exactly (Score:5, Insightful)
"And lets face it: fsync/fdatasync are not really a secret to any competent developer."
I disagree. Users of high-level languages (especially those that are cross-platform) are not necessarily aware of this situation, and arguably should not need to be.
And I disagree with your disagreement. This is something any competent developer has to know. There are fundamental limits in practical computing. This is one. It cannot be hidden without dramatic negative effects on performance. It is not a platform-specific problem. It is not a language-specific problem. It is not a hidden issue. A simple "man close" will already tell you about it. Any decent OS course will cover the issue.
I reiterate: Any good developer knows about write-buffering and knows at least that extra measures have to be taken to ensure data is on disk. Those that do not are simply not good developers.
Parent
Classic tradeoff (Score:5, Insightful)
It's amazing how fast a filesystem can be if it makes no guarantees that your data will actually be on disk when the application writes it.
Anyone who assumes modern filesystems are synchronous by default is deluded. If you need to guarantee your data is actually on disk, open the file with O_SYNC semantics. Otherwise, you take your chances.
Moreover, there's no assertion that the filesystem was corrupt as a result of the crash. That would be a far more serious concern.
not mounted sync,dirsync? (Score:5, Interesting)
When I write data to a file (either through a descriptor or FILE *), I expect it to be stored on media at the earliest practical moment. That doesn't mean "immediately", but 150 seconds is brain-damaged. Regardless of how many reads are pending, writes must be scheduled, at least in proportion of the overall file system activity, or you might as well run on a ramdisk.
While reading/writing a flurry of small files at application startup is sloppy from a system performance point of view, data loss is not the application developers' fault, it's the file system designers'.
BTW, I write drivers for a living, have tuned SMD drive format for performance, and written microkernels, so this is not a developer's rant.
Alarmist and ignorant article - not a "problem" (Score:5, Insightful)
*No* modern, desktop-usable file systems today guarantee new files to be there if the power goes out except if the application specifically requests it with O_SYNC, fsync() and similar techniques (and then only "within reason" - actually the most guarantee that the file system will recover itself, not the data). It is universally true - for UFS (the Unix file system), ext2/3, JFS, XFS, ZFS, raiserfs, NTFS, everything. This can only be a topic for inexperienced developers that don't know the assumptions behind the systems they use.
The same is true for data ordering - only by separating the writes with fync() can one piece of data be forced to be written before another.
This is an issue of great sensitivity for databases. See for example:
That there exist reasonably reliable databases is a testament that it *can* be done, with enough understanding and effort, but is not something that's automagically available.
Re:Theory doesn't matter; practice does (Score:5, Insightful)
This is the attitude that has the web stuck with IE.
There's a standard out there called POSIX. It's just like an HTML or CSS standard. If everyone pays attention to it, everything works better. If you fail to pay attention to it for your bit (writing files or writing web pages), it's not *my* fault if my conforming implementation (implementing the writing or the rendering) doesn't magically fix your bugs.
Parent
Re:Excuses are false. This is a severe flaw. (Score:5, Informative)
Nope, it writes a new file and then renames it over the old file, as rename() says it is an atomic operation - you either have the old file or the new file. What happens with ext4 is that you get the new file except for its data. While that may be correct from a POSIX lawyer pont of view, it is still heavily undesirable.
Parent