Forgot your password?
typodupeerror
Data Storage Software Linux

Ext4 Data Losses Explained, Worked Around 421

Posted by timothy
from the you-did-back-up-right dept.
ddfall writes "H-Online has a follow-up on the Ext4 file system — Last week's news about data loss with the Linux Ext4 file system is explained and new solutions have been provided by Ted Ts'o to allow Ext4 to behave more like Ext3."
This discussion has been archived. No new comments can be posted.

Ext4 Data Losses Explained, Worked Around

Comments Filter:
  • by morgan_greywolf (835522) on Thursday March 19, 2009 @12:52PM (#27258767) Homepage Journal

    FTFA, this is the problem:

    Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.

    And now my question: Why did the Ext4 developers make the same mistakes Reiser and XFS both made (and later corrected) years ago? Before you get to write any filesystem code, you should have to study how other people have done it, including all the change history. Seriously.

    Those who fail to learn the lessons of [change] history are doomed to repeat it.

  • by Spazmania (174582) on Thursday March 19, 2009 @12:53PM (#27258781) Homepage

    Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations.

    I couldn't disagree more:

    When applications want to overwrite an existing file with new or changed data [...] they first create a temporary file for the new data and then rename it with the system call - rename(). [...] Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until [up to 60 seconds later].

    Application developers reasonably expect that writes to the disk which happen far apart in time will happen in order. If I write to a file and then rename the file, I expect that the rename will not complete significantly before the write. Certainly not 60 seconds before the write. It seems dead obvious, at least to me, that the update of the directory entry should be deferred until after ext4 flushes that part of the file written prior to the change in the directory entry.

  • by girlintraining (1395911) on Thursday March 19, 2009 @12:54PM (#27258809)

    Short version: "We're sorry we changed something that worked and everyone was used to, but hey -- it's compliant with a standard." If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it. The workaround is laughable -- "call fsync(), and then wait(), wait(), wait(), for the Wizard to see you." How about writing a filesystem that actually does journaling in a reliable fashion, instead of finger-pointing after the user loses data due to your snazzy new optimization and say "The developer did it! It wasn't us, honest." Microsoft does it and we tar and feather them, but the guys making the "latest and greatest" Linux feature we salute them?

    We let our own off with heineous mistakes while professionals who do the same thing we hang simply because they dared to ask to be paid for their effort. Lame.

  • Re:LOL: Bug Report (Score:5, Insightful)

    by Z00L00K (682162) on Thursday March 19, 2009 @12:58PM (#27258863) Homepage

    This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

    And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2. If you want a really stupid filesystem go FAT and prepare for a patent attack.

  • I sit just me? (Score:3, Insightful)

    by IMarvinTPA (104941) <`moc.APTnivraMI' `ta' `APTnivraMI'> on Thursday March 19, 2009 @01:04PM (#27258961) Homepage Journal

    I sit just me, or would you expect that the change would only be committed once the data was written to disk under all circumstances?
    To me, it sounds like somebody screwed up a part of the POSIX specification. I should look for the line that says "During a crash, loose the user's recently changed file data and wipe out the old data too."

    IMarv

  • by victim (30647) on Thursday March 19, 2009 @01:05PM (#27258973)

    The workaround (flushing everything to disk before the rename) is a disaster for laptops or anything else which might wish to spin down a disk drive.

    The write-replace idiom is used when a program is updating a file and can tolerate the update being lost in a crash, but wants either the old or the new to be intact and uncorrupted. The proposed sync solution accomplishes this, but at the cost of spinning up the drive and writing the blocks at each write-replace. How often does your browser update a file while you surf? Every cache entry? Every history entry? What about your music player? Desktop manager? All of these will be spin up your disk drive.

    Hiding behind POSIX is not the solution. There needs to be a solution that supports write-replace without spinning up the disk drive.

    The ext4 people have kindly illuminated the problem. Now it is time to define a solution. Maybe it will be some sort of barrier logic, maybe a new kind of sync syscall. But it needs to be done.

  • Dunno (Score:5, Insightful)

    by Shivetya (243324) on Thursday March 19, 2009 @01:05PM (#27258983) Homepage Journal

    but if you want a write later file system shouldn't it be restricted to hardware that can preserve it?

    I understand that doing writes immediately when requested leads to performance degradation but that is why business systems which defer writes to disk only do so when the hardware can guarantee it. In other words, we have a battery backed cache, if the battery is low or nearing end of life the cache is turned off and all writes are made when the data changes.

    Trying to make performance gains to overcome limitations of the hardware never wins out.

  • Re:I sit just me? (Score:1, Insightful)

    by Anonymous Coward on Thursday March 19, 2009 @01:07PM (#27258991)

    aye, standard aren't perfect. if it doesn't make sense, that part should be avoided and create an updated standard addressing these issues. what somebody decided years bad isn't always the best solution.

  • by Samschnooks (1415697) on Thursday March 19, 2009 @01:10PM (#27259051)
    Speaking as someone who has developed OS commercial code (OS/2), I always assumed that the person before me understood what they were doing; because, if you didn't, you were spending all your time researching how the 'wheel' was invented. Also, aside from this very rare occurrence, it is pretty arrogant to think that your predecessors are incompetent or, to be generous, ignorant.

    This problem is just something that slipped through the cracks and I'm sure the originator of this bug is kicking himself in the ass for being so "stupid".

  • Re:LOL: Bug Report (Score:5, Insightful)

    by Anonymous Coward on Thursday March 19, 2009 @01:12PM (#27259073)

    Rubbish. Sorry, if the syncs were implicit, app developers would just be demanding a way to to turn them off most of the time because they were killing performance.

  • Re:LOL: Bug Report (Score:5, Insightful)

    by von_rick (944421) on Thursday March 19, 2009 @01:15PM (#27259141) Homepage

    And also consider - ext4 is relatively new, so it will improve over time. If you want stability stick to ext3 or ext2.

    QFT

    The filesystem was first released sometime towards the end of December 2008. The Linux distros that incorporated it, gave it as an option, but the default for /root and /home was always EXT3.

    In addition, this problem is not a week old like the article states. People have been discussing this problem on forums ever since mid-January, when the benchmarks for EXT4 were published and several people decided to try it out to see how it fares. I have been using EXT4 for my /root partition since January. Fortunately I haven't had any data loss, but if I do end up losing some data, I'd understand that since I have been using a brand new file-system which has not been thoroughly tested by users, nor has it been used on any servers that I know of.

  • by GMFTatsujin (239569) on Thursday March 19, 2009 @01:20PM (#27259211) Homepage

    If the issue is drive spin-up, how have the new generation of flash drives been taken into account? It seems to me that rotational drives are on their way out.

    That doesn't do anything for the contemporary generations of laptop, but what would the ramifications be for later ones?

  • by dotancohen (1015143) on Thursday March 19, 2009 @01:22PM (#27259249) Homepage

    Before you get to write any filesystem code, you should have to study how other people have done it...

    No. Being innovative means being original, and that means taking new and different paths. Once you have seen somebody else's path, it is difficult to go out on your own original path. That is why there are alpha nad beta stages to a project, so that watchful eyes can find the mistakes that you will undoubtedly make, even those that have been made before you.

  • by Dan667 (564390) on Thursday March 19, 2009 @01:32PM (#27259383)
    I believe a major difference is that Microsoft would just deny there was a problem at all. If they did acknowledge it, they certainly would not detail what it is.
  • No kidding (Score:5, Insightful)

    by Sycraft-fu (314770) on Thursday March 19, 2009 @01:36PM (#27259445)

    All the stuff with Ext4 strikes me as amazingly arrogant, and ignorant of the past. The issue that FS authors, well any authors of any system programs/tools/etc need to understand is that your tool being usable is the #1 important thing. In the case of a file system, that means that it reliably stores data on the drive. So, if you do something that really screws that over, well then you probably did it wrong. Doesn't matter if you fully documented it, doesn't matter if it technically "follows the spec" what matters is that it isn't usable.

    I mean I could write a spec for a file system that says "No write is guaranteed to be written to disk until the OS is shut down, everything can be cached in RAM for an indefinite amount of time." However that'd be real flaky and lead to data loss. That makes my FS useless. Doesn't matter if it is well documented, what matters is that the damn thing loses data on a regular basis.

    I'd give these guys more credit if I was aware of any other major OS/FS combo that did shit like this, but I'm not. Linux/Ext3 doesn't, Windows/NTFS doesn't, OS-X/HFS+ doesn't, Solaris/ZFS doesn't, etc. Well that tells me something. That says that the way they are doing things isn't a good idea. If it is causing problems AND it is something else nobody else does, then probably you ought not do it.

    This is just bad design, in my opinion.

  • by Evanisincontrol (830057) on Thursday March 19, 2009 @01:39PM (#27259481)

    Standing on the shoulders of giants is usually the best way to make progress.

    Sure, if the only direction you want to go is the direction that the giant is already moving. Doesn't help you get anywhere else, though.

  • Re:LOL: Bug Report (Score:5, Insightful)

    by try_anything (880404) on Thursday March 19, 2009 @01:48PM (#27259617)

    This is the problem with new features - the users have problems using them until they fully understands and appreciates the advantages and disadvantages.

    Advantages: Filesystem benchmarks improve. Real performance... I guess that improves, too. Does anybody know?

    Disadvantages: You risk data loss with 95% of the apps you use on a daily basis. This will persist until the apps are rewritten to force data commits at appropriate times, but hopefully not frequently enough to eat up all the performance improvements and more.

    Ext4 might be great for servers (where crucial data is stored in databases, which are presumably written by storage experts who read the Posix spec), but what is the rationale for using it on the desktop? Ext4 has been coming for years, and everyone assumed it was the natural successor to ext3 for *all* contexts where ext3 is used, including desktops. I hope distros don't start using or recommending ext4 by default until they figure out how to configure it for safe usage on the desktop. (That will happen long before the apps are rewritten.) Filesystem benchmarks be damned.

  • by TheMMaster (527904) <hp@@@tmm...cx> on Thursday March 19, 2009 @01:48PM (#27259629)

    Actually, no.

    Microsoft runs a proprietary show where they 'set the standard' themselves. Which basically means 'there is no standard except how we do it'.
    Linux, however, tries to adhere to standards. When it turns out that something doesn't adhere to standards, it gets fixed.

    Another problem is that most users of proprietary software on their proprietary OS don't have the sources to the software they use, so if the OS fixes something that was previously broken, but the software version used is 'no longer supported' the 'fix' in the OS breaks the users' software and the user has no option of fixing his software.

    THIS is why a) microsoft can't ever truly fix something and b) why using proprietary software screws over the user.

    Or would you rather have OSS software do the same as proprietary software vendors and work around problems forever but never fixing them? Saw that shiny 'run in IE7 mode' button in IE8? that's what you'll get...

  • by Hatta (162192) on Thursday March 19, 2009 @01:51PM (#27259667) Journal

    If this were Microsoft, we'd give them a healthy helping of humble pie, but because it's Linux and the magic word "POSIX" gets used, I'm sure we'll forgive them for it.

    You must be reading a different slashdot than I am. The popular opinion I see is that this is very bad design. If the spec allows this behavior, it's time to revisit the spec.

  • Re:LOL: Bug Report (Score:1, Insightful)

    by Anonymous Coward on Thursday March 19, 2009 @01:58PM (#27259795)

    ZFS isn't all that viable for Linux users. ZFS-FUSE is too slow.

    With that said, I think someone should just go ahead and put ZFS in the Linux kernel and release a patch only. This will get around the GPL issues. All it would mean is that you couldn't redistribute a kernel binary or source with ZFS stuff in it. Anyone wanting ZFS would have to patch and compile their own kernel, not that big a deal. If it's internal use only then GPL is compatible with the ZFS license.

    Personally I have lost a lot of data with all the ext filesystems (and Reiser3 too). I still use it on OS and boot partitions but all my important big data partitions are XFS. I have run for years on failing hardware with XFS. I have never lost data with XFS except for the sectors that were physically damaged and even then I never lost anything important. XFS has been fairly bulletproof for me, whereas I have lost entire ext2/3 partitions due to corruption that wasn't even a hardware failure.

  • by DragonWriter (970822) on Thursday March 19, 2009 @02:00PM (#27259815)

    I'm a hobbyist, and I don't program system level stuff, essentially, at all anymore, but way back when I did do C programming on Linux (~10 years ago), ISTR that this (from Ts'o in TFA) was advice you couldn't go anywhere without getting hit repeatedly over the head with:

    if an application wants to ensure that data have actually been written to disk, it must call the the function fsync() before closing the file.

    Is this really something that is often missed in serious applications?

  • by Anonymous Coward on Thursday March 19, 2009 @02:05PM (#27259887)

    You dissagree with his interpritations of the spec?

    Well then, show us the relevent part of the spec that says things should happen in order.

    It doesnt say that? It says instead to use fsync()?

    Blame the FS all you people want, but the fact remains that the application writters screwed up big time, their code is not robust and probably will fail again in the future. Even with Ext3, the code was a ticking time bomb. If power is lost at the right time, the same results would happen.

    Sure, it would be nice to have a FS that fixed the poorly made code people write, but that does not remove the blame from the application writters, it simply adds some to the FS writters for taking what was a good desktop FS and trying to turn it into a server FS. Desktop FSs need to deal with poor application code, and with frequent power losses, but poor code is still poor code.

  • Re:LOL: Bug Report (Score:3, Insightful)

    by shentino (1139071) on Thursday March 19, 2009 @02:08PM (#27259939)

    Ext4 is still alpha-ish, and declared as such.

    Any *user* who trusts production data to an experimental filesystem is already too stupid to have the right to gripe about losing said data.

  • Re:No kidding (Score:3, Insightful)

    by mr_mischief (456295) on Thursday March 19, 2009 @02:11PM (#27259999) Journal

    It does store data reliably on the drive that has been properly synchronized by the application's author. This data that is lost is what has been sent to a filehandle but not yet synchronized when the system loses power or crashes.

    The FS isn't the problem, but it is exposing problems in applications. If you need your FS to be a safety net for such applications, nobody is taking ext3 away just because ext4 is available. IF you want the higher performance of ext4, buy a damn UPS already.

  • Easier Fix (Score:4, Insightful)

    by maz2331 (1104901) on Thursday March 19, 2009 @02:11PM (#27260001)

    Why not just make the actual "flushing" process work primarily on memory cache data - including any "renames", "deletes", etc.?

    If any "writes" are pending, then the other operations should be done in the strict order in which they were requested. There should be no pattern possible where cache and file metadata can be out of sync with one another.

  • Re:No kidding (Score:2, Insightful)

    by SIR_Taco (467460) on Thursday March 19, 2009 @02:39PM (#27260413) Homepage

    what matters is that the damn thing loses data on a regular basis.

    I guess I don't really understand what you mean by regular basis, or maybe you just like feeding quarters into the FUD machine. Maybe you live in a place where power failures are very common and/or you like to randomly hit the reset/power buttons. Or maybe you're just not peddling hard enough to keep your computer from going into black/brown-out status.
    The fact is that you will not lose data on a regular basis unless you have severe power problems. This is a performance boost based on the assumption that power outages and bone-headed users are not the common-place. Take that as you will, and I'm not one to suggest that any distro accept this as their default FS, however, it does have its place and many people welcome it.

    Just my two cents.

  • Re:LOL: Bug Report (Score:1, Insightful)

    by Anonymous Coward on Thursday March 19, 2009 @02:39PM (#27260419)

    Part of the problem as I understand it was that ext3 performed horribly if you did do things according to the spec (i.e. fsync wrote everything pending, not just the file descriptor you gave it) which caused horrible performance.

    I think it will be a great idea to use it for desktops as it might force applications to be written correctly, those that are really worried about it can put off upgrading to a new ubuntu until the dust settles.

    I don't expect my OS to crash often enough for it to be a concern anyway and the places where its really important (like document based apps like emacs/vi/ooffice) had better have been using fsync already.

    Losing a max 2 minutes of recent data changes for extra performance only when an App isn't written to spec, I think I can live with that.

  • POSIX (Score:4, Insightful)

    by 200_success (623160) on Thursday March 19, 2009 @02:41PM (#27260439)
    If I had wanted POSIX-compliant behavior, I could have gotten Windows NT! (Windows was just POSIX-compliant enough to be certified, but the POSIX implementation was so half-assed that it was unusable in practice.) Just because Ext4 complies with the minimum requirements of the spec doesn't make it right, especially if it trashes your data.
  • Re:LOL: Bug Report (Score:5, Insightful)

    by causality (777677) on Thursday March 19, 2009 @02:46PM (#27260521)

    The first question that came to mind when I read that is "why would the average application need to concern itself with filesystem details?"

    They don't. Applications just need to concern themselves with the details of of the APIs they use, and the guarantees those APIs do or don't provide.

    The POSIX file APIs specify quite clearly that there is no guarantee that your data is on the disk until you call fsync(). The problem is with applications that assumed they could ignore what the specification said just because it always seemed to work okay on the file systems they tested with.

    Thanks for explaining that. In that case, I salute Mr. Tso and others for telling the truth and not caving in to pressure when they are in fact correctly following the specification. Too often people who are correct don't have the fortitude to value that more than immediate convenience, so this is a refreshing thing to see. Perhaps this will become the sort of history with which developers are expected to be familiar.

    I imagine it will take a lot of work but at least with Free Software this can be fixed. That's definitely what should happen, anyway. There are sometimes when things just go wrong no matter how correct your effort was; in those cases, it makes sense to just deal with the problem in the most hassle-free manner possible. This, however, is not one of those times. Thinking that you can selectively adhere to a standard and then claim that you are compliant with that standard is just the sort of thing that really should cause problems. Correcting the applications that made faulty assumptions is therefore the right way to deal with this, daunting and inconvenient though that may be.

    Removing this delayed-allocation feature from ext4 or placing limits on it that are not required by the POSIX standard is definitely the wrong way to deal with this. To do so would surely invite more of the same. It would only encourage developers to believe that the standards aren't really important, that they'll just be "bailed out" if they fail to implement them. You don't need any sort of programming or system design expertise to understand that, just an understanding of how human beings operate and what they do with precedents that are set.

  • Re:LOL: Bug Report (Score:3, Insightful)

    by AigariusDebian (721386) <.gro.naibed. .ta. .suiragia.> on Thursday March 19, 2009 @02:48PM (#27260561) Homepage

    You have a separate partition for /root ? How large can the home folder of the root user be?

  • Re:LOL: Bug Report (Score:5, Insightful)

    by AigariusDebian (721386) <.gro.naibed. .ta. .suiragia.> on Thursday March 19, 2009 @02:53PM (#27260613) Homepage

    1) Modern filesystems are expected behave better than POSIX demands.

    2) POSIX does not cover what should happen in a system crash at all.

    3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

    4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

    We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

  • A few percent performance difference will be easily wiped away when the filesystem erases an important file that one time a year when a snowstorm knocks your power out.

  • Re:Bad POSIX (Score:1, Insightful)

    by Anonymous Coward on Thursday March 19, 2009 @03:02PM (#27260735)

    Does POSIX really require this behavior, or just allow it?

    Exactly: Can is not Should.

    The Internet Protocol standard makes no guarrantees that packets will be received by the definition. Indeed, it explicitly references the fact that packet loss should be expected. But while a router which simply drops all the packets it receives might technically be standard conformant in that respect, only an idiot would think that such behavior is acceptable.

    Similarly, most of the error messages and warnings we come to expect from a modern compiler are not required by the C standard. So while a compiler which doesn't give such error messages isn't technically broken, it reasonable for a user to think it's worthless, and expect to use a compiler which works like all the other ones do.

    Standard writers are only human. With enough ingenuity, you can take any standard and produce an implementation which while technically conformant is horribly worthless or even hazardous to the end user.

    I'll finish off with a quote from another "standard" (RFC1958): "Be strict when sending and tolerant when receiving." While not part of the POSIX specification, it's a good principle to follow. While it might not technically be the filesystem's fault the application is not strictly conformant to POSIX specification, that doesn't mean that the filesystem should shrug it's shoulders and say "meh". When possible the filesystem should be reasonably tolerant when receiving errors from applications, so that the system doesn't choke and die unexpectedly.

    Standing around with your fingers in your ears singing "la, la, la, we're standard conformant and we can't hear you, la, la, la" is never acceptable behavior.

  • Re:LOL: Bug Report (Score:2, Insightful)

    by mrwolf007 (1116997) on Thursday March 19, 2009 @03:13PM (#27260891)
    Absolutely correct.
    And thats the way it should be done.
    Stability by default, increased performance by request.
    Lets be realistic, how many applications benefit from this delayed write. Not many is guess. Now, on the other hand, if you have an extremely I/O heavy app, disable the auto syncs and do it manually.
  • by SanityInAnarchy (655584) <ninja@slaphack.com> on Thursday March 19, 2009 @03:23PM (#27261031) Journal

    The right time being the hundredths of a second between the commit of the file data and the commit of the directory data, not 60 seconds.

    And if you fsync'd, the right time would be zero, on either ext3 or ext4. Or XFS, for that matter.

    If I fsync after every write, I can get reliability in ext2.

    No you can't. Reliability in ext2 would force you to sync not just your file, but whole directory structures -- and even then, you'd only be safe until something else starts writing.

    I put up with the performance hit from ext3 and ext4 because I want the reliability in the filesystem instead of having to build it into every part of every application.

    Too late.

    All the journaling guarantees is that if you lose power, you won't have to fsck -- you'll get a filesystem which is internally consistent. Oh, and it also guarantees that you won't see circular directory entries, or an entire directory falling off the face of the planet, and other nastiness.

    Whether it's consistent with respect to your application is completely outside the scope of the FS journaling, and is the responsibility of your application. Put it in a library, use a database, whatever -- but it's not the filesystem's fault that you failed to read the spec, nor is it very smart of you to code to ext3 instead of POSIX.

  • Re:No kidding (Score:3, Insightful)

    by JumboMessiah (316083) on Thursday March 19, 2009 @04:00PM (#27261547)

    I just posted in the wrong thread. Synopsis:

    I made a lot of money back in the 90's repairing NTFS installs. The similarity with it, back then, and EXT4 is they are/were young file systems.

    Give Ted and company a break. Let him get the damn thing fixed up (I have plenty of faith in Ted). Hell, I even remember losing an EXT3 file system back when it was fresh out of the gate. And I'm sure there's plenty who could say the same for all those you listed, including ZFS.

    And your comment about extended data caching. Is your memory short? Remember "laptop mode", specifically setup this way to keep the hard drive from having to spin up...

  • Re:LOL: Bug Report (Score:5, Insightful)

    by ultranova (717540) on Thursday March 19, 2009 @04:15PM (#27261747)

    Solution: an update to the code to behave as idiot application programmers require with a simple mount option.

    The application programmers aren't at fault here, the POSIX spec is. A filesystem is essentially a hierarchical database, yet POSIX doesn't include a way to make atomic updates to it. The only tool provided is fsync, which kills performance if used. And even with fsync some things - such as rewriting a configuration file - are either outright impossible or complex and fragile.

    The real solution is to come up with a transactional API for filesystem. Until that's done, problems like this will persist. Calling fsync - which forces a disk write - or playing around with temporary files isn't reasonable when all you want to do is make sure that the file will be updated properly or left alone.

    The alternative is to have every program call fsync constantly, which not only kills performance, but ironically enough also negates some of Ext4's advantages, such as delayed block allocation, since it essentially disables write caching. And it doesn't work if you are doing more complex things, such as, say, mass renaming files in a directory; you have no way of ensuring that either they are all renamed, or none are.

  • Re:LOL: Bug Report (Score:5, Insightful)

    by blazerw (47739) on Thursday March 19, 2009 @04:31PM (#27261947)

    1) Modern filesystems are expected behave better than POSIX demands.

    2) POSIX does not cover what should happen in a system crash at all.

    3) The issue is not about saving data, but the atomicity of updates so that either the new data or the old data would be saved at all times.

    4) fsync is not a solution, because ir forces the operation to complete *now*, which is counterproductive to write performance, cache coherence, laptop battery life, excessive SSD wear and a bunch of other reasons.

    We don't need reliable data-on-disk-now, we need reliable old-or-new data without using a sledgehammer of fsync.

    1. POSIX is an API. It tries not to force the filesystem into being anything at all. So, for instance, you can write a filesystem that waits to do writes more efficiently to cut down on the wear of SSDs.
    2. Ext3 has a max 5 second delay. That means this bug exists in Ext3 as well.
    3. If you have important data that if not written to the hard drive will cause catastrophic failure, then you use the part of the API that forces that write.
    4. Atomicity does not guarantee the filesystem be synchronized with cache. It means that during the update no other process can alter the affected file and that after the update the change will be seen by all other processes.

    We don't need a filesystem that sledgehammers each and every byte of data to the hard drive just in case there is a crash. What we DO need is a filesystem that can flexibly handle important data when told it is important, and less important data very efficiently.

    What you are asking is that the filesystem be some kind of sentient all knowing being that can tell when data is important or not and then can write important data immediately and non-important data efficiently. I think that it is a little better to have the application be the one that knows when it's dealing with important data or not.

  • Re:LOL: Bug Report (Score:3, Insightful)

    by Foolhardy (664051) <csmith32NO@SPAMgmail.com> on Thursday March 19, 2009 @04:42PM (#27262071)
    It sounds like the correct solution is for the file system to implement transactional semantics. That is what the applications need and were incidentally getting, despite it not being in the spec.

    Why isn't this being considered as the solution? There are other major OSes have implemented basic atomic transactions in their filesystems successfully, why not Linux?
  • Re:Bollocks (Score:1, Insightful)

    by Anonymous Coward on Thursday March 19, 2009 @05:12PM (#27262421)

    You are entirely incorrect, sir. A file system IS a Database Management System. It is not a Relational Database Management System, but it's sole purpose is to store and access data in an organized fashion, creating, if you will, a base of data.

  • Re:LOL: Bug Report (Score:5, Insightful)

    by somenickname (1270442) on Thursday March 19, 2009 @05:17PM (#27262477)

    fsyncs have other nasty side effects other than performance. For example, in Firefox 3, places.sqlite is fsynced after every page is loaded. For a laptop user, this behavior is unacceptable as it prevents the disks from staying spun down (not to mention the infuriating whine it creates to spin the disk up after every or nearly every page load). The use of fsync in Firefox 3 has actually caused some people (myself included), to mount ~/.mozilla as tmpfs and just write a cron job to write changed files back to disk once every 10 minutes.

    So, while I'm all for applications using fsync when it's really needed, the last thing I'd like to see every application on the planet sprinkling their code with fsync "just to be sure".

  • Re:LOL: Bug Report (Score:3, Insightful)

    by ChaosDiscord (4913) * on Thursday March 19, 2009 @05:23PM (#27262523) Homepage Journal

    Glossing over some details, what is happening is closer to this:

    The goal is to replace config with a new version. The programmer is essentially doing this:

    • 1. Create config.new. (Should be empty, because it's new)
    • 2. Write the new contents into config.new
    • 3. Move config.new onto config

    The goal is that when you replace config, you're replacing it with a guaranteed complete version, config.new. Assuming it happens in this order (and that step 3 is atomic; it happens or doesn't, never partially) if you crash midway through, you'll either end up with the old config or the new config, but never a partial config. Unfortunately the operating system tries to speed things up, and for a variety of good reasons delaying step 2 makes sense. Doing so is allowed by the standards specifically for these good reasons. So what actually happens is this:

    • 1. Create config.new. (Should be empty, because it's new)
    • 3. Move config.new onto config
    • 2. Write the new contents into config.new (which is actually config now, so it works)

    This works fine... unless something happens between steps 3 and 2. If we stop there, we have a new, empty file in place of "config." With ext4, the window between 3 and 2 could be as long as a minute, a window during which you can lose data.

    The correct solution is for the program, not the operating system, to take care with files it cares about:

    • 1. Create config.new. (Should be empty, because it's new)
    • 2a. Write the new contents into config.new
    • 2b. Wait until the contents are on disk. ("fsync")
    • 3. Move config.new onto config

    Now it's not possible to move 2a after 3, so you're guaranteed safe behavior. But you lose the speed benefits of reordering. For data you care about, this is a good idea. For data you don't care about (Your web browser cache leaps to mind), it's overkill and makes you slower.

    ext3 (and the new ext4 option) essentially adds 2b automatically. It's good in that it's safer for everyone involved, but it's bad in that everyone takes a speed hit, even in cases where speed is more important than safety.

  • Re:LOL: Bug Report (Score:5, Insightful)

    by spitzak (4019) on Thursday March 19, 2009 @05:30PM (#27262591) Homepage

    You don't understand the problem.

    You are wrong when you say EXT3 has this problem. It does not have it. If the EXT3 system crashes during those 5 seconds, you either get the old file or the new one. For EXT4, if it crashes, you can get a zero-length file, with both the old and new data lost.

    The long delay is irrelevant and is confusing people about this bug. In fact the long delay is very nice in EXT4 as it means it is much more efficient and will use less power. I don't really mind if a crash during this time means I lose the new version of a file. But PLEASE don't lose the old one as well!!! That is inexcusable, and I don't care if the delay is .1 second.

  • by ChaosDiscord (4913) * on Thursday March 19, 2009 @05:34PM (#27262637) Homepage Journal

    "The workaround is laughable -- 'call fsync(), and then wait(), wait(), wait(), for the Wizard to see you.'"

    The "workaround" has been the standard for decades! Twenty years ago when I was learning programming I was warned: Until you call fsync(), you have no guarantee that your data has landed on disk. If you want to be sure the data is on the disk, call fsync(). While it's a complication for application developers, the benefit is that it allows filesystem developers to make the filesystem faster. That ext3 in its default configuration happened to work as erroneously expected has always been a happy coincidence, not something to rely on.

    You might as well be complaining about the "workaround" that you have to shutdown your computer properly instead of yanking the cord out of the wall; since it didn't used to lose data when you did that.

  • by mbessey (304651) on Thursday March 19, 2009 @05:49PM (#27262825) Homepage Journal

    There's a ton of software out there that uses the "write to new file with temporary name, then rename it to the final name" pattern, much of it written before Ext4 (or Ext3, or Ext) was designed, and rather a lot of it written before most of the folks on the Linux Kernel mailing list were even out of elementary school. This is a well-established method for reliably updating files, and it works, or fails gracefully, on almost every filesystem implementation from 1976 to the present day - except for Ext4.

    Claiming that otherwise-portable software ought to include Linux-specific (not to mention Ext4-specific!) code to avoid massive data loss seems a bit backward.

  • Re:LOL: Bug Report (Score:4, Insightful)

    by Eskarel (565631) on Thursday March 19, 2009 @07:34PM (#27263697)

    This is actually even stupider for flash drives. There is essentially zero seek time on a flash drive, so, in theory, it shouldn't really matter how much you write at any given time(since hte only delay should be how long it takes to actually write the cell).

    In addition, presuming reasonable wear algorithms(which should be implemented in the device controller not in any sort of software), every bit of Math I've seen says that for any realistic amount of data writes the flash drives will last substantially longer than any current physical drives(last I saw it was about 30 years if you wrote every sector on the disk once a day, scaling down as writes increase. Even writing 6 times the volume of the drive per day that's 5 years which is a fairly long time for consumer grade physical drives, and unlike a physical drive, even if you can't read it, you can write it so you can just clone it over to a new drive.

    File systems will definitely have to change for flash drives, but delaying writes probably isn't going to be the way to do it, especially since there's no need to do so.

  • Re:LOL: Bug Report (Score:3, Insightful)

    by DiegoBravo (324012) on Friday March 20, 2009 @12:52AM (#27265329) Journal

    For many (most?) Unix admins, /root is just a nicer way to specify "/ filesystem" or "root filesystem". The path /root for root user's home directory is popular in Linux, but I never saw it in the Unixes I've used (but I don't know if that custom is a Linux invention.)

  • Re:Dunno (Score:3, Insightful)

    by mr3038 (121693) on Friday March 20, 2009 @08:04AM (#27267067)

    ext3 is also delaying writes. The bug is that ext4 is not delaying renames to happen after writes. Instead renames happen immediately, and guess what, they spin your hard drive up, then you get to wait 60 second until real data starts to be written. Oh and if you lose power or crash during these 60 seconds, you loose all data - new and old. Oh and you common desktops programs do that cycle several times a minute.

    Excuse my language, but why the fuck are those "common desktop programs" writing and renaming files several times a minute? I understand that files are written if I change any settings but this is something different. Perhaps there should be some special filesystem that is designed to freeze the whole system for 1 second for every write() any application does. Such filesystem could be used for application testing. That way it would be immediately obvious if any program is writing too much stuff without a good reason.

    The EXT4 is doing exactly the right thing because it's never actually writing any of those files to the disk. Because those files are constantly replaced with new versions, there's no point trying to save any unless the application ask so. To do that, the application should call fsync(). Otherwise, the FS has no obligation to write anything in any given order to the disk until the FS is unmounted. A high performance FS with enough cache will not write anything to disk until fsync() unless the CPU and disk have nothing else to do (and even then, only because it probably improves the performance of possibly following fsync() or unmount in the future).

As in certain cults it is possible to kill a process if you know its true name. -- Ken Thompson and Dennis M. Ritchie

Working...