Kernel Hackers On Ext3/4 After 2.6.29 Release 316
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
Safest mkfs/mount options? (Score:4, Interesting)
If I were to setup a new home spare-part-server using software RAID-5 and LVM today, using kernel 2.6.28 or 2.6.29 and I really care about not losing important data in case of a power outage or system crash but still want reasonable performance (not run with -o sync), what would be my best choice of filesystem (EXT4 or XFS), mkfs and mount options?
Re:I would go further than Linus on this one... (Score:4, Interesting)
Am I right believing that the new data is written elsewhere and then the metadata is updated in place to point to the new data? I don't know much about filesystems..
Re:I would go further than Linus on this one... (Score:4, Interesting)
This is a potential problem when you are overwriting existing bytes or removing data.
In that case, you've removed or overwritten the data on disk, but now the metadata is invalid.
i.e. You truncated a file to 0 bytes, and wrote the data.
You started re-using those bytes for a new file that another process is creating.
Suddenly you are in a state where your metadata on disk is inconsistent, and you crash before that write completes.
Now you boot back up.. you're ext3, so you only journal metadata, so that's the only thing you can revert, unfortunately, there's really nothing to rollback, since you haven't written any metadata yet.
Instead of having a 0 byte file, you have a file that appears to be the size it was before you truncated it, but the contents are silently corrupt, and contain other-program-B's data
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Interesting)
Well, some Linux filesytem developers (and some fanboys) have been chastising other (higher-performance) filesytems for not providing the guarantees that ext3 ordered move provides.
Application developers hence were indirectly educated to not use fsync(), because apparently a filesystem giving anything other than the ext3 ordered mode guarantees is just unreasonable, and ext3 fsync() performance really sucks. (The reason why you don't actually *want* what fsync implies has been explained in the previous ext4 data-loss posts).
Some of those developers are now complaining that their "new" filesystem (designed to do away with the bad performance of the old one) is disliked by users who are losing data due to applications being encouraged to be written in a bad way, and telling the developers that they now should add fsync() anyway (instead of fixing the actual problem with the filesystem).
Moreover, they are complaining that the application developers are "weird" because of expecting to be able to write many files to the filesystem and not having them *needlessly* corrupted. IMAGINE THAT!
As an aside joke, the "next generation" btrfs which was supposed to solve all problems has ordered mode by default, but its an ordered mode that will erase your data in exactly the same way as ext4 does.
Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.
Re:Safest mkfs/mount options? (Score:2, Interesting)
Ext3 with an ordered (default) style journal.
I believe XFS has a similar option, and Ext4 will with the next kernel, but for a home type system Ext3 should meet all of your needs, and Linux utilities still know it best.
Of course you should probably use RAID-10 too, with data disk space so cheap it is well worth it. Using the "far" disk layout, you get very fast reads, and though it penalizes writes (vs RAID 0) in theory, the benchmarks I have seen show that penalty to be smaller than the theory.
as for mkfs, large inodes probably, and when mounting use noatime.
for some anti-raid 5 propaganda:
http://www.baarf.com/ [baarf.com]
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Interesting)
hm. Similar to a parent of two children ranting at them without taking time to think first. Calling them morons is just going to get them growing up to be dysfunctional at best. No wonder the world has a dim view of the "geek" community.
It seems to me that, as usual, the issue is not as clear cut as it first appears [slashdot.org]
Re:Data - metadata ordering: softupdates (Score:1, Interesting)
Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata.
Maybe I misinterpret something here but doesn't that sound like the exact opposite of what you claim:
Block Allocation. When a new block or fragment is allocated for a file,
the new block pointer (whether in the inode or an indirect block) should not
be written to stable storage until after the block has been initialized.
So first initialize the data, then update the pointer in the metadata. If I am not totally mistaken that is exactly what Linus argues for.
Re:lkml.org server is slashdotted. (Score:4, Interesting)
You forgot to include a link to the comment you'll be writing later.
Maybe the power failed in the middle of him writing his comment?
Don't worry...it'll appear in some other Slashdot thread until CmdrTaco does a fsck.
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:4, Interesting)
I agree. What we need is a mechanism for an application to indicate to the OS what kind of data is being written (in terms of criticality/persistance/etc). If it is the gimp swapfile chances are you can optimize differently for performance than if it is a file containing innodb tables.
Right now app developers are having to be concerned with low-level assumptions about how data is being written at the cache level, and that is not appropriate.
I got burned by this when my mythtv backend kept losing chunks of video when the disk was busy. Turns out the app developers had a tiny buffer in ram, which they'd write out to disk, and then do an fsync every few seconds. So, if two videos were being recorded the disk is contantly thrashing between two huge video files while also busy doing whatever else the system is supposed to be doing. When I got rid of the fsyncs and upped the buffer a little all the issues went away. When I record video to disk I don't care if when the system goes down that in addition to losing the next 5 minutes of the show during the reboot I also lose the last 20 seconds as well. This is just bad app design, but it highlights the problems when applications start messing with low-level details like the cache.
Linux filesystems just aren't optimal. I think that everybody is more interested in experimenting with new concepts in file storage, and they're not as interested in just getting files reliably stored to disk. Sure, most of this is volunteer-driven, so I can't exactly put a gun to somebody's head to tell them that no, they need to do the boring work before investing in new ideas. However, it would be nice if things "just worked".
We need a gradual level of tiers ranging from a database that does its own journaling and needs to know that data is fully written to disk to an application swapfile that if it never hits the disk isn't a big deal (granted, such an app should just use kernel swap, but that is another issue). The OS can then decide how to prioritize actual disk IO so that in the event of a crash chances are the highest priority data is saved and nothing is actually corrupted.
And I agree completely regarding transaction support. That would really help.
Re:ZFS (Score:4, Interesting)
It's similar (at least, a lot more similar than any other Linux filesystem), but less mature.
In defense of the LK team on the whole ZFS issue, I understand that part of the reason they didn't pursue some ZFS-like features years ago was because of patents. Now that SUN has open-sourced (though not in a GPL-compatible way) ZFS and is defending that against Network Appliance in a lawsuit, the way looks a lot clearer for Btrfs and company to proceed.
Actually, on that thought, the IBM acquisition of SUN should get NetApp to drop that lawsuit. Going up against SUN in a MAD patent dispute is a bit risky, but (as SCO discovered) aggressive IP lawsuits against IBM come in right behind "land war in Asia".
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Interesting)
...btrfs is starting from the ground up rather than try to fight those camped on their domains and won't play ball,.... So why don't you stop talking shit, or come up with specific cases to back up your claims.
Didn't you just do that for me?
Things like XFS or JFS are badly maintained and supported because they are too complex and were lumped in from other systems. This is a problem if, for example, XFS is the only serious option for really big volumes.
Reiser3 receives no more improvements, Reiser4 is dead. That doesn't leave much besides ext3. Funnily, ext3 has been catching up in performance just because the other FS are dead. Ok, maybe funny isn't the right word...
Unlike other OSes, Linux has several filesystems to chose for whatever the users' needs are, and new ones will appear from other proprietary systems at a later date. You think NTFS or HFS+ is any better?
Choice is fine when all choices are good. When all choices have serious and different issues, that just means effort has been wasted.
As for NTFS: At least from the application side you know which problems will hit you and which ones not.
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Interesting)
Torvalds exactly knows who it is and most people following the discussion will probably know it, too....
Yes, Mr. Torvalds is fairly outspoken.
Yes, and the folks in that conversation are very thick skinned and are used to such statements, it's just they way they communicate. Having Linus call you a moron is nothing. (and he's probably right) ;)
How many times have I looked at my own code and asked, "What MORON came up with this junk?"