Slashdot Log In
Kernel Hackers On Ext3/4 After 2.6.29 Release
Posted by
timothy
on Wed Mar 25, 2009 08:18 AM
from the good-things-come-from-certain-clashes dept.
from the good-things-come-from-certain-clashes dept.
microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"
Related Stories
[+]
Hardware: The Hairy State of Linux Filesystems 187 comments
RazvanM writes "Do the OSes really shrink? Perhaps the user space (MySQL, CUPS) is getting slimmer, but how about the internals? Using as a metric the number of external calls between the filesystem modules and the rest of the Linux kernel I argue that this is not the case. The evidence is a graph that shows the evolution of 15 filesystems from 2.6.11 to 2.6.28 along with the current state (2.6.28) for 24 filesystems. Some filesystems that stand out are: nfs for leading in both number of calls and speed of growth; ext4 and fuse for their above-average speed of growth and 9p for its roller coaster path."
[+]
Ext4 Data Losses Explained, Worked Around 421 comments
ddfall writes "H-Online has a follow-up on the Ext4 file system — Last week's news about data loss with the Linux Ext4 file system is explained and new solutions have been provided by Ted Ts'o to allow Ext4 to behave more like Ext3."
[+]
Linux Kernel 2.6.29 Released 265 comments
diegocgteleline.es writes "Linus Torvalds has released Linux 2.6.29. The new features include the inclusion of kernel graphic modesetting, WiMAX, access point Wi-Fi support, inclusion of squashfs and a preliminary version of btrfs, a more scalable version of RCU, eCryptfs filename encryption, ext4 no journal mode, OCFS2 metadata checksums, improvements to the memory controller, support for filesystem freeze, and other features. Here is the full list of changes."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Idiotic (Score:5, Informative)
Mirror for the thread:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699 [gmane.org]
lkml.org server is slashdotted. (Score:5, Funny)
this is what I get from http://lkml.org/lkml/2009/3/24/460 [lkml.org]:
"The server is taking too long to respond; please wait a minute or 2 and try again."
Considering that there is only one comment on this slashdot thread, that means that most people will comment without actually reading TFA.
Like me... :-)
Re:lkml.org server is slashdotted. (Score:5, Funny)
Parent
Re:lkml.org server is slashdotted. (Score:5, Insightful)
Well this is just my meta comment. I'll be writing my real comment later...
You forgot to include a link to the comment you'll be writing later.
Parent
Re:lkml.org server is slashdotted. (Score:5, Insightful)
Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle and that the only reason for journaling was to alleviate the delay caused by fscking. All the filesystem can normally promise in the event of a crash is that the metadata will describe a valid filesystem somewhere between the last returned synchronization call and the state at the event of the crash. If you need more than that -- and you really, probably don't -- you have to do special things, such as running an OS that never, ever, ever crashes and putting a special capacitor in the system so the OS can flush everything to disk before the computer loses power in an outage.
Parent
Re:lkml.org server is slashdotted. (Score:5, Informative)
On-disk state must always be consistent. That was the point of journalig, so that you do not have to do a fsck to get to a consistent state. You write to a journal, what you are planing to do, then you do it, then you activate it and mark done in the journal. At any point in time, if power is lost, the filesystem is in a consistant state - either the state before the operation or the state after the operation. You might get some half-written blocks, but that is perfectly fine, because they are not referenced in the directory structure until the final activation step is written to disk and those half-written bloxk are still considered empty by the filesystem.
Parent
Let me guess... (Score:5, Funny)
Re:Let me guess... (Score:5, Funny)
Parent
OK, then... *WHO* is the official ext3 "moron"? (Score:5, Insightful)
Quote from Linus:
"...the idiotic ext3 writeback behavior. It literally does everything the wrong way around - writing data later than the metadata that points to it. Whoever came up with that solution was a moron. No ifs, buts, or maybes about it."
In the interests of fairness... it should be fairly easy to track down the person or group of people who did this. Code commits in the Linux world seem to be pretty well documented.
How about ASKING them rather than calling the Morons?
(note: they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus)
TDz.
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Insightful)
Torvalds exactly knows who it is and most people following the discussion will probably know it, too.
Also, there has been a fairly public discussion including a statement by the responsible person in question.
Not saying the name is Torvalds attempt at saving grace. Similar to a parent of two children saying, I don't know who did the mess, but if I come back, it better be cleaned up.
Yes, Mr. Torvalds is fairly outspoken.
Parent
Saving grace (Score:5, Funny)
Is the person responsible going to pull a classic political step-down where they resign "in order to spend more time with their family"?
Maybe it was Hans Reiser? Sure the guy is locked up in San Quentin, but nobody knows how to hack a filesystem to bits better than Reiser. Bada ba ching! Thank you, thank you... I'll be here all night.
Parent
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Interesting)
Well, some Linux filesytem developers (and some fanboys) have been chastising other (higher-performance) filesytems for not providing the guarantees that ext3 ordered move provides.
Application developers hence were indirectly educated to not use fsync(), because apparently a filesystem giving anything other than the ext3 ordered mode guarantees is just unreasonable, and ext3 fsync() performance really sucks. (The reason why you don't actually *want* what fsync implies has been explained in the previous ext4 data-loss posts).
Some of those developers are now complaining that their "new" filesystem (designed to do away with the bad performance of the old one) is disliked by users who are losing data due to applications being encouraged to be written in a bad way, and telling the developers that they now should add fsync() anyway (instead of fixing the actual problem with the filesystem).
Moreover, they are complaining that the application developers are "weird" because of expecting to be able to write many files to the filesystem and not having them *needlessly* corrupted. IMAGINE THAT!
As an aside joke, the "next generation" btrfs which was supposed to solve all problems has ordered mode by default, but its an ordered mode that will erase your data in exactly the same way as ext4 does.
Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.
Parent
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Funny)
Yep, we urgently need some kind of killer FS for Linux...
Oh, wait...
Parent
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Insightful)
fsync() (sync all pending driver buffers to disk) certainly has a major performance cost, but sometimes you do want to know that your data actually made it to disk - that's an entirely different issue from journalling and data/meta-data order of writes which is about making sure the file system is recoverable to some consistent state in the event of a crash.
I think sometimes programmers do fsync() when they really want fflush() (flush library buffers to driver) which is about program behavior ("I want this data written to disk real-soon-now", not hanging around in the library buffer indefinitely) rather than a data-on-disk guarantee.
IMO telling programmers to flatly avoid fsync is almost as bad as having a borked meta-data/data write order - progammers should be educated about what fsync does and when they really want/need it and when they don't. I'll also bet that if the file systems supported transactions (all-or-nothing journalling of a sequence of writes to disk), maybe via an ioctl(), that many people would be using that instead.
Parent
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Funny)
they may very well BE morons, but at least give them a chance to respond before being pilloried by Linus
He's following Ext3 writeback semantics. You'll have to wait for a patch to fix his behaviour.
Parent
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Insightful)
Knowing the humor that Linus has, it could be himself.
Parent
Um. This doesn't make sense. (Score:5, Insightful)
Doesn't ext3 work in exactly the way mentioned? AIUI ordered data mode is the default.
from the FAQ: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html [sapienti-sat.org]
"mount -o data=ordered"
Only journals metadata changes, but data updates are flushed to
disk before any transactions commit. Data writes are not atomic
but this mode still guarantees that after a crash, files will
never contain stale data blocks from old files.
"mount -o data=writeback"
Only journals metadata changes, and data updates are entirely
left to the normal "sync" process. After a crash, files will
may contain stale data blocks from old files: this mode is
exactly equivalent to running ext2 with a very fast fsck on reboot.
So, switching writeback mode to write the data first would simply be using ordered data mode, which is the default...
Parent
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Informative)
ext3 was merged to the mainline kernel in 2001. Git was created in 2005. I wouldn't trust any authorship evidence in a git repo for code predating the repo.
The journalling behavior of ext3 was probably decided by Stephen Tweedie [wikipedia.org]
Parent
Data - metadata ordering: softupdates (Score:5, Informative)
Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf [cmu.edu]. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).
Here's an excerpt:
We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.
There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?
Re:Data - metadata ordering: softupdates (Score:5, Informative)
It has proven to be very resilient (up to hardware problems).
No it hasn't, which is why it has been removed from NetBSD and replaced by a journaled filesystem. I've also heard grumblings from OpenBSD people about corrupted filesystems with softdep enabled.
Parent
Re:Slow performance (Score:5, Funny)
Well, they had to switch the lkml server to ext3 because posts kept getting killed and cut into pieces with their old filesystem and the admins just kept saying "Well, they must've gone to Russia."
Parent
Re:I would go further than Linus on this one... (Score:5, Informative)
You are confusing writeback caching with ext3/4's writeback option, which is simply something different.
The problem with all the ext3/ext4 discussions has been the ORDER in which things get written, not whether they are cached or not. (Hence the existance of an "ordered" mode)
You want new data written first, and the references to that new data updated later, and most definitely NOT the other way around.
Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.
Parent
Re:I would go further than Linus on this one... (Score:5, Informative)
Here's what Linus had to say, and I think he hit the nail on the head:
The point is, if you write your metadata earlier (say, every 5 sec) and
the real data later (say, every 30 sec), you're actually MORE LIKELY to
see corrupt files than if you try to write them together.
And if you write your data _first_, you're never going to see corruption
at all.
This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
literally does everything the wrong way around - writing data later than
the metadata that points to it. Whoever came up with that solution was a
moron. No ifs, buts, or maybes about it.
Parent
Re:Safest mkfs/mount options? (Score:5, Informative)
You could also look into Sun's RAID-z:
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#RAID-Z [wikipedia.org]
Parent
Re:Safest mkfs/mount options? (Score:5, Informative)
My advice:
- Make regular backups; you'll need them eventually. Keep some off-site. /boot partition).
- ext3 filesystem, default "data=ordered" journal
- Disable the on-drive write-cache with 'hdparm'
- "dirsync" mount option
- Consider a "relatime" or "noatime" mount option to increase performance (depending on whether or not you use applications that care about atime)
- If you don't want the performance hit from disabling the on-drive write-cache, add a UPS and set up software to shut down your system cleanly when the power fails. You are still vulnerable to power-supply failures etc. even if you have a UPS.
- Schedule regular "smartctl" scans to detect low-level drive failures
- Schedule regular RAID parity checks (triggered through a "/sys/.../sync_action" node) to look for inconsistencies. I have a software-RAID1 mirror and I've found problems here a few times (one of which was that 'grub' had written to only one of the disks of the md device for my
- Periodically compare the current filesystem contents against one of your old backups. Make sure that the only files that are different are ones that you expected to be different.
If you decide to use ext4 or XFS most of the above points will still apply. I don't have any experience with ext4 yet so I can't say how well it compares to ext3 in terms of data-preservation.
Parent