Kernel Hackers On Ext3/4 After 2.6.29 Release 316

Posted by timothy on Wednesday March 25, 2009 @08:18AM from the good-things-come-from-certain-clashes dept.

microbee writes "Following the Linux kernel 2.6.29 release, several famous kernel hackers have raised complaints upon what seems to be a long-time performance problem related to ext3. Alan Cox, Ingo Molnar, Andrew Morton, Andi Keen, Theodore Ts'o, and of course Linus Torvalds have all participated. It may shed some light on the status of Linux filesystems. For example, Linus Torvalds commented on the corruption caused by writeback mode, calling it 'idiotic.'"

This discussion has been archived. No new comments can be posted.

Kernel Hackers On Ext3/4 After 2.6.29 Release

Search 316 Comments Log In/Create an Account

Comments Filter:

Idiotic (Score:5, Informative)

by baadger ( 764884 ) writes: on Wednesday March 25, 2009 @08:27AM (#27327829)

Mirror for the thread:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811699 [gmane.org]

Share
twitter facebook
Re:Let me guess... (Score:4, Informative)

by Anonymous Coward writes: on Wednesday March 25, 2009 @08:53AM (#27328117)

According to Netcraft, yes. Ubuntu. [netcraft.com]

Wait, this is Slashdot... I need a cliche... uh...

Netcraft confirms is, that server is dying?

Parent Share
twitter facebook
Re:I would go further than Linus on this one... (Score:5, Informative)

by Skuto ( 171945 ) writes: on Wednesday March 25, 2009 @08:59AM (#27328177) Homepage

You are confusing writeback caching with ext3/4's writeback option, which is simply something different.
The problem with all the ext3/ext4 discussions has been the ORDER in which things get written, not whether they are cached or not. (Hence the existance of an "ordered" mode)
You want new data written first, and the references to that new data updated later, and most definitely NOT the other way around.
Linus seems to understand this much better than the people writing the filesystems, which is quite ironic.

Parent Share
twitter facebook
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Informative)

by morgan_greywolf ( 835522 ) writes: on Wednesday March 25, 2009 @09:03AM (#27328229) Homepage Journal

Most likely Ted T'so, based on the git commit logs [kernel.org]. I say most likely because someone more familiar with the kernel git repo than myself should probably confirm or deny this statement.

Parent Share
twitter facebook
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:2, Informative)

by morgan_greywolf ( 835522 ) writes: on Wednesday March 25, 2009 @09:16AM (#27328407) Homepage Journal

I can see you've never written any filesystem drivers ;). It's not quite that simple, but more or less that's the type of change you'd make.

Parent Share
twitter facebook
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Informative)

by 644bd346996 ( 1012333 ) writes: on Wednesday March 25, 2009 @09:24AM (#27328489)

ext3 was merged to the mainline kernel in 2001. Git was created in 2005. I wouldn't trust any authorship evidence in a git repo for code predating the repo.
The journalling behavior of ext3 was probably decided by Stephen Tweedie [wikipedia.org]

Parent Share
twitter facebook
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:2, Informative)

by morgan_greywolf ( 835522 ) writes: on Wednesday March 25, 2009 @09:34AM (#27328637) Homepage Journal

Right, but this problem doesn't go back to 2001.

Parent Share
twitter facebook
Re:Safest mkfs/mount options? (Score:5, Informative)

by remmelt ( 837671 ) writes: on Wednesday March 25, 2009 @09:36AM (#27328665) Homepage

You could also look into Sun's RAID-z:
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#RAID-Z [wikipedia.org]

Parent Share
twitter facebook
Re:I would go further than Linus on this one... (Score:2, Informative)

by AvitarX ( 172628 ) writes: <meNO@SPAMbrandywinehundred.org> on Wednesday March 25, 2009 @09:37AM (#27328685) Journal

It is by default, using the ordered journal type in Ext3.
It is not an option yet in Ext4, and for now may not be the default, but an option to be set at mount time.
Currently in Ext4, the meta data in journal is first updated, then the data written.
When software assumes that it can send commands, and have them take place in the order sent this becomes problematic. Because without costly immediate writes there is a risk of losing very very old data, as the files metadata gets updated but the data not written to the new place yet.

Parent Share
twitter facebook
Re:Safest mkfs/mount options? (Score:3, Informative)

by larry bagina ( 561269 ) writes: on Wednesday March 25, 2009 @09:41AM (#27328729) Journal

with lvm, you can easily try out the various file systems (don't forget jfs!). Personally, I've found linux XFS to corrupt itself beyond repair, so I use ext3.

Parent Share
twitter facebook
Re:Safest mkfs/mount options? (Score:5, Informative)

by mmontour ( 2208 ) writes: <mail@mmontour.net> on Wednesday March 25, 2009 @09:53AM (#27328927)

My advice:
- Make regular backups; you'll need them eventually. Keep some off-site.
- ext3 filesystem, default "data=ordered" journal
- Disable the on-drive write-cache with 'hdparm'
- "dirsync" mount option
- Consider a "relatime" or "noatime" mount option to increase performance (depending on whether or not you use applications that care about atime)
- If you don't want the performance hit from disabling the on-drive write-cache, add a UPS and set up software to shut down your system cleanly when the power fails. You are still vulnerable to power-supply failures etc. even if you have a UPS.
- Schedule regular "smartctl" scans to detect low-level drive failures
- Schedule regular RAID parity checks (triggered through a "/sys/.../sync_action" node) to look for inconsistencies. I have a software-RAID1 mirror and I've found problems here a few times (one of which was that 'grub' had written to only one of the disks of the md device for my /boot partition).
- Periodically compare the current filesystem contents against one of your old backups. Make sure that the only files that are different are ones that you expected to be different.
If you decide to use ext4 or XFS most of the above points will still apply. I don't have any experience with ext4 yet so I can't say how well it compares to ext3 in terms of data-preservation.

Parent Share
twitter facebook
ZFS (Score:4, Informative)

by chudnall ( 514856 ) writes: on Wednesday March 25, 2009 @09:57AM (#27328995) Homepage Journal

Linux seriously needs to find a workaround to its licensing squabbles [blogspot.com] and find a way to get a rock-solid ZFS in the kernel. Right now, ZFS on OpenSolaris [opensolaris.org] is simply wonderful, and this is what I am deploying for file service at all my customer sites now. The scary thing about file system corruption is that it is often silent, and can go on for a long time, until your system crashes, and you find that all of your backups are also crap. I've replaced a couple of linux servers (and more than a couple of Windows servers) after filesystem and disk corruption compounded by naive RAID implementations (RAID[1-5] without end-to-end checksumming can make your data *less* safe), and my customers couldn't be happier. Having hourly snapshots [dzone.com] and a fast in-kernel CIFS server fully integrated with ZFS ACLS [sun.com] (and with support for NTFS-style mixed case naming) is jut icing on the cake. Now if only I could have an Opensolaris desktop with all the nice linux userland apps available. Oh wait, I can! [nexenta.org]

Share
twitter facebook
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Informative)

by BigBuckHunter ( 722855 ) writes: on Wednesday March 25, 2009 @10:06AM (#27329121)

Honestly, the state of filesystems in Linux is SO f***d that just blaming whoever added writeback mode is irrelevant.
I agree that the who-dun-it part is irrelevant. I disagree on the "SO f***d" part. We have three filesystems that write the journal prior to the data. Basically, we know the issue, and a similar fix can be shared amongst the three affected filesystems. We've had far more "f***d" situations than this (think etherbrick-1000) where hardware was being destroyed without a good understanding of what was happening. Everything will work out as it seems to have everyone's attention.

BBH

Parent Share
twitter facebook
Data - metadata ordering: softupdates (Score:5, Informative)

by ivoras ( 455934 ) writes: <ivoras&fer,hr> on Wednesday March 25, 2009 @10:14AM (#27329215) Homepage

Somebody's going to mention it so here it is: there was a BSD unix research project that ended as the soft-updates implementation (currently present in all modern free BSDs). It deals precisely with the ordering of metadata and data writes. The paper is here: http://www.ece.cmu.edu/~ganger/papers/softupdates.pdf [cmu.edu]. Regardless of what Linus says, soft-updates with strong ordering also do metadata updates before data updates, and also keeps tracks of ordering *within* metadata. It has proven to be very resilient (up to hardware problems).
Here's an excerpt:
We refer to this requirement as an update dependency, because safely writing the direc- tory entry depends on first writing the inode. The ordering constraints map onto three simple rules: (1) Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it). (2) Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode). (3) Never reset the last pointer to a live resource before a new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written). The metadata update problem can be addressed with several mecha- nisms. The remainder of this section discusses previous approaches and the characteristics of an ideal solution.
There's some quote about this... something about those who don't know unix and about reinventing stuff, right :P ?

Share
twitter facebook
Re:I would go further than Linus on this one... (Score:5, Informative)

by Spazmania ( 174582 ) writes: on Wednesday March 25, 2009 @10:15AM (#27329231) Homepage

Here's what Linus had to say, and I think he hit the nail on the head:
The point is, if you write your metadata earlier (say, every 5 sec) and
the real data later (say, every 30 sec), you're actually MORE LIKELY to
see corrupt files than if you try to write them together.
And if you write your data _first_, you're never going to see corruption
at all.
This is why I absolutely _detest_ the idiotic ext3 writeback behavior. It
literally does everything the wrong way around - writing data later than
the metadata that points to it. Whoever came up with that solution was a
moron. No ifs, buts, or maybes about it.

Parent Share
twitter facebook
Re:lkml.org server is slashdotted. (Score:5, Informative)

by AigariusDebian ( 721386 ) writes: <aigarius&debian,org> on Wednesday March 25, 2009 @10:45AM (#27329577) Homepage

On-disk state must always be consistent. That was the point of journalig, so that you do not have to do a fsck to get to a consistent state. You write to a journal, what you are planing to do, then you do it, then you activate it and mark done in the journal. At any point in time, if power is lost, the filesystem is in a consistant state - either the state before the operation or the state after the operation. You might get some half-written blocks, but that is perfectly fine, because they are not referenced in the directory structure until the final activation step is written to disk and those half-written bloxk are still considered empty by the filesystem.

Parent Share
twitter facebook
Re:I would go further than Linus on this one... (Score:3, Informative)

by Rich0 ( 548339 ) writes: on Wednesday March 25, 2009 @11:06AM (#27329857) Homepage

This is more of a response to the 5 other replies to this comment - but rather than post it 5 times I'll just stick it here...
What everybody else has proposed is the obvious solution, which is essentially copy-on-write. When you modify a block, you write a new block and then deallocate the old block. This is the way ZFS works, and it will also be used in btrfs. Aside from the obvious reliability improvement, it also can allow better optimization in RAID-5 configurations, as if you always flush an entire stripe you don't need to do a read-before-write to update the checksum data. The algorithm is also very amenable to snapshotting - you just hold off on deallocating the old blocks. In fact, snapshots perform better than normal writes since there are fewer steps (of course you do waste disk space - but you usually don't keep snapshots around forever).

Parent Share
twitter facebook
Re:Data - metadata ordering: softupdates (Score:5, Informative)

by LizardKing ( 5245 ) writes: on Wednesday March 25, 2009 @11:14AM (#27329943)

It has proven to be very resilient (up to hardware problems).
No it hasn't, which is why it has been removed from NetBSD and replaced by a journaled filesystem. I've also heard grumblings from OpenBSD people about corrupted filesystems with softdep enabled.

Parent Share
twitter facebook
Use fadvise (Score:3, Informative)

by Chemisor ( 97276 ) writes: on Wednesday March 25, 2009 @11:14AM (#27329953)

> We need a gradual level of tiers ranging from a database that does its own journaling
> and needs to know that data is fully written to disk to an application swapfile that if
> it never hits the disk isn't a big deal (granted, such an app should just use kernel swap,
> but that is another issue).
Actually there already is a syscall for telling the kernel how the file will be used.
posix_fadvise (int fd, off_t offset, off_t len, int advice)
POSIX_FADV_DONTNEED sounds like what you would use for your swapfile case.
I don't know if the kernel actually does anything with this information, but it looks like
this would be a good place to implement any new interfaces for what you are suggesting.

Parent Share
twitter facebook
Re:A UPS (Score:3, Informative)

by swilver ( 617741 ) writes: on Wednesday March 25, 2009 @11:33AM (#27330195)

UPS are nice, and I use one too. It won't protect you from kernel crashes or direct hardware failures. It would still result in corrupted discs if some filesystem decided it did not yet have to write that 2 GB of cached data. Ext3 in ordered mode is still much preferred.

Parent Share
twitter facebook
Re:lkml.org server is slashdotted. (Score:3, Informative)

by gclef ( 96311 ) writes: on Wednesday March 25, 2009 @11:37AM (#27330247)

Actually, he has a valid point: the user doesn't give a damn about whether their disk's metadata is consistent. They care about their actual data. If a filesystem is sacrificing user data consistency in favor of metadata consistency, then it's made the wrong tradeoff.

Parent Share
twitter facebook
Re:lkml.org server is slashdotted. (Score:4, Informative)

by Anonymous Coward writes: on Wednesday March 25, 2009 @11:39AM (#27330279)

No, you're the one who's clueless.
The issue (as Linus said) isn't that the journalling is providing data integrity, it's that doing the journalling the wrong way causes *MORE* data loss.
Basically, you're sacrificing data integrity for speed, when you don't need to.
Perhaps you should work on your reading comprehension.

Parent Share
twitter facebook
Re:Data - metadata ordering: softupdates (Score:1, Informative)

by Anonymous Coward writes: on Wednesday March 25, 2009 @12:55PM (#27331509)

It is quite stable in FreeBSD; might have been an error in the port to NetBSD and OpenBSD?
I know Kirk (McKusick) had to work really hard to get it properly stable on FreeBSD.

Parent Share
twitter facebook
Integrity vs. consistency. (Score:5, Informative)

by WebCowboy ( 196209 ) writes: on Wednesday March 25, 2009 @01:22PM (#27331963)

Actually, Linus was, as he sometimes is, completely clueless. He's unaware of the fact that filesystem journaling was *NEVER* intended to give better data integrity guarantees than an ext2-crash-fsck cycle
Linus is not clueless in this case. I think it is a case of you misinterpreting the issue he was discussing.
Journaling is, as you say NOT about data integrity/prevention of data loss. That is what RAID and UPSes are for. However, it IS about data CONSISTENCY. Even if a file is overwritten, truncated or otherwise corrupted in a system failure (i.e. loss of data integrity) the journal is supposed to accurately describe things like "file X is Y bytes in length and resides in blocks 1,2,3...." (data/metadata consistency). Why would you update that information before you are sure the data was actually changed? A consistent journal is the WHOLE REASON why you can "alleviate the delay caused by fscking".
Linus rightly pointed out, with a degree of tact that Theo de Raadt would be proud of, that writing meta-data before the actual data is committed to disk is a colossally stupid idea. If the journal doesn't accurately describe the actual data on the drive then what is the point of the journal? In fact, it can be LESS than useless if you implicitly trust the inconsistent journal and have borked data that is never brought to your attention.

Parent Share
twitter facebook
Re:Data - metadata ordering: softupdates (Score:1, Informative)

by Anonymous Coward writes: on Wednesday March 25, 2009 @02:42PM (#27333281)

It's still present in 4.0.1 which is the latest release and, as usual, I have not heard *any* OS related grumblings from OpnenBSD people, ever.

Parent Share
twitter facebook
Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Informative)

by mmontour ( 2208 ) writes: <mail@mmontour.net> on Wednesday March 25, 2009 @04:51PM (#27335005)

fsync() is for flushing *all* data to disk. That's often the wrong thing to do! If the application just needs to flush it's own writes to disk, or even just one specific write, and not incur the HUGE performance hit of fsync(), it shouldn't need to call fsync().
sync() is for flushing *all* data to disk.
fsync() and the related fdatasync() operate on a single file descriptor. There is also a finer-grained, non-portable "sync_file_range()" introduced in kernel 2.6.17 (according to the man page).
fsync() is the correct function call for an application to use when it wants to flush its writes (for a particular fd) to disk. It is unfortunate if the implementation cannot do so without having to also flush unrelated writes to disk, but that's beyond the control of a userspace application.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Kernel Hackers On Ext3/4 After 2.6.29 Release 316

Kernel Hackers On Ext3/4 After 2.6.29 Release More Login

Kernel Hackers On Ext3/4 After 2.6.29 Release

Idiotic (Score:5, Informative)

Re:Let me guess... (Score:4, Informative)

Re:I would go further than Linus on this one... (Score:5, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:3, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:2, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:5, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:2, Informative)

Re:Safest mkfs/mount options? (Score:5, Informative)

Re:I would go further than Linus on this one... (Score:2, Informative)

Re:Safest mkfs/mount options? (Score:3, Informative)

Re:Safest mkfs/mount options? (Score:5, Informative)

ZFS (Score:4, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:3, Informative)

Data - metadata ordering: softupdates (Score:5, Informative)

Re:I would go further than Linus on this one... (Score:5, Informative)

Re:lkml.org server is slashdotted. (Score:5, Informative)

Re:I would go further than Linus on this one... (Score:3, Informative)

Re:Data - metadata ordering: softupdates (Score:5, Informative)

Use fadvise (Score:3, Informative)

Re:A UPS (Score:3, Informative)

Re:lkml.org server is slashdotted. (Score:3, Informative)

Re:lkml.org server is slashdotted. (Score:4, Informative)

Re:Data - metadata ordering: softupdates (Score:1, Informative)

Integrity vs. consistency. (Score:5, Informative)

Re:Data - metadata ordering: softupdates (Score:1, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot

Idiotic (Score:5, Informative)

Re:Let me guess... (Score:4, Informative)

Re:I would go further than Linus on this one... (Score:5, Informative)

Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Informative)

Re:OK, then... *WHO* is the official ext3 "moron"? (Score:2, Informative)

Re:OK, then... *WHO* is the official ext3 "moron"? (Score:5, Informative)

Re:OK, then... *WHO* is the official ext3 "moron"? (Score:2, Informative)

Re:Safest mkfs/mount options? (Score:5, Informative)

Re:I would go further than Linus on this one... (Score:2, Informative)

Re:Safest mkfs/mount options? (Score:3, Informative)

Re:Safest mkfs/mount options? (Score:5, Informative)

ZFS (Score:4, Informative)

Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Informative)

Data - metadata ordering: softupdates (Score:5, Informative)

Re:I would go further than Linus on this one... (Score:5, Informative)

Re:lkml.org server is slashdotted. (Score:5, Informative)

Re:I would go further than Linus on this one... (Score:3, Informative)

Re:Data - metadata ordering: softupdates (Score:5, Informative)

Use fadvise (Score:3, Informative)

Re:A UPS (Score:3, Informative)

Re:lkml.org server is slashdotted. (Score:3, Informative)

Re:lkml.org server is slashdotted. (Score:4, Informative)

Re:Data - metadata ordering: softupdates (Score:1, Informative)

Integrity vs. consistency. (Score:5, Informative)

Re:Data - metadata ordering: softupdates (Score:1, Informative)

Re:OK, then... *WHO* is the official ext3 "moron"? (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Re:OK, then... WHO is the official ext3 "moron"? (Score:3, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:2, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:5, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:2, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:3, Informative)

Re:OK, then... WHO is the official ext3 "moron"? (Score:3, Informative)