Forgot your password?
typodupeerror
Bug Data Storage Linux

EXT4 Data Corruption Bug Hits Linux Kernel 249

Posted by Soulskill
from the plenty-of-time-to-fix dept.
An anonymous reader writes "An EXT4 file-system data corruption issue has reached the stable Linux kernel. The latest Linux 3.4, 3.5, 3.6 stable kernels have an EXT4 file-system bug described as an apparent serious progressive ext4 data corruption bug. Kernel developers have found and bisected the kernel issue but are still working on a proper fix for the stable Linux kernel. The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."
This discussion has been archived. No new comments can be posted.

EXT4 Data Corruption Bug Hits Linux Kernel

Comments Filter:
  • by Anonymous Coward on Wednesday October 24, 2012 @03:33PM (#41755929)

    I know he'd never do anything to harm me or my data.

    • Re: (Score:2, Funny)

      by Anonymous Coward

      Or your wife?

    • Re: (Score:2, Funny)

      by localhost8080 (819098)
      yeah, reiser 4 has some killer features
    • by psm321 (450181)

      I know you're making a joke about the person, but I've had many corruption issues with ReiserFS. Granted, this was in its earlier days, but after it had been declared stable for use. I gave up on it after the problems, so no idea if later versions improved.

  • The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often.

    We're talking about Linux users here...move along.

    • Re: (Score:2, Troll)

      by vistapwns (1103935)
      What is it about Linux users' jokes that remind me of the Iraqi Information Minister? ;)
    • by starless (60879)

      Even though my linux desktop machine runs for long periods without needing rebooting, there are exceptions:
      My several year old Pioneer television runs linux. It crashes and reboots if I change HD channels more than 5 or 6 times.
      My roku box needs to be rebooted from time to time.
      So does my android phone.

      • by RR (64484)

        Even though my linux desktop machine runs for long periods without needing rebooting, there are exceptions: My several year old Pioneer television runs linux. It crashes and reboots if I change HD channels more than 5 or 6 times. My roku box needs to be rebooted from time to time. So does my android phone.

        All those are also unlikely to be running EXT4. They store the system on flash and use SquashFS, JFFS2, or YAFFS2. The ones that use eMMC might use EXT4, but Samsung just donated F2FS for that use.

        Also, they tend to use very old kernels.

  • by K. S. Kyosuke (729550) on Wednesday October 24, 2012 @03:36PM (#41755963)

    The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often."

    They're trying to boost the average uptime of all installations by making people keep their machines turned on. It's just a continuation of the uptime war waged with the BSD folks!

  • Brilliant. Well, it certainly worries this Linux developer -- although I mostly rely on pre-3.0 kernels. Wasn't there a rule on Slashdot about mirroring articles before posting links to them ?
    • Not that I've ever remebered. It was oft suggusted in comments, but most websites are nearly slashdot prooff these days. Kind of surprised that lkml is so sluggish under the load.

  • by dacut (243842) on Wednesday October 24, 2012 @03:38PM (#41756001)

    From Ted Ts'o's commentary, it's an optimization ("jbd2: don't write superblock when if its empty") gone awry:

    The reason why the problem happens rarely is that the effect of the buggy commit is that if the journal's starting block is zero, we fail to truncate the journal when we unmount the file system. This can happen if we mount and then unmount the file system fairly quickly, before the log has a chance to wrap.

    Basically, this optimization has the side effect of not updating the transaction log in this rare case. You can end up replaying old transactions after new ones, which will scramble metadata blocks. Given the rather unique conditions needed to hit this one, I'm not going to lose any sleep over any servers running without Ted's fix (though I'll certainly apply it once RedHat releases the patch).

    • by Tough Love (215404) on Wednesday October 24, 2012 @03:58PM (#41756273)

      It means you could get an incorrect replay after a crash and end up needing to do a fsck. Good thing Ext2/3/4 fsck is awesome. Of course, having no replay bug will be much better. Note: the bug was introduced this October 8th. You are not running this kernel on your server or workstation unless you are a dev, it hasn't filtered through to distros yet.

      • by NotBorg (829820)

        You are not running this kernel on your server or workstation unless you are a dev, it hasn't filtered through to distros yet.

        I'm a crazy, bad ass, rebel that uses ArchLinux for my workstation. Living wild and dangerous, I reclessly shutdown my heathen ext4 computer every night. I feel like I'm that evil mayhem guy on the Allstate commercials. RECALCULATING!

      • by Bradmont (513167)
        > it hasn't filtered through to distros yet.

        FTA:
        > Linux 3.4, 3.5, 3.6 stable kernels

        I'm running Ubuntu 12.10 stock kernel:
        % uname -r
        3.5.0-17-generic
      • Note: the bug was introduced this October 8th.

        Probably one of the more informative comments here.
        • by fatphil (181876) on Wednesday October 24, 2012 @06:29PM (#41758439) Homepage
          $ git show eeecef0af5e
          commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3
          Author: Eric Sandeen <sandeen@redhat.com>
          Date: Sat Aug 18 22:29:40 2012 -0400

                  jbd2: don't write superblock when if its empty
          • by fatphil (181876)
            That's Linus' tree. This is Greg's:

            linux-stable$ git show 14b4ed22a6
            commit 14b4ed22a6b5fc1549504336131be4f5f6ba1bf4
            Author: Eric Sandeen <sandeen@redhat.com>
            Date: Sat Aug 18 22:29:40 2012 -0400

                    jbd2: don't write superblock when if its empty

                    commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3 upstream.
      • Re: (Score:2, Insightful)

        by Anonymous Coward

        The offending commit is present in both Ubuntu's 12.10 and 13.04 generic kernels, though the package version are in proposed repositories.

  • by Bovius (1243040) on Wednesday October 24, 2012 @03:43PM (#41756077)

    ...and too deep. It awoke a being of segfaults and kernel panics.

  • At first I had mixed feelings of slight disappointment and concern, especially because it is the default filesystem in several distros, (including Android) [wikipedia.org]. Although, after some second thoughts, I have come to the following conclusions:

    1) it is part of the game of having a continuous development toward improvement (most of the times) and new features implies some pitfalls. So far, benefits [wikipedia.org] are much larger than costs.

    2) Despite the fact developers are still working on a fix, I wouldn't be surprised if it
    • by compro01 (777531)

      This bug is only 10 days old. It's rather unlikely this has percolated down to anything important, much less Android, which still runs 3.0.31 from May.

      • Re:Part of the game (Score:4, Informative)

        by fatphil (181876) on Wednesday October 24, 2012 @06:33PM (#41758489) Homepage
        It is *not* 10 days old.

        linux-stable$ git show 14b4ed22a6
        commit 14b4ed22a6b5fc1549504336131be4f5f6ba1bf4
        Author: Eric Sandeen <sandeen@redhat.com>
        Date: Sat Aug 18 22:29:40 2012 -0400

                jbd2: don't write superblock when if its empty

                commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3 upstream.

                This sequence:

                # truncate --size=1g fsfile
                # mkfs.ext4 -F fsfile
                # mount -o loop,ro fsfile /mnt
                # umount /mnt
                # dmesg | tail

                results in an IO error when unmounting the RO filesystem:

                [ 318.020828] Buffer I/O error on device loop1, logical block 196608
                [ 318.027024] lost page write due to I/O error on loop1
                [ 318.032088] JBD2: Error -5 detected when updating journal superblock for loop1-8.

                This was a regression introduced by commit 24bcc89c7e7c: "jbd2: split
                updating of journal superblock and marking journal empty".

                Signed-off-by: Eric Sandeen <sandeen@redhat.com>
                Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
                Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

        diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
        index e149b99..484b8d1 100644
        --- a/fs/jbd2/journal.c
        +++ b/fs/jbd2/journal.c
        @@ -1354,6 +1354,11 @@ static void jbd2_mark_journal_empty(journal_t *journal)

                        BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
                        read_lock(&journal->j_state_lock);
        + /* Is it already empty? */
        + if (sb->s_start == 0) {
        + read_unlock(&journal->j_state_lock);
        + return;
        + }
                        jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
                                            journal->j_tail_sequence);
  • What term do we get to use for ext4 now? It's unfortunate that Theodore Tso is actually a pretty decent guy instead of being a murderer (and a jerk). So there aren't any obviously negative terms that come to mind.

    But clearly, something needs to be done along these lines, as well as a legion of people who forever more claim that ext4 corrupts your data and you should never use it and stick with ext3 instead.

  • Summary is wrong (Score:5, Informative)

    by DrJimbo (594231) on Wednesday October 24, 2012 @04:05PM (#41756397)

    The EXT4 file-system can experience data loss if the file-system is remounted (or the system rebooted) too often.

    This is wrong. The problem occurs when the fs is unmounted too *soon*. Twice in a row. The bug only appears if the journal buffer does not wrap. You only get catastrophic results if this happens twice in a row.

    • Re:Summary is wrong (Score:5, Interesting)

      by Anonymous Coward on Wednesday October 24, 2012 @04:27PM (#41756669)

      This appears to be untrue. My latest tests suggest that it happens if a single unclean umount happens while the fs is mounted in 3.6.3. (At least, I saw corruption in /var after a single boot, followed by a rescue boot into 3.6.1 and fsck: every filesystem that had journal replay invoked also had corruption.)

        -- N., original reporter, not much enjoying his fifteen minutes of fame since it comes with happy fun filesystem corruption attached: captcha is 'contrite', how appropriate

      • by DrJimbo (594231)

        I suspect that unclean umounts may trigger the bug too but that does not contradict anything I said. I did not say there was no corruption when you hit the bug once, I said there was catastrophic corruption when you hit it twice in a row. If a bug can be triggered by a clean umount, it is not very surprising if it also gets triggered by an unclean umount.

        Your experience seems to confirm my correction. It is not about how *often* you mount, it is about how you umount. This is a non-trivial distincti

  • ... can we get the words "stable", "linux", and "kernel" into a single summary? I like this game.

  • by Panaflex (13191) <convivialdingo AT yahoo DOT com> on Wednesday October 24, 2012 @04:47PM (#41756937)

    They're mounting it wrong!

    When you mount your disks, you need to be sure of proper head alignment. Make sure she's spun up properly as well, otherwise the disks could be surprised and jump away causing a crash. Lastly, my geek friends, mounting too often can cause burning friction which can destroy data and cause irritation and discomfort.

    • by isorox (205688) on Thursday October 25, 2012 @11:26AM (#41765259) Homepage Journal

      Lastly, my geek friends, mounting too often can cause burning friction which can destroy data and cause irritation and discomfort.

      I never had a problem with frequent mounting, however I have now found a side effect from a mount I performed last year. A child-process was forked into existence shortly after the mount, and now we find we're continuously receiving interrupts from the process, which has affected pretty much every aspect of system administration.

      I find that performing the mount is occasionally possible, but having to umount to give resources to deal with the child process (which often core dumps, and needs a lot of user interaction), before ejecting can lead to frustration and cold showers.

      Most of the time my team is simply trying to run sleep whenever we can.

  • by freman (843586) on Wednesday October 24, 2012 @06:21PM (#41758321)

    People reboot linux?

  • by Anonymous Coward

    I have a Google+ post where I've posted my latest updates to this still-developing story:

    https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7

    Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is cu

  • by tytso (63275) on Wednesday October 24, 2012 @09:42PM (#41760179) Homepage

    I have a Google+ post where I've posted my latest updates to this still-developing story:

    https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7 [google.com]

    Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.

  • what's this mean about various versions of Android using ext4? I think I just flashed my tablet to use ext4 (ugh)... really don't want corruption my tablet...
    • by MtHuurne (602934)

      Android is unaffected: the bug was introduced after Linux 3.6 and no Android kernel is anywhere near that recent.

  • by anonieuweling (536832) on Thursday October 25, 2012 @05:02AM (#41762065)
    The more recent patch at http://marc.info/?l=linux-kernel&m=135105626207228&w=2 [marc.info] fixes stuff.

Machines certainly can solve problems, store information, correlate, and play games -- but not with pleasure. -- Leo Rosten

Working...