Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Bug Linux

Bad Lockup Bug Plagues Linux 257

jones_supa (887896) writes "A hard to track system lockup bug seems to have appeared in the span of couple of most recent Linux kernel releases. Dave Jones of Red Hat was the one to first report his experience of frequent lockups with 3.18. Later he found out that the issue is present in 3.17 too. The problem was first suspected to be related to Xen. A patch dating back to 2005 was pushed for Xen to fix a vmalloc_fault() path that was similar to what was reported by Dave. The patch had a comment that read "the line below does not always work. Needs investigating!" But it looks like this issue was never properly investigated. Due to the nature of the bug and its difficulty in tracking down, testers might be finding multiple but similar bugs within the kernel. Linus even suggested taking a look in the watchdog code. He also concluded the Xen bug to be a different issue. The bug hunt continues in the Linux Kernel Mailing List."
This discussion has been archived. No new comments can be posted.

Bad Lockup Bug Plagues Linux

Comments Filter:
  • by bruce_the_loon ( 856617 ) on Saturday November 29, 2014 @12:42PM (#48485645) Homepage

    The last mail in the thread, dated the 26th of November, explains that the Xen bug was a Xen bug and that the lockup was something different and traceable once the chap experiencing the bug managed to get a kernel backtrace.

  • by Anonymous Coward on Saturday November 29, 2014 @01:36PM (#48485973)

    So it may be a "bad" lockup bug in the sense that nobody knows exactly what causes it, but it's not "bad" in the sense that people should worry overly.

    Why?

    Dave Jones sees it only under insane loads (CPU loads of 150+) running a stress tester that is designed to do crazy things (trinity). And he can reproduce it on only one of his machines, and even there it takes hours. And it happens on a debug kernel that has DEBUG_PAGEALLOC and other explicit (and complex) debug code enabled. And even then the bug is a "Hmm. We made no progress in the last 21 seconds", rather than anything stranger.

    In other words, it's "bad" in the sense that any unknown behavior is bad, but it's unknown mainly because it's so hard to trigger. Nobody else than core developers should really care. And those developers do care, so it's not like it's worrisome there either. It just takes longer to figure out because the usual "bisect it" approach isn't very easy when it can take a day to reproduce..

    • by kesuki ( 321456 )

      the answer is simple grandmas cell phone goes dead when she is done talking to all her grandkids. sheesh it reminds me of the 'delete file number 23 if parsed in japanese' bug or the 'we can't do the math on gregorian calender because the line was in pascal and the unix machine doesn't get to it until the end of unix time' time loop bug. and yes i am crazy but i just took my meds an hour ago, and while i may not be 100% sure the related bugs are correctly allocate i can tell you i experienced every single o

    • by drolli ( 522659 )

      I care.

      I updated my kernel to the 3.17 and the machine locks up every few days (no when stress testing, when web surfing). No trace, no panic, nothing (which coincides what was described in the tread.

  • Bug name (Score:5, Funny)

    by Lost Race ( 681080 ) on Saturday November 29, 2014 @02:46PM (#48486367)
    Since every bug this year needs to have a catchy name for the headlines, I propose we call this one "Davy Jones' Lockup [wikipedia.org]."
  • If you're using Xen - which is a virtualization package. I've never run across Xen in the wild - in fact only at one job interview did they actually use Xen.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (10) Sorry, but that's too useful.

Working...