Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AMD Open Source Linux Hardware Technology

AMD Confirms Linux 'Performance Marginality Problem' On Ryzen (phoronix.com) 120

An anonymous reader writes: Ryzen customers experiencing segmentation faults under Linux when firing off many compilation processes have now had their problem officially acknowledged by AMD. The company describes it as a "performance marginality problem" affecting some Ryzen customers and only on Linux. AMD confirmed Threadripper and Epyc processors are unaffected; they will be dealing with the issue on a customer-by-customer basis, and their future consumer products will see better Linux testing/validation. Ryzen customers believed to be affected by the problem can contact AMD Customer Care. Michael Larabel writes via Phoronix: "With the Ryzen segmentation faults on Linux they are found to occur with many, parallel compilation workloads in particular -- certainly not the workloads most Linux users will be firing off on a frequent basis unless intentionally running scripts like ryzen-test/kill-ryzen. As I've previously written, my Ryzen Linux boxes have been working out great except in cases of intentional torture testing with these heavy parallel compilation tasks. [AMD's] analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor or the like, contrary to rumors/noise online due to the complexity of the problem."
This discussion has been archived. No new comments can be posted.

AMD Confirms Linux 'Performance Marginality Problem' On Ryzen

Comments Filter:
  • Just like FDIV (Score:3, Insightful)

    by Anonymous Coward on Monday August 07, 2017 @07:28PM (#54960135)

    Will only affect a few people, so we aren't replacing any CPUs. Way to hand Intel the business, AMD!

    • Except it doesn't apply to Threadripper, Epyc, or Ryzen Pro. And it doesn't affect all of normal Ryzen either. So the entirety of the market they're handing to Intel is those buying personal systems who run large amounts of parallel compilation workloads and who don't feel like RMAing till they get a chip without the defect.

      • Re:Just like FDIV (Score:4, Insightful)

        by arglebargle_xiv ( 2212710 ) on Tuesday August 08, 2017 @04:11AM (#54962751)

        Except it doesn't apply to Threadripper, Epyc, or Ryzen Pro.

        We don't even know if it's an AMD problem, it could be any one of a number of previously-unnoticed Linux issues that happen to show up on Ryzen (note that the text says "may also affect other Unix-like operating systems", not "exists under FreeBSD as well", so currently it's pure speculation that it extends past Linux). We'll have to wait and see what further investigation turns up...

        • by Gr8Apes ( 679165 )

          Except it doesn't apply to Threadripper, Epyc, or Ryzen Pro.

          We don't even know if it's an AMD problem, it could be any one of a number of previously-unnoticed Linux issues that happen to show up on Ryzen (note that the text says "may also affect other Unix-like operating systems", not "exists under FreeBSD as well", so currently it's pure speculation that it extends past Linux). We'll have to wait and see what further investigation turns up...

          That's more than a little interesting. Wonder if it affects NetBSD? If both FreeBSD and NetBSD are free of this error, I may have my next system.

  • oblig (Score:5, Informative)

    by Anonymous Coward on Monday August 07, 2017 @07:38PM (#54960209)

    certainly not the workloads most Linux users will be firing off on a frequent basis

    I run Gentoo you insensitive clod!

    • Re:oblig (Score:5, Insightful)

      by Misagon ( 1135 ) on Tuesday August 08, 2017 @02:55AM (#54962489)

      How was the parent modded as "Funny"?
      This is definitely not funny. Some users of compiled distros such as Gentoo have encountered the bug in fairly regular basis when trying to compile the distro -- which is needed to make it install.

  • by Anonymous Coward

    Multi-threaded performance was the main advantage that Ryzen had over Intel. Single threaded is still Intel's game and now you are telling that I can't run a make -j all my cores?

    • by F.Ultra ( 1673484 ) on Monday August 07, 2017 @08:07PM (#54960397)
      Well you can (run make -j ), just be prepared to rerun that if/when it segfaults... For most people so far they only get the segfault if they do "make clean && make -jX" a few times so a single make of even a large project should probably work most of the time. Will be interesting to see if/when AMD will be able to fix it, particular why Windows does not seam to suffer from it yet will be interesting to see.
      • Maybe someone can put on the table the key differences between Linux and Windows thread scheduling, because surely its in there somewhere.
        • by Anonymous Coward

          I suspect there are more users compiling under Linux. I run 16 thread parallel compiles daily under Win10 and see a lot of internal compiler errors on my 1700x, errors my fx8370 never displayed on 8 thread builds. There a good chance they simply haven't had any reports for windows yet.

      • by rew ( 6140 )

        Wait!

        What is happening is that the CPU will mis-execute some instruction so that some "data" becomes invalid. When a compiler is running such data is often a pointer and the wrong pointer often results in a segfault.

        But especially while we don't know what's going on exactly, this could also corrupt data. i.e. give the wrong results in a computation, or result in a bad binary when the running program is a compiler.

        So you're suggesting I trust the resulting binaries when the compilation doesn't segfault? Even

    • Comment removed based on user account deletion
    • Multi-threaded performance was the main advantage that Ryzen had over Intel.

      This type of processor bug can typically be fixed with a microcode patch. Mainly a matter of getting sufficient engineering resources on it and isolating the cause. The publicity certainly helps that process, as does the extensive community testing.

  • I do not envy the crew assigned to tracking that bug down.
  • by Jodka ( 520060 ) on Monday August 07, 2017 @07:53PM (#54960303)

    It is not like the CPU is testing for that particular combination of conditions alone and conditionally segfaulting. Really, there is a flaw in the CPU design which so far has only been demonstrated to exhibit itself under those conditions. That is much more worrying than the summary leads us to believe.

    I like AMD and Ryzen is a good bargain compared to Intel. It will be my next CPU purchase, though I am holding out until they fix the bug. But I don't like the way they are minimizing the impact.

         

    • by Kjella ( 173770 )

      It is not like the CPU is testing for that particular combination of conditions alone and conditionally segfaulting. Really, there is a flaw in the CPU design which so far has only been demonstrated to exhibit itself under those conditions. That is much more worrying than the summary leads us to believe.

      Well, from the fact that RMAs has worked for some people and not for others as well as the non-deterministic crashes it seems like it's down to production variation, some chips get unstable and corrupt data if hammered a particular way. Most likely there'll be some microcode update to stagger the problematic sequence and a new stepping increasing the safety margin to fix it properly. Still not good news for AMD, since those who can't easily verify their results will stay away until the scope of the problem

    • by tlhIngan ( 30335 )

      It is not like the CPU is testing for that particular combination of conditions alone and conditionally segfaulting. Really, there is a flaw in the CPU design which so far has only been demonstrated to exhibit itself under those conditions. That is much more worrying than the summary leads us to believe.

      Well, think of a modern CPU as a collection of execution units, In most CPUs, execution units overlap in functionality - a complex instruction may issue several loads (memory to CPU) and stores (CPU to memor

    • by AmiMoJo ( 196126 ) on Tuesday August 08, 2017 @06:56AM (#54963443) Homepage Journal

      All modern CPUs run microcode that is updated on boot by the BIOS. So fixing this will just be a microcode update, i.e. a BIOS update. AMD has been quite good at getting vendors to ship such updates for their motherboards and systems, but if for some reason they don't you could load it via a driver under Linux too.

      • by rew ( 6140 ) <r.e.wolff@BitWizard.nl> on Tuesday August 08, 2017 @02:34PM (#54968023) Homepage

        There MUST be some things in hardware to execute anything. While they (the chip manufacturers) have surprised me in the past, not all bugs CAN be fixed with a microcode update.

        A long, long time ago, people wrote "self modifying code". Say for doing bit-operations on parts of the screen buffer, you might pass 1 for AND 2 for OR and 3 for XOR. The function could then place the AND/OR/XOR opcode in the middle of the doit loop and then perform the loop.... So one day the manufacturer guarantees that the new machine will execute everything the old one did. Bad move. Turns out the new machine is faster because it prefetches instructions. By the time the code has determined the opcode for inside the loop, the loop (with the last AND/OR/XOR opcode in place) has already been prefetched. This prefetching is at the core of why the machine is fast. Implemented in hardware. Can you fix that with a microcode update? Apparently in the case at hand (PR1ME9955): yes.

        But I can easily see it happen that either you disable the whole prefetching stuff (slow everything down enormously) or you need say an extra comparator ("Is the store happening near my PC, possibly near my prefetch queue?") to allow for "normal" cases to use the prefetch queue, but this special case to flush the queue only when necessary. In any case, the microcode was updated and stuff worked properly again.

  • Don't worry... (Score:5, Insightful)

    by Chris Katko ( 2923353 ) on Monday August 07, 2017 @07:56PM (#54960317)

    ..the faults only happen for people with massive parallel loads.

    You know... the main reason people buy the CPUs.

    • Well they do say Threadripper and Eypic are unaffected, and I think those chips are a different stepping than the initial batch of Ryzen chips so the probably may already be fixed. It may be possible to fix the others with a firmware update, though who knows how long that will take to roll out depending on other things AMD is working on and their other priorities.
    • by arth1 ( 260657 )

      Yeah. i was contemplating getting a Ryzen for my new PC internals, which are due for changing out now.
      But if I can look forward to crap like this, it's not even an option - even if it were free, I wouldn't use it.

  • by iggymanz ( 596061 ) on Monday August 07, 2017 @08:18PM (#54960463)

    never mind my load type today, what about 2 years from now? why would I spend money on something that *might* segfault and for which the vendor isn't going to provide a solution to *everyone*. case by case basis my ass, that's the sign of a tech hardware vendor which should be shunned.

    • and assumed AMD would release a microcode fix (as they usually do) you would realize neither company has been making solid chips for at least 15 if not 20 years, and as they have tried to squeeze every ounce of performance out of, and every optimization into each chip, they've made design compromises that often don't show up until real world workloads.

      Personally I am pretty sure the AMD segfaults could be handled by either retuning, or disabling that nice little 'neural network' frontend, and I am not entir

    • by epine ( 68316 )

      why would I spend money on something that *might* segfault and for which the vendor isn't going to provide a solution to *everyone*

      You're dreaming if you don't think you run a similar risk with Intel. The only difference here is the proximal news cycle.

      Tomorrow's Market Probably Won't Look Anything Like Today [nytimes.com]

      The recency bias is pretty simple. Because it's easier, we're inclined to use our recent experience as the baseline for what will happen in the future. In many situations, this bias works just fine, b

      • by lucm ( 889690 )

        Intel has been rock-solid since forever. AMD has been unreliable since forever. If you think this will change today, you're kidding yourself.

        AMD makes gadgets for overclocking enthusiasts and gamers on a budget. There's nothing wrong with that, and they've kept Intel on their toes which is a good thing. But it's not the same class of product unless your focus is only on net gigahertz per dollar.

        Being surprised by this kind of problem is like being surprised that Windows phones home or that HP is fucking you

      • My intel processors don't segfault under heavy load. That includes compiler load at home and virtual machine load at my employer. Why would I risk changing that?

  • by Anonymous Coward

    And could still wind up being a Linux fault, though the various Intel errata have this sort of fault showing up a number of times, with multi-byte ops crossing page boundaries or the ilk, so no reason to single out Linux yet. Windows does so much structure-padding everywhere by default it's much less likely to occur there. This is where the ops pipeline dump comes in handy if it's deep enough.

  • Phoronix FAIL (Score:5, Insightful)

    by Anonymous Coward on Monday August 07, 2017 @08:33PM (#54960543)

    Phoronix: "certainly not the workloads most Linux users will be firing off on a frequent basis"

    Bullshit. Anyone who does video encoding will easily max out a Ryzen. Anyone who builds software for a living will max out q Ryzen. In fact, just about anybody who needs more computing power than a Chromebook will max out Ryzen.

    AMD you fucked up big time. Bigly.

    And Phoronix, who are you to say what people should be doing with their machines? People paid for this computational hardware and should expect it to perform as advertised.

    • Re: (Score:3, Informative)

      by 0123456 ( 636235 )

      Not to mention that one of the reasons we want more cores in our desktop machines is to speed up C++ compiles by compiling more files in parallel.

    • Video encoding makes heavy use of the SIMD units of the processor which is a different type of load then compiling which makes heavy use of the conventional integer logic part of the processor.

    • "Bullshit. Anyone who does video encoding will easily max out a Ryzen. Anyone who builds software for a living will max out q Ryzen. In fact, just about anybody who needs more computing power than a Chromebook will max out Ryzen."

      In other words, Phornonix is 100% correct.

  • So far so good (Score:5, Informative)

    by I'm just joshin ( 633449 ) on Monday August 07, 2017 @09:37PM (#54960909)

    Anecdote here...

    Ryzen 1700 w/ 64GB running Promox and 6 virtual machines - 1 Debian, 1 Gentoo (build machine), 1 PF Sense, and 3 Windows.

    Been rock solid doing full world builds on Gentoo, PCI passthrough of a GTX 1070 card to one of the Windows VMs (gaming actually works well), and has only been rebooted once since getting it going. Uptime of 24 days.

    No segfaults,

    It is amazingly fast & quiet. Quite the upgrade from my I7-3770K.

    • by jon3k ( 691256 )

      PCI passthrough of a GTX 1070 card to one of the Windows VMs (gaming actually works well),

      I'm currently building a Ryzen linux box (parts are literally sitting on the desk beside me) and I've been following the PCIe pass-through intermittently, mostly Wendell and Level1Techs. Can you share some details on how you got everything working and issues you've run in to?

  • what causes the problem or the exact circumstances it happens under.
  • by Orgasmatron ( 8103 ) on Monday August 07, 2017 @09:59PM (#54961003)

    Not (necessarily) a big deal. CPUs have bugs. The kernel, the compilers and the standard libraries are all stuffed full of workarounds for various CPU errors. They are called "errata" and pretty much every CPU has them. (One could argue that corrigendum would be a more appropriate word for them.) Intel has had some big ones, the most memorable (off the top of my head) were FOOF and FDIV. The 286 was so riddled with bugs that everyone gave up trying to write a protected mode kernel and just waited for the 386.

    Basically, they'll figure out what is causing the error and how to avoid it. If the workaround is easy, like "have the compiler reorder some instructions", a few patches will go out and life goes on, no big deal.

    If the workaround is less easy, like "don't utilize all cores", or "bump the clock multiplier down to overcome a thermal or electrical issue", that is a much bigger deal. If you don't meet marketing numbers, your choices are refund or replace. Intel spent a half billion dollars replacing CPUs because of the FDIV bug, even though they calculated that most people would never encounter it and it was relatively easy to patch around (but the patch would have been a drag on FPU performance - and marketing again had made promises).

    • by Misagon ( 1135 )

      The first bug report with a test case that reproduced the bug was submitted to AMD in April, and they have acknowledged the bug first now.

      And how long would we have to wait for a microcode update?

  • by Anonymous Coward

    It seems that Ryzen's hyperthreading, on Linux, under very rare circumstances, can cause memory errors. And Intel is spending millions flooding every tech forum and tech site with shill propaganda decaring this to be the 'end of the world'.

    But Intel would like you to forget that its first two generations of hyperthreading were so broken, you had to switch it off altogether to do any serious work.

    Hyperthreading needs scheduling to be sane and sympathetic. So no issues on the vastly better coded Windows. Sadl

    • by Anonymous Coward

      It seems that Ryzen's hyperthreading, on Linux, under very rare circumstances, can cause memory errors. And Intel is spending millions flooding every tech forum and tech site with shill propaganda decaring this to be the 'end of the world'.

      But Intel would like you to forget that its first two generations of hyperthreading were so broken, you had to switch it off altogether to do any serious work.

      Hyperthreading needs scheduling to be sane and sympathetic. So no issues on the vastly better coded Windows. Sadly Linux is a joke from a software stability POV. So two threads on one core with inter-dependencies have many possibilities to cause bugs.

      I once had Windows crash rarely when launching video. Turned out that I had a driver (emulating a DVD ROM) that failed to prevent its IRQ driver from 'paging out' under memory 'pressure'. And for some reason playing video had a real chance of grabbing the memory used by the interrupt code. The bug was 100% the fault of the IRQ code. And when i tracked it down, turned out there was a driver update that fixed the very bug.

      Seems the Linux bug on Ryzen is the same sort of thing. One thread, apparently, has to be an interrupt. The compile load has to be so very taxing, the entire system RAM is under constant load. And I bet my bottom dollar the hopeless Linux coder has failed to flag the interrupt handling code as 'non-paging'. Or the Linux scheduler screws up ring zero ultra-priority interrurpt handlers, and lets then 'time out' under pressure.

      Before you say "but Intel works"- WRONG. The person (sponsored by Intel) flooding forums with this 'bug' and the script to trigger it had to change the script code over and over again when users discovered it was triggering the same errors on Intel systems as well. What we know for REAL (as opposed to this fake news) is that certain compile workloads on Intel and AMD cause memory issues if hyperthreading is on. And the reason is certain to be bad linux coding.

      If version 1,2,3,4,5 and 6 of the workload script crashed both Intel and AMD, and version 7 so far (so its claimed) only affects some ryzen chips, well the problem is clearly not unique to Ryzen.

      PS again the people responsible for banging on about the issue are sponsored by Intel- and Intel has a very large active bounty for anyone who can 'prove' faults in Ryzen.

      you seem like someone who has been payed (by most likely MS) to badmouth Linux

    • by Anonymous Coward

      Hyperthreading needs scheduling to be sane and sympathetic. So no issues on the vastly better coded Windows. Sadly Linux is a joke from a software stability POV.

      The problem has been reproduced on Windows, using WSL. Also FreeBSD and DragonFlyBSD are affected.

    • by rew ( 6140 )

      Just FYI: On Linux IRQ handlers can never be paged out on a very fundamental level.

      You might think it's useful, but the thinking is that it just MIGHT be the IRQ (kernel memory) for the "get it back from disk" part. So in general stuff like that is never paged out.
      In modern systems you'll probably use maybe 3-10Mb of memory for kernel code. If you have little main memory (1GB) that's still less than 1%. So no reason at all to change this policy.

  • Only on Linux (Score:2, Interesting)

    by Khyber ( 864651 )

    That tells me someone's code is fucked up, not that AMD's processors are screwed. Ain't happening on my Hackintosh, ain't happening on my Windows box.

    Did someone let Grsecurity do the SMT kernel code?

    • You could just as easily argue that the fact that Linux works fine on other Ryzen processors, AMD's older processors and Intel's processors, and only segfaults on these specific Ryzen models, tells you that it's these processors that are broken, not Linux.

      Of course -- and I shouldn't really have to explain this on Slashdot of all places, but neither of these observations actually tell you where the problem is. Doing that involves doing some investigation, and the fact that AMD appear to be accepting blame s

      • by Khyber ( 864651 )

        I have an actual background in hardware and software troubleshooting. This is very clearly the sign of bad code, not bad hardware. Testing for similar problems under both Windows and my Hackintosh boot partitions, using software compilation tools on a high thread count. Oh, BTW, since Windows 10 has a SMT Scheduling problem with Ryzen (but only Windows 10, Windows 7 is unaffected) again this tells me that it's clearly in the newer software implementations, not hardware, as I'm unable to trigger the SMT bug

      • Comment removed based on user account deletion
  • What _else_ would people buy such CPUs for then, if not for "massive workloads"?

    Also, somehow I'm feeling considerable distrust that the OS should be able to somehow 'fix' this. Probably by turning off features until it runs at a fraction of the speed, my guess is...

    Anyway, yesterday I already sent out an email saying "don't buy Ryzen". First time I've ever done that, so well done, AMD.

    • What _else_ would people buy such CPUs for then, if not for "massive workloads"?

      Because "shiny" - why do most people buy new processors?

The truth of a proposition has nothing to do with its credibility. And vice versa.

Working...