Linux 6.1 Will Make It A Bit Easier To Help Spot Faulty CPUs (phoronix.com) 16
An anonymous reader shares a report: While mostly of benefit to server administrators with large fleets of hardware, Linux 6.1 aims to make it easier to help spot problematic CPUs/cores by reporting the likely socket and core when a segmentation fault occurs, which can help in spotting any trends if routinely finding the same CPU/core is causing problems. Queued up now in TIP's x86/cpu branch for the Linux 6.1 merge window in October is a patch to print the likely CPU at segmentation fault time. Printing the likely CPU core and socket when a seg fault occurs can be beneficial if routinely finding seg faults happening on the same CPU package or particular core.
Re: (Score:2)
if the fault is subtle would we even notice until it was too late?
Re:Faulty cpu? (Score:5, Informative)
> if the fault is subtle would we even notice until it was too late?
Nope. Early Ryzen CPU's have a bug that just results in corrupted data. Good luck proving to AMD that you have one. A guy on github wrote ryzen-test suite which is good at identifying them, but AMD doesn't provide a testing tool.
Y'know, just stand behind your product is all we ask. They know the bad batch range pre-errata, but we don't, and they know it corrupts customers' data. A bunch of Redditors did a good job bisecting the lots without any help from AMD.
Lessons learned.
Re: Faulty cpu? (Score:2)
Could you provide more information?
I've been following this thread on Bugzilla for years.
https://bugzilla.kernel.org/sh... [kernel.org]
Re:Faulty cpu? (Score:5, Informative)
Re: (Score:2)
Nope, and thats part of why its reporting the *suspected* core. By the time it gets to the error handler the execution flow may have jumped cores, threads, process boundaries and all sorts of shenanigans.
But a best of worst possible scenarios is usually the best we can hope for.
Re: Faulty cpu? (Score:3)
Next step would be to disable a faulty core. Just a soft disable in the kernel could be enough.
Re: Faulty cpu? (Score:5, Informative)
The problem with disabling bad hardware in software is lack of persistence across OS installations.
Notably, Windows 10 removes any Bad Ram settings you applied to the BCD configuration every time the OS upgrades itself.
Re: (Score:2)
That's true, but if you have a server with 64 cores and one is bad then disabling that core until you can swap the CPU would have a marginal impact on the overall performance.
Computer systems are today getting very complex and persistence across OS installations is often not really important for most users, it's the reliability that's most important for most users.
In large corporations the computer models are changed so frequently that you don't really have a "standard computer", just a huge mix of models f
Re: (Score:3, Insightful)
Back in the 90s when I was regularly building/updating my own systems, I came across one. I bought the motherboard, ram and CPU from a vendor I'd trusted for a few years by that point. None of the tech guys thought the CPU was causing graphic anomalies. Literally replaced the ram twice, the motherboard at least once, and then I was like, screw it, give me a new CPU, and boom, problem went away.
Re: (Score:3)
As TFA says, more of use to server farms than individuals. But it's worth thinking of the difference between having two HDDs and Backblaze, where they have thousands, and can publish failure statistics. And of course, on a large scale, such statistics are of value to people building server farms. If there's a 1% chance of a CPU model giving 75% more segfaults than average, it's good to know. If a particular CPU out of 1000 is giving twice as many segfaults as average running the same code, it's good to know
Faulty CPUs... (Score:3)
mostly irrelevant anecdote (Score:4, Interesting)
i had a OptiPlex with a hexacore i5 8Gen that would only boot an OS if 3 cores were disabled in the BIOS. took the Dell tech 2 trips to try and fix it, and eventually they just shipped it back to TX and "fixed" it there and shipped it back.
Likely? (Score:2)
Why doesn't the kernel know exactly?
Re: (Score:3)