Samsung Finds, Fixes Bug In Linux Trim Code 184
New submitter Mokki writes: After many complaints that Samsung SSDs corrupted data when used with Linux, Samsung found out that the bug was in the Linux kernel and submitted a patch to fix it. It turns out that kernels without the final fix can corrupt data if the system is using linux md raid with raid0 or raid10 and issues trim/discard commands (either fstrim or by the filesystem itself). The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests. According to this post the bug has been around for a long time.
awkward! (Score:4, Insightful)
Well, that's gotta be embarrassing for everyone bashing Samsung over this. I remember reading some rather strong opinions about who was at fault.
Re: (Score:2, Interesting)
I'd be interested to see if anyone has apologized. Doing so is exceedingly rare on internet forums.
Re: (Score:2, Insightful)
If the kernel devs and Linus don't apologize, they're all a bunch of self-absorbed shitlords and should be smacked off the face of this planet.
Re: (Score:2)
Accepting the patch *was* the apology.
Re:awkward! (Score:5, Informative)
Re: (Score:2)
Linus needs to apologize for his devs going "Not my fucking fault!" when in fact it WAS their fault.
https://blog.algolia.com/when-... [algolia.com]
Here's the company that found the actual problem and pinpointed it.
Re: (Score:2)
That said, Linus never apologises for his own out-rightly abusive comments and actions. There's no way he's going to apologise on behalf of someone else, especially when there's some truth to the kernel developers comments - there are known bugs
Re: (Score:2, Insightful)
Even more so for the kernel developers that blacklisted the Samsung drives.
These developers should probably be banned from kernel development or atleast banned from making decisions regarding functionality.
Creating code with a bug is human, not doubting your own code and blaming somebody else is stupid.
fairly common to blacklist devices (Score:2)
hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.
If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.
Re:fairly common to blacklist devices (Score:5, Insightful)
hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.
If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.
Another way of seeing things, is even if the bug is in the kernel, black listing still prevents damage to data on said vendor's hardware. When it comes to data corruption the first thing to do is limit damage, no matter who is it at fault. Afterwards, you can work together to try to isolate source of problems. Having unhappy users and customers is never good, unless you are the competition.
Re: (Score:2, Informative)
It's the fact that they put the boot in to Samsung, claiming that their TRIM implementation was broken. They then stopped looking at their own code and had to wait for Samsung to fix their bug.
Re:fairly common to blacklist devices (Score:5, Informative)
Sorry, that's incorrect.
There's a bug on MD raid0 and raid10. In Linux.
There is a data destroyer bug in SAMSUNG NCQ TRIM firmware. Which is *blacklisted*, so that it uses the non-ncq trim.
See? You're an idiot and everyone but you actually knew what they were complaining about. The samsung firmware is buggy crap that destroys data on NCQ TRIM, and the Linux kernel had a data destroyer bug in RAID0/RAID10 + TRIM that was fixed by a samsung engineer.
The samsung firmware is still broken, the linux kernel has been fixed, and you're still an useless idiot.
Re:awkward! (Score:5, Insightful)
The firmware bug of Samsung drives, a very severe one actually, was confirmed by Samsung. The RAID 0 issue is a totally different one, hardly affecting anyone.
So yes, the severe issue was a bug on Samsung side, thile the very rare RAID 0 bug is Linux kernel one.
Re: (Score:2)
Re:awkward! (Score:4, Informative)
Re: (Score:2)
Windows can't trigger the bug because it doesn't use that feature.
Re: (Score:2)
Bullshit.
Re: (Score:3)
The AC was sorta half right. It is not uncommon for hardware to break the standard so that it works with Windows. That sort of thing is becomm9ing less common but it's hardly unknown.
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
How many paying customers does Linux have? Massive testing is expensive. There's more likely to be issues with Linux than Microsoft Windows, because everybody tests on WIndows. That doesn't mean that Microsoft itself tests better than Linux devs, and indeed we find that Microsoft puts out lots of bugs.
Yhank You (Score:2)
Thank You Samsung!
While our company cad-workstations don't run Linux, all of them do run on Samsung SSD's.
Bravo (Score:5, Interesting)
Nice to see vendors working together to improve Linux.
Re:Bravo (Score:5, Insightful)
There was definitely some self-interest there.
Samsung can't have people saying their SSDs corrupt data when it's not them doing it.
Re:Bravo (Score:5, Interesting)
Sure there was self interest. Still I think they deserve a lot of credit here. Rather than the typical "Its not my code" response from a developer who is sure the problem is elsewhere (rightly or wrongly) they actually found and fixed the problem. That is good behavior!
Re:Bravo (Score:4, Insightful)
Of course, this is only possible when the "other person's" code is Free Software. If this had been a problem in Windows/OSX that Microsoft/Apple was refusing to fix, there's little Samsung could have done about it.
Re: (Score:3)
Sure it was good behavior.
But it was borne entirely out of the Linux people saying "OMG, teh Samsung is teh sux0r".
I do give them a lot of credit. More than the people who apparently insisted it was the fault of Samsung in the first place.
Re: (Score:2)
Rather than the typical "Its not my code" response from a developer who is sure the problem is elsewhere (rightly or wrongly)
Except that's exactly what happened (on the Linux side).
Re: (Score:2)
Nice to see vendors working together to improve Linux.
Well, Samsung had some SSDs to sell. It's part of the open source philosophy: you scratch your own itch, and everyone benefits.
Still, the problem is that we don't arrive at a well-rounded result. Fixing some things here and there is not deep QA. After stories like this I always get cold chills imagining what else broken is there.
Re: (Score:2)
Re: Bravo (Score:5, Interesting)
Yeah, the outcome is great. I just wonder why they waited more than a year to look into it. Maybe this will set a good example for the industry that with a little bit of effort you can take care of your customers and sell more product.
If this were the 80's and a hard drive vendor had more than two reports of data loss under, say VMS, there would have been engineers on a plane to DEC by morning to get it solved by the coming weekend.
Now we have thousands of users with reports and millions of units sold, and a wealthy vendor, and it's all crickets, leaving some kernel hackers to half-ass a blacklist. It's not like this is BeOS - there are millions of servers running in the target market. I don't mean to absolve the bad troubleshooting by kernel devs, but want to know what drove the apathy at Samsung (and other vendors behaving poorly). It's obviously not profit motive.
Re: Bravo (Score:5, Informative)
I take some of that back. It seems the real credit for digging in goes to these guys [algolia.com]. Samsung came in a month ago after they were provided a test suite and then gets credit for finding the kernel code path that caused the problem. An Oracle engineer provided a more-correct patch.
Re: (Score:2)
"If this were the 80's and a hard drive vendor had more than two reports of data loss under, say VMS, there would have been engineers on a plane to DEC by morning to get it solved by the coming weekend."
Hard disks were way more expensive in the 80s, and they sold in lower numbers. So it makes economic sense to do hands-on damage control.
Crying wolf (Score:5, Informative)
Re:Crying wolf (Score:5, Informative)
Re:Crying wolf (Score:5, Insightful)
The point however is that in a closed source system, Samsung could not have found and fixed the bug themselves.
Re:Crying wolf (Score:4, Insightful)
Is that really the point, though?
Vendors of products affected by bugs in closed source software collaborate all the time. It's usually in their mutual interests, and it has been going on forever. Just look at the extraordinary lengths Microsoft used to go to in order to maintain compatibility of Windows with older applications.
On the other hand, the existence of this issue in the first place, the fact that other vendors whose products may also have been affected did not act as Samsung did, and particularly the denial and active yet unjustified blacklisting of Samsung products by the people running the project with the real fault are indictments of that project, no matter how open it claims to be or how big and famous it is.
This whole affair does not look good for Linux, and more importantly, it does not reflect well on the people currently running development of Linux.
a bit too harsh (Score:2)
Bugs happen. If you've got code that seems to work and then you investigate and it doesn't work on one particular brand of drive, it would be a reasonable suspicion that there is something funny with those drives.
Given the fact that multiple Samsung drive models were failing but multiple Intel drive models were *not* failing under the same test (from the linked article), the developers could be forgiven in suspecting there was something wonky going on with the Samsung drives.
Re: (Score:2)
Yes, bugs happen, and yes, sometimes diagnosing hardware compatibility issues is tricky. But if I see a potential data loss bug in software I develop, I don't start making judgements about where it comes from -- and I definitely don't start pointing the finger at other people and denying anything is wrong with my own code -- until I've identified the root cause of the problem.
The issue here isn't really that a bug happened, even though the bug was serious. It's the way it was handled that is the greater cau
Re: (Score:3)
Bugs happen. If you've got code that seems to work and then you investigate and it doesn't work on one particular brand of drive, it would be a reasonable suspicion that there is something funny with those drives.
It's hard to evaluate exactly what went on here. If you read the original report of the discovery (which I did last month and is still the first link in TFS), you see this explanation:
Poking around in the source code of the kernel looking for the trim related code, we came to the trim blacklist. This blacklist configures a specific behavior for certain SSD drives and identifies the drives based on the regexp of the model name. Our working SSDs were explicitly allowed full operation of the TRIM but some of the SSDs of our affected manufacturer were limited. Our affected drives did not match any pattern so they were implicitly allowed full operation.
In other words, they didn't know what was going on. Then they happened upon some code in the Linux kernel that explicitly blacklisted certain model segments from certain manufacturers. So, at some point someone made the assumption that this must be related to certain models from certain manufacturers, based on code in the L
Re: (Score:2)
Hindsight really has nothing to do with it. If they didn't know for sure what the cause was, there was no need to call it at all. You can mark an issue not reproducible in a bug tracker without actively blaming someone else for a mistake they never made.
Re:Crying wolf (Score:5, Informative)
Re: (Score:2)
Re: (Score:2)
The point however is that in a closed source system, Samsung could not have found and fixed the bug themselves.
Says who? If a similar bug happened with Samsung SSD drives connected to Macintosh computers, Samsung as a highly esteemed supplier of parts would most likely be given any help needed to fix the problem. They can't just download the software, but one phone call from the right person at Samsung to the right person at Apple would fix that.
Just another case.... (Score:5, Insightful)
This is just another case of "Not My Problem" syndrome that too many techs get into. They think their code/tools/systems/whatever must be perfect, and other's are the ones fucking up. Samsung drives went on a blacklist for issuing the commands to them due to this bug? "WALP, LINUX IS PERFECT, MUST BE THE HARDWARE GUYS, even though their devices perform perfectly on other OSes" - and instead now we're left with a bug in Linux that corrupts data until the patch can make its way through the distro channels and pushed out to end users.
Re: (Score:2)
You should take a look at the "black list" before you try to figure that question out.
The list includes other brands of drive as well as Samsung...
Re: (Score:2)
Apparently it's quite normal to have software work around hw defects.
Re: (Score:2)
How many software engineers does it take to change a lightbulb? None it's an electrical problem.
How many electrical engineers does it take to change a lightbulb? None we'll just work around it in software.
Re: (Score:2)
This is just another case of "Not My Problem" syndrome that too many techs get into.
No, it's a case of everyone jumping to conclusions.
Samsung drives went on a blacklist for issuing the commands to them due to this bug?
No, they went on the queued TRIM blacklist due to a different bug. This bug was an unrelated serial TRIM bug when used in conjunction with RAID.
Re: (Score:3)
It certainly is an indicator. I think you mean to say "is not conclusive evidence."
But then again, disastrous ACPI implementations are not conclusive evidence that a whole different type of device is at fault.
Your reasoning falls into the very trap GP was pointi
Re:Just another case.... (Score:5, Interesting)
Devices working perfectly in other OSes is no indicator that the device is no at fault. Witness the vast amount of crap laptop hardware, whose disastrous ACPI implementations only worked because their Windows drivers were chock-full of workarounds.
Back when I was writing Windows drivers for plugin cards, there were certain motherboards that we'd detect and switch the motherboard bus to the slowest possible speed, because the chipset was a heap of junk that didn't work properly at higher speeds. Anyone who said 'but it works on Windows!' clearly had no idea that it only worked because we'd intentionally turned off most of the features.
Re:Just another case.... (Score:5, Interesting)
We did workarounds on the ATA bus spec for known hardware bugs in older VIA chipsets. These were silicon bugs, not chipset firmware so they couldn't be fixed afterwards with patches and there were millions of these boards out there. Declaring our devices (CD-ROM and DVD-ROM drives) wouldn't work with these boards was not going to happen for sales reasons so our code included a lockup-recovery function that was invoked when the rare bug conditions were met and the IDE bus froze. The average user never noticed these lockups and we didn't tell them about them.
Out-of-spec bugs like this were well-known in the industry and workarounds were easy to produce as long as you had access to a few million bucks worth of test equipment and a good team of professional engineers with decades of experience, not something that's common in the Linux world.
Re: (Score:2)
In a perfect world
We don't live in such a world. If we want our computers to work properly today, these workarounds have to be taken into account.
Re:Just another case.... (Score:4, Interesting)
A pro-Linux bias on Slashdot is not exactly a surprise, but an equally accurate headline on another forum might have read "Critical bug in Linux corrupts data on SSDs", and the subtitle "Linux maintainers deny serious fault, blame innocent parties for data loss" would probably have been fair too.
don't remember any denial (Score:2)
More like an assumption that the bug was in the driver because they hadn't noticed issues on other drives.
Re: (Score:2)
If you look at the SSD blacklist it's HUGE, and not just filled with Samsung drives.
Re: (Score:2)
A complete myth. At least these days.
Slashdot has several bags of crazy, all competing with one another at various times.
There's Windows fanbois, Linux fanbois, and Apple fanbois. Over the years the ratio of those has swung back and forth, these days I'd say on balance you'd be hard pressed to say there's a strong bias one way or another.
At various times it's been chic to tend more to one or another, now it seems like Slashdot has grown enough that there's at least 30 diffe
Re: (Score:2)
Re: (Score:2)
Windows doesn't yet support queued TRIM, it still uses the legacy serial TRIM.
Queued TRIM is serial as well... :) Everything is serial in the SATA bus.
With "serial TRIM" you probably mean "blocking TRIM" (it requires other operations to be halted and command queue flushed before it can be performed).
Re: (Score:2)
Comment removed (Score:4, Funny)
Re: (Score:2)
Enough money to afford SSDs but not enough to afford something better than dial-up.
I bet you have a 20MHz CPU with 64GB of RAM, too.
Re:not the case in my situation (Score:4, Informative)
Re: (Score:2)
If you have 64GB of RAM, you can cache the entire SSD. Then you won't have to issue TRIM commands!
My SSD is 1 TB. The other one is 256 GB. SSDs today are a lot larger than you seem to realize.
Re: (Score:2)
Re:not the case in my situation (Score:5, Funny)
"But.. what does my cell phone carrier have to do with anything?"
Re: (Score:2)
Easy : it's the same samsung in it !
Vote with your wallet (Score:4, Interesting)
Vote with your wallet, my next SSD will be a samsung.
Re: (Score:2)
Same problem, only spun around.
I'll buy whatever fits my job requirements. Prior to this discovery, that certainly wouldn't have been Samsung. Now? They get to be considered along with all the other vendors.
Re: (Score:2)
Just curious; what were your reasons not to consider them before? In what way didn't they fit your job requirements?
Re: (Score:2)
The reports of their data loss.
Re: (Score:2)
grasshopper said the data loss. I would have said the firmware issues that lead to performance problems with their EVO line of SSDs.
Re: (Score:2)
the 840 evo speed issues... (Score:2)
There were issues with the 840 EVO losing significant speed after it had been in use for a while. There was eventually (after much complaining from customers) a "fix" released that helped but didn't actually completely resolve the issue.
Re: (Score:2)
Why would you use consumer level drives in a business?
Clients really don't need an SSD as they are mostly limited by network speed more than anything, and servers shouldn't be touching that shit.
That said, the 840 EVOs put me off upgrading my laptop, but I just went with a 1Tb 850 instead, which doesn't have any of those problems.
Every manufacturer has problems with certain models, it's inevitable. But make sure you're using the right use case, evaluate properly, and disregard things that couldn't have aff
Re: (Score:2)
I had a SSD fail recently (two weeks ago?) and while searching for a replacement found the Samsung TRIM issues, so I didn't buy one. I got some cheap replacement for the time being.
When this new one inevitably fails prematurely, I will look again at Samsung models.
+1 (Score:2)
:thumbsup:
Good work (Score:2)
Apology (Score:3)
Oh bugger (Score:2)
I'm running Linux on a RAID-0 SSD array.
I guess I should turn off fstrim until there's a backport of the fix to Fedora?
How was this recreated before the bug existed? (Score:4, Insightful)
Blame NAND Flash Memory... (Score:2)
While an apology is due, this sort of problem is inevitable given the nature of the technology. TRIM on NAND is a crutch for a technology that is poorly suited to data storage. Transforming NAND into a usable storage device requires heroic efforts on the part of the vendor, and it is hard to blame them for the bugs. Likewise, it is hard to blame Linux developers for their heroic efforts to work around the extensive deficiencies of NAND flash. Trusting in cheap commodity devices that don't even claim to prot
Re:Why did it only happened on Samsung's SSDs? (Score:5, Insightful)
Confirmation bias. It was happening with other brands, but for one reason or another, people focused in on Samsung as the culprit, and once that happened, there was no getting out of it.
Re: (Score:2)
Excellent question. My first guesses would be that either the Samsung SSDs were doing something a bit out-of-specs, or the Samsung SSDs have something that's missing from other SSDs.
Re: (Score:2)
Re: (Score:2)
I'm not familiar with all the flash-related technologies currently in use, what's your opinion on the Intel SSDs?
Re: (Score:2)
See SSD life endurance test https://techreport.com/review/... [techreport.com].
Anyway SSD's fail completely different from Hard drives. Most just vanish, some corrupt massively and others go in a final one chance read-only mode (select Intel consumer models).
Tested backups are a necessity here.
Re: (Score:2)
If you want to buy the cheapest (price/performance) consumer SSD out there then yes you buy Samsung or Crucial.
Intel prices their consumer stuff higher because they want fatter margins.
Re: (Score:2)
Most (not all) Intel drives are higher priced because they use SLC memory. Prior to 2014, I believe all Intel drives were SLC.
Re: (Score:2)
Re: (Score:3)
From TFS: "The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests."
If the vendor of the drive does not matter in testing, then there is no relevant difference in specification compliance or other "somethings." It's pur
Re: (Score:2)
it could affect all drives equally (Score:2)
But it doesn't have to. If a drive were to implement TRIM by doing absolutely nothing (which is completely within spec) then it wouldn't show the problem, but it doesn't mean the drive is better than another or the other drive has a fault.
It's quite possible that the way IBM implements TRIM is just a little different. Perhaps they defer it for a few ms or something. So the bug is occurring over and over but it doesn't show itself with corruption.
Yes, assuming that because you can reproduce it on Samsung dri
Re: (Score:2)
Excellent question. My first guesses would be that either the Samsung SSDs were doing something a bit out-of-specs, or the Samsung SSDs have something that's missing from other SSDs.
Knowing the industry the way it is it is just as likely that Samsung were the only ones who implemented the spec faithfully without some dodgy firmware workaround.
Sometimes the "broken" device is the only one actually working properly.
Re: (Score:2)
People just complained about Samsung drives more,
the article said some Intel drives not affected (Score:2)
The linked article pointed out that five models of Samsung SSD were affected, three models of Intel SSD were not. So there were at least some drives that didn't seem to be affected by the bug. (Presumably just due to luck/usage-pattern/etc.)
Re: (Score:3)
Perhaps competitive prices coupled with perceived quality (and good experience on other platforms) led to these drives being selected by more knowledgeable or performance oriented people.
These drives then got pushed harder or in ways more likely to expose the bugs, leading to a perception that they were unreliable under Linux.
Re: (Score:2)
Because there are two different bugs at issue here. There was a bug in the Linux kernel which Samsung fixed; and some of their drives have broken queued TRIM support. Summary makes a mess of it.
Re: (Score:2)
When I worked on a military base a while back, there was a young female in the group whose last name was Trim. I never made a comment on it until the last couple days I was going to be there, and only in response to her making a remark like "some guys snicker" when hearing her name. I told her it was one of the first thought in my mind months earlier, but couldn't say anything.
Could be worse though. In World War II, there was an Admiral Kuntz. He has a road and access gate named after him at Pearl Harbor. I
Re: (Score:2)
Doesn't matter much - this is why many Samsungs were mistakenly blacklisted, thinking it was a problem with the drive.
Unless you're running RAID0 or similar, it's not going to bite you. Not at all sure why anyone runs RAID0, to be honest, and certainly not with SSD's, but there you go. RAID10 is affected, I believe, but with 8 drives I'm not sure what you'd get from RAID10 that RAID 5 wouldn't have been better for you anyway.