Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Bug Data Storage Linux

Samsung Finds, Fixes Bug In Linux Trim Code 184

New submitter Mokki writes: After many complaints that Samsung SSDs corrupted data when used with Linux, Samsung found out that the bug was in the Linux kernel and submitted a patch to fix it. It turns out that kernels without the final fix can corrupt data if the system is using linux md raid with raid0 or raid10 and issues trim/discard commands (either fstrim or by the filesystem itself). The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests. According to this post the bug has been around for a long time.
This discussion has been archived. No new comments can be posted.

Samsung Finds, Fixes Bug In Linux Trim Code

Comments Filter:
  • awkward! (Score:4, Insightful)

    by Anonymous Coward on Thursday July 30, 2015 @02:18PM (#50216413)

    Well, that's gotta be embarrassing for everyone bashing Samsung over this. I remember reading some rather strong opinions about who was at fault.

    • Re:awkward! (Score:2, Interesting)

      by Anonymous Coward on Thursday July 30, 2015 @02:38PM (#50216625)

      I'd be interested to see if anyone has apologized. Doing so is exceedingly rare on internet forums.

    • Re:awkward! (Score:2, Insightful)

      by mwvdlee ( 775178 ) on Thursday July 30, 2015 @03:14PM (#50216935) Homepage

      Even more so for the kernel developers that blacklisted the Samsung drives.
      These developers should probably be banned from kernel development or atleast banned from making decisions regarding functionality.
      Creating code with a bug is human, not doubting your own code and blaming somebody else is stupid.

      • by Chirs ( 87576 ) on Thursday July 30, 2015 @03:44PM (#50217207)

        hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.

        If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.

        • by Midnight Thunder ( 17205 ) on Thursday July 30, 2015 @03:51PM (#50217299) Homepage Journal

          hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.

          If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.

          Another way of seeing things, is even if the bug is in the kernel, black listing still prevents damage to data on said vendor's hardware. When it comes to data corruption the first thing to do is limit damage, no matter who is it at fault. Afterwards, you can work together to try to isolate source of problems. Having unhappy users and customers is never good, unless you are the competition.

          • by AmiMoJo ( 196126 ) on Thursday July 30, 2015 @05:22PM (#50218261) Homepage Journal

            It's the fact that they put the boot in to Samsung, claiming that their TRIM implementation was broken. They then stopped looking at their own code and had to wait for Samsung to fix their bug.

            • by Anonymous Coward on Thursday July 30, 2015 @09:26PM (#50219813)

              Sorry, that's incorrect.

              There's a bug on MD raid0 and raid10. In Linux.

              There is a data destroyer bug in SAMSUNG NCQ TRIM firmware. Which is *blacklisted*, so that it uses the non-ncq trim.

              See? You're an idiot and everyone but you actually knew what they were complaining about. The samsung firmware is buggy crap that destroys data on NCQ TRIM, and the Linux kernel had a data destroyer bug in RAID0/RAID10 + TRIM that was fixed by a samsung engineer.

              The samsung firmware is still broken, the linux kernel has been fixed, and you're still an useless idiot.

    • Re:awkward! (Score:5, Insightful)

      by Anonymous Coward on Thursday July 30, 2015 @04:20PM (#50217647)

      The firmware bug of Samsung drives, a very severe one actually, was confirmed by Samsung. The RAID 0 issue is a totally different one, hardly affecting anyone.

      So yes, the severe issue was a bug on Samsung side, thile the very rare RAID 0 bug is Linux kernel one.

  • by Anonymous Coward on Thursday July 30, 2015 @02:24PM (#50216495)

    Thank You Samsung!
    While our company cad-workstations don't run Linux, all of them do run on Samsung SSD's.

  • Bravo (Score:5, Interesting)

    by Virtucon ( 127420 ) on Thursday July 30, 2015 @02:25PM (#50216503)

    Nice to see vendors working together to improve Linux.

    • Re:Bravo (Score:5, Insightful)

      by gstoddart ( 321705 ) on Thursday July 30, 2015 @02:29PM (#50216547) Homepage

      After many complaints that Samsung SSDs corrupted data when used with Linux

      There was definitely some self-interest there.

      Samsung can't have people saying their SSDs corrupt data when it's not them doing it.

      • Re:Bravo (Score:5, Interesting)

        by DarkOx ( 621550 ) on Thursday July 30, 2015 @03:02PM (#50216825) Journal

        Sure there was self interest. Still I think they deserve a lot of credit here. Rather than the typical "Its not my code" response from a developer who is sure the problem is elsewhere (rightly or wrongly) they actually found and fixed the problem. That is good behavior!

        • Re:Bravo (Score:4, Insightful)

          by Anonymous Coward on Thursday July 30, 2015 @03:18PM (#50216957)

          Of course, this is only possible when the "other person's" code is Free Software. If this had been a problem in Windows/OSX that Microsoft/Apple was refusing to fix, there's little Samsung could have done about it.

        • by gstoddart ( 321705 ) on Thursday July 30, 2015 @03:24PM (#50217015) Homepage

          Sure it was good behavior.

          But it was borne entirely out of the Linux people saying "OMG, teh Samsung is teh sux0r".

          I do give them a lot of credit. More than the people who apparently insisted it was the fault of Samsung in the first place.

        • by Yunzil ( 181064 ) on Thursday July 30, 2015 @06:34PM (#50218775) Homepage

          Rather than the typical "Its not my code" response from a developer who is sure the problem is elsewhere (rightly or wrongly)

          Except that's exactly what happened (on the Linux side).

    • by jones_supa ( 887896 ) on Thursday July 30, 2015 @02:55PM (#50216749)

      Nice to see vendors working together to improve Linux.

      Well, Samsung had some SSDs to sell. It's part of the open source philosophy: you scratch your own itch, and everyone benefits.

      Still, the problem is that we don't arrive at a well-rounded result. Fixing some things here and there is not deep QA. After stories like this I always get cold chills imagining what else broken is there.

    • Re: Bravo (Score:5, Interesting)

      by bill_mcgonigle ( 4333 ) * on Thursday July 30, 2015 @03:50PM (#50217285) Homepage Journal

      Yeah, the outcome is great. I just wonder why they waited more than a year to look into it. Maybe this will set a good example for the industry that with a little bit of effort you can take care of your customers and sell more product.

      If this were the 80's and a hard drive vendor had more than two reports of data loss under, say VMS, there would have been engineers on a plane to DEC by morning to get it solved by the coming weekend.

      Now we have thousands of users with reports and millions of units sold, and a wealthy vendor, and it's all crickets, leaving some kernel hackers to half-ass a blacklist. It's not like this is BeOS - there are millions of servers running in the target market. I don't mean to absolve the bad troubleshooting by kernel devs, but want to know what drove the apathy at Samsung (and other vendors behaving poorly). It's obviously not profit motive.

      • Re: Bravo (Score:5, Informative)

        by bill_mcgonigle ( 4333 ) * on Thursday July 30, 2015 @04:16PM (#50217595) Homepage Journal

        I take some of that back. It seems the real credit for digging in goes to these guys [algolia.com]. Samsung came in a month ago after they were provided a test suite and then gets credit for finding the kernel code path that caused the problem. An Oracle engineer provided a more-correct patch.

      • by aNonnyMouseCowered ( 2693969 ) on Thursday July 30, 2015 @09:26PM (#50219819)

        "If this were the 80's and a hard drive vendor had more than two reports of data loss under, say VMS, there would have been engineers on a plane to DEC by morning to get it solved by the coming weekend."

        Hard disks were way more expensive in the 80s, and they sold in lower numbers. So it makes economic sense to do hands-on damage control.

  • Crying wolf (Score:5, Informative)

    by Sponge Bath ( 413667 ) on Thursday July 30, 2015 @02:30PM (#50216551)
    When Apple updated OS X to allow TRIM on non-Apple supplied SSDs, forums were flooded with people claiming you should never use Samsung because they were fundamentally broken with regards to TRIM. Their "proof" was that corruption happened on Linux and they would not be swayed by the thought that maybe the problem was with Linux.
    • Re:Crying wolf (Score:5, Informative)

      by GigaplexNZ ( 1233886 ) on Thursday July 30, 2015 @09:50PM (#50219919)
      That really depends on whether OS X uses serial or queued TRIM. The Samsung drives work fine with serial TRIM, but are still broken with queued TRIM. The bug that Algolia reported and Samsung fixed in the kernel was a serial TRIM issue in the Linux kernel with RAID, which is unrelated to the queued TRIM firmware issues.
  • by darkain ( 749283 ) on Thursday July 30, 2015 @02:33PM (#50216575) Homepage

    This is just another case of "Not My Problem" syndrome that too many techs get into. They think their code/tools/systems/whatever must be perfect, and other's are the ones fucking up. Samsung drives went on a blacklist for issuing the commands to them due to this bug? "WALP, LINUX IS PERFECT, MUST BE THE HARDWARE GUYS, even though their devices perform perfectly on other OSes" - and instead now we're left with a bug in Linux that corrupts data until the patch can make its way through the distro channels and pushed out to end users.

    • by LWATCDR ( 28044 ) on Thursday July 30, 2015 @03:16PM (#50216943) Homepage Journal

      You should take a look at the "black list" before you try to figure that question out.
      The list includes other brands of drive as well as Samsung...

    • by thegarbz ( 1787294 ) on Thursday July 30, 2015 @09:41PM (#50219893)

      How many software engineers does it take to change a lightbulb? None it's an electrical problem.
      How many electrical engineers does it take to change a lightbulb? None we'll just work around it in software.

    • by GigaplexNZ ( 1233886 ) on Thursday July 30, 2015 @09:52PM (#50219923)

      This is just another case of "Not My Problem" syndrome that too many techs get into.

      No, it's a case of everyone jumping to conclusions.

      Samsung drives went on a blacklist for issuing the commands to them due to this bug?

      No, they went on the queued TRIM blacklist due to a different bug. This bug was an unrelated serial TRIM bug when used in conjunction with RAID.

  • by account_deleted ( 4530225 ) on Thursday July 30, 2015 @02:33PM (#50216577)
    Comment removed based on user account deletion
  • by jwkane ( 180726 ) on Thursday July 30, 2015 @02:38PM (#50216633) Homepage

    Vote with your wallet, my next SSD will be a samsung.

  • by Dishwasha ( 125561 ) on Thursday July 30, 2015 @02:48PM (#50216699)

    :thumbsup:

  • by Kuruk ( 631552 ) on Thursday July 30, 2015 @02:56PM (#50216771)
    Hats off to Samsung for finding and even fixing the problem.
  • by JustAnotherOldGuy ( 4145623 ) on Thursday July 30, 2015 @03:29PM (#50217051) Journal
    On behalf of all internet users everywhere, whether in this specific space-time continuum or not, I would like to formally apologize to Samsung for all of the totally unwarranted bashing they took over over this issue. And I would also like to express my gratitude to them for finding a bug, fixing it, and posting a fix. Good job.
  • by metamatic ( 202216 ) on Thursday July 30, 2015 @06:01PM (#50218589) Homepage Journal

    I'm running Linux on a RAID-0 SSD array.

    I guess I should turn off fstrim until there's a backport of the fix to Fedora?

  • by godamntheman ( 989491 ) on Thursday July 30, 2015 @08:18PM (#50219453)
    Something doesn't add up ... The fix for this was an oversight in a relatively new "bio_split()" routine that merged in with the immutable bio vector patch set for Linux kernel 3.15. The Algolia blog referenced in the Samsung patch claims it was able to replicate the discard issue using kernels 3.2, 3.10, and 3.14, before the bug existed. What gives?
  • by KonoWatakushi ( 910213 ) on Thursday July 30, 2015 @10:18PM (#50220003)

    While an apology is due, this sort of problem is inevitable given the nature of the technology. TRIM on NAND is a crutch for a technology that is poorly suited to data storage. Transforming NAND into a usable storage device requires heroic efforts on the part of the vendor, and it is hard to blame them for the bugs. Likewise, it is hard to blame Linux developers for their heroic efforts to work around the extensive deficiencies of NAND flash. Trusting in cheap commodity devices that don't even claim to protect against power loss is ill-advised.

    Using TRIM as a band-aid for the performance woes of over-filled NAND devices is just asking for trouble. It has long been known that filling up filesystems leads to terrible performance, and the same applies to NAND drives. It is irresponsible of the vendors to provision the drives with insufficient reserved space, but one can compensate by setting aside an empty partition covering 5% of the space. It is much safer to disable TRIM and under-provision the drive, and it achieves the same effect of limiting write-amplification, without having to worry about bugs trimming away live data.

    The only place were TRIM really makes sense is in the context of virtualization. Recovering space in sparse virtual disk images has real benefit, and operating system vendors have a lot more incentive and ability to make it work properly.

Never trust anyone who says money is no object.

Working...