Forgot your password?
typodupeerror
Data Storage Software Education Linux

The Many Paths To Data Corruption 121

Posted by Zonk
from the luke-alan-cox-is-your-guru dept.
Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"
This discussion has been archived. No new comments can be posted.

The Many Paths To Data Corruption

Comments Filter:
  • by eln (21727) * on Friday September 14, 2007 @05:50PM (#20609727) Homepage
    The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data. As we all know, deleting data alone is not sufficient, as that will only remove the pointer to the data while leaving the block containing it undisturbed. This allows a young piece of data to easily see the old porn data as it is being written to that block. For this reason, it is imperative that you keep all pornographic data on separate physical drives.

    In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.

    Data corruption is a serious problem in computing today, and it is imperative that we take steps to stop our young innocent data from being corrupted.
    • Re: (Score:3, Funny)

      by king-manic (409855)
      In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.

      indeed, young pornographic data is disturbing. Fortunately there is a legal social firewall of 18.
      • by rts008 (812749)
        From 'eln:21727' "The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data."

        Unfortunately, as to your "Fortunately there is a legal social firewall of 18.", it depends on which block in which city you are cruising as to whether or not they may be at least/over 14, much less 18.

        At least that's what a traveling friend of mine told me....honest!
    • Boy, your expert on this subject! I wonder if the FBI is watching you.
      • by dwater (72834)
        > Boy, your expert on this subject!

        Who's "Boy", how you know Boy is an expert, and what makes Boy the poster's?
    • Re: (Score:1, Insightful)

      by Anonymous Coward
      Funny, but there's a bit of truth in it too. If data corruption happens in the filesystem, it can cause files to become interlinked or point to "erased" data, which might be a surprise that you don't want if you keep porn on the same harddisk as data which is going to be published.
    • by legirons (809082)
      "The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data."

      Whatever you do, don't dilute the data 200 times by zeroing 9/10 of the data each time, otherwise your drive will be full of porn ;)
  • Paul Cylon (Score:3, Funny)

    by HTH NE1 (675604) on Friday September 14, 2007 @05:52PM (#20609765)
    There must be 50 ways to lose your data.
  • benchmarks (Score:5, Insightful)

    by larien (5608) on Friday September 14, 2007 @05:57PM (#20609829) Homepage Journal
    As Alan Cox alluded to, there are benchmarks for data transfers, web performance, etc, etc, etc, but none for data integrity, it's kind of assumed, even if it perhaps shouldn't be. It also reminds me of various cluster software which will happily crash a node rather than risk data corruption (Sun Cluster & Oracle RAC both do this). What do you [em]really[/em] want? Lightning fast performance, or the comfort of knowing that your data is intact & correct? For something like a rendering farm, you can probably tolerate a pixel or two being the wrong shade. If you're dealing with money, you want the data to be 100% correct, otherwise there's a world of hurt waiting to happen...
    • Re:benchmarks (Score:5, Interesting)

      by dgatwood (11270) on Friday September 14, 2007 @06:25PM (#20610161) Journal

      I've concluded that nobody cares about data integrity. That's sad, I know, but I have yet to see product manufacturers sued into oblivion for building fundamentally defective devices, and that's really what it would take to improve things, IMHO.

      My favorite piece of hardware was a chip that was used in a bunch of 5-in-1 and 7-in-1 media card readers about four years ago. It was complete garbage, and only worked correctly on Windows. Mac OS X would use transfer sizes that the chip claimed to support, but the chip returned copies of block 0 instead of the first block in every transaction over a certain size. Linux supposedly also had problems with it. This was while reading, so no data was lost, but a lot of people who checked the "erase pictures after import" button in iPhoto were very unhappy.

      Unfortunately, there was nothing the computer could do to catch the problem, as the data was in fact copied in from the device exactly as it presented it, and no amount of verification could determine that there was a problem because it would consistently report the same wrong data.... Fortunately, there are unerase tools available for recovering photos from flash cards. Anyway, I made it a point to periodically look for people posting about that device on message boards and tell them how to work around it by imaging the entire flash card with dd bs=512 until they could buy a new flash card reader.

      In the end, I moved to a FireWire reader and I no longer trust USB for anything unless there's no other alternative (iPod, iPhone, and disks attached to an Airport Base Station). While that makes me somewhat more comfortable than dealing with USB, there have been a few nasty issues even with FireWire devices. For example, there was an Oxford 922 firmware bug about three years back that wiped hard drives if a read or write attempt was made after a spindown request timed out or something. I'm not sure about the precise details.

      And then, there is the Seagate hard drive that mysteriously will only boot my TiVo about one time out of every twenty (but works flawlessly when attached to a FW/ATA bridge chipset). I don't have an ATA bus analyzer to see what's going on, but it makes me very uncomfortable to see such compatibility problems with supposedly standardized modern drives. And don't get me started on the number of dead hard drives I have lying around....

      If my life has taught me anything about technology, it is this: if you really care about data, back it up regularly and frequently, store your backups in another city, ensure that those backups are never all simultaneously in the same place or on the same electrical grid as the original, and never throw away any of the old backups. If it isn't worth the effort to do that, then the data must not really be important.

      • by unitron (5733)

        And then, there is the Seagate hard drive that mysteriously will only boot my TiVo about one time out of every twenty (but works flawlessly when attached to a FW/ATA bridge chipset).

        And then there's my 80 Gig Western Digital that was very flakey (as soon as the warranty was up) in BX chipset (or equivalent) motherboard PCs, but I used it to replace the original drive in a Series 1 stand alone Philips Tivo and it's been working flawlessly in it for about a year now. Before you blame WD, I'm writing this on a BX chipset PC that's been running another WD 80 Gig that's almost identical (came off the assembly line a few months earlier) and it's been working fine since before I got the ne

      • by IvyKing (732111)

        In the end, I moved to a FireWire reader and I no longer trust USB for anything unless there's no other alternative (iPod, iPhone, and disks attached to an Airport Base Station). While that makes me somewhat more comfortable than dealing with USB, there have been a few nasty issues even with FireWire devices.

        I don't recall seeing anything with regards to FireWire vs USB that would give FireWire an advantage in data integrity (though may be missing some finer points about the respective specs). OTOH, I have seen specs (one of the LaCie RAID in a box drives) that give a 10 to 20% performance advantage to FW despite the 'lower' peak speeds - one reason is that FW uses separate pairs for xmit and rcv.

        • by dgatwood (11270)

          There's no technical reason for FW drives to be more reliable. The limited number of FireWire silicon vendors, however, does mean that each one is likely to get more scrutiny than the much larger number of USB silicon vendors, IMHO.

        • Isn't Firewire "peer-peer" rather than the "client-server" of USB (and in fact can function without a PC host)? Maybe this allows certain Firewire devices to do more of the donkey work themselves instead of offloading it to the PC "server"? /me shrugs
          • by dgatwood (11270)

            Close. The FireWire controller is smarter and allow it to do a lot more work without the CPU being involved. That's the reason FireWire performs faster than USB 2.0 despite being a slower bus.

            Of course, the new USB 3.0 will bump USB speed way up. I'm not holding my breath, though. Considering what USB 2.0 does to the CPU load when the bus is under heavy use, I'd expect that Intel had better increase the number of CPU cores eightfold right now to try to get ahead of the game.... :-)

            this group of com [slashdot.org]

      • then the data must not really be important

        Yep, that's it: loads of useless data, produced by a society barely able to perform some relatively weak techno tricks while completly failing to solve basic issues. Something is wrong in this biometric cash flow production model.

    • by BSAtHome (455370)
      To paraphrase a RFC: Good, Fast, Cheap; pick two, you can't have all three.
  • End-to-end (Score:5, Informative)

    by Intron (870560) on Friday September 14, 2007 @05:59PM (#20609847)
    Some enterprise server systems use end-to-end protection, meaning the data block is longer. If you write 512 bytes of data + 12 bytes or so of check data and carry that through all of the layers, it can prevent the data corruption from going undiscovered. The check data usually includes the block's address, so that data written with correct CRC but in the wrong place will also be discovered. It is bad enough to have data corrupted by a hardware failure, much worse not to detect it.
  • Hello ZFS (Score:5, Informative)

    by Wesley Felter (138342) <wesley@felter.org> on Friday September 14, 2007 @06:09PM (#20609971) Homepage
    ZFS's end-to-end checksums detect many of these types of corruption; as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

    I am looking forward to the day when all RAM has ECC and all filesystems have checksums.
    • No (Score:4, Interesting)

      by ElMiguel (117685) on Friday September 14, 2007 @06:26PM (#20610171)

      as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

      Sorry, but that is absurd. Nothing can absolutely protect against data errors (even if they only happen in the hard disk). For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.

      This is ECC 101 really.

      • For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.

        You can use SHA as the checksum algorithm; the chance of undetected corruption is infinitesimal.
        • Re:No (Score:5, Funny)

          by Slashcrap (869349) on Friday September 14, 2007 @07:00PM (#20610501)
          Or errors can corrupt both the data and the checksum so they match each other.

          This is about as likely as simultaneously winning every current national and regional lottery on the planet. And then doing it again next week.

          And if we're talking about a 512 bit hash then it's possible that a new planet full of lotteries will spontaneously emerge from the quantum vacuum. And you'll win all those too.
          • Re: (Score:2, Funny)

            by TruthfulLiar (927336)
            > And if we're talking about a 512 bit hash then it's possible that a new planet full of lotteries will spontaneously emerge from the quantum vacuum. And you'll win all those too.

            If this happens, be sure to keep the money from the quantum vacuum lotteries in a separate account, or it will annihilate with your real money.
            • I'm amazed that none of you have ever heard of the Girlfriend Money experiment: when a girlfriend (especially your own) looks at a certain amount of money, she'll cause the collapse of the money's superposition. This _always_ results in the money disappearing both completely and instantaneously. ;o
      • by E-Lad (1262) on Friday September 14, 2007 @08:39PM (#20611497) Homepage
        Give this blog entry a read:
        http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta [sun.com]

        And you'll understand :)
    • Re:Hello ZFS (Score:4, Informative)

      by harrkev (623093) <kfmsdNO@SPAMharrelsonfamily.org> on Friday September 14, 2007 @06:52PM (#20610409) Homepage

      I am looking forward to the day when all RAM has ECC and all filesystems have checksums.
      Not gonna happen. The problem is that ECC memory costs more, simply because there is 12.5% more memory. Most people are going to go for as cheap as possible.

      But, ECC is available. If it is important to you, pay for it.
      • Intel or AMD could force ECC adoption if they wanted to; the increase in cost would be easily hidden by Moore's Law.
      • by drsmithy (35869)

        Not gonna happen. The problem is that ECC memory costs more, simply because there is 12.5% more memory. Most people are going to go for as cheap as possible.

        It'll happen for the same reason RAID5 on certain types of arrays will be obselete in 4 - 5 years. Eventually memory sizes are going to get so big that the statistical probability of a memory error will effectively guarantee they happen too frequently to ignore.

        • by QuoteMstr (55051)
          Obsolete? What would you replace it with then?
          • Re: (Score:3, Interesting)

            by drsmithy (35869)

            Obsolete? What would you replace it with then?

            RAID6. Then a while after that, "RAID7" (or whatever they call triple-parity).

            In ca. 4-5 years[0], the combination of big drives (2TB+) and raw read error rates (roughly once every 12TB or so) will mean that during a rebuild of 6+ disk RAID5 arrays after a single drive failure, a second "drive failure" (probably just a single bad sector, but the end result is basically the same) will be - statistically speaking - pretty much guaranteed. RAID5 will be obsel

      • by renoX (11677)
        >The problem is that ECC memory costs more, simply because there is 12.5% more memory.

        The big issue is that ECC memory doesn't cost only 12.5% more than regular memory, otherwise you'd see that lots of knowledgeable (or correctly guided) people would buy ECC.
      • ECC may be available
        ( I always build systems with http://www.crucial.com/ [crucial.com] ECC RAM,
        and no I'm nae affilliated,
        but they're the ONLY brand who've never once proven flaky,
        in my experience. . . )

        but the problem is that almost no motherboards support ECC.

        Gigabyte's GC-RAMDISK and GO-RAMDISK ( up-to 4GB ) hardware-ram-drive without ECC *support*,
        is typical of this idiocy:
        The only way to make the things trustworthy is to run a RAID5 or RAID6 array of 'em,
        and that gets bloody expensive
        ( though the s

    • Just so everybody knows, ZFS is available for Linux as a FUSE module. It's easy to get it working, and lots of fun to tinker with. I have it set up right now in a test configuration with an old 80 gig drive, and a 11 gig drive. 91 gigs total, in external USB enclosures. And I created files on an NFS server the same size as each of the drives, and told ZFS to use those files as mirrors. On a 100 megabit link. And surprisingly enough, it's actually not too slow to use!

      But the reason I have it set up is not to
    • Now, as far as I know, there are many schemes for correcting and detecting errors. Some, like FEC, fix infrequent, scattered errors. Others, like turbocodes, fix sizeable blocks of errors. This leads to two questions: what is the benefit in using plain CRCs any more? And since disks are block-based not streamed, wouldn't block-based error-correction be more suitable for the disk?
      • Now, as far as I know, there are many schemes for correcting and detecting errors. Some, like FEC, fix infrequent, scattered errors. Others, like turbocodes, fix sizeable blocks of errors. This leads to two questions: what is the benefit in using plain CRCs any more?

        CRCs are only used for detecting errors. Once you've detected a bad disk block, you can use replication (RAID 1), parity (RAID 4/5/Z), or some more advanced FEC (RAID 3/6/Z2) to correct the error. The benefit of CRCs is that you can read only th
  • by jdigital (84195) on Friday September 14, 2007 @06:10PM (#20609977) Homepage
    I think I suffered from a series of Type III errors (rtfa). After merging lots of poorly maintained backups of my /home file system I decided to write a little script to look for duplicate files (using file size as a first indicator, then md5 for ties). The script would identify duplicates and move files around into a more orderly structure based on type, etc. After doing this i noticed that a small number of my mp3's now contain chunks of other songs in them. My script was only working with whole files, so I have no idea how this happened. When I refer back to the original copies of the mp3s the files are uncorrupted.

    Of course, no one believes me. But maybe this presentation is on to something. Or perhaps I did something in a bonehead fashion totally unrelated.
    • by jdigital (84195)
      Of course the fa that I was referring to is here [web.cern.ch]. Much more informative than AC's post if I may say...
    • I have experienced having small chunks of other songs inside mp3 files on my mp3 player. Of course being a cheap player I assumed it was the player... I have a few problems when writing to it from Linux. I shall look more closely now!

      Perhaps the FAT filesystem is interpreted differently on the player to how Linux expects it to be? (Or VV)
    • by mikolas (223480)
      I have had the same kind of problems a few times. I store all my stuff on a server with RAID5, but there have been a couple of times when transferring music from the server (via SMB) to MP3 player (via USB) has corrupted files. I never solved the problem as the original files were intact so I did not go through the effort. However, after reading the article I just might do something about it as I got a bit more worried about the data integrity of my lifetime personal file collection that I store on the serv
    • by Reziac (43301) *
      Oh, I believe you... I'm reminded of a "HD crash" a friend suffered. Long story short, I wound up doing file reconstruction, and from the pieces (almost always some multiple of 4k) of files stuck inside other files, concluded that there was probably nothing wrong with the HD, but rather, the RAID controller was writing intact data but to random locations -- like it had got its "which HD this data belongs on" count off by one. And there was evidence that the corruption started long before anyone noticed the
  • MySQL? (Score:5, Funny)

    by Jason Earl (1894) on Friday September 14, 2007 @06:15PM (#20610059) Homepage Journal

    I was expecting an article on using MySQL in production.

  • That'll get fixed lickety split.

     
  • That is nothing compared to the actual storage technology. Attempting to recover data packed at a density of 1 GB/sq.in. from a disk spinning at 10,000 revolutions per minute where the actual data is stored in a micron thin layer of rust on the surface of the disk is manifestly impossible.

  • by ozzee (612196) on Friday September 14, 2007 @06:52PM (#20610403)

    Ah - this is the bane of computer technology.

    One time I remember writing some code and it was very fast and almost always correct. The guy I was working with exclaimed "I can give you the wrong answer in zero seconds" and I shut up and did it the slower way that was right every time.

    This mentality of speed at the cost of correctness is prevalent, for example I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM. I have assembled many computers and unfortunately there have been a number of times where ECC memory was not an option. In almost every case where I have used ECC memory, the computer was noticably more stable. Case in point, the most recent machine that I built has never (as far as I know) crashed and I've thrown same really nasty workloads it's way. On the other hand, a couple of notebooks I have have crashed more often than I care to remember and there is no ECC option. Not to mention the ridicule I get for suggesting that people invest the extra $30 for a "non server" machine. Go figure. Suggesting that stability is the realm of "server" machines and infer end user machines should be relegated to a realm of lowered standards of reliability makes very little sense to me especially when the investment of $30 to $40 is absolutely minuscule if it prevents a single failure. What I think (see lawsuit coming on) is that memory manufacturers will sell quality marginal products to the non ECC crowd because there is no way of validating memory quality.

    I think there needs to be a significant change in the marketing of products to ensure that metrics of data integrity play a more significant role in decision making. It won't happen until the consumer demands it and I can't see that happening any time soon. Maybe, hopefully, I am wrong.

    • by Anonymous Coward
      I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM.

      I do understand it. They live in the real world, where computers are fallible, no matter how much you spend on data integrity. It's a matter of diminishing return. Computers without ECC are mostly stable and when they're not, they typically exhibit problems on a higher level. I've had faulty RAM once. Only one bit was unstable and only one test
      • by ozzee (612196)
        They live in the real world, where computers are fallible ...

        Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests, not to mention the possible corrupted data you ended up saving and other unknown consequences ? Had you bought ECC RAM, your problem would have been corrected or more than likely detected not to menti

        • Re: (Score:1, Interesting)

          by Anonymous Coward
          Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests

          That's beside the point. Computers ARE fallible, with or without ECC RAM. That you think they could be perfect (infallible) is testament to the already low rate of hardware defects which harm data integrity. It's good enough. I've experienced and located infrequent d
          • by ozzee (612196)
            Computers ARE fallible, with or without ECC RAM.

            Yes. They are, but considerably less fallible with ECC. Remember, "I can give you the wrong answer in zero seconds." There's no point in computing at all unless there is a very high degree of confidence in computational results. Software is just as fallible as hardware but again, I can, and do, make considerable effort in making it less fallible.

            I am the default sysadmin for a family member's business. There was a time where the system was fraught with

    • This mentality of speed at the cost of correctness is prevalent...

      I use to sell firewalls. People always wanted to know how fast it would work (most were good up to around 100Mbps, when most people had at most 2Mbps pipes at most), very few people asked detailed questions about what security policies it could enforce, or the correctness and security of the firewall device itself.

      Everyone knew they needed something, very few had a clue about selecting a good product, speed they understood, network securi

  • I remember a long time ago that cosmic rays (actually the ElectroMagnetic Field disruption they caused) created some of those errors.
    • by Intron (870560)
      The typical energy of a cosmic ray is around 300 MeV. Interestingly, around mid 1990's the feature size of SRAM cells got small enough that a 300 MeV event could flip the state. This means that the cache memory now needs ECC as well as main memory, but I don't see that happening in too many CPUs. Reference:

      http://www.srim.org/SER/SERTrends.htm [srim.org]

  • by cdrguru (88047) on Friday September 14, 2007 @06:59PM (#20610491) Homepage
    It amazes me how much has been lost over the years towards the "consumerization" of computers.

    Large mainframe systems have had data integrity problems solved for a long, long time. It is today unthinkable that any hardware issues or OS issues could corrupt data on IBM mainframe systems and operating systems.

    Personal computers, on the other hand, have none of the protections that have been present since the 1970s on mainframes. Yes, corruption can occur anywhere in the path from the CPU to the physical disk itself or during a read operation. There is no checking, period. And not only are failures unlikely to be quickly detected but they cannot be diagnosed to isolate the problem. All you can do is try throwing parts at the problem, replacing functional units like the disk drive or controller. These days, there is no separate controller - its on the motherboard - so your "functional unit" can almost be considered to be the computer.

    How often is data corrupted on a personal computer? It is clear it doesn't happen all that often, but in the last fourty years or so we have actually gone backwards in our ability to detect and diagnose such problems. Nearly all businesses today are using personal computers to at least display information if not actually maintain and process it. What assurance do you have that corruption is not taking place? None, really.

    A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit. In the right place, such changes could lead to penalties, interest and possible loss of a key customer.

    Why have we gone backwards in this area when compared to a mainframe system of fourty years ago? Certainly software has gotten more complex but basic issues of data integrity have fallen by the wayside. Much of this was done in hardware previously. It could be done cheaply in firmware and software today with minimal cost and minimal overhead. But it is not done.
    • by glwtta (532858)
      Yeah, go figure, cheap stuff is built to lower standards than really high-end stuff.

      A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit.

      I would think it's extremely unlikely that such random corruption would happen on some byte somewhere which actually gets interpreted as a meaningful digit; much more likely to either corrupt some format or produce some noticeable garbage somewhere (not "wrong-yet-meaningful" data).
    • by suv4x4 (956391) on Friday September 14, 2007 @09:16PM (#20611821)
      Why have we gone backwards in this area when compared to a mainframe system of fourty years ago?

      For the same reason why experienced car drivers crash in ridiculous situations: they are too sure of themselves.

      The industry is so huge, that the separate areas of our computers just accept the rest is a magic box that should magically operate as is written in the spec. Screwups don't happen too often, and when they happen they are not detectable, hence no one woke up to it.

      That said don't feel bad, we're not going downwards. It just so happened speed and flashy graphics will play important role for another couple of years. Then after we max this out, the industry will seek to improve another parameter of their products, and sooner or later we'll hit back the data integrity issue :D

      Look at hard disks: does the casual consumer need more than 500 GB? So now we see the advent of faster hybrid (flash+disk) storage devices, or pure flash memory devices.

      So we've tackled storage size, we're about to tackle storage speed. And when it's fast enough, what's next, encryption and additional integrity checks. Something for the bullet list of features...
    • by hxnwix (652290)

      There is no checking, period.

      I'm sorry, but for even the crappiest PC clones, you've been wrong since 1998. Even the worst commodity disk interfaces have had checking since then: UDMA uses a 16 bit CRC for data; SATA uses 32 bit CRC for commands as well. Most servers and workstations have had ECC memory for even longer. Furthermore, if you cared at all about your data, you already had end-to-end CRC/ECC.

      Yeah, mainframes are neat, but they don't save you from end-to-end integrity checking unless you really don't give a damn about it

    • by tonkdude (806199)
      Actually even today, a mainframe running OS390 and CICS still has problems with disk corruption. If a machine crash or power outage happens when a file is being extended via a CA split, the only way to recover the file is from backup and then forward applying the CICS journals.
  • by DigiShaman (671371) on Friday September 14, 2007 @07:00PM (#20610499) Homepage
    It's well known that ECC and other forms of error correction are found at all levels of software and hardware. For example, hard drives have their own internal error correction while the file system it's formatted with may have another. Also worth mentioning, the CPU, serial busses, network adapters (both the physical IEEE 802.x connection and TCP/IP stack) and other forms of software error correction.

    Basically, the modern computer has various hardware and software layers of error correction stacked on top of each other if not at least by themselves.

    We do have weak link with desktops regarding RAM however. While modern workstations and server are generally installed with ECC RAM, our desktops are not. Also worth mentioning, most custom built clone PCs are for the desktop market. This has become a huge problem given the voltage and timing requirements don't leave much room for tolerance. The fact memory density has been going up only makes the chances for "bit flips" even worse. I can't tell you how many countless times I've ran into data corruption due to improper RAM settings. Running a few passes with Memtest 86+ will reveal this nasty issue. Hell, even Windows Vista now includes a utility to check for faulty RAM read/write issues that's how big the problem has become in the industry. As such, the desktop market severely needs to embrace ECC RAM like the server and workstation market. These days, to not use ECC is asking for trouble. And yes, you would take a 1 to 2% performance hit, but so what; Data integrity is more imporant.

    Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.
    • by IvyKing (732111)

      We do have weak link with desktops regarding RAM however. While modern workstations and server are generally installed with ECC RAM, our desktops are not.

      The major failing of the original Apple Xserve 'supercomputer' cluster was the lack of ECC - ISTR an estimate of a memory error every few hours (estimate made by Del Cecchi on comp.arch), which would severely limit the kinds of problems that could be solved on the system. I also remember the original systems being replaced a few months later with Xserves that had ECC.

      And yes, you would take a 1 to 2% performance hit, but so what; Data integrity is more impor[t]ant.

      A 1 to 2% performance hit is less costly than having to do multiple runs to make sure the data didn't get munged.

      Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.

      You're right about the

      • Re: (Score:2, Insightful)

        by KonoWatakushi (910213)

        You're right about the crying shame - what you have is a high end games machine. Perhaps AMD still has a chance if their chipsets support ECC RAM.

        The nice thing about AMD is that with the integrated memory controller, you don't need support in the chipset. I'm not sure about Semprons, but all of the Athlons support ECC memory. The thing you have to watch out for is BIOS/motherboard support. If the vendor doesn't include the necessary traces on the board or the configuration settings in the BIOS, it won't

    • The issue with chipsets is about market segmentation rather than newer=better. P965 is a "mainstream desktop" chipset, while say, a 975X is a "performance desktop" and/or "workstation" chipset and so supports ECC. The performance hit isn't a factor, but the price hit for the extra logic apparently is.
      • Re: (Score:1, Informative)

        by Anonymous Coward
        Sad given that ECC logic is so simple it's basically FREE.

        What's worse? It IS free!
        Motherboard chips (e.g. south bridge, north bridge) are generally limited in size NOT by the transistors inside but by the number of IO connections. There's silicon to burn, so to speak, and therefore plenty of room to add features like this.

        How do I know this? Oh wait, my company made them.... We never had to worry about state-of-the-art process technology because it wasn't worth it. We could afford to be several genera
    • Re: (Score:1, Informative)

      by Anonymous Coward
      Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.

      You meant 975x, not 965x. The successor of 975x is X38 (Bearlake-X) chipset supporting ECC DRAM. It should debut this month.
  • doesn't checksum offload means that that functionality gets
    offloaded to another device like say an expensive NIC ? and thus removes that overhead from the CPU
    • Re: (Score:1, Interesting)

      by Anonymous Coward
      Correct. However, there's two problems. Firstly, it's not an expensive NIC these days - virtually all Gigabit ethernet chips do at least some kind of TCP offload, and if these chips miscompute the checksum (or don't detect the error) due to being a cheap chip, you're worse off than doing it in software.

      Also, these don't protect against errors on the card or PCI bus. (If the data was corrupted on the card or on the bus after the checksum validation but before it got to system RAM for any reason, this corr
    • by shird (566377)
      Yes, but then there is the risk the data gets corrupted between the NIC and CPU. Doing the checksum at the CPU checks the integrity of the data at the end-point, rather than on its way to the CPU.
  • Can anyone point me toward some information on the hit to CPU and I/O throughput for scrubbing?
    • by feld (980784)
      i was wondering the same thing... i dont have scrubbing enabled on my opteron workstation. i should do a memory benchmark test or two and turn it on to see how it compares.
    • by Detritus (11846)
      Scrubbing for RAM is an insignificant amount of overhead. All it involves is doing periodic read/write cycles on each memory location to detect and correct errors. This can be done as a low-priority kernel task or as part of the timer interrupt-service-routine.
      • If a system was operating in an environment where a failure was more like is it desirable to increase the frequency of the access to a given memory location. It seems reasonable that this would be the case. I am looking at an application that could be exposed to a higher level of cosmic rays than would be the normal for ground based workstations.
        • by Detritus (11846)
          If you want to be thorough about it, you need to determine the acceptable probability of an uncorrectable error in the memory system, the rate at which errors occur, and the scrub rate needed to meet or exceed your reliability target. If you scrub when the system is idle, you will probably find that the scrub rate is much higher than the minimum rate needed to meet your reliability target. In really hostile environments, you may need a stronger ECC and/or a different memory organization.
  • Timely article ... (Score:3, Interesting)

    by ScrewMaster (602015) on Friday September 14, 2007 @07:52PM (#20611027)
    As I sit here having just finished restoring NTLDR to my RAID 0 drive after the thing failed to boot. I compared the original file and the replacement, and they were off by ONE BIT.
    • by dotgain (630123)
      A while ago I had an AMD K6-2 which couldn't gunzip one of the XFree86 tarballs (invalid compressed data - CRC error). I left memtest running over 24 hours which showed nothing, copying the file onto another machine (using the K6 as a fileserver) and gunzipping it there worked. I eventually bumped into someone with the same mobo and same problem, and figured binning the mobo was the fix.

      To be honest, most of the comments about ECC RAM here have convinced me that it's worth it just for more peace of mind

  • HEY. (Score:3, Funny)

    by yoyhed (651244) on Friday September 14, 2007 @09:06PM (#20611745)
    TFA doesn't list ALL the possible ways data can be corrupted. It fails to mention the scenario of Dark Data (an evil mirror of your data, happens more commonly with RAID 1) corrupting your data with Phazon. In this case, the only way to repair the corruption is to send your data on a quest to different hard drives across the world (nay, the GALAXY) to destroy the Seeds that spread the corruption.
  • I previously had a Shuttle desktop machine running Windows XP. One day I started noticing that when I copied files to a network file server, about 1 out of 20 or so would get corrupted, with larger files getting corrupted more often than smaller ones. Copying them to the local IDE hard drive caused no problems, and other machines did not have problems copying files to the same file server. I spent a lot of time swapping networking cards, etc. and not getting anywhere, until I plugged in a USB drive and no
    • by kg261 (990379)
      And I have seen this happen on the IDE as well. In my case, the fan for the bridge chip had failed causing a bit error on disk writes every few hundred megabytes. This went on for I do not know how many months before I actually did a file copy and CMP to find the errors. Ethernet and other ports were fine.
  • I was a early "adopter" of the Internet... and when I was on a slow dial-up line, even with checksums being done on-the-fly via hardware, and packets being re-sent willy-nilly due to insufficient transmission integrity, my data seemed to get corrupted almost as often as not.

    Today, with these "unreliable" hard drives, and (apparently, if we believe the post) less hardware checking being done, I very, very seldom receive data, or retrieve data from storage, that is detectably corrupted. My CRCs almost inva
    • by Detritus (11846)
      The free market doesn't always produce socially desirable results. Manufacturers can also get trapped in a race to the bottom. Just look at the current quality of floppy disk drives and their media. I can remember when they actually worked.
      • That is still the free market at work. The quality has gone down because there is no pressure to keep it up. NOBODY uses floppy disks anymore. There is no market, so there is no market pressure.
        • by Detritus (11846)
          A case of chicken and egg. Many people stopped using them because the quality was so bad. There was a market failure in that even if you were willing to pay more, you couldn't buy stuff that worked reliably. Anyone interested in producing a quality product had left the market.
  • by Terje Mathisen (128806) on Saturday September 15, 2007 @04:42AM (#20614351)
    We have 500+ servers worldwide, many of them contains the same program install images which by definition should be identical:

    One master, all the others are copies.

    Starting maybe 15 years ago, when these directory structures were in the single-digit GB range, we started noticing strange errors, and after running full block-by-block compares between the master and several slave servers we determined that we had end-to-end error rates of about 1 in 10 GB.

    Initially we solved this by doubling the network load, i.e. always doing a full verify after every copy, but later on we found that keeping the same hw, but using sw packet checksums, was sufficient to stop this particular error mechanism.

    One of the errors we saw was a data block where a single byte was repeated, overwriting the real data byte that should have followed it. This is almost certainly caused by a timing glitch which over-/under-runs a hardware FIFO. Having 32-bit CRCs on all Ethernet packets as well as 16-bit TCP checksums doesn't help if the path across the PCI bus is unprotected and the TCP checksum has been verified on the network card itself.

    Since then our largest volume sizes have increased into the 100 TB range, and I do expect that we now have other silent failure mechanisms: Basically, any time/location when data isn't explicitly covered by end-to-end verification is a silent failure waiting to happen. On disk volumes we try to protect against this by using file systems which can protect against lost writes as well as miss-placed writes (i.e. the disk reports writing block 1000, but in reality it wrote to block 1064 on the next cylinder).

    NetApp's WAFL is good, but I expect Sun's ZFS to an equally good job a significantly lower cost.

    Terje
    • NetApp's WAFL is good, but I expect Sun's ZFS to an equally good job a significantly lower cost.

      Hard to say for certain, as comparing WAFL and ZFS ignores the often overlooked but additional integration that you get with a Filer as opposed to a more general purpose system running ZFS, which is still pretty green when it comes to this sort of stuff:

      http://mail.opensolaris.org/pipermail/zfs-discuss/2006-November/036124.html [opensolaris.org].

      It's been my experience that data corruption typically occurs in RAM (ECC), at HBA, ca
  • I came across a NIC with faulty offloading. It was at a customer's site, and it took a month to diagnose.

    The only way I found out was with an Ethereal trace at each end - I could see that every 0x2000 bytes there was a corruption. We turned TCP segment offloading off, and it worked fine because the maximum packet size was 1536 bytes - before 0x2000.
  • The perils of RAM just seems to be one of those open secrets. Apparently even Microsoft has tried pushing for ECC RAM [eetimes.com] in all machines (including dekstops) as memory errors have risen to the top 10 causes of system crashes according to their crash analysis.

    Earlier this decade I was living with strange, random crashes when booting Linux that would only seemingly only occur when booting from cold (but not every time!). It was only years later when running a memtest on someone else's sticks (which turned out to
  • Refusing to implement integrity checks at every level is data mismanagement.

    The filesystem should provide this.

    Linux people have been denying for years that hardware will cause data corruption. Therefore they can deny their own responsibility in detecting and correcting it.

    It is everyone's responsibility to make OS people aware of how often hardware causes data corruption.

    http://www.storagetruth.org/index.php/2006/data-corruption-happens-easily/ [storagetruth.org]

There are worse things in life than death. Have you ever spent an evening with an insurance salesman? -- Woody Allen

Working...