Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Software Education Linux

The Many Paths To Data Corruption 121

Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"
This discussion has been archived. No new comments can be posted.

The Many Paths To Data Corruption

Comments Filter:
  • benchmarks (Score:5, Insightful)

    by larien ( 5608 ) on Friday September 14, 2007 @05:57PM (#20609829) Homepage Journal
    As Alan Cox alluded to, there are benchmarks for data transfers, web performance, etc, etc, etc, but none for data integrity, it's kind of assumed, even if it perhaps shouldn't be. It also reminds me of various cluster software which will happily crash a node rather than risk data corruption (Sun Cluster & Oracle RAC both do this). What do you [em]really[/em] want? Lightning fast performance, or the comfort of knowing that your data is intact & correct? For something like a rendering farm, you can probably tolerate a pixel or two being the wrong shade. If you're dealing with money, you want the data to be 100% correct, otherwise there's a world of hurt waiting to happen...
  • Re:Paul Cylon (Score:2, Insightful)

    by HTH NE1 ( 675604 ) on Friday September 14, 2007 @06:04PM (#20609909)
    Ah well. Perhaps I should have been a bit cleverer and said, "There must be 110010 ways to lose your data."
  • Re:Hah (Score:4, Insightful)

    by Cajun Hell ( 725246 ) on Friday September 14, 2007 @06:46PM (#20610371) Homepage Journal
    Sometimes I think we're lucky this stuff works at all.
  • by cdrguru ( 88047 ) on Friday September 14, 2007 @06:59PM (#20610491) Homepage
    It amazes me how much has been lost over the years towards the "consumerization" of computers.

    Large mainframe systems have had data integrity problems solved for a long, long time. It is today unthinkable that any hardware issues or OS issues could corrupt data on IBM mainframe systems and operating systems.

    Personal computers, on the other hand, have none of the protections that have been present since the 1970s on mainframes. Yes, corruption can occur anywhere in the path from the CPU to the physical disk itself or during a read operation. There is no checking, period. And not only are failures unlikely to be quickly detected but they cannot be diagnosed to isolate the problem. All you can do is try throwing parts at the problem, replacing functional units like the disk drive or controller. These days, there is no separate controller - its on the motherboard - so your "functional unit" can almost be considered to be the computer.

    How often is data corrupted on a personal computer? It is clear it doesn't happen all that often, but in the last fourty years or so we have actually gone backwards in our ability to detect and diagnose such problems. Nearly all businesses today are using personal computers to at least display information if not actually maintain and process it. What assurance do you have that corruption is not taking place? None, really.

    A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit. In the right place, such changes could lead to penalties, interest and possible loss of a key customer.

    Why have we gone backwards in this area when compared to a mainframe system of fourty years ago? Certainly software has gotten more complex but basic issues of data integrity have fallen by the wayside. Much of this was done in hardware previously. It could be done cheaply in firmware and software today with minimal cost and minimal overhead. But it is not done.
  • by Anonymous Coward on Friday September 14, 2007 @07:24PM (#20610733)
    Funny, but there's a bit of truth in it too. If data corruption happens in the filesystem, it can cause files to become interlinked or point to "erased" data, which might be a surprise that you don't want if you keep porn on the same harddisk as data which is going to be published.
  • by Anonymous Coward on Friday September 14, 2007 @07:51PM (#20611011)
    I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM.

    I do understand it. They live in the real world, where computers are fallible, no matter how much you spend on data integrity. It's a matter of diminishing return. Computers without ECC are mostly stable and when they're not, they typically exhibit problems on a higher level. I've had faulty RAM once. Only one bit was unstable and only one test of the many Memtest routines triggered the defect. Even a fault that small caused problems with every other verified CD burning. Given that lots of other reasons can cause data integrity violations, many of which can't be avoided because they're rooted in the imperfections of human nature, it is more effective to have procedures in place to deal with problems than to avoid them 100%.
  • by KonoWatakushi ( 910213 ) on Saturday September 15, 2007 @03:36AM (#20614049)

    You're right about the crying shame - what you have is a high end games machine. Perhaps AMD still has a chance if their chipsets support ECC RAM.

    The nice thing about AMD is that with the integrated memory controller, you don't need support in the chipset. I'm not sure about Semprons, but all of the Athlons support ECC memory. The thing you have to watch out for is BIOS/motherboard support. If the vendor doesn't include the necessary traces on the board or the configuration settings in the BIOS, it won't work. It is worth noting that unbuffered ECC ram will work in non-ECC boards, but without actually using the ECC bits, so you have to make sure that the board explicitly supports ECC, and is not merely compatible.

    It is a shame though, and however nice a chip the Core2 is, AMD is the obvious choice if you care about your data.

Life is a whim of several billion cells to be you for a while.

Working...