The Many Paths To Data Corruption 121

Posted by Zonk on Friday September 14, 2007 @05:45PM from the luke-alan-cox-is-your-guru dept.

Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"

The Many Paths To Data Corruption

This discussion has been archived. No new comments can be posted.

Search 121 Comments Log In/Create an Account

Comments Filter:

I think this has happened to me (Score:5, Interesting)

by jdigital ( 84195 ) writes: on Friday September 14, 2007 @06:10PM (#20609977) Homepage

I think I suffered from a series of Type III errors (rtfa). After merging lots of poorly maintained backups of my /home file system I decided to write a little script to look for duplicate files (using file size as a first indicator, then md5 for ties). The script would identify duplicates and move files around into a more orderly structure based on type, etc. After doing this i noticed that a small number of my mp3's now contain chunks of other songs in them. My script was only working with whole files, so I have no idea how this happened. When I refer back to the original copies of the mp3s the files are uncorrupted. Of course, no one believes me. But maybe this presentation is on to something. Or perhaps I did something in a bonehead fashion totally unrelated.

Re:benchmarks (Score:5, Interesting)

by dgatwood ( 11270 ) writes: on Friday September 14, 2007 @06:25PM (#20610161) Homepage Journal

I've concluded that nobody cares about data integrity. That's sad, I know, but I have yet to see product manufacturers sued into oblivion for building fundamentally defective devices, and that's really what it would take to improve things, IMHO.

My favorite piece of hardware was a chip that was used in a bunch of 5-in-1 and 7-in-1 media card readers about four years ago. It was complete garbage, and only worked correctly on Windows. Mac OS X would use transfer sizes that the chip claimed to support, but the chip returned copies of block 0 instead of the first block in every transaction over a certain size. Linux supposedly also had problems with it. This was while reading, so no data was lost, but a lot of people who checked the "erase pictures after import" button in iPhoto were very unhappy.

Unfortunately, there was nothing the computer could do to catch the problem, as the data was in fact copied in from the device exactly as it presented it, and no amount of verification could determine that there was a problem because it would consistently report the same wrong data.... Fortunately, there are unerase tools available for recovering photos from flash cards. Anyway, I made it a point to periodically look for people posting about that device on message boards and tell them how to work around it by imaging the entire flash card with dd bs=512 until they could buy a new flash card reader.

In the end, I moved to a FireWire reader and I no longer trust USB for anything unless there's no other alternative (iPod, iPhone, and disks attached to an Airport Base Station). While that makes me somewhat more comfortable than dealing with USB, there have been a few nasty issues even with FireWire devices. For example, there was an Oxford 922 firmware bug about three years back that wiped hard drives if a read or write attempt was made after a spindown request timed out or something. I'm not sure about the precise details.

And then, there is the Seagate hard drive that mysteriously will only boot my TiVo about one time out of every twenty (but works flawlessly when attached to a FW/ATA bridge chipset). I don't have an ATA bus analyzer to see what's going on, but it makes me very uncomfortable to see such compatibility problems with supposedly standardized modern drives. And don't get me started on the number of dead hard drives I have lying around....

If my life has taught me anything about technology, it is this: if you really care about data, back it up regularly and frequently, store your backups in another city, ensure that those backups are never all simultaneously in the same place or on the same electrical grid as the original, and never throw away any of the old backups. If it isn't worth the effort to do that, then the data must not really be important.

No (Score:4, Interesting)

by ElMiguel ( 117685 ) writes: on Friday September 14, 2007 @06:26PM (#20610171)

as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.
Sorry, but that is absurd. Nothing can absolutely protect against data errors (even if they only happen in the hard disk). For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.
This is ECC 101 really.

...but in the market speed sells not correctness. (Score:4, Interesting)

by ozzee ( 612196 ) writes: on Friday September 14, 2007 @06:52PM (#20610403)

Ah - this is the bane of computer technology.
One time I remember writing some code and it was very fast and almost always correct. The guy I was working with exclaimed "I can give you the wrong answer in zero seconds" and I shut up and did it the slower way that was right every time.
This mentality of speed at the cost of correctness is prevalent, for example I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM. I have assembled many computers and unfortunately there have been a number of times where ECC memory was not an option. In almost every case where I have used ECC memory, the computer was noticably more stable. Case in point, the most recent machine that I built has never (as far as I know) crashed and I've thrown same really nasty workloads it's way. On the other hand, a couple of notebooks I have have crashed more often than I care to remember and there is no ECC option. Not to mention the ridicule I get for suggesting that people invest the extra $30 for a "non server" machine. Go figure. Suggesting that stability is the realm of "server" machines and infer end user machines should be relegated to a realm of lowered standards of reliability makes very little sense to me especially when the investment of $30 to $40 is absolutely minuscule if it prevents a single failure. What I think (see lawsuit coming on) is that memory manufacturers will sell quality marginal products to the non ECC crowd because there is no way of validating memory quality.
I think there needs to be a significant change in the marketing of products to ensure that metrics of data integrity play a more significant role in decision making. It won't happen until the consumer demands it and I can't see that happening any time soon. Maybe, hopefully, I am wrong.

Comment removed (Score:4, Interesting)

by account_deleted ( 4530225 ) writes: on Friday September 14, 2007 @07:00PM (#20610499)

Comment removed based on user account deletion

How much does scrubbing cost? (Score:2, Interesting)

by skeptictank ( 841287 ) writes: on Friday September 14, 2007 @07:18PM (#20610675)

Can anyone point me toward some information on the hit to CPU and I/O throughput for scrubbing?

Re:checksum offload (Score:1, Interesting)

by Anonymous Coward writes: on Friday September 14, 2007 @07:40PM (#20610879)

Correct. However, there's two problems. Firstly, it's not an expensive NIC these days - virtually all Gigabit ethernet chips do at least some kind of TCP offload, and if these chips miscompute the checksum (or don't detect the error) due to being a cheap chip, you're worse off than doing it in software.

Also, these don't protect against errors on the card or PCI bus. (If the data was corrupted on the card or on the bus after the checksum validation but before it got to system RAM for any reason, this corruption would be not be detected. But if the checksum validation was happening in software after the data was written to RAM, the curruption would be detected by the OS. It'd assume it's a network transmission error instead of a bad network card, but (in TCP) it would arrange for a retransmittal of the data.)

Timely article ... (Score:3, Interesting)

by ScrewMaster ( 602015 ) writes: on Friday September 14, 2007 @07:52PM (#20611027)

As I sit here having just finished restoring NTLDR to my RAID 0 drive after the thing failed to boot. I compared the original file and the replacement, and they were off by ONE BIT.

Re:what we have lost (Score:4, Interesting)

by suv4x4 ( 956391 ) writes: on Friday September 14, 2007 @09:16PM (#20611821)

Why have we gone backwards in this area when compared to a mainframe system of fourty years ago?

For the same reason why experienced car drivers crash in ridiculous situations: they are too sure of themselves.

The industry is so huge, that the separate areas of our computers just accept the rest is a magic box that should magically operate as is written in the spec. Screwups don't happen too often, and when they happen they are not detectable, hence no one woke up to it.

That said don't feel bad, we're not going downwards. It just so happened speed and flashy graphics will play important role for another couple of years. Then after we max this out, the industry will seek to improve another parameter of their products, and sooner or later we'll hit back the data integrity issue :D

Look at hard disks: does the casual consumer need more than 500 GB? So now we see the advent of faster hybrid (flash+disk) storage devices, or pure flash memory devices.

So we've tackled storage size, we're about to tackle storage speed. And when it's fast enough, what's next, encryption and additional integrity checks. Something for the bullet list of features...

Re: ...but in the market speed sells not correctne (Score:1, Interesting)

by Anonymous Coward writes: on Friday September 14, 2007 @10:01PM (#20612163)

Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests

That's beside the point. Computers ARE fallible, with or without ECC RAM. That you think they could be perfect (infallible) is testament to the already low rate of hardware defects which harm data integrity. It's good enough. I've experienced and located infrequent defects in almost every conceivable component of a computer system. An ECC error does not mean that the RAM is faulty. It could be caused by an aging capacitor, by a badly designed mainboard or a bunch of other reasons. An error just tells you that something is wrong. You still have to look for the cause.

If I could validate every data path in my computer at up to 20% premium, I would too. Unfortunately that is impossible, and not just because 20% is too small a premium to expect perfection. A stray particle from our sun could flip a bit in the processor and you'd be none the wiser. A seldom triggered off-by-one error in your favorite software could cause equally catastrophic mistakes as a flipped bit in main memory, and it wouldn't be caught by ECC RAM or any other available automatic integrity check. I'm not equating human fallibility to hardware problems. I'm explaining that at the current rate of faults in RAM modules, it is not the most common problem, which is precisely why it's rarely diagnosed correctly on the first try. That makes it a type of error which people don't want to pay money to avoid, as long as it can be found somehow. It turns out that it is surprisingly easy to detect too, because RAM rarely sits unused for very long, so even spurious defects show up on higher levels with a frequency that causes them to be noticed quickly. People have to be on the lookout for other defects and user errors all the time, they don't need to do anything extra to know that something is wrong when bad RAM is the cause. It just shows on a different level.

It is much more important to have working high level checks, otherwise you're going to miss lots of flaws. That's why mission critical systems run data through redundant systems with different implementations by different people and compare the results. A "whole system parity check", if you will. RAID is designed with the same philosophy: Cheap, possibly faulty hardware is used and errors are detected on a higher level and corrected if possible. Real world systems just place the checks much closer to the user, or even beyond the user, where laws allow for correction of mistakes post factum. A flipped bit in the exponent of a financial transaction does not mean you lose a lot of money. It means you end up having to correct that error. But the real world gives you that opportunity, so you're fine with saving money by not trying to achieve infallibility.

Re:Hello ZFS (Score:3, Interesting)

by drsmithy ( 35869 ) writes: <drsmithyNO@SPAMgmail.com> on Saturday September 15, 2007 @12:44AM (#20613101)

Obsolete? What would you replace it with then?
RAID6. Then a while after that, "RAID7" (or whatever they call triple-parity).
In ca. 4-5 years[0], the combination of big drives (2TB+) and raw read error rates (roughly once every 12TB or so) will mean that during a rebuild of 6+ disk RAID5 arrays after a single drive failure, a second "drive failure" (probably just a single bad sector, but the end result is basically the same) will be - statistically speaking - pretty much guaranteed. RAID5 will be obselete because it won't protect you from array failures (because every single-disk failure will become a double-disk failure). RAID6 will only give you the same protection as RAID5 today (because you will be vulnerable to a third drive failing during the rebuild in addition to the second) and "RAID7" will be needed to protect you from "triple disk failures".
On a more positive note, with current error rates, RAID10 should last until ca. 10TB drives before SATA array elements have to be "triple mirrored" (although this is far enough down the track that I expect the basic assumptions here to have changed). "Enterprise" hardware also has (much) longer to go, because the read error rate is better and drives typically (much) smaller.
(Even today, IMHO, anyone using drives bigger than 250G in 6+ disk arrays without either RAID6 or RAID10 is crazy.)
[0]This is actually being pretty generous. It's certain we'll see 2TB drives well before then, but I'm taking a timeframe where they will be "common" rather than "high end".

I have seen this many times, unfortunately. :-( (Score:3, Interesting)

by Terje Mathisen ( 128806 ) writes: on Saturday September 15, 2007 @04:42AM (#20614351)

We have 500+ servers worldwide, many of them contains the same program install images which by definition should be identical:

One master, all the others are copies.

Starting maybe 15 years ago, when these directory structures were in the single-digit GB range, we started noticing strange errors, and after running full block-by-block compares between the master and several slave servers we determined that we had end-to-end error rates of about 1 in 10 GB.

Initially we solved this by doubling the network load, i.e. always doing a full verify after every copy, but later on we found that keeping the same hw, but using sw packet checksums, was sufficient to stop this particular error mechanism.

One of the errors we saw was a data block where a single byte was repeated, overwriting the real data byte that should have followed it. This is almost certainly caused by a timing glitch which over-/under-runs a hardware FIFO. Having 32-bit CRCs on all Ethernet packets as well as 16-bit TCP checksums doesn't help if the path across the PCI bus is unprotected and the TCP checksum has been verified on the network card itself.

Since then our largest volume sizes have increased into the 100 TB range, and I do expect that we now have other silent failure mechanisms: Basically, any time/location when data isn't explicitly covered by end-to-end verification is a silent failure waiting to happen. On disk volumes we try to protect against this by using file systems which can protect against lost writes as well as miss-placed writes (i.e. the disk reports writing block 1000, but in reality it wrote to block 1064 on the next cylinder).

NetApp's WAFL is good, but I expect Sun's ZFS to an equally good job a significantly lower cost.

Terje

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

The Many Paths To Data Corruption 121

The Many Paths To Data Corruption More Login

The Many Paths To Data Corruption

I think this has happened to me (Score:5, Interesting)

Re:benchmarks (Score:5, Interesting)

No (Score:4, Interesting)

...but in the market speed sells not correctness. (Score:4, Interesting)

Comment removed (Score:4, Interesting)

How much does scrubbing cost? (Score:2, Interesting)

Re:checksum offload (Score:1, Interesting)

Timely article ... (Score:3, Interesting)

Re:what we have lost (Score:4, Interesting)

Re: ...but in the market speed sells not correctne (Score:1, Interesting)

Re:Hello ZFS (Score:3, Interesting)

I have seen this many times, unfortunately. :-( (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot