Slashdot Log In
e1000e Bug Squashed — Linux Kernel Patch Released
Posted by
Soulskill
on Fri Oct 03, 2008 10:01 PM
from the good-news-everyone dept.
from the good-news-everyone dept.
ruphus13 writes "As mentioned earlier, there was a kernel bug in the alpha/beta version of the Linux kernel (up to 2.6.27 rc7), which was corrupting (and rendering useless) the EEPROM/NVM of adapters. Thankfully, a patch is now out that prevents writing to the EEPROM once the driver is loaded, and this follows a patch released by Intel earlier in the week. From the article: 'The Intel team is currently working on narrowing down the details of how and why these chipsets were affected. They also plan on releasing patches shortly to restore the EEPROM on any adapters that have been affected, via saved images using ethtool -e or from identical systems.' This is good news as we move towards a production release!"
Related Stories
[+]
OpenSUSE Beta Can Brick Intel e1000e Network Cards 129 comments
An anonymous reader writes "Some Intel cards don't just not work with the new OpenSUSE beta, they can get bricked as well. Check your hardware before you install!" The only card mentioned as affected is the Intel e1000e, and it's not just OpenSUSE for which this card is a problem, according to this short article: "Bug reports for Fedora 9 and 10 and Linux Kernel 2.6.27rc1 match the symptoms reported by SUSE users."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
News? (Score:3, Insightful)
I know this is News For Nerds and all that, but isn't this a tad specific?
An alpha/beta of the most recent linux kernel patch had a bug fixed, and it hits the front page?
Don't get me wrong, I'm glad they found it, but this is kinda the point of debug cycles.. If we start reporting every bug squashed in all the major open source projects out there this is going to go downhill fast.. (of course, it's possible some may think that the idle. is only a step above..)
--Q
Re: (Score:2)
(of course, it's possible some may think that the idle. is only a step above..)
Or a step below...
Re:News? (Score:5, Insightful)
Parent
Re:News? (Score:5, Insightful)
Try Erasing the BIOS on the main board and you will be more accurate in your comparison.
This bug actually flashed the firmware for the network controller and hosed access to it in some unexplained sort of way. That is something note worthy because of the rarity of it. If it was simply hosing something that was readily diagnosable and more common like a boot sector or something, then it would be different. It isn't often the software is associated with hardware damage either purposefully or accidentally.
BTW, I know there are recovery methods for a hosed BIOS. That isn't the point. Simply installing an operating system shouldn't hose it nor should it hose hardware either. Imagine all the people who just thought their card was broken or something and went for a refund under warranty or the bad name Intel or Linux received for the "faulty shipment of devices" or the ability to break a device. This is something that would work in windows, load Linux in a dual boot mode, it would stop working in both windows and Linux without any errors or indication that the car was even capable of being seen by the mainboard.
Parent
Re: (Score:3)
It was even more fun. Once the card was hosed, not only would it not work, but it required a bit of hacking to get it recognized enough to attempt a re-flash (assuming you had an image of the correct contents to flash in).
The exact cause was mysterious as well since it didn't happen to everyone, nor was it predictable if or when it would happen.
Re:News? (Score:5, Informative)
An alpha/beta of the most recent linux kernel patch had a bug fixed, and it hits the front page?
They have not fixed the bug that caused the e1000e ethernet cards to get bricked. This is at least a two part bug. The EEPROM should not have been writable and Something Is Happening to cause bad writes to happen. What that "Something" is, no one knows yet, though it appears they are getting close.
Linus is an absolute, total anal retentive with regards to fixing bugs by understanding and fixing the root cause[1], not just papering over it. This papers over it for the moment, because the bug hasn't been isolated yet, but it allows more people to participate because the side effects were really nasty - this was a true bricking of the ethernet card.
This stage isn't newsworthy for Slashdot.[2] It must be a slow news day.
[1] This is a Good Thing.
[2] Nor will the real bug fix when it comes. A bug is found, a bug is fixed. Life, goes on.
Parent
Re:News? (Score:5, Interesting)
I know this is News For Nerds and all that, but isn't this a tad specific?
That's what sections are for. See the little Tux Icon over there? We all care about Linux. Besides, it's a VERY IMPORTANT BUG. A showstopper, so to speak. And keep in mind that a lot of people in here are kernel freaks. They want to test-drive the latest versions of the kernel. And one of the reasons why people keep coming here (and not to digg) is precisely for this kind of news.
Thanks, ruphus13.
Parent
Good News Everyone! (Score:5, Funny)
Hardware of Software Problem? (Score:5, Interesting)
Linus isn't very happy with Intel here:
http://lkml.org/lkml/2008/9/29/368
On Mon, 29 Sep 2008, Arjan van de Ven wrote:
>
> we have a patch to save/restore now, in final testing stages
> (obviously we want to be really careful with this)
Btw, the _real_ bug is clearly in the hardware design that allows you to
brick those things without apparently even having a lock bit.
I'm hoping Intel doesn't treat this as just a software bug. Some hw
designer should be thinking hard about which orifice they put their head
up in.
It used to be that you could fry some monitors by feeding them
out-of-range signals. The _monitors_ got fixed.
Linus
Re:Hardware of Software Problem? (Score:5, Insightful)
I remember having a motherboard with a jumper that had to be specially set to update the BIOS. The smart way was to power down, open the case and pull the jumper so that you could flash the EEPROM. Then, of course, once that was done, reverse the procedure for safety. I always regarded anybody who left the jumper off for the rare convenience as fools who deserved anything that might happen.
Parent
Re: (Score:3, Informative)
Re: (Score:3)
Given the cost of EEPROM space, I think the better answer is to double the size. One half is readable, one writable, at any point in time. To update, you write, turn off, flip the jumper across to the other side (or, heck, just use a physical switch) and you're done. Bricking isn't absolutely impossible (you could write a damaged image to one half which wipes the other when it boots), but essentially infeasible.
Re: (Score:3, Informative)
It is not uncommon to require a set of magic numbers to be written before writing to protected memory. The magic numbers and/or access pattern is designed so that no simple or likely hardware failure will allow unprotected access. Small discrete or integrated EEPROMs often have this functionality built in.
So, we put the workaround in _hardware_? (Score:5, Insightful)
Linus has a very good analogy here -- in fact, I love the fact that on the rare occasions I have to set modelines myself, I can pretty much put whatever I want, knowing that if it doesn't work, I can just ctrl+alt+backspace and try again.
But the conclusion does bother me: We're basically saying that all software is buggy, or that we're incapable of preventing this kind of thing from happening (in software). This is true of most modern OS designs -- monolithic kernels do make it possible for pretty much any driver to accidentally ruin any other driver's day.
The proposed workaround, then, is to prevent that memory from being written -- and to prevent this in hardware, for no other reason than to avoid having to write it into every kernel that might potentially allow buggy code to run in Ring 0.
I don't like either solution. Hardware shouldn't be brickable from software, or at least, not so easily. But software shouldn't need hardware to coddle it, either -- why is the SSD in this laptop emulating a hard disk?
Parent
Re:So, we put the workaround in _hardware_? (Score:5, Insightful)
Yes, because as long as the hardware can be bricked by software, it remains an exploit that can be used by malicious software writers.
Speaking of the fried monitors, back in the day a college I worked at got a virus that fried 2 monitors before I got smart and put a Hercules monochrome card in it and cleaned it up.
So, yes, while it can (and should) be worked around in Linux, it should also be fixed in hardware, if possible.
Parent
More recently than that (Score:2)
Supposedly with pre-multisync monitors (say, your average early-'90s monitor, like my old Tandy VGM-340) if you weren't careful about what X modelines you used you could fry your monitor.
Great! (Score:3, Funny)
e1000 been broken a while (Score:4, Insightful)
Re:e1000 been broken a while (Score:5, Informative)
3com used to be that way too. I'm not exactly sure what it was but the 3c905's rocked and would run data quite a bit faster then any other card at the time. I know they had a full blown data processors on the cards but I assume the others would to. I used to go to computer shows just to pick them up for $10-$20 used because they had the same effects on data performance as you would see with rendering going from a S3 trident video adapter to a Gforce video card. I because seriously convinced when at a lan party with an AMD Athlon 800 system running windows 98se with 256 memory and we had to pull a 100 meg file from a file server to get the updates in sync to a game to play. I started pulling the file last because of helping others find it, I was on the tail end of the 3rd tire of uplinked switches and I had the file installed while others were still transering it. The funny part is that people with their brand new Windows XP 1.4 and 1.8 gig plus systems were still slower and the only thing I can attribute to it is the NIC.
Intel caught up with 3com in this aspect and despite my older fascinations with 3com, I'm actually an Intel fan in this one respect now.
Parent
Root cause still unknown? (Score:5, Interesting)
Yes, they released a patch so that the NVM can't be overwritten after the e1000e driver is loaded. But from what I can tell, they still don't know what is/was responsible for the overwriting.
FWIW, I'm almost positive that modern CPUs have debug traps for this exact sort of thing...you can trap arbitrary I/O writes via SMM or something...obviously I'm not in the debug loop, but I don't see why this has been so hard to figure out...
Re: (Score:2)
Re:Root cause still unknown? (Score:5, Interesting)
So the thing is, there is more than just a simple "eeprom write interface" on these chips.
Most of the time the the eeprom attached to the nic is a cheap small serial eeprom part, usually just a few kb.. maybe 32 or 64kb. It contains mostly things like a bit of boot strapping, a few "permanent" settings like the MAC address, and the PXE rom.
And that's where the problems come in. This serial interface is usually an afterthought, and if there is noise on that bus, bits can flip. Or if something bad happens in the NIC code, you could accidentally write when you meant to read.
Usually this is recoverable, but I haven't looked into this specific corruption situation. I've had to deal with this kind of thing before. It's not fun.
Flashing NIC eeproms isn't something a normal end-user does all the time. 99% of the time it's written at the factory, stuffed on the board, and forgotten about.
Parent
Re: (Score:3, Interesting)
Otherwise what's the point of testing them? Sure they won't brick your card, but you can't get very useful feedback.
So in a nutshell... (Score:2, Funny)
Re: (Score:2)
huh?