Was the Airbus A320 Recall Caused By Cosmic Rays? (bbc.com) 75
What triggered that Airbus emergency software recall? The BBC reports that Airbus's initial investigation into an aircraft's sudden drop in altitude linked it "to a malfunction in one of the aircraft's computers that controls moving parts on the aircraft's wings and tail." But that malfunction "seems to have been triggered by cosmic radiation bombarding the Earth on the day of the flight..."
The BBC believes radiation from space "could become a growing problem as ever more microchips run our lives." What Airbus says occurred on that JetBlue flight from Cancun to New Jersey was a phenomenon called a single-event upset, or bit flip. As the BBC has previously reported, these computer errors occur when high-speed subatomic particles from outer space, such as protons, smash into atoms in our planet's atmosphere. This can cause a cascade of particles to rain down through our atmosphere, like throwing marbles across a table. In rare cases, those fast-moving neutrons can strike computer electronics and disrupt tiny bits of data stored in the computer's memory, switching that bit — often represented as a 0 or 1 — from one state to another. "That can cause your electronics to behave in ways you weren't expecting," says Matthew Owens, professor of space physics at the University of Reading in the UK. Satellites are particularly affected by this phenomenon, he says. "For space hardware we see this quite frequently."
This is because the neutron flux — a measure of neutron radiation — rises the higher up in the atmosphere you go, increasing the chance of a strike hitting sensitive parts of the computer equipment on board. Aircraft are more vulnerable to this problem than computer equipment on the ground, although bit flips do occur at ground level, too. The increasing reliance of computers in fly-by-wire systems in aircraft, which use electronics rather than mechanical systems to control the plane in the air, also mean the risk posed by bit flips when they do occur is higher... Airbus told the BBC that it tested multiple scenarios when attempting to determine what happened to the 30 October 2025 JetBlue flight. In this case also, the company ruled out various possibilities except that of a bit flip. It is hard to attribute the incident to this for sure, however, because careering neutrons leave no trace of their activity behind, says Owens...
[Airbus's software update] works by inducing "rapid refreshing of the corrupted parameter so it has no time to have effect on the flight controls", Airbus says. This is, in essence, a way of continually sanitising computer data on these aircraft to try and ensure that any errors don't end up actually impacting a flight... As computer chips have become smaller, they have also become more vulnerable to bit flips because the energy required to corrupt tiny packets of data has got lower over time. Plus, more and more microchips are being loaded into products and vehicles, potentially increasing the chance that a bit flip could cause havoc. If nothing else, the JetBlue incident will focus minds across many industries on the risk posed to our modern, microchip-dependent lives from cosmic radiation that originates far beyond our planet.
Airbus said their analysis revealed "intense solar radiation" could corrupt data "critical to the functioning of flight control." But that explanation "has left some space weather scientists scratching their heads," adds the BBC.
Space.com explains: Solar radiation levels on Oct. 30 were unremarkable and nowhere near levels that could affect aircraft electronics, Clive Dyer, a space weather and radiation expert at University of Surrey in the U.K., told Space.com. Instead, Dyer, who has studied effects of solar radiation on aircraft electronics for decades, thinks the onboard computer of the affected jet could have been struck by a cosmic ray, a stream of high-energy particles from a distant star explosion that may have travelled millions of years before reaching Earth. "[Cosmic rays] can interact with modern microelectronics and change the state of a circuit," Dyer said. "They can cause a simple bit flip, like a 0 to 1 or 1 to 0. They can mess up information and make things go wrong. But they can cause hardware failures too, when they induce a current in an electronic device and burn it out."
The BBC believes radiation from space "could become a growing problem as ever more microchips run our lives." What Airbus says occurred on that JetBlue flight from Cancun to New Jersey was a phenomenon called a single-event upset, or bit flip. As the BBC has previously reported, these computer errors occur when high-speed subatomic particles from outer space, such as protons, smash into atoms in our planet's atmosphere. This can cause a cascade of particles to rain down through our atmosphere, like throwing marbles across a table. In rare cases, those fast-moving neutrons can strike computer electronics and disrupt tiny bits of data stored in the computer's memory, switching that bit — often represented as a 0 or 1 — from one state to another. "That can cause your electronics to behave in ways you weren't expecting," says Matthew Owens, professor of space physics at the University of Reading in the UK. Satellites are particularly affected by this phenomenon, he says. "For space hardware we see this quite frequently."
This is because the neutron flux — a measure of neutron radiation — rises the higher up in the atmosphere you go, increasing the chance of a strike hitting sensitive parts of the computer equipment on board. Aircraft are more vulnerable to this problem than computer equipment on the ground, although bit flips do occur at ground level, too. The increasing reliance of computers in fly-by-wire systems in aircraft, which use electronics rather than mechanical systems to control the plane in the air, also mean the risk posed by bit flips when they do occur is higher... Airbus told the BBC that it tested multiple scenarios when attempting to determine what happened to the 30 October 2025 JetBlue flight. In this case also, the company ruled out various possibilities except that of a bit flip. It is hard to attribute the incident to this for sure, however, because careering neutrons leave no trace of their activity behind, says Owens...
[Airbus's software update] works by inducing "rapid refreshing of the corrupted parameter so it has no time to have effect on the flight controls", Airbus says. This is, in essence, a way of continually sanitising computer data on these aircraft to try and ensure that any errors don't end up actually impacting a flight... As computer chips have become smaller, they have also become more vulnerable to bit flips because the energy required to corrupt tiny packets of data has got lower over time. Plus, more and more microchips are being loaded into products and vehicles, potentially increasing the chance that a bit flip could cause havoc. If nothing else, the JetBlue incident will focus minds across many industries on the risk posed to our modern, microchip-dependent lives from cosmic radiation that originates far beyond our planet.
Airbus said their analysis revealed "intense solar radiation" could corrupt data "critical to the functioning of flight control." But that explanation "has left some space weather scientists scratching their heads," adds the BBC.
Space.com explains: Solar radiation levels on Oct. 30 were unremarkable and nowhere near levels that could affect aircraft electronics, Clive Dyer, a space weather and radiation expert at University of Surrey in the U.K., told Space.com. Instead, Dyer, who has studied effects of solar radiation on aircraft electronics for decades, thinks the onboard computer of the affected jet could have been struck by a cosmic ray, a stream of high-energy particles from a distant star explosion that may have travelled millions of years before reaching Earth. "[Cosmic rays] can interact with modern microelectronics and change the state of a circuit," Dyer said. "They can cause a simple bit flip, like a 0 to 1 or 1 to 0. They can mess up information and make things go wrong. But they can cause hardware failures too, when they induce a current in an electronic device and burn it out."
Re: Why was the older version better? (Score:3)
They didn't take it out per se, they just forgot to include it. Probably some make flag omitted or something simple.
Re: Why was the older version better? (Score:5, Insightful)
They don't really know what caused the glitch.
The cosmic ray hypothesis is just a conjecture.
So, they're rolling back to the previous version until they can figure it out.
If they're doing memory scrubbing, they might want to bump up the frequency.
If they aren't using semiconductors made with depleted boron, they should be.
Re: (Score:3)
First, cosmic rays are what you blame when you can't find the bug. And second, if cosmic really are to blame, then they should have rolled back to the previous version of the sun.
Re: (Score:2)
> they should have rolled back to the previous version of the sun.
Not just the sun - the universe.
The sun emits solar 'wind' (formed of charged particles), which can indeed affect electronics, but it typically has to get through our magnetic field first (so satellites are vulnerable). Aircraft are less protected, but still plenty protected against the majority of it. The sun does also emit cosmic rays, but in relatively small quantities.
Many cosmic rays come from much further out than that - potentially
Re: (Score:2)
The number of cosmic rays hitting earth is inversely proportional to solar activity. In years close to solar max, where we are now, the cosmic ray flux is at a minimum.
This happens because increased solar wind at solar max blocks some cosmic ray particles coming from outside the solar system.
Re: (Score:2)
cosmic rays are what you blame when you can't find the bug
Sometimes you can prove it was a bit flip [caused by cosmic ray, local radioactivity or a glitch in the Matrix], you just need to find the exact bit. A friend of that managed to do exactly that after an error in his monthly accounting software. He proved you could only get the resulting sum if you flipped bit Nth of a certain value during the summation. It took him a while and he had written the software himself.
Re: (Score:2)
Re: Why was the older version better? (Score:4, Insightful)
And second, if cosmic really are to blame, then they should have rolled back to the previous version of the sun.
You're assuming a lot. The software rollback may very well have to do with changes in error detection and correction routines. Hell here's a super oversimplified example: When you update your BIOS on a server there's a good chance you come out the other side with ECC turned off.
This isn't unreasonable. I've experienced a large compressor shutdown costing many millions of dollars thanks to a firmware update on a safety system from Honeywell which had a bug in error detection and handling which caused a simple random single hardware fault to escalate to a redundant failure that shouldn't have occurred. Honeywell withdrew the update globally and we were advised to roll back. This kind of shit happens.
Re: (Score:2)
You do realize there is a lot more exposure to electromagnetic radiation at 40,000 feet then there is at 1000 feet, yeah?
And you know that we've observed solar bursts causing bit-flips in RAM and SSD? Like, a lot?
So a solar flare causing additional EM and ionizing radiation, and increased exposure to it due to higher altitude might increase the probability of getting a few bit-flips in the systems, yeah?
No, they don't know for sure. But they have operational records that show higher exposure does cause th
Re: (Score:2)
It is recommended to run ECC RAM with any large fileserver or NA for this exact reason. ZFS explicitly recommends it.
Re: (Score:2)
Which is all well and good if you're talking about something you can easily replace.
Do realize that when someone spends $130M on a passenger jet, they expect to get a couple decades service out of it, and they aren't going to be ripping out the electronics every other year to upgrade like it's a PC - they upgrade things when they will either get more service revenue out of it (i.e. there's an RoI to be had), or if it's required on an airworthiness directive for safety.
Some of these systems were created befo
Re: Why was the older version better? (Score:2)
Cosmic rays don't originate in the solar system, aka they have nothing to do with the sun.
Re: (Score:3)
If they aren't using semiconductors made with depleted boron, they should be.
No they should not. They should spend their money focusing on designs that are inherently resilient to soft errors rather than spending a fortune on buying hardened silicon to address a singular cause of a potential error. Boron-11 silicon is predominantly used in the medical imaging, space, and nuclear industry where equipment is expected to be continuously bombarded with high levels of radiation. Flights just don't qualify for that level of mitigation requirement in the silicon manufacture.
Re: (Score:2)
They don't really know what caused the glitch.
The cosmic ray hypothesis is just a conjecture.
So, they're rolling back to the previous version until they can figure it out.
This is called "being careful". They could just have done what Boeing does and risked a few 100 dead but avoided that costly "recall". Instead they determined the possible causes and eliminated the most likely ones, and those include an unknown software fault. They currently are not finding that fault and hence they think it may have been a rare but possible event like a bit flip.
Overthinking it... (Score:2)
Re:Overthinking it... (Score:5, Interesting)
Their developers are supposed to be very competent and careful, but mostly because of culture and the application of development processes that consider lots of potential errors. The default assurance guidance documents (don't call them standards, for rather pedantic reasons) are ED-79 (for Europe because we're taking about Airbus, jointly published as ARP4754 in the US) for aircraft and system design, ARP4761/ED-135 for the accompanying safety analyses, DO-178/ED-12 for software development and DO-254/ED-80 for hardware development. DO-254 gets augmented by AC 20-152A to clarify a number of points. Regulators who certify the system or aircraft also have guidance about what level of involvement they should have in the development process, based on lots of factors, but with most of them boiling down to prior experience of the developers.
You can read online about the objectives in those documents, but flight control systems have potentially catastrophic failure effects, so they need to be developed to DAL A. For transport category aircraft, per AC 25.1309-1B, a catastrophic effect should occur no more often than once per billion operational hours. Catastrophic effects must not result from any single failure; there must be redundancy in the aircraft or system. Normally, the fault tree analysis can only ignore an event if it's two or three orders of magnitude less likely than the overall objective.
Cosmic rays normally cause more than one single-event upset per 10 trillion hours of operation, so normally there should be hardware and software mechanisms to avoid effects from them. In hardware, it might be ECC plus redundant processors with a voting mechanism. For software, it might be what DO-178 calls multiple version dissimilar software independence.
I don't know Airbus itself, and one always has the chance of something like the Boeing 737 MAX MCAS. But typically, companies and regulators do expect these systems to be extremely reliable because the developers are professional and honest: not necessarily super-competent, but super-careful about applying good development practices, having independence in development processes as well as the product, and checking their work with process and quality assurance teams who know what to look for and what to expect.
Re: (Score:3)
super-careful about applying good development practices
That works, until bean counting MBAs are allowed to control what should be an engineering process. In the case of the 737-MAX it was because the MBAs that run Boeing see programmers as a fungible input like aluminum, so any old programming team will do if the price is right. In that case the programming team which won the low bid normally worked in the financial industry.
process and quality assurance teams who know what to look for
Those guys were too expensive for Boeing's management, they've all been laid off years ago.
Re: (Score:2)
Makes you think if you only want to fly Airbus from now on.
Well you're more than welcome to fly on Boeing...
A funny scary thing (Score:4, Informative)
The issue is seemingly far more widespread than people realize. My memory is otherwise 100% stable because I've run a 24-hour MemTest86 loop at least a couple of times and it didn't find any errors. However, it's important to note that sometimes it actually detects a single error, but it's not reproducible.
Re: (Score:3)
Try running a one-week memtest86 run, then?
I used to have similar problems (with 4x32 GB sticks), but they went away when I replaced my RAM. Those kinds of problems can also be caused by voltage fluctuations, either from the input power or from load (and memtest86 isn't good at increasing CPU or GPU load) -- even without overcooking. It could be cosmic rays, but it could also be much more local causes.
Re: (Score:2)
Re:A funny scary thing (Score:4, Informative)
Unless you are at the North or South Pole or on top of one of the highest mountains, you are unlikely to be getting an average of one SEU per week in one computer due to cosmic rays. I would attribute most of the errors you see to other causes: marginal timing compatibility, power glitches, an overburdened fan, a leaky microwave nearby, several of these in combination, etc. Cosmic rays sound cool, but most bit flips have more boring causes.
In my case, I saw a lot more errors when I was running compute-intensive jobs: read files, decompress them, run a domain specific compression to text, generate SHA-256, compress using a general purpose compression, in parallel on 24 cores. The location of errors was random like in your system, but the correlation with processor load convinced me it wasn't caused by cosmic rays.
Re: (Score:2)
Re: A funny scary thing (Score:3)
Re: (Score:2)
Yeah, cosmic rays, that'll be it (Score:3)
No chance of it being slightly out of spec RAM that was sold anyway or perhaps issues with the MB or power supply , no sir, its cosmic rays!
Re: (Score:2)
At least once a week
That is not cosmic rays. Are you sure your nextdoor neighbour isn't running a secret nuclear reactor?
Yes bit flips from cosmic rays happen. If you were to to say once or twice a year then I'd blame it on a bitflip (that's about in line with what Google's study estimates a a server with large amounts of memory would have), but if you were getting errors daily then its time to replace your RAM. If it's seemingly random across the memory channels then new CPU/Motherboard.
Re: (Score:2)
*once per week, not per day.
In any case my server with 2x 32GB sticks in it registers a hardware error slightly less than once a year (last one I see was in September 2024) and it's not like I live in a hardened bunker.
Re: (Score:2)
We're at sea level, and in datacenters with lots of shit on the roof, so maybe we're doing a little better than someone's house, but 1/week is 100% not the FSM fucking with your bits. That's memory or bus timings or voltage on the razor's edge or something.
Re: (Score:1)
>1/week is 100% not the FSM fucking with your bits. That's memory or bus timings or voltage on the razor's edge or something.
His Almighty Noodliness is known to use memory, bus timings, or voltage that's on the razors edge to fsck with ones bits.
For what it's worth, it could also be Ceiling Cat or Basement Cat making the mischief.
Re: (Score:2)
Ceiling Cat
All hail
Re: (Score:2)
We have hundreds of servers and don't see that kind of MCE log frequency across all of them combined lol
You're A Funny Scary Thing (Score:3)
At least once a week, the Linux kernel displays this message:
Unless your Linux machine is in space, it is because you have bad memory!
When you see this error, you replace the fucking memory and then you don't see it anymore.
I have hundreds of Linux systems under my management and this error never occurs. Is that because they are shielded from neutrinos in special lead and water lined bunkers? Nope, they just don't have bad memory chips in them.
Let's put more stuff up there! (Score:2)
Re: (Score:3)
I doubt it could make the slop any worse. Might even improve it with a bit of extra random dither occasionally!
Re: (Score:2)
Yup, I'm sure they haven't thought about this issue at all when considering putting computers in space.
Re: (Score:3)
Until they were finally grounded the Space Shuttles used 486 CPUs, mostly because the large die size minimized the issue of flipped bits.
Re: (Score:2)
The avionics package on the orbiter in fact consisted of 8086s. You may be thinking about Hubble.
Re: (Score:2)
Oops, you're right. Anyway, big die size = minimal bit flips.
Re: (Score:2)
Bigger and slower RAM cells also have larger potentials, harder to flip. Buses and gates have higher voltages, harder to flip.
That being said- small die fast chips can be made reliable in space- but it's much more expensive than just using something really old.
Re: (Score:1)
>Until they were finally grounded the Space Shuttles used [very old, large-die] CPUs, mostly because the large die size minimized the issue of flipped bits.
If my memory is correct (pun intended), Space Shuttles also had 5 flight computer systems for redundancy.
Re: (Score:2)
Yes, but then until it crashed because of completely not cosmic ray related issues, ingenuity used off the shelf computers, and on the ISS you will find bog standard modern computers for the astronauts to process data on.
Re: (Score:2)
For processing data, that's fine. Run the analysis of your test results twice, if they match you're probably fine. On the other hand IIRC the systems that actually maintain attitude and other critical functions are military-type hardened systems (they weren't that much more expensive at the time, unless it was the Pentagram purchasing them).
No ECC? (Score:1)
Re: No ECC? (Score:2)
That was my thought, but I don't really know much about it...but I did think that this sort of thing is exactly what ECC memory is for...
Re: No ECC? (Score:2)
Do we know the bit flip happened in memory and not elsewhere?
Either way, these systems should have triple redundancy for these signal corruption cases. Only accept input that two sources agree on.
Re: (Score:2)
Airbus does have triple redundancy in all of their fly by wire aircraft, but it can happen (and has actually happened at least once) that two sources return the same defective data.
Re: (Score:3)
The chances of 2 separate cosmic ray events flipping the exact same bits in program code or its data at the exact same time to cause the computers to return the same defective result is so infinitesimally small that it can be discounted as a realistic scenario.
If this was a cosmic ray then it clearly affected part of the avionics that didn't have triple redundancy. Perhaps they should be looking at that.
Re: (Score:2)
You couldn't do it in a googol ages of the universe.
Re: (Score:2)
As I said, it did happen before.
https://en.wikipedia.org/wiki/... [wikipedia.org]
A bit flip separately might not be a big deal. A bit flip coupled with unexpected hardware or software limitations can break things that seem impossible for break.
Re: (Score:2)
A bit flip separately might not be a big deal. A bit flip coupled with unexpected hardware or software limitations can break things that seem impossible for break.
So you're not proposing to equal bit flips on 2 computers, you're describing a bit flip's end result mirroring that of a defective piece of hardware that would have been a compared against value- I can buy that.
Re:No ECC? (Score:4, Interesting)
Consumer grade memory just takes bit flips, but ECCs do exist. Do you mean to tell me they don't use them at Airbus? -dk
This is an embedded system in a high reliability environment. The way these things work is keep-it-simple to an absurd level. I bet you this is some dinky 8-bit RISC CPU that's built on a crazy big process node, and the production QC trace on it will be insane. On these sorts of systems, if you want ECC, you add it to the firmware, but only in the areas you need it, and only after a thorough analysis of (a) the problem it is solving (b) the amount of ECC required to solve that problem (c) the best algorithm to meet the identified objectives. There are many ways to do ECC - including just duplicating variables n number of times - which has the advantage of being very easy to implement and formally verify while being less efficient at RAM utilisation vs a Hamming Code, but even that depends on the statistics of your error conditions.
The point is that, sure, they could add some generic hardware ECC, but that ECC can fail (if there are too many bit flips, if the ECC logic itself gets bit flipped, or there is a design error for a particularly input sequence, etc etc). Maybe you win out overall, maybe you don't - the problem is that you'd have to run a complete analysis to know. That means you have to now add ECC hardware failure modes to pieces of software that did not need ECC before. I mean, sure, maybe you win, but maybe you make it worse, and have to develop extra software to deal with the new hardware failure modes. Whatever the outcome you'll have to do a boat load more documentation to make sure.
I bet you it took them less than a day to identify a fix for the code and update it. It would have then been thousands of hours of work to update all the documentation and thoroughly verify the new code against all the other requirements on the system.
If you want a good example of how quickly these supposedly simple systems can get complicated, look into the CAN bus CRC bug. This fault is present on EVERY system that uses the CAN bus (basically any vehicle since the 1990s). It is an extremely subtle bug involving the error detection system that is obvious once you're show it, but the very smart people who designed it, along with thousands of engineers who worked with it, didn't spot it for around a decade. Even worse when they developed CAN 2.0 they tried to fix the bug, and didn't even get that right.
Re: (Score:3)
If you want a good example of how quickly these supposedly simple systems can get complicated, look into the CAN bus CRC bug.
It's not simple to figure out what you're talking about, a search doesn't return anything obvious through the flurry of marketing content.
This fault is present on EVERY system that uses the CAN bus
It applies to every CAN standard? There's like five of them.
basically any vehicle since the 1990s
Since after the 1990s, you mean? While there were a few CAN vehicles in the 1990s, it didn't really become popular until the 2000s because the interface chips were still relatively expensive.
Re: (Score:3)
Search for 'Multi-Bit Error Vulnerabilities in the Controller Area Network Protocol'. (It's a thesis by Eushiuan Tran)
This issue is quite subtle, but essential, the fact that the CRC is applied before bit-stuffing means that a single bit error can cascade into multiple errors that exceed the detection limit for the CRC. The potential for this is fortunately rare, but it's like having holes in your bullet proof vest.
This is why CAN FD (apologies, I said 2.0 in the previous message) includes the stuff bits i
Another gadget added to the list of forbidden item (Score:5, Funny)
I can't bring a ton of shampoo, nor a pair of scissors. Certain laptops or batteries. Now, it's looking like my homemade cosmic ray simulator won't be making it onboard with me...
Re: (Score:2)
I can't bring a ton of shampoo, nor a pair of scissors. Certain laptops or batteries. Now, it's looking like my homemade cosmic ray simulator won't be making it onboard with me...
LAG restrictions have been lessened or even gotten rid of in Australia and the UK. Air travel is not as bad outside the US.
Batteries are becoming a problem for airlines because people are entitled fuckwits and won't follow basic instructions (I MUST charge my phone no matter what people tell me) as they keep bringing damaged batteries on board which conflagrate. So they're getting banned at the insistence of airlines rather than governments.
Risk of updating software (Score:1)
solar flares (Score:2)
Wasn't there a ton of solar flare activity causing auroras? That's more likely the cause than cosmic rays.
Re: (Score:2)
I wouldn't expect the PR flacks who write press releases know the difference.
Neutron flux (Score:1)
Not great, not terrible.
Soon we won't have to worry... (Score:2)
Cosmic Bit Flip Bullshit (Score:2)
This cosmic bit flip thing stinks of bullshit. Especially when a cosmic physics problem is somehow solved by a software reversion.
Seems like bad code to me.
Also, if the bit flip is possible, then it's a design error for failing to use ECC RAM.
I are so smart.
Job ambitions (Score:2)
because careering neutrons leave no trace of their activity behind
It's always this. Neutrons are "the little MBAs" of the subatomic world, and they chew through role after role so quickly that it can be dizzing to trace. Compounding the issue is that most subatomic particles don't take the time to fill out their LinkedIn profiles.
Why does this sound like what AI let loose will do (Score:1)
Where did I put that COSMAC 1802? (Score:2)
I'm sorry officer! I wasn't speeding! (Score:2)
Some cosmic rays interfered with my electronic speedometer. It told me I was driving exactly the speed limit. Honest!