Forgot your password?
typodupeerror
Transportation Space

Was the Airbus A320 Recall Caused By Cosmic Rays? (bbc.com) 75

What triggered that Airbus emergency software recall? The BBC reports that Airbus's initial investigation into an aircraft's sudden drop in altitude linked it "to a malfunction in one of the aircraft's computers that controls moving parts on the aircraft's wings and tail." But that malfunction "seems to have been triggered by cosmic radiation bombarding the Earth on the day of the flight..."

The BBC believes radiation from space "could become a growing problem as ever more microchips run our lives." What Airbus says occurred on that JetBlue flight from Cancun to New Jersey was a phenomenon called a single-event upset, or bit flip. As the BBC has previously reported, these computer errors occur when high-speed subatomic particles from outer space, such as protons, smash into atoms in our planet's atmosphere. This can cause a cascade of particles to rain down through our atmosphere, like throwing marbles across a table. In rare cases, those fast-moving neutrons can strike computer electronics and disrupt tiny bits of data stored in the computer's memory, switching that bit — often represented as a 0 or 1 — from one state to another. "That can cause your electronics to behave in ways you weren't expecting," says Matthew Owens, professor of space physics at the University of Reading in the UK. Satellites are particularly affected by this phenomenon, he says. "For space hardware we see this quite frequently."

This is because the neutron flux — a measure of neutron radiation — rises the higher up in the atmosphere you go, increasing the chance of a strike hitting sensitive parts of the computer equipment on board. Aircraft are more vulnerable to this problem than computer equipment on the ground, although bit flips do occur at ground level, too. The increasing reliance of computers in fly-by-wire systems in aircraft, which use electronics rather than mechanical systems to control the plane in the air, also mean the risk posed by bit flips when they do occur is higher... Airbus told the BBC that it tested multiple scenarios when attempting to determine what happened to the 30 October 2025 JetBlue flight. In this case also, the company ruled out various possibilities except that of a bit flip. It is hard to attribute the incident to this for sure, however, because careering neutrons leave no trace of their activity behind, says Owens...

[Airbus's software update] works by inducing "rapid refreshing of the corrupted parameter so it has no time to have effect on the flight controls", Airbus says. This is, in essence, a way of continually sanitising computer data on these aircraft to try and ensure that any errors don't end up actually impacting a flight... As computer chips have become smaller, they have also become more vulnerable to bit flips because the energy required to corrupt tiny packets of data has got lower over time. Plus, more and more microchips are being loaded into products and vehicles, potentially increasing the chance that a bit flip could cause havoc. If nothing else, the JetBlue incident will focus minds across many industries on the risk posed to our modern, microchip-dependent lives from cosmic radiation that originates far beyond our planet.

Airbus said their analysis revealed "intense solar radiation" could corrupt data "critical to the functioning of flight control." But that explanation "has left some space weather scientists scratching their heads," adds the BBC.

Space.com explains: Solar radiation levels on Oct. 30 were unremarkable and nowhere near levels that could affect aircraft electronics, Clive Dyer, a space weather and radiation expert at University of Surrey in the U.K., told Space.com. Instead, Dyer, who has studied effects of solar radiation on aircraft electronics for decades, thinks the onboard computer of the affected jet could have been struck by a cosmic ray, a stream of high-energy particles from a distant star explosion that may have travelled millions of years before reaching Earth. "[Cosmic rays] can interact with modern microelectronics and change the state of a circuit," Dyer said. "They can cause a simple bit flip, like a 0 to 1 or 1 to 0. They can mess up information and make things go wrong. But they can cause hardware failures too, when they induce a current in an electronic device and burn it out."
This discussion has been archived. No new comments can be posted.

Was the Airbus A320 Recall Caused By Cosmic Rays?

Comments Filter:
  • So the likelihood of a memory/wordsize handling bug that might lead to heap overrides is less likely than cosmic radiation bitflip? Amazing! These companies must really employ the most professional and honest engineers on the planet. Makes you think if you only want to fly Airbus from now on. ...I'd say their cosmic radiation comment is more concerning than the recall itself.
    • by Entrope ( 68843 ) on Monday December 08, 2025 @07:39AM (#65842951) Homepage

      Their developers are supposed to be very competent and careful, but mostly because of culture and the application of development processes that consider lots of potential errors. The default assurance guidance documents (don't call them standards, for rather pedantic reasons) are ED-79 (for Europe because we're taking about Airbus, jointly published as ARP4754 in the US) for aircraft and system design, ARP4761/ED-135 for the accompanying safety analyses, DO-178/ED-12 for software development and DO-254/ED-80 for hardware development. DO-254 gets augmented by AC 20-152A to clarify a number of points. Regulators who certify the system or aircraft also have guidance about what level of involvement they should have in the development process, based on lots of factors, but with most of them boiling down to prior experience of the developers.

      You can read online about the objectives in those documents, but flight control systems have potentially catastrophic failure effects, so they need to be developed to DAL A. For transport category aircraft, per AC 25.1309-1B, a catastrophic effect should occur no more often than once per billion operational hours. Catastrophic effects must not result from any single failure; there must be redundancy in the aircraft or system. Normally, the fault tree analysis can only ignore an event if it's two or three orders of magnitude less likely than the overall objective.

      Cosmic rays normally cause more than one single-event upset per 10 trillion hours of operation, so normally there should be hardware and software mechanisms to avoid effects from them. In hardware, it might be ECC plus redundant processors with a voting mechanism. For software, it might be what DO-178 calls multiple version dissimilar software independence.

      I don't know Airbus itself, and one always has the chance of something like the Boeing 737 MAX MCAS. But typically, companies and regulators do expect these systems to be extremely reliable because the developers are professional and honest: not necessarily super-competent, but super-careful about applying good development practices, having independence in development processes as well as the product, and checking their work with process and quality assurance teams who know what to look for and what to expect.

      • by cusco ( 717999 )

        super-careful about applying good development practices

        That works, until bean counting MBAs are allowed to control what should be an engineering process. In the case of the 737-MAX it was because the MBAs that run Boeing see programmers as a fungible input like aluminum, so any old programming team will do if the price is right. In that case the programming team which won the low bid normally worked in the financial industry.

        process and quality assurance teams who know what to look for

        Those guys were too expensive for Boeing's management, they've all been laid off years ago.

    • Makes you think if you only want to fly Airbus from now on.

      Well you're more than welcome to fly on Boeing...

  • A funny scary thing (Score:4, Informative)

    by Artem S. Tashkinov ( 764309 ) on Monday December 08, 2025 @06:40AM (#65842873) Homepage
    For decades, people have dismissed bit flips caused by cosmic rays, but here's what I've been dealing with: I have four 16 GB sticks of DDR4 RAM running at stock without overclocking or anything. At least once a week, the Linux kernel displays this message:

    mce: [Hardware Error]: Machine check events logged
    mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 19: 9460eb40d5040348
    mce: [Hardware Error]: TSC 0
    mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1750853778 SOCKET 0 APIC 2 microcode a20102d

    The issue is seemingly far more widespread than people realize. My memory is otherwise 100% stable because I've run a 24-hour MemTest86 loop at least a couple of times and it didn't find any errors. However, it's important to note that sometimes it actually detects a single error, but it's not reproducible.

    • by Entrope ( 68843 )

      Try running a one-week memtest86 run, then?

      I used to have similar problems (with 4x32 GB sticks), but they went away when I replaced my RAM. Those kinds of problems can also be caused by voltage fluctuations, either from the input power or from load (and memtest86 isn't good at increasing CPU or GPU load) -- even without overcooking. It could be cosmic rays, but it could also be much more local causes.

      • Like I said, MemTest86 sometimes detects a single bit error, but consecutive runs don't. And no errors have ever been the same (they haven't even been detected in the same bank). To me, this strongly suggests external causes.
        • by Entrope ( 68843 ) on Monday December 08, 2025 @08:10AM (#65842985) Homepage

          Unless you are at the North or South Pole or on top of one of the highest mountains, you are unlikely to be getting an average of one SEU per week in one computer due to cosmic rays. I would attribute most of the errors you see to other causes: marginal timing compatibility, power glitches, an overburdened fan, a leaky microwave nearby, several of these in combination, etc. Cosmic rays sound cool, but most bit flips have more boring causes.

          In my case, I saw a lot more errors when I was running compute-intensive jobs: read files, decompress them, run a domain specific compression to text, generate SHA-256, compress using a general purpose compression, in parallel on 24 cores. The location of errors was random like in your system, but the correlation with processor load convinced me it wasn't caused by cosmic rays.

          • I recently read the research on this exact topic, and the error rate I'm getting (one bit flip per week) is consistent with the expected rate caused by cosmic rays. However, you might be right, and I could be overlooking something, like my microwave oven, which is placed in the adjacent room just behind the wall. I never thought about it, but timing-wise, bit flips happen outside the window of when I use my oven (or maybe they are just detected later).
            • Well there is a way to test this hypothesis. Placing the computer a couple dozen feet underground and repeating memtest a few times should do it. This is an interesting case and I look forward to an update in due course.
        • Sounds like some bad ram. I run machines for years (granted 32G not 64) with no memory errors. I think I've seen one in decades of owning multiple machines.
    • No chance of it being slightly out of spec RAM that was sold anyway or perhaps issues with the MB or power supply , no sir, its cosmic rays!

    • At least once a week

      That is not cosmic rays. Are you sure your nextdoor neighbour isn't running a secret nuclear reactor?

      Yes bit flips from cosmic rays happen. If you were to to say once or twice a year then I'd blame it on a bitflip (that's about in line with what Google's study estimates a a server with large amounts of memory would have), but if you were getting errors daily then its time to replace your RAM. If it's seemingly random across the memory channels then new CPU/Motherboard.

      • *once per week, not per day.

        In any case my server with 2x 32GB sticks in it registers a hardware error slightly less than once a year (last one I see was in September 2024) and it's not like I live in a hardened bunker.

        • In our datacenters, our per-server MCE anomalies are, averaged, ~0.2/yr, across ~150 servers.
          We're at sea level, and in datacenters with lots of shit on the roof, so maybe we're doing a little better than someone's house, but 1/week is 100% not the FSM fucking with your bits. That's memory or bus timings or voltage on the razor's edge or something.
          • by davidwr ( 791652 )

            >1/week is 100% not the FSM fucking with your bits. That's memory or bus timings or voltage on the razor's edge or something.

            His Almighty Noodliness is known to use memory, bus timings, or voltage that's on the razors edge to fsck with ones bits.

            For what it's worth, it could also be Ceiling Cat or Basement Cat making the mischief.

    • Na, that's a problem in the bus or the memory.
      We have hundreds of servers and don't see that kind of MCE log frequency across all of them combined lol
    • At least once a week, the Linux kernel displays this message:

      Unless your Linux machine is in space, it is because you have bad memory!

      When you see this error, you replace the fucking memory and then you don't see it anymore.

      I have hundreds of Linux systems under my management and this error never occurs. Is that because they are shielded from neutrinos in special lead and water lined bunkers? Nope, they just don't have bad memory chips in them.

  • And now they want to put AI computers in space. What could possibly go wrong?
    • by Viol8 ( 599362 )

      I doubt it could make the slop any worse. Might even improve it with a bit of extra random dither occasionally!

    • Yup, I'm sure they haven't thought about this issue at all when considering putting computers in space.

      • by cusco ( 717999 )

        Until they were finally grounded the Space Shuttles used 486 CPUs, mostly because the large die size minimized the issue of flipped bits.

        • 486! hah!
          The avionics package on the orbiter in fact consisted of 8086s. You may be thinking about Hubble.
          • by cusco ( 717999 )

            Oops, you're right. Anyway, big die size = minimal bit flips.

            • Everything about it helps.
              Bigger and slower RAM cells also have larger potentials, harder to flip. Buses and gates have higher voltages, harder to flip.

              That being said- small die fast chips can be made reliable in space- but it's much more expensive than just using something really old.
        • by davidwr ( 791652 )

          >Until they were finally grounded the Space Shuttles used [very old, large-die] CPUs, mostly because the large die size minimized the issue of flipped bits.

          If my memory is correct (pun intended), Space Shuttles also had 5 flight computer systems for redundancy.

        • Yes, but then until it crashed because of completely not cosmic ray related issues, ingenuity used off the shelf computers, and on the ISS you will find bog standard modern computers for the astronauts to process data on.

          • by cusco ( 717999 )

            For processing data, that's fine. Run the analysis of your test results twice, if they match you're probably fine. On the other hand IIRC the systems that actually maintain attitude and other critical functions are military-type hardened systems (they weren't that much more expensive at the time, unless it was the Pentagram purchasing them).

  • Consumer grade memory just takes bit flips, but ECCs do exist. Do you mean to tell me they don't use them at Airbus? -dk
    • That was my thought, but I don't really know much about it...but I did think that this sort of thing is exactly what ECC memory is for...

    • Do we know the bit flip happened in memory and not elsewhere?
      Either way, these systems should have triple redundancy for these signal corruption cases. Only accept input that two sources agree on.

      • Airbus does have triple redundancy in all of their fly by wire aircraft, but it can happen (and has actually happened at least once) that two sources return the same defective data.

        • by Viol8 ( 599362 )

          The chances of 2 separate cosmic ray events flipping the exact same bits in program code or its data at the exact same time to cause the computers to return the same defective result is so infinitesimally small that it can be discounted as a realistic scenario.

          If this was a cosmic ray then it clearly affected part of the avionics that didn't have triple redundancy. Perhaps they should be looking at that.

        • That is statistically impossible in a bit flip scenario.
          You couldn't do it in a googol ages of the universe.
          • As I said, it did happen before.
            https://en.wikipedia.org/wiki/... [wikipedia.org]
            A bit flip separately might not be a big deal. A bit flip coupled with unexpected hardware or software limitations can break things that seem impossible for break.

            • A bit flip separately might not be a big deal. A bit flip coupled with unexpected hardware or software limitations can break things that seem impossible for break.

              So you're not proposing to equal bit flips on 2 computers, you're describing a bit flip's end result mirroring that of a defective piece of hardware that would have been a compared against value- I can buy that.

    • Re:No ECC? (Score:4, Interesting)

      by monkeyxpress ( 4016725 ) on Monday December 08, 2025 @07:59AM (#65842975)

      Consumer grade memory just takes bit flips, but ECCs do exist. Do you mean to tell me they don't use them at Airbus? -dk

      This is an embedded system in a high reliability environment. The way these things work is keep-it-simple to an absurd level. I bet you this is some dinky 8-bit RISC CPU that's built on a crazy big process node, and the production QC trace on it will be insane. On these sorts of systems, if you want ECC, you add it to the firmware, but only in the areas you need it, and only after a thorough analysis of (a) the problem it is solving (b) the amount of ECC required to solve that problem (c) the best algorithm to meet the identified objectives. There are many ways to do ECC - including just duplicating variables n number of times - which has the advantage of being very easy to implement and formally verify while being less efficient at RAM utilisation vs a Hamming Code, but even that depends on the statistics of your error conditions.

      The point is that, sure, they could add some generic hardware ECC, but that ECC can fail (if there are too many bit flips, if the ECC logic itself gets bit flipped, or there is a design error for a particularly input sequence, etc etc). Maybe you win out overall, maybe you don't - the problem is that you'd have to run a complete analysis to know. That means you have to now add ECC hardware failure modes to pieces of software that did not need ECC before. I mean, sure, maybe you win, but maybe you make it worse, and have to develop extra software to deal with the new hardware failure modes. Whatever the outcome you'll have to do a boat load more documentation to make sure.

      I bet you it took them less than a day to identify a fix for the code and update it. It would have then been thousands of hours of work to update all the documentation and thoroughly verify the new code against all the other requirements on the system.

      If you want a good example of how quickly these supposedly simple systems can get complicated, look into the CAN bus CRC bug. This fault is present on EVERY system that uses the CAN bus (basically any vehicle since the 1990s). It is an extremely subtle bug involving the error detection system that is obvious once you're show it, but the very smart people who designed it, along with thousands of engineers who worked with it, didn't spot it for around a decade. Even worse when they developed CAN 2.0 they tried to fix the bug, and didn't even get that right.

      • If you want a good example of how quickly these supposedly simple systems can get complicated, look into the CAN bus CRC bug.

        It's not simple to figure out what you're talking about, a search doesn't return anything obvious through the flurry of marketing content.

        This fault is present on EVERY system that uses the CAN bus

        It applies to every CAN standard? There's like five of them.

        basically any vehicle since the 1990s

        Since after the 1990s, you mean? While there were a few CAN vehicles in the 1990s, it didn't really become popular until the 2000s because the interface chips were still relatively expensive.

        • Search for 'Multi-Bit Error Vulnerabilities in the Controller Area Network Protocol'. (It's a thesis by Eushiuan Tran)

          This issue is quite subtle, but essential, the fact that the CRC is applied before bit-stuffing means that a single bit error can cascade into multiple errors that exceed the detection limit for the CRC. The potential for this is fortunately rare, but it's like having holes in your bullet proof vest.

          This is why CAN FD (apologies, I said 2.0 in the previous message) includes the stuff bits i

  • I can't bring a ton of shampoo, nor a pair of scissors. Certain laptops or batteries. Now, it's looking like my homemade cosmic ray simulator won't be making it onboard with me...

    • by mjwx ( 966435 )

      I can't bring a ton of shampoo, nor a pair of scissors. Certain laptops or batteries. Now, it's looking like my homemade cosmic ray simulator won't be making it onboard with me...

      LAG restrictions have been lessened or even gotten rid of in Australia and the UK. Air travel is not as bad outside the US.

      Batteries are becoming a problem for airlines because people are entitled fuckwits and won't follow basic instructions (I MUST charge my phone no matter what people tell me) as they keep bringing damaged batteries on board which conflagrate. So they're getting banned at the insistence of airlines rather than governments.

  • Nice write up. I already was vaguely aware of the amount of rigor around avionics development which is why I was surprised to see how fast this update was rolled out--that with all that rigor they still got the change out quickly. Hopefully someone thought about the relative risks of a quick update introducing an actual bug vs. a repeat of a (theoretical bit flip).
  • Wasn't there a ton of solar flare activity causing auroras? That's more likely the cause than cosmic rays.

  • Not great, not terrible.

  • about cosmic rays. Starlink is building a shield to protect Earth from all cosmic rays.
  • This cosmic bit flip thing stinks of bullshit. Especially when a cosmic physics problem is somehow solved by a software reversion.

    Seems like bad code to me.

    Also, if the bit flip is possible, then it's a design error for failing to use ECC RAM.

    I are so smart.

  • because careering neutrons leave no trace of their activity behind

    It's always this. Neutrons are "the little MBAs" of the subatomic world, and they chew through role after role so quickly that it can be dizzing to trace. Compounding the issue is that most subatomic particles don't take the time to fill out their LinkedIn profiles.

  • I used to program FADEC's (digital jet engine controls) back in the 90's. Computer failure had a watchdog background process to immediately reset the computer to a known state if even the slight abnormality was detected. From what I can tell, they have NFI what happened and are using a PR magic 8 ball until they find the root cause in their overly complicated, bloated, poorly documented software/hardware stack. Where is Richard Feynman when you need him ... if I was as active trader, I would short Airbus
  • Some had silicon on sapphire.
  • Some cosmic rays interfered with my electronic speedometer. It told me I was driving exactly the speed limit. Honest!

... when fits of creativity run strong, more than one programmer or writer has been known to abandon the desktop for the more spacious floor. -- Fred Brooks

Working...