573

Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that probability is.

"Since 2-128 is 1 out of 340282366920938463463374607431768211456, I think we're justified in taking our chances here, even if these computations are off by a factor of a few billion... We're way more at risk for cosmic rays to screw us up, I believe."

Is this programmer correct? What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?

Robert Harvey
  • 168,684
  • 43
  • 314
  • 475
Mark Harrison
  • 267,774
  • 112
  • 308
  • 434
  • 2
    It's finite, but infinitesimally small – ChrisF Apr 05 '10 at 20:23
  • 2
    I usually use "less than the probability that you'll win the lottery" for this sort of thing - shuts people right up. – MusiGenesis Apr 05 '10 at 20:26
  • 45
    *"Winning Lotteries: What is the probability they will affect a program?"* – kennytm Apr 05 '10 at 20:29
  • 28
    It depends in part on where the program is being executed and how well it's shielded. On Earth, the cosmic ray flux is much lower than in deep space, or even near Earth orbit. The Hubble Space Telescope, for instance, produces raw images that are riddled with cosmic ray traces. – Adam Hollidge Apr 05 '10 at 20:30
  • 5
    It happens every time Elvis uses the system. – Brian Rasmussen Apr 05 '10 at 20:35
  • 2
    @KennyTM, that might be: *Winning Lotteries: what is the probability it has on whether or not I care about the program being affected*! – Mark Harrison Apr 05 '10 at 20:37
  • 1
    I read this and thought, "this *has* to be a joke... alongside the 'how do I control a program that has become sentient' post." You can't imagine my surprise in finding out that cosmic rays actually can cause a computer error... /begins wrapping everything including constants in try/catch blocks – Carson Myers Apr 05 '10 at 20:37
  • 1
    @Carson: What if your `try/catch` block errors due to cosmic ray? :p – kennytm Apr 05 '10 at 20:39
  • 4
    OK - I revise my original comment to be "It's finite, but *not* infinitesimally small" – ChrisF Apr 05 '10 at 20:40
  • 91
    Does this mean that from now on, when someone next asks about `finally` blocks, we'll have to qualify it with "always executes *except* if the program exits, *or* if it gets hit with a cosmic ray"? – skaffman Apr 05 '10 at 20:42
  • 2
    I think so. And @KennyTM, you're right. I suppose I will have to either use a cosmic ray disclaimer in my software or give up and work at McDonalds. Oh god, what if a cosmic ray hits the till! – Carson Myers Apr 05 '10 at 20:56
  • @Adam I would imagine that most of the cosimc ray traces in the HST images are from cosmic protons and shower debris hitting the detector elements rather than flipping bits in the digital logic, though that can certainly happen. – dmckee --- ex-moderator kitten Apr 05 '10 at 21:08
  • 4
    Various bits of computer logic used in particle physic triggers are subjected to very high levels of ionizing radiation. Some equipment seems to be more sensitive than other bits, but I've worked experiments where the board crash rate correlated pretty well with the radiation level in the hall. One set of 64 MB, 68040 VME crates seemed to be particularly sensitive. We'd get one crash every few board days at typical running intensities (millions of times the sea-level cosmic-ray background level). – dmckee --- ex-moderator kitten Apr 05 '10 at 21:14
  • 75
    Working on a prototype particle detector years ago, I programmed it to print "ouch!" every time it was hit by a cosmic ray. Good times... – Beta Apr 05 '10 at 21:42
  • 8
    On of the most interesting questions I've read here in a while. A real eye-opener. Count on me to re-open. – Agnel Kurian Apr 06 '10 at 06:08
  • 1
    "Probability is less than x", true, but you already know what the probability is - 1 out of 340282366920938463463374607431768211456, so you don't really need to know x. Nevertheless, it's an interesting question. – Daniel Daranas Apr 06 '10 at 07:31
  • @dmckee -- Right. I was only commenting on the relative preponderance of cosmic rays in space vs. on the surface of Earth, not bit-flips vs. detector traces. The implicit assumption in most responses is that the software is Earth-bound, but the question didn't specify. – Adam Hollidge Apr 06 '10 at 14:04
  • 2
    Although the question is interesting, this is not a programming question ( something that could be answered with code ) Is software development related, yes, but doesn't quite qualifies for stackOverflow in my opinion . – OscarRyz Apr 06 '10 at 15:29
  • 1
    I have not voted either way on the open/close issue here. This is fundamentally a hardware issue, but it does go to requirements in the case of very high reliablity systems and/or high-altitude and space-borne platforms. It also leads to a discussion of error recovery and robustness in a programming context, but *that* should go on a blog. – dmckee --- ex-moderator kitten Apr 06 '10 at 18:00
  • 2
    @dmckee... since the developer made the claim with regards to a software method being statistically more reliable than X, and since this is a claim we have all heard for many years regarding software methods, I thought it was appropriate to address as a software design issue. In addition, the fact that ECC memory will correct for single-bit errors on a chip, so that there would have to be two cosmic ray influeced transient memory errors on the same chip, may lead to the conclusion that many "less likely than cosmic ray errors" will prove to be false. – Mark Harrison Apr 06 '10 at 18:47
  • 3
    +1: interesting if somewhat eclectic question. – Juliet Apr 07 '10 at 17:41
  • @MusiGenes "less than the probability that you'll win the lottery" is rather weak, lottery odds tend to be in the 2^-24 ballpark. – starblue Apr 07 '10 at 20:08
  • @starblue: unless the lottery is rigged, of course. And I can't say it isn't, since neither I nor anyone I know has ever won it. :) – MusiGenesis Apr 08 '10 at 00:49
  • @MusiGenesis *Way* less than the probability that I will kill you. – Mateen Ulhaq Mar 04 '11 at 01:35
  • @muntoo: unfortunately, I can't really estimate that probability, other than to say it's somewhere between 0 and 1. – MusiGenesis Mar 04 '11 at 01:40
  • 1
    the stuff i encounter on SO. It all just blows my mind. A good blow-up though ;) – bad_keypoints Aug 22 '12 at 06:02
  • 7
    There exists a CosmicRayInteruptionException? :D ;) – Milindu Sanoj Kumarage May 13 '15 at 16:18
  • Err... I was just about to design an editor based on this [this technology](https://xkcd.com/378/)... ;-) – anishsane May 14 '15 at 03:54

15 Answers15

317

From Wikipedia:

Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.[15]

This means a probability of 3.7 × 10-9 per byte per month, or 1.4 × 10-15 per byte per second. If your program runs for 1 minute and occupies 20 MB of RAM, then the failure probability would be

                 60 × 20 × 1024²
1 - (1 - 1.4e-15)                = 1.8e-6 a.k.a. "5 nines"

Error checking can help to reduce the aftermath of failure. Also, because of more compact size of chips as commented by Joe, the failure rate could be different from what it was 20 years ago.

kennytm
  • 469,458
  • 94
  • 1,022
  • 977
  • 4
    Improved error checking? Back when that study was published, most personal computers had a parity bit on each byte of memory. Now error control circuitry on memory systems is generally found only on server-level machines (as far as I know), and not even on all server machines. However, when there is error circuitry on memory systems today, it's generally ECC instead of just parity. – Michael Burr Apr 05 '10 at 20:45
  • 13
    More importantly, the chip feature size for CPUs in 1995 was around 0.35 µm or 350nm. It's now 1/10th that size at 35nm. – Joe Koberg Apr 05 '10 at 21:02
  • 22
    Is it possible that instead of reducing risk, decreased size would increase risk since it would take less energy to change the state of each bit? – Robert Apr 05 '10 at 21:59
  • @Robert: why it takes less energy? Anyway, the energy of cosmic ray is so high that this factor isn't important I think. – kennytm Apr 05 '10 at 23:08
  • 67
    Reduced size definitely increases risk. Hardened processors for space vehicles use very large feature sizes to avoid cosmic ray effects. – Joe Koberg Apr 05 '10 at 23:27
  • 25
    Not just cosmic rays, radioactive isotopes in the materials used in the chip are a much bigger problem. Makers go to huge lengths to make sure the silicon, solder, encapsulation etc doesn't contain any alpha or beta emitters. – Martin Beckett Apr 06 '10 at 03:12
  • 2
    And then there's the fact that the chip sizes actually _grow_, despite the fact that feature sizes shrink. I suppose that with bigger chips and cosmic rays it's the same as with bigger sails and wind? – sbi Apr 06 '10 at 14:17
  • 1
    It's sad to see more error tolerant processors (such as SPARC, et al.) go by the wayside. They have all kinds of nifty self-correcting mechanisms built in for such things. Oh well, it seems like the x86 architecture is finally noticing this issue and is starting to design for it too. – Brian Knoblauch Apr 06 '10 at 17:27
  • 18
    Wow! This means that about 1 byte in my PC gets corrupted every two days. – Stefan Monov Sep 26 '10 at 07:10
  • 7
    One thought I have about the risk to user machines is that, on s typical user machine, the objects taking up the most RAM, and therefore the objects most likely to be damaged by cosmic rays, are large media entities like sound, video, and images. For these objects, s flipped bit will have no discernable effect on the computer's performance or what the user sees. So the probability that a bit is flipped is much greater than the probability that a cosmic ray causes dangerous behavior. – Kevin May 13 '15 at 17:51
  • Also spaceships or sattelites calulcate everything twice and then compare the calulcation result because cosimic rays are heavier up there. Important parts get calulcated tree or more times simultaniously in case one of them breaks down – BlueWizard May 15 '15 at 10:12
  • My survey paper confirms what @JoeKoberg says https://www.academia.edu/12046032/A_Survey_of_Techniques_for_Modeling_and_Improving_Reliability_of_Computing_Systems – user984260 Jun 26 '15 at 12:11
  • 5
    "5 nines" would typically mean 0.99999 not 0.00001 – Piotr Falkowski Jul 30 '18 at 17:21
  • 1
    @Kevin, a single flipped bit can actually render a JPG image into partial corruption. https://en.wikipedia.org/wiki/Data_degradation#Visual_example – Sam Sirry Sep 23 '19 at 13:08
  • I would certainly never call `0.0000018` "5 nines", indeed, it has no nines whatsoever. – Apollys supports Monica Jan 22 '20 at 18:22
  • @Kevin I'm 5 years late but I love how you wrote "s flipped bit" to demonstrate how even two flipped bits ('a'->'s') might not make much difference on the readability of text. – JiK Jan 23 '20 at 13:23
  • so a todays server with 32GByte RAM has a probability of 4% bitflips PER DAY? Thats why you should use ECC RAM. – Klaus Jul 23 '20 at 10:03
97

Apparently, not insignificant. From this New Scientist article, a quote from an Intel patent application:

"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "

You can read the full patent here.

ire_and_curses
  • 64,177
  • 22
  • 110
  • 139
  • 9
    Why do they increase with a decrease in size of the chip? Surely a smaller object is less likely to be hit by a ray (i.e. compare throwing a tennis ball at a wall, to throwing it at a stamp) – Jonathan. May 13 '15 at 16:28
  • 11
    Because as the size of components shrinks, they become much more sensitive to cosmic ray hits. – ire_and_curses May 13 '15 at 17:14
  • 6
    Yes, smaller equals less likely to be hit, but more likely that the hit will affect the state. – John Hascall May 13 '15 at 18:09
  • 2
    @ire_and_curses [citation needed] – Anko May 13 '15 at 20:15
  • 10
    @Anko - It's kind-of obvious. As a given component gets smaller, it needs less voltage and less charge to set a bit. That makes it more sensitive to being blasted with energy from outer space. However, here's a citation for you: [*As LSI memory devices become smaller, they become more sensitive to nuclear-radiation-induced soft fails.*](http://www.pld.ttu.ee/IAF0030/curtis.pdf) – ire_and_curses May 13 '15 at 21:18
  • @Jonathan They take up the same space on the die, so the area is roughly the same, just broken up into smaller bits, which as have been stated have a higher susceptibility to this sort of thing. One could imagine though, that the per bit likelihood has been greatly reduced, but that the per chip risk has increased. – wedstrom May 10 '16 at 22:40
  • It's like a battle cruiser going through a hail of ping pong balls, compared to a tiny bird trying to get through them - would you rather be the cruiser or the bird? The cruiser has no chance of avoiding being hit, but... – Cato Aug 30 '16 at 10:20
  • it's now end of the decade! – yekanchi May 20 '20 at 22:08
66

Note: this answer is not about physics, but about silent memory errors with non-ECC memory modules. Some of errors may come from outer space, and some - from inner space of desktop.

There are several studies of ECC memory failures on large server farms like CERN clusters and Google datacenters. Server-class hardware with ECC can detect and correct all single bit errors, and detect many multi-bit errors.

We can assume that there is lot of non-ECC desktops (and non-ECC mobile smartphones). If we check the papers for ECC-correctable error rates (single bitflips), we can know silent memory corruptions rate on non-ECC memory.

  • Large scale CERN 2007 study "Data integrity": vendors declares "Bit Error Rate of 10-12 for their memory modules", "a observed error rate is 4 orders of magnitude lower than expected". For data-intensive tasks (8 GB/s of memory reading) this means that single bit flip may occur every minute (10-12 vendors BER) or once in two days (10-16 BER).

  • 2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".

  • 2012 Sandia report "Detection and Correction of Silent Data Corruptionfor Large-Scale High-Performance Computing": "double bit flips were deemed unlikely" but at ORNL's dense Cray XT5 they are "at a rate of one per day for 75,000+ DIMMs" even with ECC. And single-bit errors should be higher.

So, if the program has large dataset (several GB), or has high memory reading or writing rate (GB/s or more), and it runs for several hours, then we can expect up to several silent bit flips on desktop hardware. This rate is not detectable by memtest, and DRAM modules are good.

Long cluster runs on thousands of non-ECC PCs, like BOINC internet-wide grid computing will always have errors from memory bit-flips and also from disk and network silent errors.

And for bigger machines (10 thousands of servers) even with ECC protection from single-bit errors, as we see in Sandia's 2012 report, there can be double-bit flips every day, so you will have no chance to run full-size parallel program for several days (without regular checkpointing and restarting from last good checkpoint in case of double error). The huge machines will also get bit-flips in their caches and cpu registers (both architectural and internal chip's triggers e.g. in ALU datapath), because not all of them are protected by ECC.

PS: Things will be much worse if the DRAM module is bad. For example, I installed new DRAM into laptop, which died several weeks later. It started to give lot of memory errors. What I get: laptop hangs, linux reboots, runs fsck, finds errors on root filesystem and says that it want to do reboot after correcting errors. But at every next reboot (I did around 5-6 of them) there are still errors found on the root filesystem.

osgx
  • 80,853
  • 42
  • 303
  • 470
  • 11
    Additional material from BH 2011: "Bitsquatting. DNS hijacking without exploitation" https://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf lists modern multi-GB DRAMs to have around 10000-30000 FIT/Mbit (less than 100 hours between errors for every 128MB). The paper also lists articles which conclude that [most soft errors](http://en.wikipedia.org/wiki/Soft_error) are from radiation, almost all cases - from cosmic rays, and some cases from alpha-emitters inside PC. BH authors did experiment and got 50000 accesses to domains, having 1 bit changed from popular sites – osgx May 14 '14 at 10:45
  • Kudos for adding more recent studies here. Given the dynamics of SO voting and how they are accumulated over time, it is alas difficult to have an up-to-date presentation on this topic stand out (here). – Fizz Jan 12 '15 at 14:06
  • We had similar problem. We didn't do any exact study, but we had quite some crash dumps with visible bit flips. We checked those bit flips and it turned out they were in code section. We compared with what should be there and it did not look like deliberate modification (i.e. resulting instructions did not have much sense). In the end we had simple application, that compared crash dumps against (archived) released versions and filtered out such cases. Interestingly I think most of such cases were coming from Iran, Arabia and I think one more country from South America (don't remember now). – GiM Aug 17 '17 at 06:25
  • 2
    In Google's paper it looks more like a case that some RAM is bad _About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. Our per-DIMM rates of correctable errors translate to an aver- age of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit and a median FIT range of 778 – 25,000 per Mbit (median for DIMMs with errors), while pre- vious studies report 200-5,000 FIT per Mbit. The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others._ – vartec Sep 18 '17 at 15:52
31

Wikipedia cites a study by IBM in the 90s suggesting that "computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month." Unfortunately the citation was to an article in Scientific American, which didn't give any further references. Personally, I find that number to be very high, but perhaps most memory errors induced by cosmic rays don't cause any actual or noticable problems.

On the other hand, people talking about probabilities when it comes to software scenarios typically have no clue what they are talking about.

JesperE
  • 59,843
  • 19
  • 133
  • 192
  • I guess they should be more clear about "one cosmic-ray-induced error"... If I had to guess, I would say one flipped bit over an array of 256MB ram per month. – wtaniguchi Apr 05 '10 at 20:33
  • 7
    The probability of a bit being flipped must be multiplied by the probability of the bit having a noticeable affect on the program. I'm guessing the second probability is a lot lower than you think. – Mark Ransom Apr 05 '10 at 21:18
  • 2
    @Mark: Typical computer programs don't have that kind of fault-tolerance built-in. A single-bit error in the program code will more likely than not crash the program, if the broken code is executed. – Robert Harvey Apr 06 '10 at 03:05
  • 76
    Yes, but most of the memory contains data, where the flip won't be that visiblp. – zoul Apr 06 '10 at 05:01
  • 2
    @Robert Harvey, not only will most of the program be data, but much of the actual program will be executed rarely if ever. Think about how tough it is to get 100% code coverage for testing. Also some instruction changes might be very subtle. Combine all those and the probabilities start getting very low. – Mark Ransom Apr 06 '10 at 13:51
  • 35
    @zoul. lol at 'visiblp', but if e=1100101 and p=1110000 then you're the unfortunate victim of *3* bit flips! – PaulG Apr 08 '10 at 08:47
  • 10
    @Paul: or *one* finger blip. – mpen May 09 '10 at 07:29
  • 1
    If you want a more competent presentation with full citations see Ray Heald's ["How Cosmic Rays Cause Computer Downtime"](http://www.ewh.ieee.org/r6/scv/rl/articles/ser-050323-talk-ref.pdf). – Fizz Jan 12 '15 at 14:01
  • It sounds a bit too high, of course finding a PC to run for a month with 10 x 256 Mb blocks of data is not actually that hard to figure out, all you need is a regular pattern in the data that can be quickly checked (each byte being zero is a pattern). You could even check for errors at shutdown, then kick it off again during run-time. If you get no errors after 3 months of up-time, BS can then be called. At 10^-9 – Cato Aug 30 '16 at 10:31
31

Well, cosmic rays apparently caused the electronics in Toyota cars to malfunction, so I would say that the probability is very high :)

Are cosmic rays really causing Toyota woes?

Community
  • 1
  • 1
Kevin Crowell
  • 9,476
  • 4
  • 32
  • 51
  • 24
    "Federal regulators are studying whether sudden acceleration in Toyotas is linked to cosmic rays." This is why you should never give federal regulators power over your life. –  Apr 05 '10 at 20:33
  • 13
    I guess the theory here is that cosmic rays are flipping bits in older brains causing them to malfunction and press the wrong pedal. – Knox Apr 05 '10 at 20:43
  • 16
    "Apparently"? I'd say that's a wild guess at this point. My own wild guess is that this phenomenon is a result of that old nightmare of embedded systems (actually most complex computer systems) - the race condition. – Michael Burr Apr 05 '10 at 20:49
  • 7
    @Knox: Get out your old tinfoil hat, it *is* useful! –  Apr 06 '10 at 05:31
  • @Kevin: Comments are appropriate for jokes, not answers. This does not even attempt to answer the question. –  Apr 06 '10 at 06:03
  • @Roger Providing possible evidence, no matter how far-fetched it may be, does not help answer the question? – Kevin Crowell Apr 06 '10 at 06:40
  • 3
    It may not be a joke. I've seen some seriously weird stuff like that happen before. Not as rare as most people think. – Brian Knoblauch Apr 06 '10 at 17:32
  • 1
    @Roger: There's quite a tradition of humorous answers being well taken and up-voted on SO. (Heck, there's even been a tradition of humorous _questions_. Sadly, this has been stopped by the closing police.) – sbi Apr 07 '10 at 09:10
  • @Brian: The OP's now-deleted comment (along the lines of "it is a very relevant joke!") indicates the spirit in which it was intended. @sbi: There's a stronger tradition and convention for jokes in comments, and I find it mildly offensive to post such noise answers on questions seriously asked in good faith. I'll willingly downvote any such "not useful" answers, but this one didn't even try to answer the question. –  Apr 07 '10 at 09:17
27

With ECC you can correct the 1 bit errors of Cosmic Rays. In order to avoid the 10% of cases where cosmic rays result in 2-bit-errors the ECC cells are typically interleaved over chips so no two cells are next to each other. A cosmic ray event which affects two cells will therefore result in two correctable 1bit errors.

Sun states: (Part No. 816-5053-10 April 2002)

Generally speaking, cosmic ray soft errors occur in DRAM memory at a rate of ~10 to 100 FIT/MB (1 FIT = 1 device fail in 1 billion hours). So a system with 10 GB of memory should show an ECC event every 1,000 to 10,000 hours, and a system with 100 GB would show an event every 100 to 1,000 hours. However, this is a rough estimation that will change as a function of the effects outlined above.

eckes
  • 9,350
  • 1
  • 52
  • 65
17

Memory errors are real, and ECC memory does help. Correctly implemented ECC memory will correct single bit errors and detect double bit errors (halting the system if such an error is detected.) You can see this from how regularly people complain about what seems to be a software problem that is resolved by running Memtest86 and discovering bad memory. Of course a transient failure caused by a cosmic ray is different to a consistently failing piece of memory, but it is relevant to the broader question of how much you should trust your memory to operate correctly.

An analysis based on a 20 MB resident size might be appropriate for trivial applications, but large systems routinely have multiple servers with large main memories.

Interesting link: http://cr.yp.to/hardware/ecc.html

The Corsair link in the page unfortunately seems to be dead.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
janm
  • 16,989
  • 1
  • 40
  • 60
  • Cosmic ray bitflips may not be uniformly distributed, particularly if we include solar storms under the "cosmic ray events"-umbrella. If you got two or more bitflips within the same byte, the typical ECC won't be able to correct the error. – tobixen Nov 15 '17 at 09:48
  • @tobixen Detecting a double bit error is better than continuing to run with bad data. The next step after ECC is Chipkill with DIMM mirroring ... – janm Nov 16 '17 at 05:33
15

This is a real issue, and that is why ECC memory is used in servers and embedded systems. And why flying systems are different from ground-based ones.

For example, note that Intel parts destined for "embedded" applications tend to add ECC to the spec sheet. A Bay Trail for a tablet lacks it, since it would make the memory a bit more expensive and possibly slower. And if a tablet crashes a program every once in a blue moon, the user does not care much. The software itself is far less reliable than the HW anyway. But for SKUs intended for use in industrial machinery and automotive, ECC is mandatory. Since here, we expect the SW to be far more reliable, and errors from random upsets would be a real issue.

Systems certified to IEC 61508 and similar standards usually have both boot-up tests that check that all RAM is functional (no bits stuck at zero or one), as well as error handling at runtime that tries to recover from errors detected by ECC, and often also memory scrubber tasks that go through and read and write memory continuously to make sure that any errors that occur get noticed.

But for mainstream PC software? Not a big deal. For a long-lived server? Use ECC and a fault handler. If an uncorrectable error kills the kernel, so be it. Or you go paranoid and use a redundant system with lock-step execution so that if one core gets corrupted, the other one can take over while the first core reboots.

jakobengblom2
  • 4,806
  • 2
  • 23
  • 30
  • Cosmic ray bitflips may not be uniformly distributed, particularly if we include solar storms under the "cosmic ray events"-umbrella. A sudden burst may cause several bitflips within a byte, and ECC-algorithms won't be able to correct a failure. – tobixen Nov 15 '17 at 09:47
12

If a program is life-critical (it will kill someone if it fails), it needs to be written in such a way that it will either fail-safe, or recover automatically from such a failure. All other programs, YMMV.

Toyotas are a case in point. Say what you will about a throttle cable, but it is not software.

See also http://en.wikipedia.org/wiki/Therac-25

Robert Harvey
  • 168,684
  • 43
  • 314
  • 475
  • Nevermind the software for throttles. The sensors and wiring for the throttles is the weak point. My Mitsubishi throttle position sensor failed into a random number generator... No unintended acceleration, but it sure didn't do anything good for the fuel mixture! – Brian Knoblauch Apr 06 '10 at 17:31
  • 3
    @Brian: Good software would have figured out that the data points were discontinuous, and concluded that the data was bad. – Robert Harvey Apr 06 '10 at 19:19
  • ..and then what... Good data is required. Knowing it's bad doesn't help any. Not something you can magically work around. – Brian Knoblauch Apr 06 '10 at 19:50
  • 4
    @Brian: Well, for one thing, you can take corrective action based on the knowledge that your data is bad. You can stop accelerating, for instance. – Robert Harvey Apr 06 '10 at 20:29
  • Yes you can (and should) cheksum data. Best end-to-end. However this only reduces the chances of corruption. Imagine your "is this valid" instruction gets the bit corrupted in memory or CPU register just when you want to branch to the error handler. – eckes Sep 26 '13 at 02:25
11

I once programmed devices which were to fly in space, and then you (supposedly, noone ever showed me any paper about it, but it was said to be common knowledge in the business) could expect cosmic rays to induce errors all the time.

erikkallen
  • 31,744
  • 12
  • 81
  • 116
  • 8
    Above the atmosphere two things happen: 1) the total flux is higher 2) much more of it comes in the form of heavy, very energetic particles (with enough energy to flip a bit packed into a small space). – dmckee --- ex-moderator kitten Apr 05 '10 at 21:22
  • With regard to references, there are books (e.g., https://books.google.com/books?hl=en&lr=&id=Er5_rzW0q3MC), conferences (e.g., http://www.radecs2015.org , http://www.seemapld.org , and others), and papers galore on this topic. Cosmic rays are not a joke in aerospace. They are one of the key reasons that many spacecraft use rad hardened computers, most of which have the processing power of a modern smart toaster oven. – David Hammen Sep 28 '15 at 09:42
10

"cosmic ray events" are considered to have a uniform distribution in many of the answers here, this may not always be true (i.e. supernovas). Although "cosmic rays" by definition (at least according to Wikipedia) comes from outer space, I think it's fair to also include local solar storms (aka coronal mass ejection) under the same umbrella. I believe it could cause several bits to flip within a short timespan, potentially enough to corrupt even ECC-enabled memory.

It's well-known that solar storms can cause quite some havoc with electric systems (like the Quebec power outage in March 1989). It's quite likely that computer systems can also be affected.

Some 10 years ago I was sitting right next to another guy, we were sitting with each our laptops, it was in a period with quite "stormy" solar weather (sitting in the arctic, we could observe this indirectly - lots of aurora borealis to be seen). Suddenly - in the very same instant - both our laptops crashed. He was running OS X, and I was running Linux. Neither of us are used to the laptops crashing - it's a quite rare thing on Linux and OS X. Common software bugs can more or less be ruled out since we were running on different OS'es (and it didn't happen during a leap second). I've come to attribute that event to "cosmic radiation".

Later, "cosmic radiation" has become an internal joke at my workplace. Whenever something happens with our servers and we cannot find any explanation for it, we jokingly attribute the fault to "cosmic radiation". :-)

tobixen
  • 3,334
  • 1
  • 15
  • 20
7

More often, noise can corrupt data. Checksums are used to combat this on many levels; in a data cable there is typically a parity bit that travels alongside the data. This greatly reduces the probability of corruption. Then on parsing levels, nonsense data is typically ignored, so even if some corruption did get past the parity bit or other checksums, it would in most cases be ignored.

Also, some components are electrically shielded to block out noise (probably not cosmic rays I guess).

But in the end, as the other answerers have said, there is the occasional bit or byte that gets scrambled, and it's left up to chance whether that's a significant byte or not. Best case scenario, a cosmic ray scrambles one of the empty bits and has absolutely no effect, or crashes the computer (this is a good thing, because the computer is kept from doing harm); but worst case, well, I'm sure you can imagine.

Ricket
  • 31,028
  • 28
  • 106
  • 137
  • Cosmic ray bitflips may not be uniformly distributed, particularly if we include solar storms under the "cosmic ray events"-umbrella. If you got two bitflips within the same byte, the parity bit check will fail. Several bitflips, and ECC-algorithms probably won't be able to correct a failure. – tobixen Nov 15 '17 at 09:46
6

I have experienced this - It's not rare for cosmic rays to flip one bit, but it's very unlikely that a person observe this.

I was working on a compression tool for an installer in 2004. My test data was some Adobe installation files of about 500 MB or more decompressed.

After a tedious compression run, and a decompression run to test integrity, FC /B showed one byte different.

Within that one byte the MSB had flipped. I also flipped, worrying that I had a crazy bug that would only surface under very specific conditions - I didn't even know where to start looking.

But something told me to run the test again. I ran it and it passed. I set up a script to run the test 5 times overnight. In the morning all 5 had passed.

So that was definitely a cosmic ray bit flip.

rep_movsd
  • 6,277
  • 4
  • 26
  • 31
  • Definitely? Couldn't it have been an uninitialized variable that never got a bad initial value in the subsequent tests? – doug65536 Jan 06 '16 at 21:34
  • I always compile with W3 or W4 on VS - Also Rational Purify, there were no bugs of that sort. – rep_movsd Jan 06 '16 at 21:49
  • Ah, sorry I didn't know that those compiler options and Rational Purify were utterly infallible. =) – doug65536 Jan 25 '16 at 06:51
  • Considering that the code was then put into production and compressed and uncompressed hundreds of GB properly, there was no sign of a similar bug. – rep_movsd Jan 25 '16 at 14:51
4

You might want to have a look at Fault Tolerant hardware as well.

For example Stratus Technology builds Wintel servers called ftServer which had 2 or 3 "mainboards" in lock-step, comparing the result of the computations. (this is also done in space vehicles sometimes).

The Stratus servers evolved from custom chipset to lockstep on the backplane.

A very similar (but software) system is the VMWare Fault Tolerance lockstep based on the Hypervisor.

eckes
  • 9,350
  • 1
  • 52
  • 65
4

As a data point, this just happened on our build:

02:13:00,465 WARN  - In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:133:
02:13:00,465 WARN  - /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/locale:3180:65: error: use of undeclared identifier '_'
02:13:00,465 WARN  - for (unsigned __i = 1; __i < __trailing_sign->size(); ++_^i, ++__b)
02:13:00,465 WARN  - ^
02:13:00,465 WARN  - /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/locale:3180:67: error: use of undeclared identifier 'i'
02:13:00,465 WARN  - for (unsigned __i = 1; __i < __trailing_sign->size(); ++_^i, ++__b)
02:13:00,465 WARN  - ^

That looks very strongly like a bit flip happening during a compile, in a very significant place in a source file by chance.

I'm not necessarily saying this was a "cosmic ray", but the symptom matches.

dascandy
  • 6,916
  • 1
  • 25
  • 49