Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Japan Network

Cosmic Rays Causing 30,000 Network Malfunctions in Japan Each Year (mainichi.jp) 71

Cosmic rays are causing an estimated 30,000 to 40,000 malfunctions in domestic network communication devices in Japan every year, a Japanese telecom giant found recently. From a report: Most so-called "soft errors," or temporary malfunctions, in the network hardware of Nippon Telegraph and Telephone Corp. are automatically corrected via safety devices, but experts said in some cases they may have led to disruptions. It is the first time the actual scale of soft errors in domestic information infrastructures has become evident. Soft errors occur when the data in an electronic device is corrupted after neutrons, produced when cosmic rays hit oxygen and nitrogen in the earth's atmosphere, collide with the semiconductors within the equipment. Cases of soft errors have increased as electronic devices with small and high-performance semiconductors have become more common. Temporary malfunctions have sometimes led to computers and phones freezing, and have been regarded as the cause of some plane accidents abroad. Masanori Hashimoto, professor at Osaka University's Graduate School of Information Science and Technology and an expert in soft errors, said the malfunctions have actually affected other network communication devices and electrical machineries at factories in and outside Japan.
This discussion has been archived. No new comments can be posted.

Cosmic Rays Causing 30,000 Network Malfunctions in Japan Each Year

Comments Filter:
  • Oblig... (Score:5, Funny)

    by LordHighExecutioner ( 4245243 ) on Tuesday April 06, 2021 @04:09PM (#61244144)
    ...xkcd quote [xkcd.com]
  • I assume the presumed connection to Godzilla is still under investigation.
    • He's got employment as an ambassador for Japan now. If he was up to it, it'd be a problem for everyone else, not them.

    • Apropos of random stuff, the data in the article is a bit suspect, they used neutron bombardment to estimate 100 errors per day in its servers, but what you're mostly affected by is gammas, and in particular if you've got neutrons flowing through your IT gear and aren't Los Alamos that's a cause for serious concern. So the "30,000" figure is more or less made up, if you want actual data they'd need to publish FIT data or similar.
  • I recall back when a megabyte machine was a big deal, this issue was much discussed in some circles. Why, I gather, pretty much all server class machines have ECC memory. Random bit decay we called it. Nice to know it hasn't been forgotten. The change from synchronous error reporting by the cpu likely obscured some of the problems this caused. I stopped paying attention a long time ago. But then, I only did internals and was never much of an applications hacker.

    • I recall back when a megabyte machine was a big deal, this issue was much discussed in some circles. Why, I gather, pretty much all server class machines have ECC memory. Random bit decay we called it. Nice to know it hasn't been forgotten. The change from synchronous error reporting by the cpu likely obscured some of the problems this caused. I stopped paying attention a long time ago. But then, I only did internals and was never much of an applications hacker.

      When the conversion of RAM from core to semiconductor technology was starting at DEC, in the late 1970s, the hardware engineers worried that cosmic rays would cause memory errors, since semiconductor memory is more sensitive to cosmic rays than are ferrite cores. As a result, the first PDP-10 memories using semiconductors also had ECC.

      • Re:even older (Score:4, Interesting)

        by AmiMoJo ( 196126 ) on Tuesday April 06, 2021 @06:14PM (#61244590) Homepage Journal

        DDR6 is going to be all ECC. Memory sizes are so large now it's pretty much a requirement.

        It will still come in two types though. One that does ECC internally, one that exposes the ECC to the CPU.

        • DDR6 is going to be all ECC. Memory sizes are so large now it's pretty much a requirement.

          It will still come in two types though. One that does ECC internally, one that exposes the ECC to the CPU.

          On the one that does ECC internally, how does it report soft and hard ECC errors?

          • by AmiMoJo ( 196126 )

            It doesn't report errors, it just silently corrects them internally. To the computer it looks like non-ECC RAM.

            • It doesn't report errors, it just silently corrects them internally. To the computer it looks like non-ECC RAM.

              Correcting but not reporting errors is a bad design--you don't know your memory is getting flakey until it fails hard, so you don't know you need to stock a spare. Does it also not report hard ECC errors, but just return bad data? It would be better to stop, freezing the CPU. At least that way you would know you have a problem.

              • by AmiMoJo ( 196126 )

                Consider that at the moment your computer just carries on completely unaware of the error in most cases. So it's clearly better than the status quo, although not as good as proper ECC memory that reports faults.

                I do why wonder why they even bothered with self correcting mode... Probably at Intel's request so they can segment the market by continuing to offer ECC and non-ECC CPUs. Ryzen supports ECC and while it's not qualified on many motherboards it does work just fine in most cases.

                • Consider that at the moment your computer just carries on completely unaware of the error in most cases. So it's clearly better than the status quo, although not as good as proper ECC memory that reports faults.

                  I do why wonder why they even bothered with self correcting mode... Probably at Intel's request so they can segment the market by continuing to offer ECC and non-ECC CPUs. Ryzen supports ECC and while it's not qualified on many motherboards it does work just fine in most cases.

                  Depending on what they do in the case of an uncorrectable ECC error, it could be better than the status quo.

                  I suspect the motivation for adding ECC without error reporting is to improve reliability. With the reporting omitted that will be hard to measure. I wonder if somebody will come out with a hack to convert non-reporting ECC to reporting ECC, thus avoiding the attempt by Intel to segment the market. My desktop has a motherboard that supports ECC RAM, but I bought non-ECC RAM because it costs half as

                  • by AmiMoJo ( 196126 )

                    The issue with Intel is that the CPUs don't support ECC. Or rather they do, it's just disabled on the consumer ones. You have to pay extra for a model that is identical other than having the ECC feature unlocked.

                    AMD support ECC on most, if not all Ryzen parts. Many motherboard manufacturers don't test ECC RAM so it's technically not qualified and you are on your own with it, but it usually works just fine.

                    • The issue with Intel is that the CPUs don't support ECC. Or rather they do, it's just disabled on the consumer ones. You have to pay extra for a model that is identical other than having the ECC feature unlocked.

                      AMD support ECC on most, if not all Ryzen parts. Many motherboard manufacturers don't test ECC RAM so it's technically not qualified and you are on your own with it, but it usually works just fine.

                      The issue isn't just with Intel. When I priced RAM, I found that ECC memory costs twice as much as non-ECC memory. Maybe the memory manufacturers are taking a lesson from Intel.

        • So it's just all parity bits and no data bits?
        • by Bengie ( 1121981 )
          DDR5 is already ECC by spec, but only for the storage, not the bus. But most bit errors happen in the memory, not the communications.
      • There's a talk online somewhere by a guy who writes radiation-hardened code, it was at a conference in Australia a few years ago. The slides are pages and pages of the most paranoid code you've ever seen, he did a demo where he zapped a cellphone with uranium to demonstrate what faults can do. So you can mitigate it in software, it's just software that looks like nothing else on earth.
        • There's a talk online somewhere by a guy who writes radiation-hardened code, it was at a conference in Australia a few years ago. The slides are pages and pages of the most paranoid code you've ever seen, he did a demo where he zapped a cellphone with uranium to demonstrate what faults can do. So you can mitigate it in software, it's just software that looks like nothing else on earth.

          It is possible, though very difficult, to write code that can deal with certain kinds of hardware faults. IBM's OS/360, for example, would abend a user job if certain kinds of faults, known as "program damage", happened while the user job was running. Other kinds of faults, known as "system damage", would bring down the whole computer.

          Even with the most paranoid of code, there are limits to the kinds of hardware faults that software can recover from. If one percent of your memory fetches get ECC Uncorrec

          • Not necessarily, you can create extremely fault-tolerant code that works even in the presence of hardware faults, like the talk I mentioned. Another one is the control systems used in the French SACEM train control, which has multiple data flows that cross-check each other. Or TMR programming, standard in high-radiation environments. You can certainly get essentially zero faults (or at least reduced to a negligible level) via software-only techniques, it just takes a lot of careful programming. Seeing b
            • Not necessarily, you can create extremely fault-tolerant code that works even in the presence of hardware faults, like the talk I mentioned. Another one is the control systems used in the French SACEM train control, which has multiple data flows that cross-check each other. Or TMR programming, standard in high-radiation environments. You can certainly get essentially zero faults (or at least reduced to a negligible level) via software-only techniques, it just takes a lot of careful programming. Seeing both TMR and self-checking code in action is impressive, you can randomly flip bits while it's running and it just keeps on going.

              NASA does a good job of dealing with hardware faults in software, but there are some problems that are utterly out of reach of software. A good example of software recovery is the NEAR Shoemaker mission to the atrroid Eros, which you can read about at this URL: http://near.jhuapl.edu/anom/Ho... [jhuapl.edu] . An example of an unrecoverable problem is the loss of Mars Climate Orbiter, at this URL: https://spectrum.ieee.org/aero... [ieee.org] . I liked this paragraph:

              In an analysis done by the spacecraft's builders at Lockheed-Ma

  • flip a bit (Score:5, Interesting)

    by Camel Pilot ( 78781 ) on Tuesday April 06, 2021 @04:38PM (#61244286) Homepage Journal

    I remember having a Perl script that had done a specific function for 20 years without failed and then one day it quit working and spit out an error on invocation. A quick inspection revealed that in the text code an "h" got flipped to "g". So a function call to "hello_gps" was upon inspection was "gello_gps".

    Good thing it didn't flip a bit in some gnarly RegEx :)

    • That's really fascinating. I've had times where I swore something like that happened, but because it was most likely some transient in volatile memory I couldn't confirm it.

      Of course I've had corrupted files, but I've always chalked that up to a bug in the file system, OS, or the software that wrote the file. Uninitialized variables do stuff that looks like that. They're some of the worst bugs, right up there with using free'd memory. You feel miserable when tracking them down, and you feel like Sherlo

      • I once had a case where I received a dump file (Windows). After analyzing the place where the program crashed, I realized that it was "impossible" - there is no way that the program could have crashed where it was because the assembly statement before it did a test against the value and should have branched. I figured it was a hardware error of some sort - maybe it could have been a flipped bit? It was the only time in my career where I was confident it was a hardware problem and not a software bug.

        Of cours

    • by sjames ( 1099 )

      WAAY back in the dark ages, I had a PC sorting huge (for the time) data files using a home grown sort routine. After flawless operation, suddenly it produced output horribly out of order. Since it was batch processing, I could exactly repeat the run as often as I wanted. It ran flawlessly every time after.

    • In ASCII, g is 0x67; h is 0x68. But going from "hello_gps()" to "jello_gps()" would be a single bit-flip (0x68 -> 0x6a). And would probably make your location jiggle.
      • by Camel Pilot ( 78781 ) on Tuesday April 06, 2021 @08:32PM (#61245080) Homepage Journal

        I was just going from memory as this happened but in 2005. Should've known this was slashdot and someone was going to check me on it... So I searched the official maintenance log and found the actual entry

        "Nightingale investigated the problem and after finding no issue with the format of the state file, proceeded to examine the place in the cams.pl code where the error was occurring. Nightingale discovered that the cams.pl file had been mysteriously modified. A single bit was changed causing a variable name to be changed, e.g. State_ref => Stade_ref. ... The source code change was an ASCII character change from "t" to "d".

        This is a change of a single bit in the character byte. The change in source code was not made by human editing. Troxel believes that the Piquin hard drive may have been impacted by the "Oh My God Particle". The source code change was an ASCII character change from "t" to "d". This is a change of a single bit in the character byte."

        • I'm curious, don't HDDs employ basic error correction, or does this only get invoked if it physically can't read a sector?

          If not it's an even stronger use case for a check summing filesystem.

          • by ebvwfbw ( 864834 )

            I think it's to do with disk technology. Used to be we had MFM drives that seemed to be solid. Then RLL drives that were flakey. Now we're well beyond that. Once in a while I compare a USB spindle 4TB disk to my desktop machine. I do a rsync to it every week and tell it to delete the deleted crap. Much to my surprised there are differences. The USB that is kept in a commercial grade steel safe (700+ Lbs) turns out to be right. Things like a picture. When it's messed up on the desktop it's obvious. I get lik

  • ... should be shielded but no one really want's to pay for it. Let's be honest, our species will simply ride until a calamity that forces a change in how we do things because it's easy to take the low cost + error prone route as long as those errors aren't severely disrupted.

    • ... should be shielded but no one really want's to pay for it.

      Shielding makes things bigger and heavier, would you really like to go back to 'phones like this ? [alamy.com]

      • It means most people would stop using smartphones and by association social media too.

        So the answer to your question is: fuck yes.

    • by tomhath ( 637240 )
      Spend lots of money hoping to avoid a once in a million years event, only to find you overlooked the thing that got you. Or just get on with your life and deal with a problem if/when it happens.
    • by sjames ( 1099 )

      It's always a game of statistics. No amount of shielding will be enough to completely eliminate the problem, so error detection and/or correction will always be needed. At some point spontaneous decay in the shielding becomes a potential source of bit flips. By no amount of shielding, I mean no amount that isn't larger than the Earth itself.

    • I've looked into the shielding that would be required, as we had a real problem with bit-flips in FPGAs causing storage devices to spit errors and crash. The necessary shielding wasn't just annoyingly expensive, it was completely infeasible from cost, logistics, and mechanical perspectives. It's much better to simply check for it with various hardware and software based methods, and correct it when it happens (and when you run things at large enough scale, it happens several times a month... and that's ju

  • Nonsense (Score:5, Interesting)

    by belthize ( 990217 ) on Tuesday April 06, 2021 @04:57PM (#61244372)

    This is not the first time this has become evident. There have been papers written on it. You can calculate the expected incident rate based on packet throughput. As others have said companies like DEC and IBM spent a great deal of effort on the general question 50 years ago, CISCO had several papers specific to network impacts in the late 90's.

    This is why all such networking equipment, memory and filesystems are fairly resilient to the effect, because there's fuck all you can do about it (at least cost effectively).

    • Re: (Score:2, Informative)

      by Anonymous Coward
      Yea, the problem is that the size of lithography has shrunk so much, that the circuits are more likely to get messed with. The rate is really bad now where your average computer will experience a bit flip basically every day. https://www.macobserver.com/co... [macobserver.com] And there is something you can do. Current ECC memory uses a 9th bit to correct. As error rates go up, you can go to 2 parity bits and that will increase resiliency greatly.
      • Yeah, that's pretty much what I meant about resiliency. There's nothing practical you can do to stop the effect so you have to build in resiliency into the system, for memory that's ECC, for networks that's things like TCP checksums, RAID parity bits etc.

        Resilient systems are much more practical than error proof systems.

      • As error rates go up, you can go to 2 parity bits and that will increase resiliency greatly.

        Fuck everything, we're doing five parity bits. - Samsung

    • by AmiMoJo ( 196126 )

      In Japan the only internet service you can order is fibre, and the lowest speed is 2Gbps symmetrical. In many places the baseline is 10Gbps.

      That's cheap consumer broadband. The hardware they give you is consumer grade. They just supply a modem, you have to get your own router. Of course very few consumer routers have a 10Gbps port and even fewer can actually route packets that fast.

      • In Japan the only internet service you can order is fibre, and the lowest speed is 2Gbps symmetrical. In many places the baseline is 10Gbps.

        Wow, all that and tentacle hentai p0rn? Time to move to Japan!

  • Always ruining my preconceived notion on how reality works.
  • Not a new phenomenon (Score:4, Interesting)

    by dsgrntlxmply ( 610492 ) on Tuesday April 06, 2021 @05:46PM (#61244518)

    We had an embedded system with maybe 16KB of (if I recall correctly) Intel 2107 DRAM (4096 x 1 bit) deployed mid-to-late 1977. We had parity on the memory, and were observing higher than expected rates of parity error crashes. Trade newspapers brought word of May and Woods paper (Apr. 1978) on alpha particle upsets from radioactive materials in ceramic chip packages. Reported failure rates applied across our device population, explained most of the parity crashes that we had been observing.

    Around 2013 in a telecom application, we (another company) had functional failures that could be traced to single event upsets causing persistent bit flips in hardware routing tables. The failure rate was low, but the limited number of cases observed, seemed to have higher rates of occurrence at higher altitude sites. Third semester physics problem: how can muons make it to ground level (under justifiable assumptions stated in the problem)? Answer: muons at 0.98c are relativistic: their decay "clock" runs slower than earth observer frame by a factor of 5. A sufficient number reach the ground, to cause trouble.

    Bonus points: help your kid build a cloud chamber. My dad did. It was fun and spooky, and I had great, if sometimes painful, adventures with the Model T ignition coil. Cosmic ray observation, quite literally on the kitchen table.

    • As processors and memory shrink more, down to that ultimate single molecule level, a cosmic ray can disrupt a bit much more easily than with larger more primitive chips and bits. I wonder how much of this is a side effect of Moore's law in action. Unintended consequences, maybe?

    • by AmiMoJo ( 196126 )

      This is why defensive programming is a really good idea.

      Some coding practices actually discourage stuff that would protect against this kind of thing. For example -Wall with clang (and probably GCC, I didn't check) warns if you use default in a select statement on an enum. The theory is that you should explicitly cover ever possible enumerated value, but what happens if a bit randomly flips? Default is the best way to handle that, as well as a load of other errors that are not uncommon with communication st

      • Some coding practices actually discourage stuff that would protect against this kind of thing. For example -Wall with clang (and probably GCC, I didn't check) warns if you use default in a select statement on an enum. The theory is that you should explicitly cover ever possible enumerated value, but what happens if a bit randomly flips? Default is the best way to handle that, as well as a load of other errors that are not uncommon with communication stuff.

        Defensive programming is fine for detecting problems yet there are boundaries in responding to them when crossed become counterproductive.

        Where people get into trouble with "defensive" programming is acting as if the goal is not to crash when in fact the goal is correct operation. This means maintaining a level of brittleness in execution where the default response to the unexpected outcome is to stop digging.

        Generally the resolution to your questions is to apply runtime assertions for gremlins. Warnings

  • I knew those cosmic rays were a problem. They also sap and impurify all of my precious bodily fluids.

  • The smaller the feature size, the more it is susceptible to both thermal and cosmic ray upsets, but the less cross sectional area there is for either on any one transistor or memory location. It's a dance.

    Lower operating voltages are probably a bigger driver. P=fV**2 is a tempting little equation isn't it?

  • by argee ( 1327877 ) on Tuesday April 06, 2021 @09:20PM (#61245210)

    I tried to explain it to her. Went to an online Flower shop to get her a bouquet, and due to the Chinese Neutron bug, an exotic porn site popped up. For some reason, she wouldn't believe me.

  • This sounds about right. I remember it was estimated in the 1980's that a Cray XMP made an undetected error every 2 minutes or so because of cosmic rays. Cosmic rays are a particularly tricky thing to guard against. With steady, random errors you can add extra parity bits and trap when things go wrong, but cosmic rays give you a shower of ionising particles at the same time, and roughly in the same place, so there is a significant chance that it can flip a bit and the corresponding parity bit. Back then, wh
  • cosmic ray error
    your morganstanley BTC bag is now ...
    empty
    thank you, please come again - well , i guess that should be covered by the 10k + nodes, so what CAN it do ?

It is easier to write an incorrect program than understand a correct one.

Working...