Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
IBM

The IBM Mainframe: How It Runs and Why It Survives (arstechnica.com) 138

Slashdot reader AndrewZX quotes Ars Technica: Mainframe computers are often seen as ancient machines—practically dinosaurs. But mainframes, which are purpose-built to process enormous amounts of data, are still extremely relevant today. If they're dinosaurs, they're T-Rexes, and desktops and server computers are puny mammals to be trodden underfoot.

It's estimated that there are 10,000 mainframes in use today. They're used almost exclusively by the largest companies in the world, including two-thirds of Fortune 500 companies, 45 of the world's top 50 banks, eight of the top 10 insurers, seven of the top 10 global retailers, and eight of the top 10 telecommunications companies. And most of those mainframes come from IBM.

In this explainer, we'll look at the IBM mainframe computer -- what it is, how it works, and why it's still going strong after over 50 years.

"Today's mainframe can have up to 240 server-grade CPUs, 40TB of error-correcting RAM, and many petabytes of redundant flash-based secondary storage. They're designed to process large amounts of critical data while maintaining a 99.999 percent uptime -- that's a bit over five minutes' worth of outage per year..."

"RAM, CPUs, and disks are all hot-swappable, so if a component fails, it can be pulled and replaced without requiring the mainframe to be powered down."
This discussion has been archived. No new comments can be posted.

The IBM Mainframe: How It Runs and Why It Survives

Comments Filter:
  • Tractors (Score:5, Insightful)

    by OricAtmos48K ( 979353 ) on Sunday July 30, 2023 @09:45AM (#63725444)
    Why we use a slow, noisy vehicle in the fields ... It is built for the job
    • by ls671 ( 1122017 )

      I have owned many tractors but I have always only dreamt about owning a main frame. especially for IO taks :(

      • Well,
        I would start with dreaming about the power plant you will need to operate one :P

      • If you fancy running a mainframe class operating system at home you can run a version of ICL's "George" O/S on a Raspberry PI !

        https://en.wikipedia.org/wiki/... [wikipedia.org]
        https://www.george3.co.uk/ [george3.co.uk]
        https://www.rs-online.com/desi... [rs-online.com]

        I worked for years on ICL machines (mostly 2900 series and above) and found VME to be a simply wonderful operating system (George was a forerunner of VME). Some of the concepts are still years ahead of their time (file generations) and the S3 programming language is one of the best language

    • And why do we? Because the story people have in their heads says to use them. Because their "daddy" did. Because thinking is for other people.

  • by david.emery ( 127135 ) on Sunday July 30, 2023 @09:57AM (#63725452)

    There's a huge jump going from 4x9 to 5x9 reliability (something a lot of requirements writers don't appreciate when they levy such a requirement.) For the systems I worked on, that was "no more than 1 reboot every year." Now rebooting has gotten faster on a lot of systems, but it's still something you can't do often in a 5x9 environment. Thus the ability to fix hardware and software while the system is running is a huge consideration. For the systems I worked on, that usually meant hot-standbys (but then the set of 'system faults' also included power failures or other situations external to the box.)

    And the big difference between 4x9 and 5x9 in that regard was something I'd use to educate requirements writers and managers about the impact of just a single digit in a set of requirements or performance contract. It's also the kind of requirement that is pervasive across a system, where hardware, systems software, applications software and operations/procedures all had to work together to meet the need. (I still don't see how you achieve that kind of pervasive requirement using "just pick a requirement written on a 3x5 card to implement this week" Agile techniques.)

    • by mmdurrant ( 638055 ) on Sunday July 30, 2023 @10:37AM (#63725516)
      I couldn't help but address your sloppy thinking on the agile thing.
      First, we don't specify stories in terms of requirements, we specify them in terms of features that satisfy requirements. Imagining your perception to be true, however, this can still be done in an iterative manner.
      The larger story would be "As a consumer of the app, I need 5 9s uptime so I can ensure critical services are always available to users." This would have acceptance criteria and the subsequent tasks would be requirement/validation based, answering the questions of how much downtime is necessary to replace a NIC or a kernel upgrade for security or thr other varying knowledge required to ensure the system meets spec.
      Ultimately, Agile is a system intended to reduce risk to business by ensuring R&D spends minimal time on work that isn't the business's top priority. It's suited for software development more than systems development but as long as a person doesnt become too dog.atic in their thinking, a lot of the ideas are applicable to any work that requires a team.
      • by quantaman ( 517394 ) on Sunday July 30, 2023 @01:33PM (#63725834)

        Ultimately, Agile is a system intended to reduce risk to business by ensuring R&D spends minimal time on work that isn't the business's top priority. It's suited for software development more than systems development but as long as a person doesnt become too dog.atic in their thinking, a lot of the ideas are applicable to any work that requires a team.

        Agile works very well for software development because it's easy to compartmentalize and refactoring is fairly cheap and easy as well.

        It's less well suited to systems development as interfaces become more important and refactoring is really expensive.

        On large industrial projects waterfall is still king. Discovering a new requirement part way through the project might be sending someone out into the field to manually update dozens of field devices. Or a change in a control protocol might require another round of site acceptance testing. Sending an observer to every single remotely controlled device and validating that commands are carried out.

        Agile is a philosophy, and just like every other philosophy it's brilliant at some aspects and terrible at others.

        Where exactly a mainframe falls along that line is hard to say. But if it involves a lot of custom hardware I'd be very, very cautious about going agile and discovering part way through that a bunch of already fabbed chips don't meet the new requirements.

        • Try to say that to SpaceX..
    • Five nines are difficult and expensive if you try to achieve high availability without redundancy. If you can't reboot more than once a year, that is a system design not made for high availability. Requiring five nines might just be a way of saying "no single points of failure", i.e. a way of requiring a type of system design: fault tolerance, not avoidance.
      • by david.emery ( 127135 ) on Sunday July 30, 2023 @11:14AM (#63725584)

        Redundancy isn't a complete solution, for a couple of reasons:
        1. Switchover is really hard. Maintaining a hot standby and -transactional consistency- requires both the primary and the backup(s) to complete a transaction before final commit. There have been lots of examples where the switchover system (hardware and software) didn't quite get it right, with the result that either the standby was unable to take over, or the inconsistencies between primary and standby caused downstream problems.

        And that's assuming a single system has the authoritative data, versus some distributed system designs that want to survive network partitioning. Turns out there's no general solution to the 'partitioned database replication problem', where "solution" is "an algorithm to take the full set of updates to multiple independent copies, jam them together, to produce a single consistent system." (There are solutions you can do that take advantage of domain knowledge.)

        2. It doesn't always handle "Bohrbugs" when the same fault impacts all of the copies. That includes -zero day malware-. (And as Nancy Levison showed, 'n-version' doesn't necessarily fix problems, because even independent groups of designers tend to make the same mistakes.)

        But when the -system- requirement is for 5x9 or more stringent, redundancy is almost always part of the solution. The questions then get into "how much, and at what cost?"

        • by jythie ( 914043 ) on Sunday July 30, 2023 @01:02PM (#63725756)
          -transactional consistency- is the bit a lot of people tend to forget. One company I worked for had two database systems, one that was really modern and fast, and another for where would not not afford to lose transactions. The designs and priorities were night and day.
          • It's difficult for me to see how anyone could "forget" something that fundamental.
            • by david.emery ( 127135 ) on Sunday July 30, 2023 @01:21PM (#63725812)

              Most so-called "system engineers" are appallingly ignorant of stuff like this. And so are a large number of IT people (as opposed to those trained in software engineering or complex computer science.)

              Fortunately, on my last big program, all 3 of my (Army LtCol) bosses had advanced degrees in computer science, so we could have intelligent conversations about stuff like this without their eyes glazing over. Most of the other people around us, even though they were involved in software-intensive system design, had no clue. (True for military, civil service, and contractors, both SETA support to the government and the prime contractor.) But that often made me the 'lone voice crying in the wilderness' in technical review meetings.

            • Because they're more interested in talking about how it's webscale and does sharding.

              https://youtu.be/b2F-DItXtZs [youtu.be]

            • by sjames ( 1099 )

              The kind of engineers that find things like that to be unforgettable cost more...

        • 5x9 usually is a SERVICE SLO, not a machine SLO. You usually run the service on many machines.
          • by Pieroxy ( 222434 )

            Not usually, only for the services that require no transactional consistency.

            Hard to ensure transactional consistency when your service runs on many machines.

            • For that you need an eventually consistent database backend. They exist.
              • by Pieroxy ( 222434 ) on Sunday July 30, 2023 @05:15PM (#63726166) Homepage

                An eventually consistent database has *nothing* to do with a transactional system. I don't want my bank account to eventually show me my correct balance. I want it to be correct every time.

                Let me know if you ever work for a bank, and let me know which bank so that I'm sure to avoid it.

                • There are of course special cases that require additional care. But the answer is never "do everything on one machine." At least, not the best answer. You still need to be able to be simultaneously correct and redundant.
        • by castrox ( 630511 )

          How cool. Thanks a lot for sharing, David Emery.

    • by tlhIngan ( 30335 ) <slashdot&worf,net> on Sunday July 30, 2023 @04:42PM (#63726092)

      Well, 4x9 is barely 1 reboot a year either - that's around 29 minutes of downtime per year. 5x9 is under 5 minutes of downtime.

      If you have ever turned on a modern x86 server, it can take 5 minutes just to get through the BIOS. Heck, you can be sitting there for a minute before you even see anything on the screen (your only feedback that something is happening besides the little power LED is the fans are making more noise an a jet taking off). Then you sit there as the BIOS runs the self-tests on everything.

      And yes, it can get old - if you're trying to recover data off one and having to reboot between multiple environments you can be sitting there waiting for the BIOS to even present to you the boot options menu.

      The PC on your desktop - the desktop or laptop you do work on, has been way optimized for boot time - these days you can be in and out of the BIOS within seconds, but for a server PC, not so much.

      The mainframe computer? Reboots are generally scheduled months in advance because it can take hours to shut down and come back up again. A lot of it is in the self-tests - you want to test to make sure every bit of RAM Is working properly and keep making sure it works, you want to make sure the disk is working just fine or if it isn't, you can provide so much warning that the data on that disk has been auto-migrated elsewhere so all someone needs to do is change the disk out.

      522 bytes per sector seems about right to get your a standard ECC 1-2 system to detect early drive failures - you read the disk, and you have ECC to check if the sector was read correctly or not. (This is on top of the ECC that the drive already does - but remember there is a possibility of an undetectable bit error every TB or so transferred with modern hard drives.)

    • (I still don't see how you achieve that kind of pervasive requirement using "just pick a requirement written on a 3x5 card to implement this week" Agile techniques.)
      Obviously the same way like you do it. What should agile project management/conduction change? Obviously nothing.

    • by mjwx ( 966435 )

      There's a huge jump going from 4x9 to 5x9 reliability (something a lot of requirements writers don't appreciate when they levy such a requirement.) For the systems I worked on, that was "no more than 1 reboot every year." Now rebooting has gotten faster on a lot of systems, but it's still something you can't do often in a 5x9 environment. Thus the ability to fix hardware and software while the system is running is a huge consideration. For the systems I worked on, that usually meant hot-standbys (but then the set of 'system faults' also included power failures or other situations external to the box.)

      And the big difference between 4x9 and 5x9 in that regard was something I'd use to educate requirements writers and managers about the impact of just a single digit in a set of requirements or performance contract. It's also the kind of requirement that is pervasive across a system, where hardware, systems software, applications software and operations/procedures all had to work together to meet the need. (I still don't see how you achieve that kind of pervasive requirement using "just pick a requirement written on a 3x5 card to implement this week" Agile techniques.)

      You're never going to have five 9s on a single point of failure. Redundancy, especially live redundancy is key to continuous uptime. If you're running a POS (Point Of Sale) system for a chain you really want a minimum of 3 nodes, that way one can be taken offline for maintenance and you can still suffer a single node outage without suffering a complete outage, sometimes even going as far as to have a 4th as a hot spare.

      We used to use mainframes for POS applications because they used to be the only system

  • 4.5GHz z15 CPU speed (Score:5, Interesting)

    by Ecuador ( 740021 ) on Sunday July 30, 2023 @10:04AM (#63725460) Homepage

    When I did a compute cloud performance comparison [dev.to] earlier this year, I included the IBM Cloud's z/Architecture servers with the latest z15 @ 4.5 GHz out of curiosity. Turns out the cores performed similarly to 3GHz Ampere Altra cores on my custom (perl/C) benchmark suite.
    Obviously, given how expensive they are, there's no point in using them for anything other than your System/360 software from decades ago, but still it was interesting to see how fast the top of the line is for general computing.

    • by gtall ( 79522 ) on Sunday July 30, 2023 @10:27AM (#63725506)

      One point of using them is that they have incredibly redundant capabilities. As the preview mentioned, the components are hot-swappable. So there is a point to using them for other that your System/360 software from decades ago. Speed is important, but it isn't everything and frequently not even the overriding concern.

      • by Ecuador ( 740021 )

        I was mainly referring to the IBM cloud, you don't really care how each cloud has built redundancy, as long as they offer similar reliability and you don't have to pay several times more to get slower IBM servers when your workloads don't need them.
        As for building your own clusters, there are similarly redundant / hot-swappable high end servers from other manufacturers too, it's not like IBM has the only solution.

        • by ls671 ( 1122017 )

          As for building your own clusters, there are similarly redundant / hot-swappable high end servers from other manufacturers too, it's not like IBM has the only solution.

          They often suck at IO compared to a main frame although.

    • by vbdasc ( 146051 )

      This reminds me of my own "tests" I was doing many years ago as a system operator with too much free time, naively comparing the performance of an IBM 4361 mini-mainframe to a 80486 PC... forgetting that a 80486 PC could never do the same things the mainframe was doing, even if the System/370 code was somehow rewritten to use x86 instructions. The raw CPU performance is but one of the things that are important in getting your work done.

    • by bws111 ( 1216812 )

      So you actually think that they spun up an LPAR that had a CPU dedicated to your task? Here's a hint - they didn't. You were sharing that resource with dozens, maybe hundreds of other users. Your benchmark is completely invalid.

  • by cormandy ( 513901 ) on Sunday July 30, 2023 @10:15AM (#63725478)
    • Hey can someone get the attention of a moderator / editor to alert them to the fact that they didn't actually link to the article? See parent of this comment for link. (facepalm)

  • But they are irritating in all of the places I've interfaced with them.

    AFAIHBT it's relatively simple to make web interfaces to your mainframe apps, but all the ones I've had to use had to be accessed through a 3270 terminal or emulator. I did a lot of terminal swapping when I worked for the county of Santa Cruz. Now I'm using a 3270 emulator to access a really atrociously user-unfriendly system elsewhere... which I also did when working for IBM itself.

  • by MikeDataLink ( 536925 ) on Sunday July 30, 2023 @10:38AM (#63725520) Homepage Journal

    They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years.

    Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?

    • by bsolar ( 1176767 ) on Sunday July 30, 2023 @10:49AM (#63725536)

      They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years.

      Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?

      2M? More like 200... A corporation which invested heavily in a mainframe as core component of their IT ecosystem would require a staggering investment of time and resources over literally decades to completely migrate off it. It might require migrating dozens if not hundreds of applications and services and often that migration would require significant rewrites or adaptations.

    • by mendax ( 114116 ) on Sunday July 30, 2023 @01:30PM (#63725828)

      They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years. Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?

      Indeed! In fact, I think it is this backward compatibility, both in hardware and software, that make these beasts so amazing, even more than their redundancy and reliability. It is a pretty amazing bit of both hardware and software engineering to enable such compatibility, especially when one takes into account the fact that the earliest apps for these systems were written nearly 60 years ago.

    • by jezwel ( 2451108 )

      They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years.

      Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?

      Where can we get these cheap mainframes? Last I looked, we replaced our mainframe this year to the tune of several million dollars, and the systems development costs on it were now being measured in fractions of billions.

  • It's a 'legacy reptile'

  • by jfdavis668 ( 1414919 ) on Sunday July 30, 2023 @11:03AM (#63725568)
    We used to have Tandem NonStop mini-computers at our various sites, solely to ensure 24-7 system access. Over time though, the requirements at the sites changed and we didn't need that kind of reliability. Fewer people used them, internet access was much more reliable, overnight use dropped to zero. But the people we supported loved them, even as the systems that ran on them became hopelessly obsolete. They just loved having hardware they could always rely on. But the cost became too much. Once we deployed replacement software which worked great, everyone forgot about the NonStops and we quickly dumped them. We just didn't require that kind of reliability any more.
  • It's been awhile since my mainframe days, but during Y2K there was a grand total of 1 system programmer [ibm.com] using SMP/E [wikipedia.org] to administer the company's IBM mainframe that was used by thousands of employees. Compare that to today's hordes of DevOps engineers spinning up thousands of containers using their own individual images and Rube Goldberg pipeline processes.

    From the article: "A medium-sized bank may use a mainframe to run 50 or more separate financial applications and supporting processes and employ th

    • by keltor ( 99721 ) *
      Unless you are just running legacy shit on the mainframe, you are going to have developers and "application admins" galore, but sysadmins - some places really would just have 1. My linux admins take care of our Z series, but realistically only two of them REALLY know how to do stuff in an emergency (excluding our senior devs and architects who probably could do this stuff if the sysadmins both died.) OS/2 is a much bigger problem as we simply cannot seem to retain staff to handle them and in banking and i
  • We run a maxed out multi-frame Z16 JUST to run SAP HANA because otherwise the overnight jobs take too long sometimes. Even now we have to evaluate any changes super carefully or it will not be ready at 5am and people will freak out about reports. We're in no way married to the hardware, it's just the only way to do it and have all the support aspects so we can yell at someone when things don't work out.
  • is because they can't \ daren't replace the cobol systems running on them that nobody understands what they do.
  • No Diesel, No Broccoli

  • > TodayÃ(TM)s mainframe can have up to 240 server-grade CPUs, 40TB of error-correcting RAM, and many petabytes of redundant flash-based secondary storage. TheyÃ(TM)re designed to process large amounts of critical data while maintaining a 99.999 percent uptimeÃ"thatÃ(TM)s a bit over five minutes' worth of outage per year...

    I wonder if Slashdot ran on those, would it be able to deal with Unicode?

  • and insert characters under mask. //DD ...

  • I dunno if AS/400s could be considered mainframes, but this discussion brings to mind a fun situation my wife encountered about 25 years ago. Her company had tried to drop AS/400 support for their product, but as often happens, one customer was willing to pay lots of money to keep it going, but not enough to pay for an actual AS/400. Instead, in the lab, they ran an AS/400 VM under AIX. My wife's company convinced their customer to begin the transition process to AIX, so the customer decided that the first

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (1) Gee, I wish we hadn't backed down on 'noalias'.

Working...