The IBM Mainframe: How It Runs and Why It Survives (arstechnica.com) 138
Slashdot reader AndrewZX quotes Ars Technica: Mainframe computers are often seen as ancient machines—practically dinosaurs. But mainframes, which are purpose-built to process enormous amounts of data, are still extremely relevant today. If they're dinosaurs, they're T-Rexes, and desktops and server computers are puny mammals to be trodden underfoot.
It's estimated that there are 10,000 mainframes in use today. They're used almost exclusively by the largest companies in the world, including two-thirds of Fortune 500 companies, 45 of the world's top 50 banks, eight of the top 10 insurers, seven of the top 10 global retailers, and eight of the top 10 telecommunications companies. And most of those mainframes come from IBM.
In this explainer, we'll look at the IBM mainframe computer -- what it is, how it works, and why it's still going strong after over 50 years.
"Today's mainframe can have up to 240 server-grade CPUs, 40TB of error-correcting RAM, and many petabytes of redundant flash-based secondary storage. They're designed to process large amounts of critical data while maintaining a 99.999 percent uptime -- that's a bit over five minutes' worth of outage per year..."
"RAM, CPUs, and disks are all hot-swappable, so if a component fails, it can be pulled and replaced without requiring the mainframe to be powered down."
It's estimated that there are 10,000 mainframes in use today. They're used almost exclusively by the largest companies in the world, including two-thirds of Fortune 500 companies, 45 of the world's top 50 banks, eight of the top 10 insurers, seven of the top 10 global retailers, and eight of the top 10 telecommunications companies. And most of those mainframes come from IBM.
In this explainer, we'll look at the IBM mainframe computer -- what it is, how it works, and why it's still going strong after over 50 years.
"Today's mainframe can have up to 240 server-grade CPUs, 40TB of error-correcting RAM, and many petabytes of redundant flash-based secondary storage. They're designed to process large amounts of critical data while maintaining a 99.999 percent uptime -- that's a bit over five minutes' worth of outage per year..."
"RAM, CPUs, and disks are all hot-swappable, so if a component fails, it can be pulled and replaced without requiring the mainframe to be powered down."
Tractors (Score:5, Insightful)
Re: (Score:2)
I have owned many tractors but I have always only dreamt about owning a main frame. especially for IO taks :(
Re: (Score:2)
Well, :P
I would start with dreaming about the power plant you will need to operate one
Re: (Score:3)
If you fancy running a mainframe class operating system at home you can run a version of ICL's "George" O/S on a Raspberry PI !
https://en.wikipedia.org/wiki/... [wikipedia.org]
https://www.george3.co.uk/ [george3.co.uk]
https://www.rs-online.com/desi... [rs-online.com]
I worked for years on ICL machines (mostly 2900 series and above) and found VME to be a simply wonderful operating system (George was a forerunner of VME). Some of the concepts are still years ahead of their time (file generations) and the S3 programming language is one of the best language
Re: (Score:2)
I know all of that since I have worked on main frames. It's just of fantasy dream of having my own main frame in my basement. Would run my electricity bill high as well so it would not be practical to run it 24/7 like I do with other devices. I came close to buy an old one just for the kick but as you said, I don't need one.
Re: (Score:2)
Some of the ideas would be nice in a personal machine, like VMs.
Re: Tractors (Score:2)
IIUC, the BigCo I work for is running a mission-critical mainframe app in some kind of virtualization. On what iron, I dunno. Not sure if commodity cloud hardware can emulate that.
Re: Tractors (Score:2)
None of that makes any difference. If you're arguing sector size, you've missed the point.
Re: (Score:2)
And why do we? Because the story people have in their heads says to use them. Because their "daddy" did. Because thinking is for other people.
5x9 reliability: "one reboot every year" (Score:5, Informative)
There's a huge jump going from 4x9 to 5x9 reliability (something a lot of requirements writers don't appreciate when they levy such a requirement.) For the systems I worked on, that was "no more than 1 reboot every year." Now rebooting has gotten faster on a lot of systems, but it's still something you can't do often in a 5x9 environment. Thus the ability to fix hardware and software while the system is running is a huge consideration. For the systems I worked on, that usually meant hot-standbys (but then the set of 'system faults' also included power failures or other situations external to the box.)
And the big difference between 4x9 and 5x9 in that regard was something I'd use to educate requirements writers and managers about the impact of just a single digit in a set of requirements or performance contract. It's also the kind of requirement that is pervasive across a system, where hardware, systems software, applications software and operations/procedures all had to work together to meet the need. (I still don't see how you achieve that kind of pervasive requirement using "just pick a requirement written on a 3x5 card to implement this week" Agile techniques.)
Re: 5x9 reliability: "one reboot every year" (Score:5, Interesting)
First, we don't specify stories in terms of requirements, we specify them in terms of features that satisfy requirements. Imagining your perception to be true, however, this can still be done in an iterative manner.
The larger story would be "As a consumer of the app, I need 5 9s uptime so I can ensure critical services are always available to users." This would have acceptance criteria and the subsequent tasks would be requirement/validation based, answering the questions of how much downtime is necessary to replace a NIC or a kernel upgrade for security or thr other varying knowledge required to ensure the system meets spec.
Ultimately, Agile is a system intended to reduce risk to business by ensuring R&D spends minimal time on work that isn't the business's top priority. It's suited for software development more than systems development but as long as a person doesnt become too dog.atic in their thinking, a lot of the ideas are applicable to any work that requires a team.
Re: 5x9 reliability: "one reboot every year" (Score:5, Insightful)
Ultimately, Agile is a system intended to reduce risk to business by ensuring R&D spends minimal time on work that isn't the business's top priority. It's suited for software development more than systems development but as long as a person doesnt become too dog.atic in their thinking, a lot of the ideas are applicable to any work that requires a team.
Agile works very well for software development because it's easy to compartmentalize and refactoring is fairly cheap and easy as well.
It's less well suited to systems development as interfaces become more important and refactoring is really expensive.
On large industrial projects waterfall is still king. Discovering a new requirement part way through the project might be sending someone out into the field to manually update dozens of field devices. Or a change in a control protocol might require another round of site acceptance testing. Sending an observer to every single remotely controlled device and validating that commands are carried out.
Agile is a philosophy, and just like every other philosophy it's brilliant at some aspects and terrible at others.
Where exactly a mainframe falls along that line is hard to say. But if it involves a lot of custom hardware I'd be very, very cautious about going agile and discovering part way through that a bunch of already fabbed chips don't meet the new requirements.
Re: 5x9 reliability: "one reboot every year" (Score:2)
Re: (Score:2)
Try to say that to SpaceX..
Explain.
Re: (Score:2)
On large industrial projects waterfall is still king.
Actually it is not. All Agile Methods were developed/discovered in industrial production processes.
You do not even know what water fall is.
The grandparent talked about the difference between a 4x9 versus a 5x9 system, aka 99.99% uptime versus 99.999%.
If you try to design/implement a system in waterfall style, and the system by its shear size needs 4 years to be built: you make an error in year one, but discover it at the end of the year 3. Only god knows how much work was wasted and is paid already but has to be redone, probably for the same or even highe costs than the last time.
It's about tradeoffs. Waterfall is a lot more effort, and it means that missed requirements/bad designs are a bit more expensive to fix, but it means that missed requirements and bad designs are way less common.
In software, the waterfall overhead isn't worth it. But when missed requirements and bad designs become really expensive then you need something like waterfall.
very cautious about going agile and discovering part way through that a bunch of already fabbed chips don't meet the new requirements.
Ah, and here we are, How the funk should that be possible in agile project? THAT IS WATERFAALLL, not agile. Dumbass.
Every damn agile method on the planet says pretty clearly: a part of a thing is done - and only then - when all its requirements are implemented and tested and verified and approved!
The whole idea of Agile is that requirement discovery happens throughout the project.
That's great for software, terrible for the example I la
Re: (Score:2)
IMHO, the problem is with Agile (TM, pat. pend. some restrictions apply, offer not valid on the blue moon or alternate Tuesdays...) vs. agile.
The formalized Agile like most such methodologies is an attempt to create a rigid rules-driven procedure to duplicate what a unicorn team of well qualified and experienced developers do only with a less qualified and experienced team (and generally less qualified management as well). Like most such methodologies, it tends to quickly devolve into a religious debate and
Re: (Score:2)
Re:5x9 reliability: "one reboot every year" (Score:5, Informative)
Redundancy isn't a complete solution, for a couple of reasons:
1. Switchover is really hard. Maintaining a hot standby and -transactional consistency- requires both the primary and the backup(s) to complete a transaction before final commit. There have been lots of examples where the switchover system (hardware and software) didn't quite get it right, with the result that either the standby was unable to take over, or the inconsistencies between primary and standby caused downstream problems.
And that's assuming a single system has the authoritative data, versus some distributed system designs that want to survive network partitioning. Turns out there's no general solution to the 'partitioned database replication problem', where "solution" is "an algorithm to take the full set of updates to multiple independent copies, jam them together, to produce a single consistent system." (There are solutions you can do that take advantage of domain knowledge.)
2. It doesn't always handle "Bohrbugs" when the same fault impacts all of the copies. That includes -zero day malware-. (And as Nancy Levison showed, 'n-version' doesn't necessarily fix problems, because even independent groups of designers tend to make the same mistakes.)
But when the -system- requirement is for 5x9 or more stringent, redundancy is almost always part of the solution. The questions then get into "how much, and at what cost?"
Re:5x9 reliability: "one reboot every year" (Score:5, Informative)
Re: (Score:2)
Re:5x9 reliability: "one reboot every year" (Score:5, Insightful)
Most so-called "system engineers" are appallingly ignorant of stuff like this. And so are a large number of IT people (as opposed to those trained in software engineering or complex computer science.)
Fortunately, on my last big program, all 3 of my (Army LtCol) bosses had advanced degrees in computer science, so we could have intelligent conversations about stuff like this without their eyes glazing over. Most of the other people around us, even though they were involved in software-intensive system design, had no clue. (True for military, civil service, and contractors, both SETA support to the government and the prime contractor.) But that often made me the 'lone voice crying in the wilderness' in technical review meetings.
Re: (Score:2)
Because they're more interested in talking about how it's webscale and does sharding.
https://youtu.be/b2F-DItXtZs [youtu.be]
Re: (Score:2)
The kind of engineers that find things like that to be unforgettable cost more...
Re: 5x9 reliability: "one reboot every year" (Score:2)
Re: (Score:2)
Not usually, only for the services that require no transactional consistency.
Hard to ensure transactional consistency when your service runs on many machines.
Re: 5x9 reliability: "one reboot every year" (Score:2)
Re: 5x9 reliability: "one reboot every year" (Score:5, Insightful)
An eventually consistent database has *nothing* to do with a transactional system. I don't want my bank account to eventually show me my correct balance. I want it to be correct every time.
Let me know if you ever work for a bank, and let me know which bank so that I'm sure to avoid it.
Re: 5x9 reliability: "one reboot every year" (Score:2)
Re: (Score:2)
I did use some and wrote one, so trust me, I know what they are. However, they are no replacement for a transactional system, at least for some operations, so claiming you can distribute everything with an eventually consistent backend is forgetting about my bank account balance being accurate at all times. That won't happen.
Re: (Score:3)
Moreover, from wikipedia, "Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value".
So that's very precisely what I wrote. It is guaranteed to be consistent, *at some point* which doesn't work well for my bank account balance.
Re: (Score:2)
How cool. Thanks a lot for sharing, David Emery.
Re:5x9 reliability: "one reboot every year" (Score:4, Interesting)
Well, 4x9 is barely 1 reboot a year either - that's around 29 minutes of downtime per year. 5x9 is under 5 minutes of downtime.
If you have ever turned on a modern x86 server, it can take 5 minutes just to get through the BIOS. Heck, you can be sitting there for a minute before you even see anything on the screen (your only feedback that something is happening besides the little power LED is the fans are making more noise an a jet taking off). Then you sit there as the BIOS runs the self-tests on everything.
And yes, it can get old - if you're trying to recover data off one and having to reboot between multiple environments you can be sitting there waiting for the BIOS to even present to you the boot options menu.
The PC on your desktop - the desktop or laptop you do work on, has been way optimized for boot time - these days you can be in and out of the BIOS within seconds, but for a server PC, not so much.
The mainframe computer? Reboots are generally scheduled months in advance because it can take hours to shut down and come back up again. A lot of it is in the self-tests - you want to test to make sure every bit of RAM Is working properly and keep making sure it works, you want to make sure the disk is working just fine or if it isn't, you can provide so much warning that the data on that disk has been auto-migrated elsewhere so all someone needs to do is change the disk out.
522 bytes per sector seems about right to get your a standard ECC 1-2 system to detect early drive failures - you read the disk, and you have ECC to check if the sector was read correctly or not. (This is on top of the ECC that the drive already does - but remember there is a possibility of an undetectable bit error every TB or so transferred with modern hard drives.)
Re: (Score:2)
(I still don't see how you achieve that kind of pervasive requirement using "just pick a requirement written on a 3x5 card to implement this week" Agile techniques.)
Obviously the same way like you do it. What should agile project management/conduction change? Obviously nothing.
Re: (Score:2)
There's a huge jump going from 4x9 to 5x9 reliability (something a lot of requirements writers don't appreciate when they levy such a requirement.) For the systems I worked on, that was "no more than 1 reboot every year." Now rebooting has gotten faster on a lot of systems, but it's still something you can't do often in a 5x9 environment. Thus the ability to fix hardware and software while the system is running is a huge consideration. For the systems I worked on, that usually meant hot-standbys (but then the set of 'system faults' also included power failures or other situations external to the box.)
And the big difference between 4x9 and 5x9 in that regard was something I'd use to educate requirements writers and managers about the impact of just a single digit in a set of requirements or performance contract. It's also the kind of requirement that is pervasive across a system, where hardware, systems software, applications software and operations/procedures all had to work together to meet the need. (I still don't see how you achieve that kind of pervasive requirement using "just pick a requirement written on a 3x5 card to implement this week" Agile techniques.)
You're never going to have five 9s on a single point of failure. Redundancy, especially live redundancy is key to continuous uptime. If you're running a POS (Point Of Sale) system for a chain you really want a minimum of 3 nodes, that way one can be taken offline for maintenance and you can still suffer a single node outage without suffering a complete outage, sometimes even going as far as to have a 4th as a hot spare.
We used to use mainframes for POS applications because they used to be the only system
4.5GHz z15 CPU speed (Score:5, Interesting)
When I did a compute cloud performance comparison [dev.to] earlier this year, I included the IBM Cloud's z/Architecture servers with the latest z15 @ 4.5 GHz out of curiosity. Turns out the cores performed similarly to 3GHz Ampere Altra cores on my custom (perl/C) benchmark suite.
Obviously, given how expensive they are, there's no point in using them for anything other than your System/360 software from decades ago, but still it was interesting to see how fast the top of the line is for general computing.
Re:4.5GHz z15 CPU speed (Score:5, Insightful)
One point of using them is that they have incredibly redundant capabilities. As the preview mentioned, the components are hot-swappable. So there is a point to using them for other that your System/360 software from decades ago. Speed is important, but it isn't everything and frequently not even the overriding concern.
Re: (Score:3)
I was mainly referring to the IBM cloud, you don't really care how each cloud has built redundancy, as long as they offer similar reliability and you don't have to pay several times more to get slower IBM servers when your workloads don't need them.
As for building your own clusters, there are similarly redundant / hot-swappable high end servers from other manufacturers too, it's not like IBM has the only solution.
Re: (Score:2)
As for building your own clusters, there are similarly redundant / hot-swappable high end servers from other manufacturers too, it's not like IBM has the only solution.
They often suck at IO compared to a main frame although.
Re: (Score:2)
This reminds me of my own "tests" I was doing many years ago as a system operator with too much free time, naively comparing the performance of an IBM 4361 mini-mainframe to a 80486 PC... forgetting that a 80486 PC could never do the same things the mainframe was doing, even if the System/370 code was somehow rewritten to use x86 instructions. The raw CPU performance is but one of the things that are important in getting your work done.
Re: (Score:2)
So you actually think that they spun up an LPAR that had a CPU dedicated to your task? Here's a hint - they didn't. You were sharing that resource with dozens, maybe hundreds of other users. Your benchmark is completely invalid.
Link to article (Score:3)
Re: (Score:2)
Hey can someone get the attention of a moderator / editor to alert them to the fact that they didn't actually link to the article? See parent of this comment for link. (facepalm)
They are everywhere (Score:2)
But they are irritating in all of the places I've interfaced with them.
AFAIHBT it's relatively simple to make web interfaces to your mainframe apps, but all the ones I've had to use had to be accessed through a 3270 terminal or emulator. I did a lot of terminal swapping when I worked for the county of Santa Cruz. Now I'm using a 3270 emulator to access a really atrociously user-unfriendly system elsewhere... which I also did when working for IBM itself.
Legacy Applications... (Score:5, Insightful)
They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years.
Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?
Re:Legacy Applications... (Score:5, Informative)
They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years.
Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?
2M? More like 200... A corporation which invested heavily in a mainframe as core component of their IT ecosystem would require a staggering investment of time and resources over literally decades to completely migrate off it. It might require migrating dozens if not hundreds of applications and services and often that migration would require significant rewrites or adaptations.
Re:Legacy Applications... (Score:4, Insightful)
Indeed! In fact, I think it is this backward compatibility, both in hardware and software, that make these beasts so amazing, even more than their redundancy and reliability. It is a pretty amazing bit of both hardware and software engineering to enable such compatibility, especially when one takes into account the fact that the earliest apps for these systems were written nearly 60 years ago.
Re: (Score:2)
They survive because the cost of re-coding and re-implementing legacy applications is more than just buying a new mainframe every 5-7 years.
Why spend $2M re-writing your sales app when you can just keep it and buy and $300K mainframe?
Where can we get these cheap mainframes? Last I looked, we replaced our mainframe this year to the tune of several million dollars, and the systems development costs on it were now being measured in fractions of billions.
NOT A Dinosaur (Score:2)
It's a 'legacy reptile'
Re: (Score:2)
Re: (Score:2)
perhaps a Silurian.
Used to use Tandem NonStop (Score:5, Insightful)
So... basically a different type of supercomputer (Score:2)
.
Re: (Score:2)
Thousands of support personnel? Uh, no. (Score:2)
It's been awhile since my mainframe days, but during Y2K there was a grand total of 1 system programmer [ibm.com] using SMP/E [wikipedia.org] to administer the company's IBM mainframe that was used by thousands of employees. Compare that to today's hordes of DevOps engineers spinning up thousands of containers using their own individual images and Rube Goldberg pipeline processes.
From the article: "A medium-sized bank may use a mainframe to run 50 or more separate financial applications and supporting processes and employ th
Re: (Score:2)
Really big SAP HANA Instances (Score:2)
how it survives (Score:2)
Re: how it survives (Score:2)
Says the person who's never seen any really bad COBOL. With many languages, subroutine entry and exit points are fairly recognizable. But COBOL is a different beast entirely. You could have a nice long sequence of paragraphs being used as a subroutine with one perform statement and later see an other perform statement use just a subset of that same sequence of paragraphs. Then of course, you have idiot programmers who violate the rules of the language and jump out of a sequence of paragraphs being performed
Re: (Score:2)
They run on Diesel (Score:2)
No Diesel, No Broccoli
Use for Slashdot? (Score:2)
> TodayÃ(TM)s mainframe can have up to 240 server-grade CPUs, 40TB of error-correcting RAM, and many petabytes of redundant flash-based secondary storage. TheyÃ(TM)re designed to process large amounts of critical data while maintaining a 99.999 percent uptimeÃ"thatÃ(TM)s a bit over five minutes' worth of outage per year...
I wonder if Slashdot ran on those, would it be able to deal with Unicode?
Branch and link (Score:2)
and insert characters under mask. //DD ...
Virtualize everything (Score:2)
I dunno if AS/400s could be considered mainframes, but this discussion brings to mind a fun situation my wife encountered about 25 years ago. Her company had tried to drop AS/400 support for their product, but as often happens, one customer was willing to pay lots of money to keep it going, but not enough to pay for an actual AS/400. Instead, in the lab, they ran an AS/400 VM under AIX. My wife's company convinced their customer to begin the transition process to AIX, so the customer decided that the first
Re: (Score:3)
You do realize that no one at IBM from over 50 years ago still works there, yes? Are you suggesting those are company policies? If so, please cite your evidence. If not, you are just making a silly point that you read an article about IBM history.
Re: They were very efficient at eugenics and genoc (Score:2, Informative)
Re: (Score:2)
Re: (Score:2)
It's called a joke, you humourless cunt.
You forgot the /sarcasm tag...since many on /. nowadays are insulted by humor when they are not warned about it being present in the post.
Re:5 9's (Score:4, Insightful)
From Wikipedia on ECC:
"Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code[a] (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches. "
I presume the rest of your screed is bollocks as well since you cannot even get ECC correct.
Re:5 9's (Score:5, Informative)
I'm so terribly sorry that you are confused on this.
ECC memory stores 4 bits for parity for every 8 bits of data. With "this much parity" it is very possible to detect a single bit flip in either the data or the party and correct it. This is why ECC is so attractive to servers and other "mission critical systems," precisely because it CAN (to a limited extent) correct damaged data. It's very common for high end server to keep counters on the number of corrections preformed on each DIMM, and when it finally exceeds a threshold throw a warning light.
The next step past that is keeping a DIMM spare on hand and being able to copy the memory on that DIMM onto a good one. And/or hooks down in the operating system to get the kernel to start relocating pages. Which, can be tricky, lots of times some key kernel pages are difficult to relocate ... which is when it's nice to have hardware that can turn on a spare at the hardware level.
There's a whole subset of math called hamming codes that are all about how to "expand" data via "more parity bits" or some other system to not only detect errors but correct them. They are used very VERY commonly in radio protocols (WiFi and cellular) where bit corruptions are very frequent, and sending "extra" data and allowing the receiver to regenerate the lost data is more efficient that trying to signal for and do complete retransmits.
Also, old spinning rust disk drives rely on hamming codes A LOT. Like ... A LOT A LOT. Like: nearly never does the data read correctly from the platter a lot, it needs to be recovered on read. And if the % bits read exceeds a threshold, the sector is rewritten, and if it exceeds an even higher threshold the sector is relocated.
Re: (Score:2)
Re:5 9's (Score:5, Informative)
I did mention that if you use a RAID-like structure you can store redundancy information to recreate the data, but there's a huge cost in access speed, memory density, and cost.
Anything else you need explained to you?
Turns out IBM does exactly that. https://www.ibm.com/community/... [ibm.com]
The IBM zEnterpriseA system introduced a new and innovative
redundant array of independent memory (RAIM) subsystem design as
a standard feature on all zEnterprise servers
Re: (Score:2)
Anything else you need explained to you?
Yeah, what does the first C stand for? And when you answer realise that this isn't a fancy marketing name but a description of what the thing actually does.
NOTHING can CREATE DATA from GARBAGE.
Jesus, if that were true you wouldn't even be able to read this post, much less enjoy many of the world's technological wonders of today.
Re:5 9's (Score:4, Informative)
Re: (Score:2)
Re: (Score:2)
Instead of spending an astonishing amount of time explaining to us what you think ECC is doing, you should have spent less time researching what server grade ECC actually does, no?
Re: (Score:2)
It does not support unicode, because once it does.
And people uses writing direction change codes, to somehow smuggle malicious stuff into the postings.
Never researched how that works.
But instead of blacklisting certain code sequences, they removed unicode completely again.
Why they are not able to add it, no idea, seems they do not see the "business reason".
Pretty annoying that a Mac user can not use Umlauts, but a Windows user can :P
Re: (Score:2)
I strongly suspect that the real reason Slashdot (still) doesn't have Unicode support is because there is nobody left there who understands the Slashdot codebase well enough to implement a change of that magnitude... either they've already tried to do it, and failed, or they're afraid to try.
Re: (Score:2)
Slashdot is owned by an SEO company that scoops up strong tech media brands that are under valued/appreciated and uses them to promote their clients. It’s not ideal but they’re keeping the lights on. Slashdot has been used as an PR platform for cmdrtaco’s pals (take a look at the sites you can add to the sidebar and then look at who owns them, where they went to school, etc), Corey Doctorow’s pals, and now purely commercial interests for hire so it’s not totally unprecedented
Re: (Score:2)
More than 50% of mainframe workload is now "new workload", meaning Linux and/or Java. It's time for the "legacy only" meme to die.
Re:5 9's (Score:5, Informative)
If it has multiple CPUs it's not a mainframe.
IBM introduced the 3090 model 200 system in 1985, which included dual processors. They continued ramping up the CPU count from there.
Are you saying that's when they stopped making mainframes?
nobody out there is using memory in a RAID-style configuration (e.g. give up 1/n of your storage and n^x of your access speed so if there's a memory error you can recover the missing data)
Um, this is pretty much *exactly* what an ECC memory chip does internally.
Re: (Score:2)
They had a dual processor version of the S/360 which is the genesis of mainframes.
Re:5 9's (Score:5, Informative)
You really are misguided. Mainframes have had separate cpus for things like terminals and i/o since their creation. That is their entire point. Offload as much as possible from the main cpu which shouldn’t concern itself where the data is located on the storage. It sends that task to the storage controller and waits for an interrupt. This is a good overview of ECC and the transaction processes. https://www.ibm.com/community/... [ibm.com]
Re:5 9's (Score:5, Informative)
ECC is a misnomer. It identifies when storage has failed, and doesn't have the ability to "correct" for that. Even the simples parity systems wouldn't allow regenerating data from non-data.
This is categorically false.
It can be done with RAID but nobody out there is using memory in a RAID-style configuration but nobody out there is using memory in a RAID-style configuration
You are completely incorrect.
There are two kinds of parity memory, false parity and true parity. Early PCs used to have no parity (they had just as many RAM chips as they needed) and then it became the fashion to have false parity (just enough RAM to know if you had a memory error and throw an exception instead of just pretending nothing happened and compounding errors) and now you have two options when you buy DIMMs, false parity or true parity. All the normal DIMMs have false parity, if you buy RAM labeled as ECC then it's true parity. True parity memory has enough additional cells to actually regenerate the lost information.
What's more, this is not new technology. It's not even new to be able to have true parity memory on a PC, that has been possible for most of their history (It's been possible to get standardized true parity memory for PCs at least as long as we've been using DIMMs, I don't honestly recall if it was possible with SIPPs or SIMMs) but moreover it was the standard for Unix workstations way back in the way back. sun4m had parity memory, for example. I specifically remember getting console messages on a SS20 (which was loaded up with 4 ~120MHz HyperSPARC modules and running Cadence, Magic, etc.) about corrected memory errors. Then I ordered replacement hardware, and I got replacement modules cross-shipped. Downtime, just a few minutes to shut down, slap in the modules, and boot.
Most of us don't want to pay the various penalties for having ECC memory in our PCs. It limits your maximum bus speed, and it costs substantially more since there's actually more RAM on the modules. But almost anyone with a Ryzen processor in a desktop has at least one or two officially supported options available to them, and there are generally lots of other modules that will work.
I for one am running 32GB of budget-ass non-ECC RAM, and have never had ECC RAM in a PC because I'm cheap. I've even run non-parity RAM when I had the option for the same reason. But next time I actually build a PC from scratch I'm going ECC. These memory sizes are significant enough that the likelihood of an error is also significant.
Re: (Score:2)
These memory sizes are significant enough that the likelihood of an error is also significant.
There is a false level of concern here. Why do you need 32GB of RAM? Is it because you use large mathematical datasets, or because of superfluous fluff? Which bit is at risk of being flipped, one in actual use, or something that is on standby?
When I do banking on a phone with 1GB of RAM, or a PC with 64GB, the risk is the same. The critical transactions don't span more memory, there's just more memory available in the system. You're right in that a system reading and writing 64GB is more likely to encounter
Re: (Score:2)
Well, mostly right.
The reason why bigger memories are more prone to bit errors is simple and two fold.
Bit errors occur because cosmic radiation is going through your chip. Flipping a/some bit/s.
Or simple radioactive decay in one of the components in your computer.
The more memory you have: the bigger is the attack surface.
However you are (half) right: transaction wise, aka in a bank transaction, it is only relevant for the tiny part of memory where this process is sitting and holding its data.
Re: (Score:2)
Which bit is at risk of being flipped, one in actual use, or something that is on standby?
One thing worth noting is that even memory on "standby" is susceptible. For example, Windows maintains a list of "zero pages" which is memory that's already been zeroed by the low-priority (actually, lowest) "zero-page thread". This allows the OS to return zeroed memory without delay if an app requests it. If one of the bits in a zero page was flipped to a one, this violates the guarantee that all bits are zero leading to any number of bad behaviors.
The odds of it impacting something actually important are no higher than they were 10 years ago
Disagree, simply because apps use so much more goddamne
Re: (Score:2)
Re: (Score:2)
It's not easy to find ECC ddr5 memory, except for the on-chip type that most vendors are friendly enough to mention and which doesn't help when filtering for ECC.
Up until very recently DDR5 RDIMMs were really hard to find. Today they are readily available and the price isn't terrible at something like $200 for a 64GB stick.
Anyway I've only been able to find 4800 memory
I don't understand the whole memory speed / overclocking scene. If your application is truly bandwidth limited the solution is buying a system with more memory channels not screwing around with parameters that have no real world impact separate from increasing likelihood of a system crash.
Re: (Score:2)
I've tried over clocking 20+ years ago, it was fun when you could get your cheap hardware to run as fast as hardware twice the price. Which couldn't be overlooked due to already running at the max system
Re: 5 9's (Score:2)
Increasing memory speed matters, but a lot less than linearly because computers are synchronous. If you are going to use lots of ram you get much more benefit by spreading it out across more channels than by increasing the speed.
Re: (Score:2)
Re: (Score:2)
I for one am running 32GB of budget-ass non-ECC RAM, and have never had ECC RAM in a PC because I'm cheap. I've even run non-parity RAM when I had the option for the same reason. But next time I actually build a PC from scratch I'm going ECC. These memory sizes are significant enough that the likelihood of an error is also significant.
Starting /w DDR5 they all have ECC. There is still market segmentation to get the memory controller in the loop which is downright evil but still better than nothing.
Re: (Score:3)
There's no such thing as a mainframe anymore. Distributed hardware has overtaken that segment of the market. Even if you put it all in one big metal box it's still a distributed hardware solution, not a mainframe.
IBM would disagree as they still sell mainframes. [ibm.com]
The criteria of "you can replace parts without powering the whole thing down" has NOTHING to do with uptime and reliability. If the system is unable to process the workload it is DOWN HARD regardless of if the power light is green or red or blue.
What are you talking about? If you do not need to bring a system down to replace critical parts like a CPU, that increases uptime and reliability. In a mainframe, the workload is not down BECAUSE there is redundancy. Not sure what you are smoking.
Is that any better or worse for a distributed system sold by IBM vs a cloud empire on AMZ, Goog, Azure, Oracle? No. In fact it's less reliable.
Er what? Just last month AWS was down 2 hours [apnews.com]. All of the services you mentioned have suffered major outages in the last several years. If any of them was down more than 10 minutes in 2 years (and they have), they a
Re: (Score:2)
error correcting memory doesn't exist. ECC is a misnomer. It identifies when storage has failed, and doesn't have the ability to "correct" for that. Even the simples parity systems wouldn't allow regenerating data from non-data. It can be done with RAID but nobody out there is using memory in a RAID-style configuration (e.g. give up 1/n of your storage and n^x of your access speed so if there's a memory error you can recover the missing data) Ridiculous to even suggest that.
My friend, you are WRONG.
ECC memory exists, and has existed, for decades. Most use Hamming code to correct single bit errors in a memory word and detect double bit errors. Most server class computers include an ECC error log in hardware to track corrected errors. If a RAM device (i.e. DIMM) encounters errors (single bit, which were corrected), they are logged and when the count reaches a threshold it is reported through a system monitoring interface. This alerts the user that a RAM device is "flaky"
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
IBM mainframes from Day One of System 360, were multiprocessor complexes. Device controllers, multiplexor channels, and selector channels all had internal microprograms
If you use this argument to counter your parent's assertion that mainframes don't have multiple CPUs (which is wrong, of course), then you use wrong logic too. By that logic, an IBM PC/XT with a 8087 numeric coprocessor is a dual-processor system too... which it isn't. Yes, it's true that System/360 machines by design have multiple channels, which are actually additional processors, but they're not CPUs.
Yet, IBM marketed and sold some dual-CPU System/360 model 65s in the 1960s, which proves that your parent
Re:5 9's (Score:4, Interesting)
Re: (Score:2)
Re: (Score:2)