There's a third option: combine the best of both worlds. Use powerful, superscalar cores with shared memory, as powerful as you can reasonably make them, and then run clusters of those in parallel.
Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison).
Well, there's your problem. Many real world applications can only be programmed that way.
Examples? It's just a different model, it's doesn't prevent you solving any problem.
The ClearSpeed 192-core CSX700 is on the market, but nobody is buying it
Yeah, that was a shame. The trouble is that HPC-specific chips are just going to get steamrolled on the price point by commodity (x86) hardware. But what about the other three that are selling like hotcakes?
It's more efficient to have instructions which take maybe a bit more bits, but on average they don't really take that much more and have microcode on-die to handle them
Well that would be true, but the really complex x86 instructions are rarely used, so you're not really adding much in the way of code density, and you have to add a lot of hardware complexity to decode it. Not only that, more complex instructions mean bigger pipelines which mean bigger branch penalties.
they're forced to do so because they reach the limits of a single core
Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says
Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.
The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.
Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.
The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip and Icera to name just a couple, not to mention things like GPGPU (Fermi, etc.)
Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".
I assume they are using Base 10.
Hey, actually maybe they use base 2.
That's what he said.
So you're saying that it would take just 11 posts on Twitter to kill someone?
Or a single ethernet frame!
Finally it produces a neutron flux
Hah, yeah right! You're going to have to make up some more believable sounding sciencey words before we fall for THAT one. Why don't you just go reroute the flux capacitor through the deflector to invert a tachyon pulse while you're at it?
In computational complexity theory, the complexity class NP-complete (abbreviated NP-C or NPC, with NP standing for nondeterministic polynomial time) is a class of problems having two properties
- Any given solution to the problem can be verified quickly (in polynomial time); the set of problems with this property is called NP.
- If the problem can be solved quickly (in polynomial time), then so can every problem in NP.
Anyway, this article is about solving the problem in parallel with bacteria (which is totally cool, don't get me wrong.) It's not a faster algorithm, although I suppose you could argue that massively parallelizing it IS a faster solution.
Writing software is more fun than working.