Forgot your password?
typodupeerror

Comment: Re:Instruction set... (Score 1) 326

by kohaku (#34305596) Attached to: Intel Talks 1000-Core Processors
Well sure, but that's why x86 is still around. It doesn't mean it's better, and it doesn't mean it's going to scale better.
As for multiple processes accessing the same drive, you'd handle it the same way as we do now: with a filesystem layer serving all those processes. That they're running concurrently has no bearing.

Comment: Re:Instruction set... (Score 1) 326

by kohaku (#34305236) Attached to: Intel Talks 1000-Core Processors
What layer? Their decoder is the translation, and although it doesn't take up 50%, it's not a trivial amount of space. Not only space, though, but pipeline: an instruction gets 5-deep into the pipeline just in terms of decoding whereas an equivalent A8 pipeline is only 3 stages. Branch penalties on x86 are nasty, which is why there's so much logic (caching decoded instructions, branch statistics, etc.) dedicated to alleviating the problem.

Comment: Re:Instruction set... (Score 2, Interesting) 326

by kohaku (#34304912) Attached to: Intel Talks 1000-Core Processors

There's a third option: combine the best of both worlds. Use powerful, superscalar cores with shared memory, as powerful as you can reasonably make them, and then run clusters of those in parallel.

Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison).

Well, there's your problem. Many real world applications can only be programmed that way.

Examples? It's just a different model, it's doesn't prevent you solving any problem.

The ClearSpeed 192-core CSX700 is on the market, but nobody is buying it

Yeah, that was a shame. The trouble is that HPC-specific chips are just going to get steamrolled on the price point by commodity (x86) hardware. But what about the other three that are selling like hotcakes?

Comment: Re:Instruction set... (Score 1) 326

by kohaku (#34304842) Attached to: Intel Talks 1000-Core Processors
I'm not arguing against the design of the processors- a RISC core is probably the best way to implement them. My point is that if you already have a RISC core, the original (x86) instruction set needs to be ditched because decoding it wastes space. The worst case example was supposed to indicate the complexity of the instruction set and its decoding, not the code density. If you're talking about ARM's Thumb/Thumb2, that's a little different- Thumb2 is trivially easy to decode, whereas x86 definitely isn't. I would argue that x86 isn't particularly more dense than a RISC equivalent (let's take Thumb2), since the very long, complex instructions are very infrequently used (although I can't find any statistics). Also, many x86 instructions take many cycles to complete, meaning even more potential pipeline slowdowns.

Comment: Re:RISC has downsides... (Score 2, Informative) 326

by kohaku (#34304794) Attached to: Intel Talks 1000-Core Processors

It's more efficient to have instructions which take maybe a bit more bits, but on average they don't really take that much more and have microcode on-die to handle them

Well that would be true, but the really complex x86 instructions are rarely used, so you're not really adding much in the way of code density, and you have to add a lot of hardware complexity to decode it. Not only that, more complex instructions mean bigger pipelines which mean bigger branch penalties.

Comment: Re:Instruction set... (Score 3, Interesting) 326

by kohaku (#34304752) Attached to: Intel Talks 1000-Core Processors

they're forced to do so because they reach the limits of a single core

Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.

Comment: Re:Instruction set... (Score 3, Interesting) 326

by kohaku (#34304436) Attached to: Intel Talks 1000-Core Processors

That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says

Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.

Comment: Re:Instruction set... (Score 4, Insightful) 326

by kohaku (#34304088) Attached to: Intel Talks 1000-Core Processors

There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.

Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.

Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip and Icera to name just a couple, not to mention things like GPGPU (Fermi, etc.)
Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".

Comment: Re:Summary is overrated (Score 3, Interesting) 135

by kohaku (#28825459) Attached to: Bacterial Computer Solves Hamiltonian Path Problem
Why is the GP modded over the parent? "Simply another NP-complete problem" and "not a special case" are just wrong. As can be found on wikipedia, the following text states that solving one NP-complete problem faster means they are ALL solvable faster. Come on slashdot! Computational complexity 101!

In computational complexity theory, the complexity class NP-complete (abbreviated NP-C or NPC, with NP standing for nondeterministic polynomial time) is a class of problems having two properties

  • Any given solution to the problem can be verified quickly (in polynomial time); the set of problems with this property is called NP.
  • If the problem can be solved quickly (in polynomial time), then so can every problem in NP.

Anyway, this article is about solving the problem in parallel with bacteria (which is totally cool, don't get me wrong.) It's not a faster algorithm, although I suppose you could argue that massively parallelizing it IS a faster solution.

Our business in life is not to succeed but to continue to fail in high spirits. -- Robert Louis Stevenson

Working...