Forgot your password?
typodupeerror

Inside Intel's Next Generation Microarchitecture 116

Posted by CowboyNeal
from the tiny-big-ideas dept.
Overly Critical Guy writes "Arstechnica has the technical scoop on Intel's next-generation Core chips. As other architectures move away from out-of-order execution, the from-scratch Core fully adopts it, optimizing as much code as possible in silicon, and relies on transistor size decreases--Moore's Law--for scalability."
This discussion has been archived. No new comments can be posted.

Inside Intel's Next Generation Microarchitecture

Comments Filter:
  • by dlakelan (43245) <dlakelan.street-artists@org> on Thursday April 06, 2006 @11:13PM (#15081991) Homepage
    Out of order execution is where special silicon on the processor tries to figure out the best way to run your code by reordering the instructions to use more of the processor features at once.

    In order execution doesn't require all that special silicon and therefore frees up die space.

    So one approach is to try to make your one processor as efficient as possible at executing instructions.

    Another approach is to make your processor relatively simple, and get lots of them on the die so you can have many threads at once.

    I personally prefer the multiple cores, because I think there is plenty of room for parallelism in software. HOwever this guy is basically claiming that intel is trying to get both, more cores and smarter cores. They're relying on Moore's law to shrink the size of their out of order execution logic so that they can get more smart cores on die.

  • by John_Booty (149925) <johnbootyNO@SPAMbootyproject.org> on Friday April 07, 2006 @01:03AM (#15082176) Homepage
    It's a philosophical difference. Should we optimize code at run-time (like an OOOE processor) or rely on the compiler to optimize code at compile time (the IOE approach)?

    The good thing about in-order execution is that it keeps the actual silicon simple and uses less transistors. This keeps costs down and engineers have more die space to "spend" on other features, such as more cores or more cache.

    The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well. Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

    (Unless developers released multiple binaries or the source code itself. While we'd HAVE source code for everything in an ideal world, that just isn't the case for a lot of performance-critical software out there such as games and commerical multimedia software.)

    As a programmer, I like the idea of out-of-order execution and the concept of runtime optimization. Programmers are typically the limiting factor in any software development project. You want those guys (and girls) worrying about efficient, maintainable, and correct code... not CPU specifics.

    I'd love to hear some facts on the relative performance benefits of runtime/compiletime optimization. I know that some optimizations can only be achieved at runtime and some can only be achieved at compiletime because they require analysis too complex to tackle in realtime.
  • Re:Israel (Score:3, Informative)

    by pchan- (118053) on Friday April 07, 2006 @01:42AM (#15082348) Journal
    Intel Israel has been a strong development center for Intel for quite some time now. Traditionally, new chips have been designed in the U.S., and then the designs were sent to the Israel for making them more power-efficient or improving performance. This situation got turned on its head. The American design team came up with the disaster known as the Netburst architecture (the highest clock P4 chips). Meanwhile, the Israel team was optimizing the Pentium-M (P3 and up) architecture and got its performance close to that of the Netburst chips at a lower clock rate and lower power consumption. Now Intel's top of the line chip was getting trounced by AMD's offering in both performance and power consumption, and further, AMD was announcing dual core chips years before Intel had planned to release any. In a way, Intel got lucky. They couldn't extend the Netburst architecture much more, the massively long pipelines on it made it terrible at executing general purpose code, and even hyperthreading didn't help it. It was generating massive amounts of heat at the frequency it was running and needed a huge cache. It was not ready for dual-cores. But the Pentium-M was. AMD's move to dual core saved Intel from competing in the megahertz race, just when the payoff from cranking the clock was starting to run out. They could now move from advertising clock rate to advertising dual cores. The Israel design team delivered the Core-Duo chip, and fast. Noticed how these appeared in laptops first? That's what the Israel team was experienced with.

    Expect the Israel team to continue developing this line of processors, with the American developers going back to the drawing boards for the next generation product.
  • by acidblood (247709) <.ten.ppced. .ta. .oiced.> on Friday April 07, 2006 @01:45AM (#15082356) Homepage
    Be careful when you speak of parallelism.

    Some software simply doesn't parallelize well. Processors like Cell and Niagara will take a very ugly ugly beating from Core architecture based processors in that case.

    Then there's coarse-grained parallelism, tasks operating independently with modest requirements to communicate between themselves. For these workloads, cache sharing probably guarantees scalability. Going even further, there's embarassingly parallel tasks which need almost no communication between different processes -- such is the case of many server workloads, where each incoming user spawns a new process, which is assigned to a different core each time, keeping all the cores full. This type of parallelism ensures that multicore (even when taken to the extreme, as in Sun's Niagara) will succeed in the server space. The desktop equivalent is multitasking, which can't justify the move to multicore alone.

    Now for fine-grained parallelism. Say the evaluation of an expression a = b + c + d + e. You could evaluate b + c and d + e in parallel, then add those together. The architecture best suited for this type of parallelism is the superscalar processor (with out-of-order execution to help extract extra parallelism). Multicore is powerless to exploit this sort of parallelism because of the overhead. Let's see:
    • There needs to be some sort of synchronization (a way for a core to signal the other that the computation is done);
    • The fastest way cores can communicate is through cache sharing -- L1 cache is fairly fast, say a couple of cycles to read and write, but I believe no shipping design implements shared L1 cache, only shared L2 cache;
    • An instruction has to go through the entire pipeline, from decode to write-back, before the result shows up in cache, whereas in a superscalar processor there exist bypass mechanisms which make available the result of a computation in the next cycle, regardless of pipeline length.

    Essentially, putting synchronization aside for the moment (which is really the most expensive part of this), it takes a few dozens of cycles to compute a result in one core and forward it to another. Also, if this were done in a large scale, the communication channel between cores would become clogged with synchronization data. Hence it is completely impractical to exploit any sort of fine-grained paralellism in a multicore setting. Confront this with superscalar processors, which have execution units and data buses especially tailored to exploit this sort of fine-grained parallelism.

    Unfortunately, this sort of fine-grained parallelism is the easiest to exploit in software, and mature compiler technology exists to take advantage of it. To fully exploit the power of multicore processors, the cooperation of programmers will be required, and for the most part they don't seem interested (can you picture a VB codemonkey writing correct multithreaded code?) I hope this changes as new generations of programmers are brought up on multicore processors and multithreaded programming environment, but the transition is going to be turbulent.

    Straying a bit off-topic... Personally, I don't think multicore is the way to go. It creates an artificial separation of resources: i.e. I can have 2 arithmetic units per core, so 4 arithmetic units on a die, but if the thread running on core 1 could issue 4 parallel arithmetic instructions while the thread running on core 2 could issue none, both of core 1's arithmetic units would be busy on that cycle, leaving 2 instructions for the next cycle, while core 2's units would sit idle, despite the availability of instructions from core 1 just a few milimeters away. The same reasoning is valid for caches and we see most multicore designs moving to shared caches, because it's the most efficient solution, even if it takes more work. It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functional

  • by Kupek (75469) on Friday April 07, 2006 @12:57PM (#15085448)
    OOOE has amazing potential
    Which has been realized for about the past 20 years. Exploiting Instruction Level Parallelism (which requires an out-of-order-execution processor) has gotten us to where we are today. We're reaching the limits of what ILP can buy us, so the solution is to put more cores on a chip.

    It may be possible to integrate OOOE into a multicore.
    It is possible, and every single Intel multicore chip has done it. Same with IBM's Power5s. For general-purpose multicore processors, that is the norm.

Optimism is the content of small men in high places. -- F. Scott Fitzgerald, "The Crack Up"

Working...