Forgot your password?

Inside Intel's Next Generation Microarchitecture 116

Posted by CowboyNeal
from the tiny-big-ideas dept.
Overly Critical Guy writes "Arstechnica has the technical scoop on Intel's next-generation Core chips. As other architectures move away from out-of-order execution, the from-scratch Core fully adopts it, optimizing as much code as possible in silicon, and relies on transistor size decreases--Moore's Law--for scalability."
This discussion has been archived. No new comments can be posted.

Inside Intel's Next Generation Microarchitecture

Comments Filter:
  • by willith (218835) on Thursday April 06, 2006 @10:41PM (#15081855) Homepage
    Do we get two front page articles because the Core Duo has two cores? Goodie!!
  • It even links to the same article...
  • by Sqwubbsy (723014) on Thursday April 06, 2006 @10:47PM (#15081888) Homepage Journal
    Ok, so I know I'm going to get a lot of AMD people agreeing with me and a lot of Intel people outright ripping me to shreds. But I'm going to speak my thoughts come hell or high water and you can choose to be a yes-man (or woman) with nothing to add to the conversation or just beat me with a stick.

    I believe that AMD had this technology [] [] before Intel ever started in on it. Yes, I know it wasn't really commercially available on PCs but it was there. And I would also like to point out a nifty little agreement between IBM and AMD [] [] that certainly gives them aid in the development of chips. Let's face it, IBM's got research money coming out of their ears and I'm glad to see AMD benefit off it and vice versa. I think that these two points alone show that AMD has had more time to refine the multicore technology and deliver a superior product.

    As a disclaimer, I cannot say I've had the ability to try an Intel dual core but I'm just ever so happy with my AMD processor that I don't see why I should.

    There's a nice little chart in the article but I like AMD's explanation [] [] along with their pdf [] [] a bit better. As you can see, AMD is no longer too concerned with dual core but has moved on to targeting multi core.

    Do I want to see Intel evaporate? No way. I want to see these two companies go head to head and drive prices down. You may mistake me for an AMD fanboi but I simply was in agony in high school when Pentium 100s costed an arm and a leg. Then AMD slowly climbed the ranks to be a major competitor with Intel--and thank god for that! Now Intel actually has to price their chips competitively and I never want that to change. I will now support the underdog even if Intel drops below AMD just to insure stiff competition. You can call me a young idealist about capitalism!

    I understand this article also tackles execution types and I must admit I'm not too up to speed on that. It's entirely possible that OOOE could beat out the execution scheme that AMD has going but I wouldn't know enough to comment on it. I remember that there used to be a lot of buzz about IA-64's OOOE [] [] processing used on Itanium. But I'm not sure that was too popular among programmers.

    The article presents a compelling argument for OOOE. And I think that with a tri-core or higher processor, we could really start to see a big increase in sales using OOOE. Think about it, a lot of IA-64 code comes to a point where the instruction stalls as it waits for data to be computed (most cases, a branch). If there are enough cores to compute both branches from the conditional (and third core to evaluate the conditional) then where is the slowdown? This will only break down on a switch style statement or when several if-thens follow each other successively.

    In any case, it's going to be a while before I switch back to Intel. AMD has won me over for the time being.
    • What gives? (Score:1, Funny)

      by Sqwubbsy (723014)
      If the editors can post a dupe story, why can't I post a dupe comment?
      The mods gotta loosen up a little. Sheesh.
    • There's no way you could do branch prediction and processing on multiple cores, the latency would be too high for branches of a realistic size.
      • 'no way' sounds a bit far fetched, but it does seem that Intel's idea of using massive pipelines to aid in certain calculations bombed a bit. Nothing wrong with trying new things though. Branch prediction does seem a waste on anything but a multithread/multicore processor, unless you're running calculations in spare processor cycles - but for most apps where performance matters, how likely is that to happen? You may as well try to predict every single possible thing the user is going to do next while they a
        • The massive pipelines work great, just not on things with lots of branches. This was a known issue to Intel, and was considered to be a worthwhile risk, as the expectation was that the CPU would scale to the high GHz. That the processor tops out at ~4GHz means that your gain to loss ratio of what the popeline depth gets you has changed (or more accurately failed to improve as anticipated).

          All that said, there are several applications where the Intel Archecture whips AMDs, the top two being:
          MS Office and s
      • Really? Is it not weird, then, that Sun's octuple-core T1 processor outclassed the competition with at least 2:1, and normally closer to 3:1, in the last SPECweb round []?
    • I'm neither an Intel or an AMD fan. I generally dislike Intel due to the retarded Netburst architecture and many of their business practices, but on a purely technical standpoint I seriously think they're onto something with their next generation Core. I do think you're talking out of your ass, or just pasted an old comment from a different article. You may feel free to tell me I'm talking out of my own ass as well.

      I believe that AMD had this technology before Intel ever started in on it.

      What technology a
  • by LordRPI (583454) on Thursday April 06, 2006 @10:48PM (#15081893)
    Each core can be in two places at once!
  • It's like a landmark-- "Surf until you get to a geekish news site with anti-Microsoft bent and a couple dupes on the front page. When you get there, you're on Slashdot."
  • by Anonymous Coward
    Guys I have a great idea, let's all point out the fact that this article is a dupe!

    Seriously this is gonna be so cool, slashdot will never be the same again!

  • Dupe articles with identical links? Meh. Bring it on. When are we getting dupes with identical summaries?
  • Israel (Score:1, Interesting)

    by Anonymous Coward
    So apparently Intel had to go to Israel to find computer engineers to design their flagship architecture for the next 5+ years. With a population of only 7 million how is it that so many brilliant chip designers are in Israel?
    • by kfg (145172)
      With a population of only 7 million how is it that so many brilliant chip designers are in Israel?

      So many of them came from Levittown, LI.

    • [flamebait]
      During the Middle Ages, while gentiles pushed their smart sons into the priesthood and celibacy, the smart Jews became rabbis and had lotsa kids.

      The Izzies have had to become really smart because they're surrounded by people who'd like nothing better than to push them into the sea. As a matter of fact, when they got military gear from the States, the manufacturers often came back and asked them exactly *what* they did with the electronics; it might have had to do with the 88-2 kill r
      • They also, Have A Lot Of Math Education Per Capita. We, OTOH here in the States, have enough to allow people to push the <Big Mac Meal> <Med Coke> <Total> <10.00> buttons and hope to get the right change. In my high school (in state A) we were taught all the math basics (you know, trig, pre-cal, nothing too hard), including how to use a scientific calculator (such as the simple TI line), and where I am now in Uni (almost ten years later, in state B), people in my math classes canno
        • Re:Israel (Score:2, Insightful)

          by jawtheshark (198669) *
          I read that the main problem in the US is that science/math is considered unsexy. Most students want to go into business or law, because that's where the money is made. I guess it is a result of being an extremely capitalist society.

          One odd thing is that the US imports many scientists with attractive grants, resulting in an exodus from European scientists (probably from other countries too, I just know Europe). Of course, since the eleventh september, getting a visa has become hard and thus less scient

      • "it might have had to do with the 88-2 kill ratio over the Bekaa Valley in the early 80s."

        For comparison, the US Navy lost 2 planes [] to Syrian SAMs in just one raid in '83.
    • Re:Israel (Score:3, Informative)

      by pchan- (118053)
      Intel Israel has been a strong development center for Intel for quite some time now. Traditionally, new chips have been designed in the U.S., and then the designs were sent to the Israel for making them more power-efficient or improving performance. This situation got turned on its head. The American design team came up with the disaster known as the Netburst architecture (the highest clock P4 chips). Meanwhile, the Israel team was optimizing the Pentium-M (P3 and up) architecture and got its performanc
      • AMD was announcing dual core chips years before Intel had planned to release any.

        Is this an attempt to prove the saying that if a lie is often repeated, it becomes true?

        Intel First to Ship Dual core []

        I don't care how you spin it, your statement was a lie bordering on AMD fanboyism.
    • Because Jews are really smart. No, seriously. Why do you think they're so rich? Studies have actually shown that there is a sub-population of Jews that gets Nobel Prizes vastly out of proportion with their numbers.
  • Bite me twice.
  • Since this is a dupe (Score:4, Interesting)

    by TubeSteak (669689) on Thursday April 06, 2006 @11:04PM (#15081954) Journal
    Can someone summarize nicely and neatly, the practical difference(s) between out-of-order and in-order executions?

    Why is it important that Intel is embracing OOOE and everyone else is moving away.
    • by dlakelan (43245) <dlakelan@s[ ]et- ... g ['tre' in gap]> on Thursday April 06, 2006 @11:13PM (#15081991) Homepage
      Out of order execution is where special silicon on the processor tries to figure out the best way to run your code by reordering the instructions to use more of the processor features at once.

      In order execution doesn't require all that special silicon and therefore frees up die space.

      So one approach is to try to make your one processor as efficient as possible at executing instructions.

      Another approach is to make your processor relatively simple, and get lots of them on the die so you can have many threads at once.

      I personally prefer the multiple cores, because I think there is plenty of room for parallelism in software. HOwever this guy is basically claiming that intel is trying to get both, more cores and smarter cores. They're relying on Moore's law to shrink the size of their out of order execution logic so that they can get more smart cores on die.

      • In software like video/audio processing, then yes, there is a veritable orgasm of parallelizable code. For most single programs a user wants to execute, the max speedup you can expect to see is about 1.2 with 2 cores versus one. (I'll spare you the computation, it's rather long, I had to do it in my Advanced Computer Architecture class and again in my High Performance Architecture class).

        Also don't forget that with multiple cores you're introducing a host of new problems such as scheduling, cache coherency,

        • Before, when copying 40 GB of movies/tv/pr0n from your friend's removable HDD, your computer would tank, practically deadlocked.

          I believe disk transfers are mostly done using DMA, the processor isnt really executing a loop for copying data (check ur cpu usage during a copy)... the deadlocking i think has prolly more to do with the IO interface being choked.

          You are right about the amount of available parallelism though, architects/designers simply dont know of any good way to use all the real estate o

          • Yeah my example was a bad choice, I was more thinking of IDE being such a load on the CPU. I'm sure any of us can think of a good example of doing something that hogs your CPU but would be almost unnoticeable in a dual core environment.

            And yes, dumping the pipeline due to page faults or cache misses is a big deal. Miss penalties are a huge deal in any system. Most nowadays just go do something else if a program faults (assuming there's something else to do).

        • Hmm, if nothing else, this is kind of where the Cell concept comes in. Is it not supposed to actually dedicate one of the cores specifically to the control of the others? Sounds like a good idea to me. Anyway, what I'm thinking (and I've touched base on this in another thread) is that with multicore being pushed so hard these days programmers might buckle down and actually program better. See, OOOE probably requires the chip to do most of the breaking up. I mean, if you break up the code first-hand, wh
          • See, OOOE probably requires the chip to do most of the breaking up. I mean, if you break up the code first-hand, why do you need the chip to have a smart way to do it for you?

            This doesn't really help, unless you give the programmer direct access to the cpu cache, as OOOE helps with cache misses on the cpu, so rather then halting the processing while it fetches the needed data, it trys to continue running, while it gets the needed data to continue on the instruction it was working on.

            and btw, giving the pro
            • I'm not really talking about direct access to the cache or anything like that. Just better design of the code so that it it splits more things to begin with. With OOOE this isn't absolutely necessary (though it can't hurt to try to write it concentrating on writing code that you can be relatively positive will do well in OOOE) but with multithreading it does, admitedly, become necessary. While you can't directly control what the chip will be doing, you can control what you are sending to it to begin with
        • I think you mean "veritable orgy," not "veritable orgasm," unless of course you're processing the tail end of a porn video.
      • by acidblood (247709) <decio@dec[ ]net ['pp.' in gap]> on Friday April 07, 2006 @01:45AM (#15082356) Homepage
        Be careful when you speak of parallelism.

        Some software simply doesn't parallelize well. Processors like Cell and Niagara will take a very ugly ugly beating from Core architecture based processors in that case.

        Then there's coarse-grained parallelism, tasks operating independently with modest requirements to communicate between themselves. For these workloads, cache sharing probably guarantees scalability. Going even further, there's embarassingly parallel tasks which need almost no communication between different processes -- such is the case of many server workloads, where each incoming user spawns a new process, which is assigned to a different core each time, keeping all the cores full. This type of parallelism ensures that multicore (even when taken to the extreme, as in Sun's Niagara) will succeed in the server space. The desktop equivalent is multitasking, which can't justify the move to multicore alone.

        Now for fine-grained parallelism. Say the evaluation of an expression a = b + c + d + e. You could evaluate b + c and d + e in parallel, then add those together. The architecture best suited for this type of parallelism is the superscalar processor (with out-of-order execution to help extract extra parallelism). Multicore is powerless to exploit this sort of parallelism because of the overhead. Let's see:
        • There needs to be some sort of synchronization (a way for a core to signal the other that the computation is done);
        • The fastest way cores can communicate is through cache sharing -- L1 cache is fairly fast, say a couple of cycles to read and write, but I believe no shipping design implements shared L1 cache, only shared L2 cache;
        • An instruction has to go through the entire pipeline, from decode to write-back, before the result shows up in cache, whereas in a superscalar processor there exist bypass mechanisms which make available the result of a computation in the next cycle, regardless of pipeline length.

        Essentially, putting synchronization aside for the moment (which is really the most expensive part of this), it takes a few dozens of cycles to compute a result in one core and forward it to another. Also, if this were done in a large scale, the communication channel between cores would become clogged with synchronization data. Hence it is completely impractical to exploit any sort of fine-grained paralellism in a multicore setting. Confront this with superscalar processors, which have execution units and data buses especially tailored to exploit this sort of fine-grained parallelism.

        Unfortunately, this sort of fine-grained parallelism is the easiest to exploit in software, and mature compiler technology exists to take advantage of it. To fully exploit the power of multicore processors, the cooperation of programmers will be required, and for the most part they don't seem interested (can you picture a VB codemonkey writing correct multithreaded code?) I hope this changes as new generations of programmers are brought up on multicore processors and multithreaded programming environment, but the transition is going to be turbulent.

        Straying a bit off-topic... Personally, I don't think multicore is the way to go. It creates an artificial separation of resources: i.e. I can have 2 arithmetic units per core, so 4 arithmetic units on a die, but if the thread running on core 1 could issue 4 parallel arithmetic instructions while the thread running on core 2 could issue none, both of core 1's arithmetic units would be busy on that cycle, leaving 2 instructions for the next cycle, while core 2's units would sit idle, despite the availability of instructions from core 1 just a few milimeters away. The same reasoning is valid for caches and we see most multicore designs moving to shared caches, because it's the most efficient solution, even if it takes more work. It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functional

        • It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functionality via symmetric multi-threading (SMT), a.k.a. hyperthreading. The secret is designing a processor for SMT from the start, not bolting it on a processor designed for single-threading as happened with the P4. I strongly believe that such a design would outperform any strict-separation multicore design with a similar transistor budget.


          • Submision changed n^2 complexities to n complexities.
            Its register rename, choocing which instruction goes next etc... increasing n^2 when when core changes.
            • Your sig, while perhaps being factually correct, is extremely misleading.

              High risk in medical terminology means a statistically significant risk higher than average. This means that 1 in 6 babies have a risk that is outside the margin of error. Most likely this means that 1 in 6 babies have a 1 percent chance of brain damage. So roughly 0.16 percent of babies actually have some form of brain damage that can be attributed to coal pollution.

              I do agree that 1 out of every 600 babies damaged by pollution is

          • Also when the travaling of information across a die takes more than 10 cycles you need to have smaller structures, it will increase latencies of instructions.

            Not sure what you mean here, but if you're talking about my estimate of the costs of exchanging information between cores, remember that this is due to the lack of bypass structures between cores, the need for explicit synchronization code, and the rather inefficient method of sharing data through the cache. Once hardware is dedicated to it, even in la

            • Well the problem you need to fix is called physics. The RC delay with process scaling increases.
              The basicly in every process generation you have to reduce length of each wire by 0.7 or have half as many wires. Inorder to keep the delay per mm at same. Since rc delay increases when scaling wires smaller. The latency of moving data around increases all the time.
              Your transistor budget may go up, but the area that you can use with reasonable clockspeed per cycle goes down.

              Here's a hint, even in a good condition
        • The poster obviosly hasn't design any CPU:s. Nor doesn't know about physics related to semiconductor design.
          He's programmer who doesn't need to think those things.
          n^2 or n^3 algorithms (in terms of power and aread) are used in MOST part of the core. So when the guy recommends that in next generation instead of having 4 cores we have single core he suggested that we have one core which is twice as wide as one of those 4 cores.
          Large fraction of code is pointer chasing, large fraction of code has ILP equal or
      • Out of order execution, and also on-chip cache, help in speeding up programs like Windows, and also speeding up other programs that are compiled from high level languages like C or Basic.

        Neither feature improves the speed of assembly language programs. Out of order execution does not assist code that has been written to run fast.
        On-chip cache does not help such code as much as plain old on-chip memory would.

        Therefore Intel's and AMD's focus on on-chip complexity is to favor Windows Benchmark programs.

        The fa
    • An out of order execution executes out or order and an in order executions executes in order.

      Get with the program. Sheesh
    • The article summary is strange. Nobody should be surprised by Intel's decision to base their next generation of CPUs on out-of-order execution. They've been doing that ever since the Pentium Pro. Outside of the Itanium, Intel has never gotten away from "embracing OOOE". I have no idea why they even brought up the subject of in-order and out-of-order execution.
    • by John_Booty (149925) <johnbooty AT bootyproject DOT org> on Friday April 07, 2006 @01:03AM (#15082176) Homepage
      It's a philosophical difference. Should we optimize code at run-time (like an OOOE processor) or rely on the compiler to optimize code at compile time (the IOE approach)?

      The good thing about in-order execution is that it keeps the actual silicon simple and uses less transistors. This keeps costs down and engineers have more die space to "spend" on other features, such as more cores or more cache.

      The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well. Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

      (Unless developers released multiple binaries or the source code itself. While we'd HAVE source code for everything in an ideal world, that just isn't the case for a lot of performance-critical software out there such as games and commerical multimedia software.)

      As a programmer, I like the idea of out-of-order execution and the concept of runtime optimization. Programmers are typically the limiting factor in any software development project. You want those guys (and girls) worrying about efficient, maintainable, and correct code... not CPU specifics.

      I'd love to hear some facts on the relative performance benefits of runtime/compiletime optimization. I know that some optimizations can only be achieved at runtime and some can only be achieved at compiletime because they require analysis too complex to tackle in realtime.
      • The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well. Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

        That may have mattered in previous iterations of CPU hardware, but haven't the last f

        • That may have mattered in previous iterations of CPU hardware, but haven't the last few generations of AMD & Intel CPUs used the same instruction sets?

          You can have two processors that implement the exact same instruction set, yet have entirely different performance characteristics.

          Of course, this happens even with complex out-of-order cores. With simpler, in-order cores, the difference really grows. You need to tightly couple your code (typically via compiler optimizations, unless you're hand-coding a

        • When I learned to code, I was taught that multiplication was expensive, and shifting was cheap. If at all possible, I should replace power-of-two multiplications with shifts. In some cases, it was even better to replace constant multiplications with sequences of shifts and adds. This was so common that (when I checked a year ago), GCC output shift/add sequences for all constant multiplications.

          The Athlon, while instruction-set compatible with previous CPUs, had two multipliers on chip and only one shift

      • BEGIN RANT *sighs* If only programmers today were concerned with efficient, correct, and maintainable code. In reality the lazy/money factor usually wins out now days. That is why you see 10 billion frameworks out there and every project uses a handful of them.

        Usually said programmers sell out efficient code claiming that the framework has been tested and worked on by a lot of people, blah blah blah. The truth is that two good programmers will churn out roughly the same number of bugs per 1000 lines of code
      • That's where a project like LLVM [] comes in. Platform-neutral binaries via LLVM bytecode, and full processor-specific link-time native compilation+optimization when a binary is installed. Alternatively, you can JIT the bytecode at runtime. Developers just distribute LLVM bytecode binaries, and the installers/users do the rest. I think the LLVM approach is the future.
        • Wow, that sounds fascinating. Sounds like that achieves the best of all worlds with minimal drawbacks.

          I'd seen the odd reference to LLVM in the past, but I'd never seen a succinct description of its benefits until now. Thanks for the informative reply.
      • Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

        Not really. The best case for any in-order processor is to have dependent instructions as far apart from each other as possible. From this state, no amount of re-ordering instructions by an OoO processor will give any performance benefit. Similarly, no in-order pipeline will be particularl

        • Not really. The best case for any in-order processor is to have dependent instructions as far apart from each other as possible. in-order pipeline will be particularly disadvantaged by this.

          You're assuming that the definition of "dependent instructions" is the same for every in-order processor sharing the same instruction set. I think that's a highly suspect assumption!

          Different theoretical in-order x86 CPUs would surely differ in terms of execution units and other factors.
        • The fundemental problem is that the compiler doesn't know at runtime exactly what the dependencies will be. Branches and memory operations, which are extremely common in most software, create dependencies that the compiler cannot analyze at compile-time, but the processor can analyze at run-time. In the real-world, in-order versus out-of-order isn't just a matter of code scheduling, but fundementally limits the types of code you can run at high speed.
      • "The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well.
        (Unless developers released multiple binaries or the source code itself. While we'd HAVE source code for everything in an ideal world, that just isn't the case for a lot of performance-critical software out there such as games and commerical multimedia software.)"

        This isn't an issue that couldn't

        • Actually you didn't miss the binary build distribution, sorry about that. The idea of releasing the source-code for every application seemed like non-sense. Binary build distribution is the status quo, and therefore doesn't even present a challenge in my mind. Which is why I assumed you missed this answer. Again, my apologies for misreading you.
          • In the above quote, you placed the burden of efficiency on the compiler, and on this quote, you placed the burden of efficiency on the programmer. Which is responsible for the optimization of the resulting binary, the compiler or the language?

            You certainly made a great point here, though. To be honest, I'm not sure of the answer. I was banking on it being "both".

            I'm going on various (admittedly secondhand) things I've heard about Xbox360/PS3 development along with several whitepapers I've read. Creating c
    • OOOE breaks up the intruction stream execution order so that as many execution units are busy as possible thus maximizing performance. While this is done, the hardware checks data dependencies between instructions so that the correct results are still produced. For example, if there is a integer add followed by a fp multiply and then a branch, it could theoretically execute all there in parallel assuming enough execution units are available. But then lots of problems come up such as if the fp multiply gener
    • At a technical level, the difference between OOO and IO is thus: an OOO Processor can issue, via a structure called a reservation station, instructions in an order other than what is in the code stream. So say the CPU decodes instructions A, B, C, and D, in that order. These instructions go into a reservation station. Instructions in this structure sit there until all its source operands are available. That means if A, B, C, and D enter the RS, but B and C's operands are available before A and D's, B and C
  • by lordsid (629982) on Thursday April 06, 2006 @11:08PM (#15081967)
    The real problem with dupes isn't the fact that there are the same two articles on the front page, nor the whines that come from it, or even the whitty banter chidding the mods.

    If I see an article I've already read at the top of the page I QUIT READING.

    This has happened to me several times over the number of years I've read this site. Then I end up coming back and realizing it was a dupe and that I missed several interesting articles inbetween.

    • You say that under the assumption that you are the majoirty that realise it's a dupe (you are a minority). This article was probably posted for those that missed the previous article, and seeing the comments about dupes leads people to actually click on the link.

      I'm one of those people that just read summaries, and decide not to click on the link because it doesn't interest me. Seeing people say "dupe" leads me to think this article was worth posting twice.

      Or Ars wasn't pleased with the ad-clicks from the p
  • by Gothmolly (148874) on Thursday April 06, 2006 @11:25PM (#15082044)
    Wasn't the Achilles heel of the P4 and Itanium crappy code, that caused a pipeline stall on their very long pipes? Every time someone pointed out that AMD didn't have this problem, an Intel fanboy would reply that "with better compilers" you could avoid conditions where you'd have to flush the pipeline, thus maintaining execution speed.
    Well, those "better compilers" don't seem to be falling from the sky, and AMD is beating Intel in work/MHz because of it.
    Is Intel finally deciding "screw it, we'll make the CPU so smart, that even the crappiest compiled code will run smoothly" ?
    • This was a problem with the Itanium, not the P4. The problem with the P4 was that the pipeline was very long and wide. A P4 could have 150 (from memory) instructions in-flight at once. On average, every 7th instruction is a branch. Every branch that is incorrectly predicted causes a pipeline flush (i.e. 150 instructions, at various stages of execution, are ignored). With a prediction rate of 95%, this means you will have an incorrect prediction every 20 branches. Since 20 branches means roughly 140 in
  • I just want a planet with two cores now.
  • by TheSHAD0W (258774)
    Does this mean we're not going to be seeing mid-ten-digit clock rates any more? That was one thing that really annoyed me about the P4; a 2 GHz P4 was NOT more than twice as fast as a 850 MHz P3. It meant one couldn't compare CPUs with each other any more.
    • It does mean that the long-pipeline + high clock strategy of Netburst will be abandoned. Presler and Dempsey are the last of that ill-fated breed.

      However, Conroe has been announced to hit speeds as high as 3 ghz (or higher) for Intel's next Extreme Edition part. We may see speeds that high for the server version of Conroe (Woodcrest) as well.
    • Not likely on the Intel side, but IBM has made good progress with the Power6. They have managed to keep the pipeline at 13 stages, while clocking it at 4-5GHz. This is in contrast to the P4 with its 31 stage pipeline and much higher power consumption. It seems that Intel has given up prematurely, or perhaps their process technology/ISA are not as amenable to such optimizations.

      Now, frequency isn't everything, but performance scaling is nearly linear if you hold the pipeline depth constant. (And scale

    • Re:GHz (Score:2, Interesting)

      by jawtheshark (198669) *
      That was one thing that really annoyed me about the P4; a 2 GHz P4 was NOT more than twice as fast as a 850 MHz P3. It meant one couldn't compare CPUs with each other any more.

      You never could do that in the first place. Within a CPU family, it used to be possible. (With Intels naming schemen today, I can't do it anymore either!) Compare a P-III 500MHz to a P-III 1GHz and you knew that the latter was approximately twice as fast. An 2GHz AMD Athlon XP was approximately twice as fast as a 1GHz AMD Ath

  • by Nazo-San (926029) on Friday April 07, 2006 @01:08AM (#15082217)
    I just thought it should be stated for the record. Moore's law isn't a definite fact that cannot be disproven. It has been working so well up to now and will for a while yet that it is rather easy to seriously call it a law, but, we shouldn't forget that, in the end, there are physical limitations. I don't know how much longer we have until we reach them though. It could be five years, it could be twenty. It is there though and eventually we will hit that point to where transistors will get no smaller no matter what kind of technology you throw at it. At that point, a new method must be put into place to continue growth. This is why I personally like reading Slashdot so much for articles on things like quantum computing and the like. Those may be pipe dreams perhaps, but, the point is, they are alternate methods that may have hope someday of becoming truly powerful and useful. Perhaps the eventual sucessor to the current system will arise soon? Let's keep an eye out for it with open minds though.

    Anyway, I do understand a bit about how it all works. OOOE has amazing potential, but, in the end the fact remains that you can only optomize things so much. The idea there is actually to kind of break up instructions in such a way that you can actually kind of multi-thread a task not originally designed for multi-tasking. A neat idea I must say, with definite potential. However, honestly, in the end the fact remains that you will run into a lot of instructions that it can't figure out how to break up or which actually can't be broken up to begin with. If they continue to run with this technology, they will improve upon both situations, but, in the end, the nature of machine instructions leads me to believe that this idea may not take them far to be brutally honest.

    Let's not forget that one of the biggest competitors in the processors that focus on SIMD is kind of fading now. Apple is going to x86 architechure with all their might (and I must say I'm impressed at how smoothly they are switching -- it's actually exciting most Apple fans rather than upsetting them) and I think I read they no longer will even be producing anything with PowerPC style chips, which I suppose isn't good for the people who make them (maybe they wanted to move on to something else annyway?) At this point it's looking like it's more and more just the mobile devices who benefit from this style of chip, which is primarily just due to the fact that between their lack of need for higher speeds and overall design to use what they have efficiently, they use very little power and do what they do well in a segment like that.

    Multi-threading, however, is a viable solution today and in the future as well. It just makes sense really. You start to run into the limitations as to how fast the processor is going to run, how many transistors you can squeeze on there at once, power and heat limitations, etc, however, if you stop at those limits and simply add more processors handling things, you don't really have to design the code all THAT well to take advantage of it and keep the growth continuing in it's own way. I can definitely see multicore having a promising future with a lot of potential for growth because even when you hit size limitations for a single core you can still squeeze more in there. Plus, I wonder if multicore couldn't work in a multi-processor setup? If it can't today, won't it in a future? Who knows, there are limits on how far you can go with multi-core, but, those limits are further away than single core by far and I really feel like they are more promising than relying on smart execution on a single core running around the same speed. In the end, a well designed program will be splitting up instructions on a SMP/multicore system much like the OOOE will try to do. While the OOOE may be somewhat better at poorly designed programs (ignoring for a moment the advantages that multithreading provides to a multitasking os since even on a minimal setup a bunch of other stuff is running in the background) overa
    • No way Jose, SIMD isn't going out of style at all. What do you think the SPE:s of the Cell processor do best? SIMD. What did Intel put a LOT of resources into in its new Core architecture, theoretically doubling the speed of this part? SIMD. What is it that makes it possible for a PII300 to decode DivX, or a P4 3GHz/Athlon64 2GHz able to decode video in HDTV resolutions? SIMD. It wouldn't stand a chance with just regular scalar instructions. MMX/SSE2 are essential.
      • You misunderstand me. I mean major processors fully relying on this sort of method such as the PowerPC. Of the things you mentioned, only the Cell is actually a truly modern thing (which, btw, I hear is basically a PowerPC style chip.) Instructions like MMX are definitely useful, but, does the processor rely on them almost exclusively? You see, the almost pure SIMD processors run far slower and rely on getting a lot of stuff done at once while the x86 architecure we're so used to runs blazing fast and g
    • Google search for powerpc shows that they are IBM chips. Just thought you might like to know. They're not exactly looking to get out of the processor market, and I am even under the impression that they use PPC in their datacenter style servers, etc []. Just so's use-all knows, s'all 'm sayin'.

      Plus, I wonder if multicore couldn't work in a multi-processor setup?

      Well, I work for a major computer manufacturer (think top 3, they also make very nice printers [market leaders you might say]) and the Enterprise

      • The initial "Google search" in the above should be Google search []
      • Only thing I'm really in any disagreement at all about is the popularity of the PowerPC processors today (not even a year ago, but, specifically today.)

        It sounds like you're advocating the oft-mentioned point that games are the main thing that will benefit. Well, this is true, but, there are some business or non-gaming oriented things where people will see the differences as well, and these shouldn't be discounted either. Firstly, we're going to need those things like MMX I guess. MS is determined that o
        • Didn't mean to give the impression that I thought that the multi setups were better for games, I don't game that often [wait for collective slashdot sigh to dissipate]. I personally would rather see better multi-threaded application support, however, IIRC, the big programs out there are: Adobe, Autodesk (Engin minor) and SAP, etc. So now we're left with a bunch of not-necessary-that-they-run-at-all programs which may or may not be multithreaded, and programs like Word or Excel that it really wouldn't be
    • OOOE has amazing potential
      Which has been realized for about the past 20 years. Exploiting Instruction Level Parallelism (which requires an out-of-order-execution processor) has gotten us to where we are today. We're reaching the limits of what ILP can buy us, so the solution is to put more cores on a chip.

      It may be possible to integrate OOOE into a multicore.
      It is possible, and every single Intel multicore chip has done it. Same with IBM's Power5s. For general-purpose multicore processors, that is the norm.
  • by hobotron (891379) on Friday April 07, 2006 @01:55AM (#15082399)

    Alright mod me offtopic, but if /. just took the beta tags and if dupe showed up after a certain number of tags, or however they calculate it, have the story minimize to the non popular story size thats in between main stories, I dont want dupes deleted but this would be a simple soultion that would get them out of the limelight.

    • Here's the real problem: I'll bet that Arstechnica pays Slashdot to have their article linked from the main page. Doing what you suggest would remove the article summary from the main page, and Slashdot would lose a revenue source.
  • The Merom/Conroe/Woodcrest cores are NOT based on the PM, aka Banias/Dothan/Yonah. Any even cursory look at the architectures, from pipeline depth to functional units will show they are totally different.

    Who keeps perpetuating this stupidity, and when have we as a culture lost the ability to look past shiny things shown to us by guys in lab coats? The cores rock, the previos cores rock, they are not the same.

    Just because Merom is more like the PM than the P4 means all of squat.

It is impossible to travel faster than light, and certainly not desirable, as one's hat keeps blowing off. -- Woody Allen