Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Compaq

Is SMT In Your Future? 119

Dean Kent writes "Simultaneous MultiThreading is a method of getting greater throughput from a processor by essentially implementing multi-tasking within a single CPU. Information on Compaq's plans for SMT in the EV8 can be found in thisan an article and thisand this article. Also, there is some speculation that Intel's Foster CPU (based upon the Willamette core) will also have SMT and that the P4 may even have the circuitry already included, as discussed briefly in forums."
This discussion has been archived. No new comments can be posted.

Is SMT In Your Future?

Comments Filter:
  • by Anonymous Coward
    Already has been done about 5 years ago by TERA computing. See http://www.tera.com which recently renamed themselves Cray, Inc. They had the Multi-threaded architecture (MTA). See their old news bulletins from about 1998/1999. Supposedly, one MTA processor could handle 128 instruction streams (threads) in what can be viewed as a virtual processor implementation. The real problem is writing a compiler to take advantage of this. They sold a four MTA processor system to SDSC (Sand Diego Supercomputer center) awhile back. See the following for a realaudio streaming explanation of it. http://www.cray.com/products/systems/craymta/video .html
  • by Anonymous Coward
    It's harder to create a bus system for eight separate processors.
  • Generally, the argument is one of utilization. The goal of an SMT processor is to be as efficient as possible. Think throughput rather than latency.

    With an SMP, each thread has resources dedicated to it: caches, function units, etc. In an SMT system these are shared dynamically across threads. Theoretically, each thread uses just as many resources as it needs for its level of instruction-level parallelism. So instead of each processor using, say, 2 integer units out of an available four, you now have 8 integer units being used in 90% capacity by multiple threads.

    Note that these threads need not all be from the same program, either. SMT works great in a multiprogrammed environment.

    Due to its ability for fast context switching, we're going to see some...interesting applications of threading. Check out the MICRO/ISCA/PACT, etc. papers on Dynamic Multithreading, Polypath architectures and Simultaneous Subordinate Microthreading (all of which, BTW increase the performance of single-threaded applications). Wild stuff is on the horizon.

    --

  • IMHO it's not worth it. That kind of work often requires serious rethinking of data structures which generally affects every part of the program. Eventually you end up rewriting the whole program. Programmer time is too expensive for that. Better to properly design things for readability and maintainability than to try to get 5% more performance out of it.

    This is why we have compilers and hardware. Compilers already do a fair amount of program transformation. Often the programmer, in a quest for "optimization," inevitably screws something up for the compiler, usually by doing "fast" pointer manipulation or using global variables.

    --

  • The cores are actually able to execute in different contexts as well, not just within the same context as with SMT.
    Why is this a limitation of SMT? SMT's have been simulated with multiprogrammed workloads for years. I'm honestly curious as I very likely might not be seeing something obvious.

    Do you have a reference for the paper? I know several folks who'd be interested. Where did the 20% improvement of the MSC come from?

    I can only assume the MSC is an abbreviation for the MultiScalar architecture (MSC == MultiScalar Computer?) that came out of Wisconsin. Is this correct?

    --

  • Heh...you've asked your way into a very complex environment.

    There are many factors that affect the ILP available in "typical" programs. The two most important limiting factors are the memory subsystem and the branch predictor. Anywhere from 30%-60% (depending on architecture) of your dynamic instructions are memory operations. When these miss in the cache, it takes a long time to service them. This backs everything up in the instruction window. O-O-O tries to get around this by issuing independent instructions to the core. The problem is, either no such instructions are available or they also block on a memory operation.

    On the I-fetch side, the branch predictor is responsible for feeding thie "right" stuff into the instruction queue. If a prediction is incorrect, the processor generally has to blow away everything in progress and start over on the right path. With the deeper pipelines we're seeing, this is only going to get more expensive. Even a 90% correct predictor incurs a huge penalty because the processor sees a branch every 5-8 instructions or so. The multiplicative factors ensure that accuracy diminishes quickly. No one has yet come up with a good multiway branch predictor.

    So on one level, the hardware is to blame because it doesn't work right. Not only do memory and branches choke off the available ILP, the machine can't look "far away" to discover distant parallelism. Instructions after a function call, for example, are often independent of those before the call, but there is no way the processor can fetch that far ahead.

    Enter the compiler. It is the compiler's job to schedule instructions such that the processor can "see" the parallelism. Unfortunately, this is very hard to do. Mostly this is due to the static nature of the compiler -- while the compiler can look far ahead (theoretically at the whole program, in fact), it doesn't know what will happen at runtime. The hardware has the advantage of (eventually) knowing the "right" path of execution. A compiler generally cannot schedule instructions above a branch because it doesn't know whether it is valid to execute them. In fact, the validity changes with each dynamic pass through the code.

    We're seeing some of these limitations being lifted with dynamic translation and recompilation. Unfortunately, you're now saddling the compiler with the limitations of the hardware: limited lookahead. It is too expensive to do a "really good" job of optimization at runtime. Still, there is some improvement to be had here.

    To sum up, the blame lies neither solely with the hardware or software. There is a complex interplay here that is only now beginning to be understood.

    --

  • I can easily see an engineering workstation making good use of SMT. In my environment, I'd love to fire off 4+ compilation threads to my SMT processor. Right now we have to use expensive quad Xeons to do it. SMT is cheap multiprocessing for the masses.

    --

  • Erm...huh? I've not heard of it. Can you provide a reference?

    Perhaps he's referring to the G5 processor? IBM's POWER line (S390, etc.) has used multiple execution cores for a while, but not for throughput. They use them for verification and reliability. One core checks the other and if one fails, the processor shuts down and its work is transferred to another node in the SMP system.

    --

  • For optimal performance, the compiler has to generate instructions to release registers when their values are not needed anymore.

    Reference? I don't recall reading anything about "register freeing" instructions wrt. SMT. A compiler "releases" a register by redefining its value. They've done that for years. :)

    It's true that a machine must hold onto a phyiscal register until a redefinition of the corresponding logical register is committed, but this isn't a problem in "traditional" O-O-O architectures, where the number of physical registers is adjusted to eliminate any difficulties this might imply. Register-caching architectures need to worry about stuff like this, but it has nothing to do with SMT per se.

    --

  • Eliminating it absolutely is the goal of SMT! SMT works on the priciple that when a thread blocks due to a cache miss, mispredicted branch, etc. it can execute from another thread. Don't think in terms of heavyweight threads, synchronization and SMP.

    An SMT doesn't really "context switch" in the traditional sense of the word. The reason it is "simultaneous" is that all threads are executing at the same time, like an SMP. It only "context switches" in the sense of fetching "more" from whatever threads are not currently blocked (or predicted to be blocked).

    An MTA (muti-threaded architecture, of which SMT is a variation) really does "context switch," but it is fast enough that it can be done every few instructions. This is a very fine grade of threading, more akin to instruction-level parallelism than SMP.

    With an SMT/MTA, the job of the OS (like an on SMP) is to schedule runable threads on the processor. Beyond that the hardware takes care of deciding what to fetch and from where.

    --

  • Well, not directly, though it's similar. It's closest to 'CMT' (coarse multi-threaded) in the article. Each microengine has 4 threads, and it siwtches to another thread on a memory access. If CMT can get a 166MHz processor with 6 cores to keep up with Gb line rate packet forwarding, I can't wait to see what SMT can do to the Alpha at 2GHz.
  • This is essentially a product of using the same processor features you need for other performance gains (register renaming, out-of-order execution) better, with a small amount of extra state.

    The first thing you do is get the OS out of the way by making the processor able to handle multiple threads at once without involving OS code in context switching. This is a definite win, because context switches done by the OS kill the pipeline and often kill the cache. If the processor is in charge of all of this, it can do better, because it is closer to what is going on.

    After that, the only innovation is to use the out-of-order-execution support on the whole set of things the processor is doing, instead of just one thread.

    Considering what they're already doing, this isn't a lot of new complexity, but it should help performance significantly. The main problem with out-of-order stuff has generally been that there isn't really all that much implicit parallelism within a thread. Adding other threads which do not, in general, have any dependancies gives the instruction scheduler much more to work with.
  • I imagine there would be a similar problem to what Intel is facing right now with the P4. I don't claim to be a compiler technology guru, but I imagine the EV8 will only be very fast when code was compiled specifically for it (then again, which other processors isn't this true for?

    The Transmeta Crusoe?

  • When you talk about server-rooms, instead of ye old pentium box doing some printer sharing 250watts becomes small.

    absolute balderdash. In your data centre power usage is /more/ of a concern than a PC at home or in a small office. You have to be careful and make sure you don't draw more power from a rack distribution box than it is rated for.

    Then you have to make sure that the peak power drawn by distribution boxes attached to one of your UPS line doesn't exceed what it is rated for.

    Then you have to make sure the total peak draw on your UPS is below its spec.

    Then you have to make sure you have enough mains 3 phase to feed your UPS at peak output.

    Then you have to make sure that your generators can supply the UPS at peak.

    So if equipment such as servers were to suddenly jump in power requirement from ~700W peak to maybe 2.5kW peak, this would poke a huge hole in your power planning.
  • the AS/400 has been a true/full 64 bit machine since 1995

    Actually, the AS/400 is just a new generation of the S/38, which was 64-bit back at least as far as the early 80s.
  • The problem with 8 CPUs on one die is that if one CPU has one tiny flaw, you have to chuck the whole die.

    WRONG!

    With 8 CPUs on a die, you can deactivate the damaged CPU. This is often done with level 2 cache and I beleive its even done with memory. Sell it as a 7 CPU chip? 4?

    Obviously this can't be done with LCD displays since you can see which pixels would be deactivated.

  • As long as your game engine can feed the video card at 30fps and you can do real-time video (de)/compression, what next?

    more polygons per frame, of course... and higher resolution images.... hardware NURBS.... real-time ray tracing.... don't worry, there are plenty of ways to use the eztra CPU cycles!

  • A lot of those "uneducated American beer-swilling, pork rind-munching citizens" are either in Congress, or working for the INS, and ensure that the rest of the UABSPRM citizens have their jobs "protected" by keeping them dang furriners out. At least, that's all one can conclude if you've ever dealt with any of these people.

  • Ummm...Dean Kent is the guy who runs Real World Technologies. Though why the actual author didn't submit it, I don't know.
  • True, although register renaming eliminates some of the "data-dependency" problem he mentioned.

  • I could look up the reference but the *star series (northstar, etc) of processors used in the AS/400 and RS6K boxes are not SMT. Its more vertical multithreading, as the comp.arch guys like to call it.
    The deal is that the processor executes instructions from one stream until it takes a cache miss at which point it takes a small context switch overhead time (something like 6 cycles) and starts executing another thread until that thread takes a cache miss. IBM's implementation is more of a method of hiding memory latency than a way to increase IPC. Of course decreasing memory latency causes the average system global IPC to go up but it also slows down the thread specific IPC. There are also issues with the TLB, cache etc being shared between both threads which can cause TLB thrashing which is a nasty performance problem. The result is that the performance improvement isn't nearly what the theoretical says it can be but it is a nice cheap way to gain a small but quite significant performance improvement.
  • Ah, there's a reference below.
  • Similar story with the same links is posted on the Ars Technica [arstechnica.com] frontpage today, except Hannibal of Ars is the one that did the research and came up with these articles at Realworldtech.com [realworldtech.com]. At least "Dean Kent" managed to read a few of the links a bit before submitting this info as his own. Perhaps next time you can submit a link to the rightful author's page rather than bypassing it and claiming his hard work as your own.
  • Check out the Octium IV [e-com-con.com] - it needs its own Hydrogen Enhanced Ductless Cooling System just to handle its 9000 parallel SOUs!

  • Being a nerd on the internet, my future involves more SMUT than SMT...

  • You could heat a swimming pool with that.

    Actually, 2kW is just the power consumtion of one or two cooking plates. So we can expect that the properly networked kitchen is only a short time away, although it probably won't be the fridge which has the most computing power...

  • Let's take a look at the projected power consumption of the 21464/EV8:

    250W / 1.2V = 208.3 A

    I wonder how they're going to push that kind of current through the processor core and how they want to cool that baby - considering that the die size has only increased by a factor of 1.5 since the 21064, but the power consumption has increased by a factor of 8.3, so the cooling will have to be 5.6 times as efficient as for the 21064.

    Will the 21464 have to be submerged in liquid nitrogen to avoid a deth by spontaneous evaporation?

  • OK, here's an idea I had about 5 years ago, but was told by a friend who knows more about CPU design than I would not work. If this turns out to be a good idea, I'm going to be pissed. :-) My understanding is that one of the big things limiting how much parallelism you can get is data dependencies. A given program or thread tends to be basically computing one thing at a time, and so things tend to depend on earlier things. To address this, my idea was to put 2 sets of registers (general registers, stack pointer, program counter, paging control stuff, etc) on the CPU. Set 1 is the main set, which the CPU normally executes using. Whenever the instruction steam for set 1 is not parallizable enough, the CPU could take instructions from set 2. Since set 1 and set 2 would be refering to different processes, there would be no data dependencies between them.
  • On top of all that, to get the best performance from SMT processors you need very smart compilers that are able to find parallelizable code and generate the binary for such. With MSC this isn't a problem. It'll run multi-threaded code simultaneously, but it'll also run multiple processes or any combination of both processes and theads simultaneously without help from smart compilers.

    Actually, I believe you're thinking of VLIW (aka EPIC) architectures. They need special compilers because parallelism is expressed in the instruction stream itself rather than discovered by the scheduling logic in the CPU. (Essentially, the compiler does some or all of the work of instruction scheduling.)

    SMT does not have this problem. It needs support in the OS (to present OS-level processes/threads as CPU-level threads), not the compiler.

    [BTW, I work for Compaq on the Araña (aka EV8) project that this article is about.]

  • I work for Compaq on the Araña (aka EV8) project that this article is about.

    Floating around our offices there's this comic somebody in the Alpha group drew years ago about the somewhat laughable concept of an Alpha-based portable. It shows the "Alpha notebook/backyard barbecue" crunching numbers and grilling steaks at the same time!

    [There actually was an Alpha notebook back in '96 [unixpac.com.au], but AFAIK there wasn't much demand for it and it died fairly quickly. Who would want to run VMS on a laptop anyway?]

  • The major microprocessor developers are all pursuing one of the following architecutreal paths:

    • SMT. Compaq is definitely doing this with the Alpha. There are rumors that IBM is working on it with their PowerPC line, although they may or may not have working prototypes.
    • VLIW: parallelism between instructions is discovered by the compiler and explicitly specified in the machine code. This is the way Intl and HP are going. [Don't be fooled by their new acronym EPIC, it's the same thing as VLIW. The only reason they don't want to say "VLIW" is that companies such as Multiflow that pursued VLIW in the '80s all died a flaming death when the technology proved unworkable, at least at that time.]
    • MCU: multiple simple processor cores in one package (possibly on one die, possibly as a multi-chip module). IBM has working prototypes of MCU PowerPC implementations. It's rumored that a research group in Compaq (but outside the Alpha microprocessor group) is developing an Alpha version.

    The reason for pursuing these is really a matter of differences in underlying philosophy. SMT is based on the philosophy the throughput is more important than single-stream performance. VLIW is based on the belief that single-stream performance is most important. MCU is based on the notion that time-to-market is key.

  • Unfortunately, I don't have the paper anymore and it has been over a year since I read it. That recount is what I remember. There was a lot more, it was a most interesting paper. I held onto it for a long time, but misplaced it in a move 6 months ago. I'll email Prof. Berger and see if I can get another copy to post.

    Ryan Earl
    Student of Computer Science
    University of Texas
  • Doh, maybe if I spelled his damn name right. Dr. Burger not Dr. Berger.

    "Billion Transistor Architectures" in PDF format [utexas.edu].

    And here's his homepage [utexas.edu] with other articles you should find interesting. He's the hauss; one the best professor I've ever had the pleasure of taking. The architecture is called CMT = Chip MultiProcessor.

    Ryan Earl
    Student of Computer Science
    University of Texas
  • In part two, Paul DeMone states:

    The IBM PowerPC RS64, also known as Northstar, is rumored to incorporate two way coarse grained multithreading capability, although it is not utilized in some product lines.

    The Northstar processor does in fact incorporate CMT. According to http://www.as400.ibm.com/beyondtech/arch_nstar_per f.htm, "Emphasis was placed on ... increasing the processor utilization by alternately executing two independent instruction streams." A similar page targeted at the RS/6000 audience (http://www.rs6000.ibm.com/resource/technology/nst ar.html) makes no mention of multithreading, so it appears that this feature is not used in that product line.

    Also, "Northstar" refers to the A50 (AS/400) / RS64-II (RS/6000) processor. An older processor, the A35/RS64-I "Apache" did not have hardware multithreading.

  • by Anonymous Coward
    Intel's Foster will not have SMT, although it may have multiple CPU cores on die.

    The problem with processors today is that they stall, often for hundreds (or thousands) of CPU cycles waiting for memory (or device registers). Adding more processors helps some, as other programs can run on the other processors, but they can also stall for hundreds of cycles on every cache miss.

    What SMT does is allows another thread to use the execution units on THIS processor during those hundreds or thousands of cycles that the processor is waiting on a cache line. SMT is an enormous win, especially for programs that have forced cache misses (such as reading memory that was DMAed from a device, as is the case for network packets and disk blocks). Forced cache misses are why OS kernels don't scale with the hadrware as much as applications. They are also the reason larger caches don't continue to buy more performance. (They also occur on SMPs when different processors modify the same cache line.)

    There are downsides to SMT, of course. First, it increases cache pressure (more working sets need to be kept in cache at a time). However, now the larger caches offer more benefit. And even if the cache isn't large enough, SMT allows the processor to maximally utilize it. Second, SMT increases the processor's complexity. Fortunately, most of the circutry required for SMT is already required for out-of-order execution. Interestingly, that circutry was removed from Intel's IA64 processors. Third, since the processors can't issue enough simultaneous bus transactions now, they would be further starved with SMT. This can be solved by increasing the number of outstanding requests that can be supported (which would be done for any reasonable SMT implementation).

    Explicit prefetching can also be beneficial at improving the execution of a single thread, but current processors do not allow enough outstanding memory transactions (typically 4-8; 4 in Intel, 8 on some RISC) for prefetching to be very useful: Rambus memory is so "slow" on x86 because the processor can't issue enough requests to the memory controller to keep the pipeline full: Thing of the memory bus as a highway: we can engineer it to have more lanes, or increase the speed limit. But we need to increase the number of "cars" on the road to see the performance advantage. Being stuck with 4 is pathetic.

    I hope the Alpha kicks some serious butt.

  • First the IBM 4 way die. Then the 52 way die some guy was raising venture capitol to build. Now Compact's 8 way die. With Intel dropping out of SMP support and AMD never seriously moving SMP support out the door, it looks like everyone whose tried SMP has gotten so little interest as to focus on higher clockspeeds instead. Sounds like Compact is getting ready to unload some Alpha engineers and proposing SMP Alphas on a single die under the name "SMT" is their last attempt at career recovery.
  • The cores are actually able to execute in different contexts as well, not just within the same context as with SMT.

    Who says the threads all need to be from the same context? I asked a compaq guy about that 14 months ago and he said they cold be from diffrent VM contexts, and (and this was a supprise to me) load/stores in diffrent MMU contexts can be done in one cycle.

    I don't recall if he was on the EV8 team, but he was in their CPU design department (his focus was on heat though).

  • Unfortunately, I don't have the paper anymore and it has been over a year since I read it. That recount is what I remember.

    My guess is that the paper and Compaq have slightly diffrent definitions of SMT. I assume the paper chose one the author thought was intresting, or easy to evalulate, or easy to implment, or most constructave to evalulate, and Compaq chose one that would give good value for the design and transistor investment.

    Given that very little existing software could use all-in-one-MMU-context SMT (multithreded programs only, and only CPU bound ones would take much advantage), and pretty much any CPU bound server workload (anything with more then one process) could take advantage of SMT with multiple MMU contexts.... of corse that assumes the implmentation cost isn't too horrific, but given that they picked it...

    Anyway, if you do get another copy of the paper I would love to see it. Even if it doesn't exactly address Compaq's SMT, it sounds intresting. I can't find it with google, but maybe if you remember the title of the paper, or any authors other then Prof Berger?

  • Why not put 8 processors behind one on-chip cache with write-back? That ends up looking like a single CPU off-chip. Aren't crossbars for explicitly-parallel algorithms?

    Thanks

    Bruce

  • Do you really have to chuck the whole die? Why not just blow the fuse to disable that CPU and sell a die with 7 CPUs instead of 8? My understanding is that memory yield-increasing technology works this way - they have "spare" rows that are production-time programmed to replace defective rows.

    Thanks

    Bruce

  • This crossed my mind as well. A good reason to favor SMT over SMP is that an SMP system wastes a lot of clock cycles on memory access. Just because a second CPU is still crunching while the first waits for data doesn't make the problem go away, in fact, it provides an opportunity for both CPUs to be wasting clock cycles. SMT improves processor utilization rather than decreasing it as do SMP systems. My only question is where I can buy a quad-CPU SMT system ;)
  • The problem with 8 CPUs on one die is that if one CPU has one tiny flaw, you have to chuck the whole die.
    Ah, but what if you had 32 CPU's visible on one die, and you actually manufactured 34? Then the "one or two tiny flaws" can be worked around by disabling the damaged CPU at a very low level, replacing them with one of the spares.

    This could result in an extremely fast, relatively cheap SMP machine. Yields would actually be much higher than normal CPU's, even though the chips would be bigger.

    Each CPU could have it's own level 1 cache, but they could share a big level 2-cache, and all the inter-CPU communication would be on the single chunk of silicon -- very fast!

    Hmmm. How about a Beowulf cluster of those! (duck).


    Torrey Hoffman (Azog)
  • The problem with 8 CPUs on one die is that if one CPU has one tiny flaw, you have to chuck the whole die. Then you have to charge customers for the one they buy and the one (or several) that you threw away making the one they bought. I'm blanking on the technical term for this, but it's a huge problem for LCD screen (which is why they're damn expensive). If you make 8 serpate cheap CPUs, it would be a lot more cost effective on the CPU end. Then, as the other reply points out, the mobo/bus gets a whole lot more complicated.

    -B
  • Remember, these aren't your run of the mill desktop CPUs, they are designed for servers that take a considerable load, and can out horse power pretty much anything at home. When you talk about server-rooms, instead of ye old pentium box doing some printer sharing 250watts becomes small.
  • As someone with a lot of hands-on experience with network processors (althouth not the Intel one), I must add that all of the network processors I have investigated have many coprocessors for doing operations like lookups, queueing, and so forth. The idea is that the actual processor core does not have to do a whole lot.

    One cannot compare a network processor to a general purpose processor since all of the NPs I've looked at are very specific to one application, networking.

    For example, one could not run SETI@Home calculations on a network processor, nor could they run Linux, as their memory architectures are often limited with most of the program memory residing on-chip and/or in very fast SRAM. Right now the largest high-speed SRAM chip available is around 4MB. It becomes impractical to add more than 16 MB of SRAM due to loading of the bus (assuming 64-bit). At 166MHz it is even worse.

    As for multiple contexts, many of the network processors can switch between contexts very quickly, but also remember that NP cores do not have many of the things a general purpose processor has. There's no paging or fancy memory management, nor is there floating point.
  • Huh? An ordinary out-of-order execution processor with multiple functional units already exploits instruction level parallelism (ILP).

    Isn't the whole point of SMT specifically because there usually isn't enough ILP in a single thread to keep the CPU busy...so you expose additional thread-level parallelism to the out-of-order execution engine to hopefully keep things humming?

    If you think SMT is different, please explain!
  • The whole premis of SMT is that even with a modern out-of-order execution CPU that there are hardware resources sitting idle that other threads could use... Is this pretty much unavoidable when executing typical compiled C/C++ code, or is this just because the compiler is doing a lousy job of generating code that gives the CPU enough opportunity to reorder instructions to keep the execution units busy?
  • I'm sure this'll soon sound like Gates' "640KB should be enough for anyone", but it does seem that for every day stuff processors are getting close to as fast they need be anyway. As long as your game engine can feed the video card at 30fps and you can do real-time video (de)/compression, what next? There doesn't seem to be much on the immediate horizon (like true AI) that'll really demand a lot more CPU power for average Joe.
  • Hmmm.. it'd be interesting to benchmark different versions of the same code explicitly written to have less sequential dependencies, and see if there's actually a noticeable speed-up. I guess you'd also have to be carefuol to do this on the scale of the processors pipeline.

    Do you have any idea if this type of code rewriting/reordering can actually be effective?

  • 5 years ago roughly what you describe was already being implemented by someone as a research project. So I don't know why you were pissed, it's been a known idea for quite a while.


  • Latency to main memory is only one of many problems you're trying to solve. The Tera MTA solves only that problem; SMT solves more.


  • From reading the datasheet, the IXP1200 has nothing to do with SMT.

    And you don't need a new benchmark for multi-threaded processors; current benchmarks generally cover both single process performance and workloads. For workstations, these benchmarks are SPEC2000 and SPECrate2000...


  • If you think it looks cool, you're the only person on the planet who does. It made me retch the first time I saw it.


  • There are many real workloads that can use more than 4 units. Many numerical programs, for example. SMT is a great way of running these programs really fast while doing as good as several current CPUs on workloads which can't use as many functional units.

  • The guy you're responding to was talking about on-chip multithreading, which is basically what EV8 does. What you're talking about is register renaming, which is completely different (although related). And BTW, all modern superscalar CPUs do register renaming, including the PPC.
  • It's unavoidable for two main reasons. The biggest reason is that there is a fundamentally limited amount of parallelism available in a single thread of C/C++ code. The programmer expects the code to execute serially, so you only get parallelism if you're lucky.

    The other reason is that to tolerate high memory latencies and other delays while keeping the processor busy, you need to have a really big instruction window. But that is very expensive to build and doesn't really scale, and futhermore requires lots of very accurate branch prediction; but programs have a certain amount of inherent unpredictability.
  • The OS context switching overhead is actually insignificant; eliminating or reducing it is not a goal of SMT. Context switch overhead only matters when you're doing a lot of it due to inter-process communication, and SMT doesn't help you there.
  • If I recall correctly, MTA is not a *Simultaneous* multithreading architecture; it only retires instructions from one thread in any given cycle. A true SMT machine can process and retire instructions from different threads in the same cycle.
  • Most SMT designs dynamically allocate physical registers to different threads "on the fly". For optimal performance, the compiler has to generate instructions to release registers when their values are not needed anymore. Thus you get significantly better performance by recompiling your software to the new architecture, regardless of how much backward compatibility they build into the chip. This suggests that people with access to the source code for their software are going to be happy campers.
  • The point is that making OS-level context switches faster is not the goal of SMT.

    PS, I was an intern at DEC SRC while Eggers was there on sabbatical helping design the EV8.
  • colohan@cs.cmu.edu :-)

    He interned at IBM and is working on speculative multithreading with Todd Mowry.
  • I don't have a reference, I just remember people talking about it while I was at DEC. Maybe it turned out not to be significant.

    However, it does seem logical to me that it could be a problem. Imagine you have two SMT threads, one which wants lots of physical registers and one which wants only a few. You'd have to tie up a bunch of physical registers to back up all the logical registers for the second thread, even though they're not really needed. How much more elegant to have the compiler insert instructions to say "I'm not going to need the value in this register".
  • Maybe it suffices to just build a gigantic register file, but I got the impression that was harder than you seem to think it is.
  • I don't know too much about the history of SMT, but I do know that there was quite a bit of research on it at UW's CSE department. My OS professor, Hank Levy, is working with Compaq on the SMT processor, I believe.

    A link to his SMT page is here: http://www.cs.washington.edu/research/smt [washington.edu]

    Since I'm not really qualified to say much about SMT, I recommend those that are interested to visit the link above and read some of the research. I attended Prof. Levy's lectures on SMT and it sounded very interesting.

    One very interesting note I'd like to make is that SMT is a way of keeping today's superscalar out-of-order architecture, and pump it with the benefits of running multiple threads without a context switch. VLIW machines rely on the COMPILER to organize and arrange machine code to take advantage of the parallelism inside the VLIW architecture. Of course, the problem with VLIW is that you live and die by the compiler. Not only that, but because the scheduling is static for VLIW, subtle changes in the architecture could result in the code no longer running at optimal scheduling.

    SMT allows the processor to execute multiple threads "simultaneously" (ie without requiring context switch). You allow maximum utilization of your functional units because a math hungry thread can run along-side a "light" thread, maximizing processor utilization simultaneously. As others have pointed out, this helps increase utilization especially with today's long latencies for a cache miss. And, because the processor does this dynamically, you can achieve close to optimal utilization across different running scenarios, and across multiple iterations of the architecture.

    Please correct me if I made mistakes, either through mis-understanding or lack of proof-reading.

  • what you describe is a form of threaded architecture - it's not a new idea (certainly it's been in the literature for way more than 5 yrs - in other forms it was in a variety of IO processors in the 60s) - the stuff being described in these articles are a more tightly coupled sort of threading where an out-of-order CPU can use register renaming etc to implement the multiple register sets.

    Having no data dependance isn't necesarily a good thing - it tends to lead to needing caches and TLBs that are twice as big or having the existing caches/TLBs thrash - some SMT schemes assume compilers that do things like generate speculative threads and share data and address mappings closely in order not to choke.

  • If the processor itself is dealing with thread-local state, wouldn't you include more than one prefetch queue/pipeline, and match available pipelines to working threads just like any other register set or other thread-local stuff?

    Ummm ... maybe, maybe not .... in an out of order, register renaming CPU like a Athlon/Pentium/etc 'pipelines' are pretty amourphous, apart from the prefetch there's basicly just a bunch of instructions waiting for chances to get done - you may have even speculatively gone down both sides of a conditional branch and intend to toss some of them depending on the branch being resolved (or even speculatively guessed at the results of a load and gone down that path ....). Expanding this to SMT is a pretty simple process - you just expand the size of the 'name' that you rename things to and tag loads/stores to use a particular TLB mapping.

    Now ifetch (and as a result decode) is a harder problem - ports into icaches are expensive - running 4 caches with associated decoders is possible. But remember the idea here is to use existing hardware that's unused some portion of the time - not to make the whole design 4 times larger, so more likely you're going to do something like provide some back pressure to the decoder logic giving information about how many micro-ops are waiting for each thread and use that to interleave fetch and decode from various threads.

    Now IMHO the conditions that make SMT viable are somewhat transient - they may make sense for a particular architecture one year, and maybe not next year - depends on a lot of confluence of technologies (for example I still think RISC to CISC transition made sense mostly because of the level of integration available at the time and the sudden speed up of ifetch bandwidth over core) - apart from the super-computer (everything's a memory access) crowd SMT may be a passing fad - not worth breaking your ISA for or creating a new one with SMT as its raison d'etre (ie add a few primitives, don't go crazy).

    (note to patent lawyers - I'm "skilled in the art" I find all the above is obvious)

  • The first thing I'd like to say is that you obviously need a fair bit of smacking around. Just because a chip won't do well on benchmarks don't mean dick. You're right - on benchmarks, this chip won't do too great. So, you can be happy with your P4 which shows great on the benchmarks, and gets 40fps on the next-gen games; I'll be happy on my badly-benchmarking EV8 which gets 120fps.

    Anyways, now, to answer your valid concern :)

    From an adoption standpoint(ie: how well your CPU will sell), putting 8 or more CPUs in one die isn't a great thing. How many operating systems do you know run well on 8 or more processors? However, almost every OS today uses multiple threads/processes which will benefit from this architecture.

    Of course, we're talking about an Alpha here, which basically runs Unix(and the various flavours thereof). When Linux gets ported to this processor, I imagine it'll perform stellarly. That's why I want Linux to succeed, actually :) If games start getting written for Linux, natively, then I'll be able to run nice 3D games on kick-ass non-x86 hardware :)

    Dave

    Barclay family motto:
    Aut agere aut mori.
    (Either action or death.)
  • I imagine there would be a similar problem to what Intel is facing right now with the P4. I don't claim to be a compiler technology guru, but I imagine the EV8 will only be very fast when code was compiled specifically for it(then again, which other processors isn't this true for? :).

    There is also going to be some needed kernel support too. Since the threads need to be distinguishable to the EV8, the kernel will have to name them(looks like two bits would do it, and it makes sens to me. But maybe they'll use just one, or use three or four for some headroom).

    Actually, you're right, I was speaking specifically of this hypothetical EV8; but only because I imagine it'll be a while before it comes out. In that time, I hope Linux becomes mainstream enough that some nice high-quality(read: nice graphics[not necessarily 3D; Diablo was great) games will be ported/written to it/for it. That way, with a nicely updated GCC and an EV8-aware Linux kernel, those games would just scream :)

    I havn't played a real game in about a year now - it's all old hat. There's just not a whole lot more you can do with today's processors. I hope the EV8 inspires someone :)

    Dave

    Barclay family motto:
    Aut agere aut mori.
    (Either action or death.)
  • I didn't mean to imply that "Hannibal" had written the articles at RWT, I was just giving him props for coming across those articles in his research on the Alpha and passing them along to his readers. At any rate, props are due to both of you.
  • Geez I'm dumb...sorry Dean Kent. I'll take my flame-thrower elsewhere.
  • latency. 8 simple CPUs running SMP would have astronomical latency for missed cache hits, and would probably sit idle most of the time. A single CPU (on the die) running SMT utilizes the latency time, by performing both paths of a branch, before the branch condition (an possible cache hit/miss) is resolved. Essentially, the CPU running SMT is more effective in a memory latency environment, an ever growing problem with these new 1+Ghz chips.
  • Not knowing anything about modern processor design, maybe this is naive, but...

    If the processor itself is dealing with thread-local state, wouldn't you include more than one prefetch queue/pipeline, and match available pipelines to working threads just like any other register set or other thread-local stuff?

    "In this thread, I am executing clock two of a floating pt divide, have this suite of values in the registers, and have prefetched sixty probable future instructions." Multiply by four for a 4xSMT processor.

    It's still using a single fetch mechanism to feed stuff to the pipelines, but one stalled thread doesn't waste the whole clock cycle that could have been spent fetching for another thread's pipeline.

    Like I said, I once was able to grok the actual transistor design of a 6502, but nothing more modern than that.

  • They didn't name the first Pentium the 586 because Intel couldn't trademark '586' and you of course cannot trademark numbers. Mainly they were pissy that companies like AMD and Cyrix were spitting out 486 clones and calling them a 486. They didn't want anymore of that, so they called it the Pentium(from the prefix Penta- meaning 5 in case you were too stupid to figure that out) Besides the flaw that the original pentium had regarding incorrect answers to arthemtic was not integer based but floating point, so the orginial pentiums even with the screwed FPU would still spit out 100 + 486 = 586, this is because its not a floating point number.
  • I think all you've done is move the problem onto the chip (which may well be an advantage, but the basic problem remains).

    Crossbars are for anywhere you have more than 1 source and/or more than one destination and you want to have multiple flows going at the same time -- imagine a bus architecture as an old thinwire ethernet and a crossbar architecture as an ethernet switch.

    That is, CPU A can be fetching from memory bank 2 at the same time as CPU C is writing to bank 3:

    A--[_4x4__]--0
    B--[_cross]--1
    C--[__bar_]--2
    D--[switch]--3

  • Not to mention that any respectable 8 cpu system today would us something like a crossbar switch ($$$) instead of a lame old bus design.

  • From the article: "Can the EV8 execute threads from different processes simultaneously? (i.e. threads with different address spaces). That hasn't been disclosed but the simple answer is, it would probably be easy to permit but it wouldn't be desirable in practice because it could thrash the TLBs."

    The cache bottleneck strikes again. The thing is intended to support multiple threads in the same address space. If the threads are in different address spaces, or doing drastically different things, the load on the cache goes up, apparently to unacceptable levels. This has a number of implications, most of them bad.

    The obvious application, just running lots of processes as if it were a big symmetrical multiprocessor, isn't what this is optimized for, apparently. What these people seem to have in mind are multithreaded applications where all the threads are doing roughly the same thing. SGI used to have a parallelizing compiler for their multiprocessor MIPS machines, so you could run one application faster, and this machine would be ideal for that. But that's a feature for the large-scale number-crunching community, and you just can't sell that many machines to that crowd.

    Graphics code would benefit, but the graphics pipeline is moving to special-purpose hardware (mostly from nVidia) which has much higher performance on that specific problem.

    I think if this is to be useful, it has to be as general-purpose as a shared-memory multiprocessor. Historically, parallel machines less general than an SMP machine have been market flops. This idea may fly if they can squeeze in enough cache to remove the single-address-space restriction. Then it makes sense for big server farms.

  • One of the major advantages of the TERA is that when a memory access operation stalls it can switch to a different thread which is not waiting for memory. This effectively eliminates memory bandwidth problems (a major problem with big supercomputers) as long as you can get enough threads going at one time.

    SDSC has one which is the coolest looking computer in the bunch - blue and kinda wavy shaped. And we all know that that's what really matters most.
  • One of the problem it's data consistency among the different processors cache... I believe. Sometimes it's hard to avoid ping-pong effect of the different threads. Also, for threaded applications, they normally use the same memory, which means the data is replicated among the caches, which enforces consistency mechanisms.

    --ricardo

  • Paragraph title from first link:

    Alpha EV8 (Part 1): Simultaneous Multi-Threat

    I thought the threat part was Microsoft's job?

  • I know it's not sexy, but check out the AS/400 sometime. It uses Power PC processors that are 64 bits, some of which implement what IBM calls Hardware Multithreading. Hardware Multithreading isn't as radical at SMT, but it is a good step.

    The CPU has two register files, each of which is called a "thread." All of the architected registers are duplicated, so it is like having two processes/executable units cached on the processor. When one is executing, if it stalls on memory the processor context switches to the other. The context switch is a hard boundary - only one thread can execute at at time.

    This isn't as fine grained as SMT, but it is easy to implement and it provides pretty good bang for the buck. It improves throughput, not speed. The throughput improves because the processor can try to do more work on the other thread while the first one is stalled. Some deadlock prevention stuff is thrown in, as well as some operating system tweaks to make it run better.

    There is a published paper - it's in the IEEE archives, from 1997 or 1998.

    This is relevant because it's been out for a few years (1998), it's commercially available, and thousands of AS/400 customers are using it today. (And it works so well, most don't even know that it is there.)
  • So is it more power effective to get a box of these or a box of PIIIs? (In Million Floating Point Operations per Joule)
  • I'm so frikin sick of hearing about how the Transmeta CPU can morph into any architecture on the planet. This is just patently false. Even if you were just joking, there are many out there that believe this crap.
  • What you're proposing is somewhat similar to the Pirranha architecture that the Alpha folks have experimented with. But aside from that, the main advantage of SMT is it allows a single complex CPU to offer as much implicit parallelism as it can, and then, when it fails at that during low-multiple-execution-unit-use-rate-periods, it can make use of explicit parallelism. It's the best of both worlds, in other words.
  • by profesor ( 1499 ) on Thursday December 28, 2000 @10:46AM (#1415991)
    Intel's IXP1200 [intel.com] network processor already does something like this. It's a very spiffy little processor - one of these running at 166MHz can route IP packets at a Gb/s. Though you need to explicitly code your microcode this way, it's definitely not hidden from you like the Alpha chip will hide it.

    Sounds like it may be time to make a new benchmark to cover multi-threaded processors.
  • by Greg Lindahl ( 37568 ) on Thursday December 28, 2000 @10:25AM (#1415992) Homepage

    Just because benchmarks are single threaded doesn't mean that they can't benefit from multiple execution units. Typical chips today (Pentium III, Alpha) have a lot more than 1 exectuion unit, and get a benefit from it most of the time.

    The benefit of SMT over N smaller cpus is flexability: A program that can use the entire chip at once is damn fast, or several programs can share it.

  • by roca ( 43122 ) on Thursday December 28, 2000 @12:03PM (#1415993) Homepage
    I've been told by an architecture grad student friend of mine, who should know, that IBM has an AS/400 system using a PPC-like core that does SMT.
  • by barracg8 ( 61682 ) on Thursday December 28, 2000 @01:35PM (#1415994)
    I'm involved in a project involving a SMT (well, more CMT) processor design. We do not yet have any silicon, but are getting good results in simulation.

    We have a simulator that can be set to simulate a processor with any number of closely coupled cores, and any number of threads per core. We get good results at a 8 core * 4 threads setup (total up to 32 way parallel).

    Using some basic automatic parallelization on a piece of code designed to run in a single thread, we have generated up to a 26X speedup, 8 core * 4 threads versus 1 core * 1 thread.

    The advantage of SMT over a normal processor is that it makes use of clock cycles that would otherwise be wasted, eg waiting for the cache to fill. If your architchture spends half of its time stalled, and you can make use of these cycles by adding SMT, then you can increase your processor performance very efficiently.

    SMT basically requires you to duplicate all of the processor's registers n times (n = #threads), + a little extra hardware ('little', relative to duplicating the entire core). So for ((1 * core) + (2 * registers) + SMT hardware) you are getting the performance of ((2 * core) + (2 * registers)). Good bang per buck ratio, when you count up the transistors.

    But SMT naturally gives you diminishing returns for each thread you add - the whole point is that each new thread is using up wasted cycles - and once you reach ~4 threads there are very few cycles left over. At this point, if you have room left over on the die, you may as well start thinking about SMP on the same die.

    Surpriesd the article didn't mention SMT & AMD. Check out this link [chip-architect.com].

  • by jmaessen ( 75854 ) on Thursday December 28, 2000 @12:22PM (#1415995)
    A lot of people are assuming that multiple processors can be put on the same die for "equal or less cost". This simply isn't true.

    Sharing the cache is hard

    Cache is the vast majority of chip area in a modern processor; as others have pionted out, it's obvious that multiple processors should share a cache. However, this is difficult. The problem is that every load/store unit from every processor must share the same cache bandwidth.

    Thus, for a 2-way chip with only a shared cache, memory latency---to the cache, the best possible case---is cut in half.

    We can work around this by using various tricks up to and including multiported caches---but most of these tricks increase latency (lowering maximum clock speed) or require much more circuitry in the caches (we were sharing the cache because it was so big, remember?).

    It makes much more sense to share the circuitry that feeds into the cache.

    Those are the superscalar execution units! Thus, SMT.

    Utilization

    Instead of keeping half the execution units busy, we attempt to keep them all busy. Extrapolating very roughly from Figure 2 [realworldtech.com] we can expect to issue about half as many instructions as we have issue slots (actually less if we have a lot of execution units). The basic idea is we can cut the number of empty issue slots in half each time we add a new thread. Further, instructions from separate threads do not need to be checked for resource overlaps---this circuitry is the main source of complexity in a modern processor.

    What's happening now has been predicted for a long time. The extra resources (a bigger register set, TLB, extra fetch units) required for multithreading are now cheaper than the extra resources you'd need (mostly pipeline overlap logic) to get a similar increase in single-threaded performance.

    SMT easier than SMP?

    Moving thread parallelism into the processor is actually easier for the compiler and programmer; the weak memory models implied by cache coherence models aren't an issue when threads share exactly the same memory subsystem.

    To get an idea for how hard it is to really understand weak memory models, consider Java (which actually tries to explain the problem to programmers---in every other language you're on your own). Numerous examples of code in the JDK and elsewhere contain an idiom---double-checked locking---which is wrong [umd.edu] on weakly-ordered architectures. What's this mean? Your "portable" Java code will break mysteriously when you move it to a fast SMP. Alternatively, you will need to run your code in a special "cripple mode" which is extremely slow.

    From a programmer's perspective, SMT (as opposed to SMP) architectures will be a godsend.

  • by El ( 94934 ) on Thursday December 28, 2000 @10:22AM (#1415996)
    Rather than run multiple simultaneous threads on a single massively complicated CPU with 8 instruction units, why not simply put 8 very simple CPUs on the same die (at equal or less cost) and just run SMP? Why is SMT considered a "win", when most benchmarks are single-threaded anyway? Seems like we're moving in the direction of complexity for complexity's sake here...
  • Many of you may not be familiar with TERA, the seattle based super computer company that bought CRAY from SGI and then renamed themselves CRAY for marketing reasons.

    Tera's home-brew supercomputers used what they called the TERA MTA - Multi-threaded architecture processors. You could get a 4 proc MTA machine that would significantly outperform much larger super computers.

    Essentially the MTA cpu has knowledge of 128 virtual threads of execution inside of it. AFAICT, the point of the MTA design, and apparently of this one, is to minimize the penalty for branches, context switches, etc, wherever possible by putting fine grained execution knowledge in the CPU itself.

    Given that superscaling has reached its limit and superpipelining is getting nastier and nastier, this might be a good way to go. Apparently Tera gets great numbers with their MTA stuff.
  • by rjh3 ( 99390 ) on Thursday December 28, 2000 @11:48AM (#1415998)
    There have been a variety of real world experiences with multi-threaded CPUs. Two of the more interesting are:

    The Denelcor HEP. Only a few were made, and this dates way back to 1985, but it was a really neat multi-threaded CPU. It ran a variety of Unix, and had some reasonable extensions to adapt Fortran (even now probably the most popular number crunching language) to the multi-threaded CPU world.

    The Alewife project at MIT. A variety of interesting ideas. Nothing ever really past the prototypes was finished to my knowledge. The concepts of operation are fun to examine.

    These are an interesting complement to the SMP approach.

  • by pallen ( 111618 ) on Thursday December 28, 2000 @10:52AM (#1415999)
    I know its slightly OT but it says in the article that each of these babies will consume 250 Watts. Thats obscene. People run 8 processor boxes of these as well, so 2KW just for the processors. You could heat a swimming pool with that.
    --------
    Make something idiot proof and someone will make a better idiot.
  • by john@iastate.edu ( 113202 ) on Thursday December 28, 2000 @10:29AM (#1416000) Homepage
    Right, but the major stumbling block to just throwing more execution units at a CPU and letting the CPU and/or compiler schedules them is that after about 4 units or so you run out of work you can schedule from a single thread (the old "every fifth instruction is a branch bugaboo").

    So DEC's idea was, hell, grab some work from some other thread and do that.

    Pretty cool, IMO.

  • by aanantha ( 186040 ) <ahilan_anantha@yahoo.com> on Thursday December 28, 2000 @11:41AM (#1416001)
    Having a 4-way SMT single CPU is a lot cheaper than 4 separate processors. Basically, you can think of it as a bridge between SMP and single CPU. There aren't enough applications out there that utilize SMP enough for most people to want to spend money on multiple processors. And because so few people use multi-processors, there haven't been enough application developers willing to make their code multithreaded. A typical catch-22 situation. But supporting multiple register sets on a single CPU doesn't cost all that much, and there are already multiple functional units on superscalar processors.

    So that brings us to a second reason. Wide-issue superscalar processors end up using very little of that issue width most of the time. You just can't get enough parallelism out of single threaded applications. SMT offers the ability to use that wasted issue width by scheduling different threads onto the wasted functional units.

    A third benefit to SMT is that it drives the industry in the right direction. Writing code to take advantage of SMT is basically the same as for SMP. You want to find ways to break your application into separate threads. If SMT becomes a common feature on CPU's, then perhaps we'll have lots more SMP-favorable code. There will also be greater incentive to write efficient parallelizing compilers. CMP is more efficient that SMT for high levels of parallism. So in future, people will probably be moving from SMT to CMP.

    It's true that SMT doesn't help with standard single-threaded benchmarks. That's probably what's delayed the industry in adopting it. But the industry is finding out that it's running out of ways to speed up processors. Increasing clock rate isn't enough because your memory latency becomes a greater bottleneck. So increased parallelism becomes more and more crucial
  • by Mtgman ( 195502 ) on Thursday December 28, 2000 @10:36AM (#1416002)
    I intend for SMT to be an integral part of my daily life from here on. I just can't spend a day without thinking of all those beautiful babes doing all those naughty things... Oh, and you've done your usual crappy job of editing, there's supposed to be a U in that word.

    Steven
  • Add a heatsink as a grill and give it a 10 degree tilt with a fat collection tray ... get a half-witted aging sports star to endorse it and a funky name ...

    " ... grandmas want them! college kids want them! ... " etc.

  • by tietokone-olmi ( 26595 ) on Thursday December 28, 2000 @11:13AM (#1416004)

    Actually, that's pretty much what the Pentium Pro (ergo p2, p3, celeron, celeron2 and the p4) do - only there it's done using "virtual registers" which means that the register "eax" can map to a completely different physical register if the instruction scheduler needs it to.

    For example, you could write your code like this:

    mov ebx, Pointer
    mov cx, [ebx]
    mov eax, [ebx+cx]
    mov Pointer2, eax

    (now I'm pretty sure that's not the best way to do it - it's just an example, ok?)
    Now, if you have another multi-instruction operation after this and it's going to use any of the registers used above, the CPU will see in the decoding phase that "a-ha! eax has received a value that doesn't depend on itself (i.e. a completely new value)" and will assign a different physical register to "eax" until it's overwritten again. (this is also the reason who xor reg,reg is not the preferred way of clearing a register on the ppro and up.) Same for ebx and ecx and the other regs. By the time the CPU is finished decoding these instructions (this would take 1 and 1/3 cycles for ppro through p3 and 1 cycle for the p4 (due to the 4-1-1-1 decoders)), the reorder buffer (that receives the decoded instructions, also called micro-ops or uops) will have been filled up with previously decoded instructions and will be able to put as many uops into the execution "ports" as possible (3 per cycle in ppro through p3, not sure about the p4).

    This, of course, assumes that the code is organised so that the decoders can feed the reorder buffer with more than 3 micro-ops per decoding cycle, so that there's something to reorder. But this will, for the most part, take care of that data-dependency problem.

    Personally, I prefer explicit register setting (a'la PowerPC, 32 int regs + 32 fp regs) so that the CPU won't have to schedule instructions for me...

    (all this information, except for the p4 decoder uop-max series, comes from the excellent pentopt.txt [agner.org] file.)

  • by Greg Lindahl ( 37568 ) on Thursday December 28, 2000 @11:43AM (#1416005) Homepage

    The Tera MTA requires a compiler to multi-thread all processes. You only get 1 functional unit (and huge latencies == terrible speed) if your program can't be transformed by the compiler.

    SMT, in contrast, can work on programs which can't be multi-threaded by a compiler. It works on "instruction level parallelism" (ILP). This is a much finer grain than parallelism that a compiler can find and exploit with another thread.

  • by taniwha ( 70410 ) on Thursday December 28, 2000 @11:28AM (#1416006) Homepage Journal
    the problem you're trying to solve is the long latencies to main memory and the fact that when the CPU is idle for long periods when it has to wait for them. Basicly if you've gone to the trouble of building a cool OO cpu with register renaming, scoreboards etc etc then setting it up with and extra PC and the hardware to manage an extra thread is (theoretically) relatively easy - doing it for something like an X86 with state up the wazoo is probably rather harder.

    Having gone down the route of doing a paper design for an SMT I know that one of the real problems with SMT in traditionally piped CPUs (ie non-OO) is that with today's deep pipelining the cost of thread switches is really high - often to the point of being useless.

    The alternative (SMP) is good for other reasons - you can potentially reduce the size synchonous clock domains on a dies - design time may be lower (build one and lay out 8). The downsides have to do with memory architectures (cross bars, buses, cache paths etc)

  • by Heretic2 ( 117767 ) on Thursday December 28, 2000 @12:06PM (#1416007)
    I read a very good paper while taking Systems Architecture at UT by Dr. Berger that he wrote while he was at Wisconsin. They simulated three different billion transistor architectures:
    • Massively Parallel/Pipelined ala today's processor
    • SMT
    • Multiple simple core on-die
    The MSC (forgot the real abreviation, but that's what I'm going to call it) architecture had 4 simple, identical cores. Each core was about somewhere between a Pentium and a K6 in terms of complexity--lean on scheduling logic, heavy on executive hardware--each with an independent, decent sized L1 cache. The MSC chip had a large on-die L2 cache quad-ported or oct-ported that all processing cores could access quickly and simultaneously, and a fat L3 cache to boot on-die. It also contained some special context caching mechanism.

    The cores are actually able to execute in different contexts as well, not just within the same context as with SMT. This opens up parallization across more than one process.

    One of the more interesting problems in a billion transistor chip is the wire delay. With processes so small that a billion transistors can be put on a moderate sized die, the clock rate is so high that the wire delay from one side of the chip to another side can be over 100 clock cycles! So locality of information becomes extremely important. With multiple, simple processing cores, all the logic for the pipeline is close together. The data is readily available in L1 cache. The scheduling logic has been mostly handled outside the cores, all they have to do it crunch numbers within their context as fast as possible. They don't have to worry about sending/receiving signals from very far on the chip and the resultant delay, so everything is local and fast.

    Additionally, it's the least complex chip to design. Only one processing core needs to be designed and tested since it's duplicated 4 times. The core is much simpler than other designs. The scheduling logic is all much simpler and easier to test. Most of the die space is devoted to localized caches and executive units, not scheduling logic.

    In the benchmarks the SMT and MSC processors vastly outperformed a convential massively pipelined/parallel billion transistor processor. And the MSC performed an additional 20+% (on average) than the SMT processor.

    On top of all that, to get the best performance from SMT processors you need very smart compilers that are able to find parallelizable code and generate the binary for such. With MSC this isn't a problem. It'll run multi-threaded code simultaneously, but it'll also run multiple processes or any combination of both processes and theads simultaneously without help from smart compilers.

    Ryan Earl
    Student of Computer Science
    University of Texas

A complex system that works is invariably found to have evolved from a simple system that works.

Working...