Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
For the out-of-band Slashdot experience (mostly headlines), follow us on Twitter, or Facebook. ×
Transmeta Hardware

NVIDIAs 64-bit Tegra K1: The Ghost of Transmeta Rides Again, Out of Order 125 125

MojoKid (1002251) writes Ever since Nvidia unveiled its 64-bit Project Denver CPU at CES last year, there's been discussion over what the core might be and what kind of performance it would offer. Visibly, the chip is huge, more than 2x the size of the Cortex-A15 that powers the 32-bit version of Tegra K1. Now we know a bit more about the core, and it's like nothing you'd expect. It is, however, somewhat similar to the designs we've seen in the past from the vanished CPU manufacturer Transmeta. When it designed Project Denver, Nvidia chose to step away from the out-of-order execution engine that typifies virtually all high-end ARM and x86 processors. In an OoOE design, the CPU itself is responsible for deciding which code should be executed at any given cycle. OoOE chips tend to be much faster than their in-order counterparts, but the additional silicon burns power and takes up die area. What Nvidia has developed is an in-order architecture that relies on a dynamic optimization program (running on one of the two CPUs) to calculate and optimize the most efficient way to execute code. This data is then stored inside a special 128MB buffer of main memory. The advantage of decoding and storing the most optimized execution method is that the chip doesn't have to decode the data again; it can simply grab that information from memory. Furthermore, this kind of approach may pay dividends on tablets, where users tend to use a small subset of applications. Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting. There's no need to keep decoding it for execution over and over.
This discussion has been archived. No new comments can be posted.

NVIDIAs 64-bit Tegra K1: The Ghost of Transmeta Rides Again, Out of Order

Comments Filter:
  • by IamTheRealMike (537420) on Tuesday August 12, 2014 @05:47AM (#47653499) Homepage

    Although I know only a little about CPU design, this sounds like one of the most revolutionary design changes in many years. The question in my mind is how well it will work. The CPU can use information at runtime that a static analyser running on a separate core might not have ahead of time, most obviously branch prediction information. OOO CPU's can speculatively execute multiple branches at once and then discard the version that didn't happen, they can re-order code depending on what it's actually doing including things like self-modifying code and code that's generated on the fly by JITCs. On the other hand, if the external optimiser CPU can do a good job, it stands to reason that the resulting CPU should be faster and use way less power. Very interesting research, even if it doesn't pan out.

  • by Anonymous Coward on Tuesday August 12, 2014 @07:15AM (#47653679)

    I think the entire point of having 7 micro-ops in flight at any point in time combined w/ the large L1 caches and 128MB micro-op instruction cache is designed to mitigate this, in much the same fashion the shear number of warps (blocks of threads) in PTX mitigates in-order execution of threads and branch divergence.

    Based on their technical press release, AArch64/ARMv8 instructions come in, at some point the decoder decides it has enough to begin optimization into the native ISA of the underlying chip, at which point it likely generates micro-ops for the code that presumably place loads appropriately early s.t. stalls should be non-existant or minimal once approaching a branch. By the looks of their insanely large L1 I-cache (128kb) this core will be reading quite a large chunk of code ahead of itself (consuming entire branches, and pushing past them I assume - to pre-fetch and run any post-branch code it can while waiting for loads) to aid in this process.

    The classic case w/ in-order designs is of course cases where the optimization process can't possibly do anything in-between a load, and a dependent branch - either due to lack of registers to do anything else, lack of execution pipes to do anything else, or there literally being nothing else to do (predictably) until the load or branch has taken place. Depending on the memory controller and DDR latency, you're typically looking at 7-12 cycles on your typical phone/tablet SoC for DDR block load into L2 cache, and into a register. This seems like it may be clocked higher than a Cortex A15 though, so lets assume it'll be even worse on denver.

    This is where their 'aggresive HW prefetcher' comes into play I assume, combined w/ their 128KiB I-cache prefetching and analysis/optimization engine, denver has a relatively big (64KiB) L1 D-cache as well! (for comparison, the Cortex A15 - which is also a large ARM core - has a 32KiB L1 D-cache per core) - I would fully expect a large part of that cache is dedicated to filling idle memory-controller activity with speculative loads to take educated "Stabs in the dark" at what loads are coming up in the code to once again, in the hope of getting some right and mitigating in-order branching/loading issues further.

    It looks to me like they've applied the practical experience of their GPGPU work over the years and applied it to a larger more complex CPU core to try and achieve above-par single core performance - but instead of going for massively parallel super-scalar SIMT (which clearly doesn't map to a single thread of execution), they've gone for 7-way MIMT and a big analysis engine (logic and caches) to try and turn single-threaded code into partially super-scalar code.

    This is indeed radically different to typical OoO designs in that those designs waste those extra pipelines running code that ultimately doesn't need to be executed to mitigate branching performance issues (by running all branches, when only one of their results matters) - where as denver decided "hey, lets take the branch hit - but spend EVERY ONE of our pipelines executing code that matters - because in real world scenarios, we know there's a low degree of parallelism which we can run super-scalar, and we know with a bit more knowledge, we can predict and mitigate the branching issues anyway!"

    Hats off, I hope it works well for them - but only time will tell how it works in the real world.

    Fingers crossed - this is exactly the kind of out of the box thinking we need to spark more hardware innovation. Imagine this does work well, how are AMD/ARM/IBM/Intel/IT going respond when their single-core performance is sub-par? We saw the ping-pong of performance battles between AMD/Intel in previous years, Intel has dominated for the last 5 years or so, unchallenged - and has ultimately stagnated in the past 3 years.

  • by shanipribadi (1669414) on Tuesday August 12, 2014 @08:24AM (#47653925)
    2 GiB = 2 * 2 ^ 30 Byte
    128 MB = 128 * 10^6 Byte
    2 GiB - 128 MB = 2019483648 Byte;
    2019483648 Byte > 2GB

    Who's the stupid fucker now?
  • by Megol (3135005) on Tuesday August 12, 2014 @08:35AM (#47653989)

    Out of order execution can only do one thing actually: cope with varying latency of operations. For most normal instructions a LIW/VLIW/explicit scheduled processor (yes there are some that aren't a *LIW type) can in most case do better than the dynamic scheduler. Where OoO execution really shines is hiding L1 cache misses and in some cases even L2 cache misses and there static scheduled code have a hard time adapting to hit/miss patterns.
    The standard technique for statically scheduled architectures is to move loads up as far as possible so that L1 misses can at least partially be hidden by executing independent code, often using specialized non-faulting load instructions that can fail "softly" and be handled by special code paths. The problem doing things like that is that fine grain handling isn't really possible due to code explosion.

    But it is fully possible to do partial OoO execution just for memory operations and maybe that's what Nvidia is doing. Maybe not.

  • by Theovon (109752) on Tuesday August 12, 2014 @08:51AM (#47654085)

    I'm an expert on CPU architecture. (I have a PhD in this area.)

    The idea of offloading instruction scheduling to the compiler is not new. This was particularly in mind when Intel designed Itanium, although it was a very important concept for in-order processors long before that. For most instruction sequences, latencies are predictable, so you can order instructions to improve throughput (reduce stalls). So it seems like a good idea to let the compiler do the work once and save on hardware. Except for one major monkey wrench:

    Memory load instructions

    Cache misses and therefore access latencies are effectively unpredictable. Sure, if you have a workload with a high cache hit rate, you can make assumptions about the L1D load latency and schedule instructions accordingly. That works okay. Until you have a workload with a lot of cache misses. Then in-order designs fall on their faces. Why? Because a load miss is often followed by many instruction that are not dependent on the load, but only an out-of-order processor can continue on ahead and actually execute some instructions while the load is being serviced. Moreover, OOO designs can queue up multiple load misses, overlapping their stall time, and they can get many more instructions already decoded and waiting in instruction queues, shortening their effective latency when they finally do start executing. Also, OOO processors can schedule dynamically around dynamic instruction sequences (i.e. flow control making the exact sequence of instructions unknown at compile time).

    One Sun engineer talking about Rock described modern software workloads as races between long memory stalls. Depending on the memory footprint, a workload could spend more than half its time waiting on what is otherwise a low-probability event. The processors blast through hundreds of instructions where the code has a high cache hit rate, and then they encounter a last-level cache miss and and stall out completely for hundreds of cycles (generally not on the load itself but the first instruction dependent on the load, which always comes up pretty soon after). This pattern repeats over and over again, and the only way to deal with that is to hide as much of that stall as possible.

    With an OOO design, an L1 miss/L2 hit can be effectively and dynamically hidden by the instruction window. L2 (or in any case the last level) misses are hundreds of cycles, but an OOO design can continue to fetch and execute instructions during that memory stall, hiding a lot of (although not all of) that stall. Although it's good for optimizing poorly-ordered sequences of predictable instructions, OOO is more than anything else a solution to the variable memory latency problem. In modern systems, memory latencies are variable and very high, making OOO a massive win on throughput.

    Now, think about idle power and its impact on energy usage. When an in-order CPU stalls on memory, it's still burning power while waiting, while an OOO processor is still getting work done. As the idle proportion of total power increases, the usefulness of the extra die area for OOO increases, because, especially for interactive workloads, there is more frequent opportunity for the CPU to get its job done a lot sooner and then go into a low-power low-leakage state.

    So, back to the topic at hand: What they propose is basically static scheduling (by the compiler), except JIT. Very little information useful to instruction scheduling is going to be available JUST BEFORE time that is not available much earlier. What you'll basically get is some weak statistical information about which loads are more likely to stall than others, so that you can resequence instructions dependent on loads that are expected to stall. As a result, you may get a small improvement in throughput. What you don't get is the ability to handle unexpected stalls, overlapped stalls, or the ability to run ahead and execute only SOME of the instructions that follow the load. Those things are really what gives OOO its adva

Machines have less problems. I'd like to be a machine. -- Andy Warhol

Working...