Forgot your password?
typodupeerror

Nvidia Working on a CPU+GPU Combo 178

Posted by Zonk
from the that-will-keep-them-out-of-trouble-for-a-while dept.
Max Romantschuk writes "Nvidia is apparently working on an x86 CPU with integrated graphics. The target market seems to be OEMs, but what other prospects could a solution like this have? Given recent development with projects like Folding@Home's GPU client you can't help but wonder about the possibilities of a CPU with an integrated GPU. Things like video encoding and decoding, audio processing and other applications could benefit a lot from a low latency CPU+GPU combo. What if you could put multiple chips like these in one machine? With AMD+ATI and Intel's own integrated graphics, will basic GPU functionality be integrated in all CPU's eventually? Will dedicated graphics cards become a niche product for enthusiasts and pros, like audio cards already largely have?" The article is from the Inquirer, so a dash of salt might make this more palatable.
This discussion has been archived. No new comments can be posted.

Nvidia Working on a CPU+GPU Combo

Comments Filter:
  • by TheRaven64 (641858) on Friday October 20, 2006 @12:55PM (#16517931) Journal
    It's not just floating point. Originally, CPUs did integer ops and comparisons/branches. Some of the things that were external chips and are now found on (some) CPU dies include:
    1. Memory Management Units. Even in microcomputers there are some (old m68k machines) that have an off-chip MMU (and some, like the 8086 that just don't have one).
    2. Floating Point Units. The 80486 was the first x86 chip to put one of these on-die.
    3. SIMD units. Formerly only found in high-end machines as dedicated chips, now on a lot of CPUs.
    4. DSPs. Again, formerly dedicated hardware, now found on-die in a few of TI's ARM-based cores.
    A GPU these days is very programmable. It's basically a highly parallel stream processor. Integrating it onto the CPU makes a lot of sense.
  • Re:A cyclic process? (Score:5, Informative)

    by shizzle (686334) on Friday October 20, 2006 @12:56PM (#16517943)
    Yup, the idea is pushing 30 years old now, and came out of the earliest work on graphics processors. The term "wheel of reincarnation" came from "On the Design of Display Processors", T.H. Myer and I. E. Sutherland, Communications of the ACM, Vol 11, No. 6, June 1968.

    http://www.cap-lore.com/Hardware/Wheel.html [cap-lore.com]

  • Re:A cyclic process? (Score:4, Informative)

    by levork (160540) on Friday October 20, 2006 @12:58PM (#16517981) Homepage
    This is known as the wheel of reincarnation [catb.org], and has come up several times in the last forty years of graphics hardware.
  • by Do You Smell That (932346) on Friday October 20, 2006 @01:12PM (#16518179)
    What I don't understand is that I thought GPUs were made to offload a lot of graphics computations from the CPU. So why are we merging them again? Isn't a GPU supposed to be an auxillary CPU only for graphics? I'm so confused.
    You're partially right. GPUs were made to execute the algorithms developed for graphically-intensive programs directly in silicon... thus avoiding the need to run compiled code within an operating system, which entails LOTS of overhead. Being able to do this directly on dedicated hardware (with entirely different processor designs optimized for graphical computations)makes it possible to execute ALOT more calculations per second. You can really see the difference if you, for instance, use DirectX on two nearly identical video cards; one with hardware based DirectX, the other with it running as software.

    Moving it right up next to the CPU will allow the data to flow between the two alot faster than currently where it has to go over a bus... they can finally get rid of the bottlenecks that have been around since the two were seperated.
  • by DeathPenguin (449875) * on Friday October 20, 2006 @01:23PM (#16518345)
    When I saw this headline I immediately thought of this article [wired.com], an interview with Jen-Hsun Huang (CEO: nVidia) by Wired dated July '02. In it, the intention of overthrowing Intel is made quite clear, and ironically enough they even mention the speculation from a time when it was rumored that nVidia and AMD would merge.

    It's actually a very good article for those interested in nVidia's history and Huang's mentality. Paul Otellini [intel.com] ought to be afraid. Very afraid.
  • by LWATCDR (28044) on Friday October 20, 2006 @01:45PM (#16518629) Homepage Journal
    I am an old school programmer so I tend to use ints a lot. The sad truth if that float using SSE are as fast and sometimes faster than the old tricks we used to avoid floats!
    Yes we live in an upside down world where floats are faster than ints some times.
     
  • by daVinci1980 (73174) on Friday October 20, 2006 @01:48PM (#16518675) Homepage
    Does it often matter whether a pixel has position (542,396) or (542.0518434,395.97862456)?

    Yes. It absolutely matters. It makes a huge difference in image quality.

    It matters when we go to sample textures, it matters when we enable AA, it matters.
  • by arth1 (260657) on Friday October 20, 2006 @02:09PM (#16518995) Homepage Journal
    Yes. It absolutely matters. It makes a huge difference in image quality.

    No, it doesn't. Note that I said pixel, not coordinate.
    The coordinates should be as accurate as possible, but having a pixel more accurate than twice the resolution of the display serves very little purpose.
  • by Anonymous Coward on Friday October 20, 2006 @02:21PM (#16519169)

    A guy from Intel recently presented at a seminar at my university. He is working with a group that is pushing for a CPU architecture that looks kind of like a GPU, when you look at it at a very high level (and perhaps your eyes squinted just a bit).

    The unofficial title of his talk was 'the war going on inside your PC'. He argued that the design of future CPUs and GPUs will eventually converge, with future architectures being comprised of a sea of small and efficient but tightly interconnected processors (no superscalar), and that it is basically a race to see who will get there first - the CPU manufacturers or the GPU manufacturers.

    One of his main points was that with increased compiler effort, potentially many computational workloads can be made to run on the tiled architecture of simple processors, much in the way that the process of graphic rendering has been able to be shifted into the type of workload that can leverage the 'tiles of simple processors' found in a graphics card today, even though the nature of graphic rendering was originally better suited for execution in a typical CPU, where control dependent loads run efficiently. When the workload cannot be mapped to the 'tiles of simple processors' architecture, just slap a superscalar processor in the corner of your die (like nvidia seems to be doing) to take care of those small corner cases.

    So, we will likely be seeing a lot more of this in the future. Especially now that AMD and ATI are together.

    (More details on the abstract of the presentation I mentioned can be found here [utexas.edu])

  • Why are the multiprocessor units suddenly so popular, relative to why e.g. the Voodoo graphics cards failed? I remember them being ridiculed and ending up in the performance backwaters through their 2-4-8(-16) multiprocessor cards, but it seems that there are engineering reasons why multiple processors are now suddenly coming into favour, or?

    multiple processors (CPU, GPU or otherwise) are a way to add more 'cycles' based on current technology. This has the advantage of being able to get more out of your current designs and manufacturing technology, but comes at the cost of increased complexity in both the supporting hardware, and in software.

    Getting a single core implementation faster is always the more efficient way to add processing capacity, but it is very impractical beyond a certain point due to power and heat considerations (where that point is exactly depends on the state of technology at any given moment but in the end is limited by the physical size of molecules, at least for as far as current technology goes)

    So, multiple processors is not directly better from an engineering point of view, rather, it is a solution to overcome the speed limits of current technology, provided you can deal with the extra complexity (moving much of the hardware complexity into the chip itself like AMD and Intel are doing now removes the burden from systemboard designers, but the complexity itself is still there, esp. on the software side of the picture).

    With regards to 3dfx, it seems to me that:
    1. They failed to manage the additional complexity
    2. As their competition showed, limits of technology at that time were much higher then what 3dfx managed, which indicates there were problems with either their design or manufacturing technology, or more likely, with both.

  • by stevenm86 (780116) on Friday October 20, 2006 @02:38PM (#16519475)
    That's sort of the point of building them on the same die. You can't just run a wire to it, as it would be quite slow. Wires tend to have parasitic inductances and capacitances, so the setup and hold times on the lines would be too large to provide a benefit.
  • Not so much (Score:5, Informative)

    by Sycraft-fu (314770) on Friday October 20, 2006 @02:56PM (#16519745)
    System RAM is SLOW compared to GPU RAM. PCIe actually allows very high speed access to system RAM, but the RAM itself is too slow for GPUs. That's one of the reasons their RAM amounts are so small, they use higher speed and thus more expensive RAM. Also because of the speed you end up dealing with cooling and signal issues which makes it impractical (or perhaps impossible) to simply stick it in addon slots to allow for upgrades.

    Even fast as it is, it's still slower than the GPU would really like.

    What you've suggested is already done by low end accelerators like the Intel GMA 950. Works ok, but as I said, slow.

    Unless you are willing to start dropping serious amounts of cash on system RAM, we'll be needing to stick with dedicated video RAM here for some time.
  • by Intron (870560) on Friday October 20, 2006 @02:58PM (#16519787)
    Typically, unimplemented instructions cause an exception. The operation can then be emulated in software.
  • by Sycraft-fu (314770) on Friday October 20, 2006 @03:05PM (#16519905)
    1) Processors are wicked fast at floating point these days. Have a look at the benchmarks a modern chip using SSE2 can do some time. Integer doesn't inherently mean faster, and chips these days have badass FPUs.

    2) For many things, it DOES make a difference. You might ask why do we need more than 24-bit (or 32-bit if you consider the alpha channel) integer colour? After all, it's enough to look highly realistic. Yes well that's fine for a final image, but you don't want to do the computation like that. Why? Rounding errors. You find that with iterative things like shaders doing them integer adds up to nasty errors which equals nasty colours and jaggies. There's a reason why pro software does it as 128-bit FP (32-bits per colour channel) and why cards are now going that way as well.

    3) In modern games, everything is handled in the GPU anyhow. The CPU sends over the the data and the GPU does all the transform, lighting, texturing and rasterizing. The CPU really is responsible for very little. With vertex shaders the GPU even handles a good deal of the animation these days. The reason is that not only is it more functional but it's waaaaay faster. You can spend all the time you like trying to make a nice optimised integer T&L path in the CPU, the GPU will blow it away. You actually find that some older games run slower than new ones because they rely on the CPU to do the initial rendering phases like T&L before handing it off, whereas newer games let the GPU handle it and thus run faster even though having higher detail.
  • by joto (134244) on Friday October 20, 2006 @03:25PM (#16520173)
    There are a couple of strategies:
    1. Write a specialized program that will only run at a single computer, the one the programmer owns, as everything is specialized and optimized for his/her hardware. If other people needs to run the program, write a new one, or at the very least use some other compiler options.
    2. Don't use non-portable features. Always go for the lowest common denominator.
    3. Manually testing for existence of coprocessor at each FPU instruction, branch to emulator if FPU doesn't exist.
    4. Same as above, but tests are inserted automatically by the compiler.
    5. Test for existence of coprocessor at start of program execution. If FPU doesn't exist, dynamically replace all FPU instructions with branches to emulator routines
    6. Same as above, but done automatically by the OS program loader
    7. Make it mandatory for CPUs to: either support the FPU instructions (with a coprocessor if necessary); or to issue some sort of trap/interrupt that can be used by software such as the OS kernel/libc to use an emulator routine instead.

    I believe the last option (option 7) is what x86/87 CPU/FPU combo actually used. That's why there is a coprocessor-prefix in front of the FPU instructions. They are not just unused opcodes.

    Option 5 (and sometimes even 3) is commonly used for MMX/3dNOW/SSE/SSE2/SSE3/whatever instructions.

    Unless they *really* need nonportable features, most programmers tend to go with option 2.

  • by Sycraft-fu (314770) on Friday October 20, 2006 @03:34PM (#16520287)
    Yes the SYSTEMS Tom used to test have normal speed ram for systems. Duh. The graphics cards, however, have much faster RAM. For example my system at home has DDR2-667 RAM. That's spec'd to run at 333MHz which is 667MHz is DDR RAM speak. My graphics card, a 7800GT, on the other hand has RAM clocked at 600MHz, or 1200MHz in RAM speak.

    Not a small difference, really. My system RAM is rated to somewhere around 10GB/second max bandwidth (it gets like 6 in actuality). The graphics card? 54GB/sec.

    Video cards have fast RAM subsystems. They use fast, expensive chips and they have controllers designed for blazing fast (and exclusive) access. You can't just throw normal, slow, system RAM at it and expect it to perform the same.
  • by MojoStan (776183) on Friday October 20, 2006 @03:47PM (#16520465)
    ... if you combine them on the same die with a large shared cache and the on-chip memory controller... you can see where I'm going with this. Think of it as a separate CPU, just printed on the same silicon wafer. That means you only need 1 fan to cool it and you can lose a lot of heat producing power management circuitry on the video card.

    Obviously this is not going to be ideal for high end gaming rigs; but it will improve the quality of integrated video chipsets on lower end and even mid range PCs.

    Do you remember how Intel tried to do this with their (code name) Timna [geek.com] processor in 2000? Timna was supposed to be a low cost solution that integrated a CPU, GPU, and memory controller on the same silicon wafer. The CPU was a Celeron CPU (Pentium III based), the GPU was based on Intel's new i740, and the memory controller used RAMBUS (yes, RAMBUS) memory. At the same time, Intel was also developing the first chipset with integrated graphics (i810 chipset) and the first RAMBUS chipset (i820 chipset). RAMBUS was supposed to be the successor to PC100 SDRAM.

    When Timna was initially finished, RAMBUS was still so expensive that Timna's release had to be delayed so that a (PC100-to-RAMBUS) memory translator could be added. Those of us who followed chipsets back then know how badly RAMBUS and memory translators bombed. The integrated RAMBUS memory controller had to be the biggest reason Timna was cancelled. This might also be a reason Intel doesn't integrate a memory controller onto their current CPUs.

    Interestingly, Timna was the first project of Intel's new Israeli design team. Not a great start, but their second project was pretty darned good (Pentium M/Centrino) [anandtech.com].

  • by 644bd346996 (1012333) on Friday October 20, 2006 @04:13PM (#16520801)
    Having the GPU on the same chip or die as the CPU would reduce the latency by several orders of magnitude and allow a much higher clock for the bus between the two. The memory access could also be improved dramatically, depending on how it was implemented.

    I think the first example of this integration we see will use the HyperTransport bus and a single package with CPU and GPU on different dies, though fabbed on the same process. This could be done with an existing AMD socket and motherboard.

    Before this happens, though, I think we will see graphics cards on HTX slots. For those who do not know, HTX slots were introduced in a recent revision of the HyperTransport standard. They allow an add-in card to communicate with the CPU with much lower latency and higher bandwidth than PCIe, and no controller in between. The add-in card could even have another CPU on it, and the performance would be comparable to current AMD SMP systems. A GPU on an HTX card could have its own RAM, and be able to access system RAM much faster than PCIe allows. The neat thing is that with HT, the CPU would probably be able to use the graphics RAM as though it were system RAM.

    Note that Nvidia is a member of the HyperTransport Consortium due to their chipset business, and they could easily have HTX cards in their labs right now.
  • by daVinci1980 (73174) on Friday October 20, 2006 @06:28PM (#16522767) Homepage

    You'd be mistaken [virginia.edu]. See the slide on Texture Mapping.

    Perspective divide is performed before texture sampling. This is necessary to get proper texture step sizes, for correct sampling of the texture onto the pixel.

    Fractional pixel locations are also used in antialiasing.

Men love to wonder, and that is the seed of science.

Working...