Forgot your password?
typodupeerror
Sun Microsystems Technology

Sun To Release 8-Core Niagara 2 Processor 214

Posted by CowboyNeal
from the screw-everything-we're-going-eight-cores dept.
An anonymous reader writes "Sun Microsystems is set to announce its eight-core Niagara 2 processor next week. Each core supports eight threads, so the chip handles 64 simultaneous threads, making it the centerpiece of Sun's "Throughput Computing" effort. Along with having more cores than the quads from Intel and AMD, the Niagara 2 have dual, on-chip 10G Ethernet ports with cryptographic capability. Sun doesn't get much processor press, because the chips are used only in its own CoolThreads servers, but Niagara 2 will probably be the fastest processor out there when it's released, other than perhaps the also little-known 4-GHz IBM Power 6."
This discussion has been archived. No new comments can be posted.

Sun To Release 8-Core Niagara 2 Processor

Comments Filter:
  • by dread (3500) on Friday August 03, 2007 @04:47AM (#20098545)
    Correct. At my last employer we found this out the hard way. Most servers were getting great performance but the one that actually did some (and it wasn't much really) FP work was horrible. This should really remedy that problem.

    On the other hand, SUN still suffers from the fact that ETCA is getting more and more mindshare in the telco arena which has been one of their major cash cows. It will be real interesting to see how that pans out in the end.
  • by Eukariote (881204) on Friday August 03, 2007 @04:48AM (#20098549)

    Along with having more cores than the quads from Intel and AMD...
    What quad from Intel/AMD? Intel is selling two dual cores on a cracker. The "quad" bit is just marketing, the actual silicon chips are pure dual core designs that have to talk across the front side bus just as in a two-socket server. And AMD has so far only been previewing their quads, you can't buy them yet.
  • by Cheesey (70139) on Friday August 03, 2007 @05:44AM (#20098779)
    High-speed CPUs are all limited by a bottleneck - getting data on and off chip. Putting the Ethernet controllers on chip helps to offset this.

    In the future, it is likely that all the wired buses in your motherboard will be replaced by an internal Ethernet-like network. We are already seeing a trend towards simpler and faster interconnects such as SATA. The next step is to use Ethernet-style connections for every chip-to-chip link, and within the chips themselves too. If this seems unlikely, consider that your PCs memory bus already is basically a network connection. The device at one end (CPU) is in a different clock domain to the device at the other (memory). Data is sent in packets (called bursts) to offset the latency of setting up a transfer.
  • Re:Trust me... (Score:5, Informative)

    by LarsWestergren (9033) on Friday August 03, 2007 @05:45AM (#20098791) Homepage Journal
    ...If they put THESE under the GPL, along with the T1, they'd be getting more press than they could imagine.

    http://www.opensparc.net/ [opensparc.net]

    They are openly discussing making the Niagara 2 available as open source as well, but note that there are some roadblocks such as the US government's restrictions [opensparc.net] on crypto technology.

  • by zeromemory (742402) on Friday August 03, 2007 @05:58AM (#20098835) Homepage
    Sun donated one of the original T2000 (based on the original 8-core 4-thread/core Niagara processor) systems to a campus organization where I'm a volunteer system administrator, so think I have quite a bit of experience with this processor. Here's my take on the Niagara2, based upon my experiences with the Niagara1:
    • No, this processor is not going to be the 'fastest' processor out there; this processor is designed primarily for workloads that don't require floating-point calculations (web servers, mail, etc), so it's not going to be the go-to processor for places like rendering farms. In fact, float-point performance on the Niagara1 was so terrible that Sun included a special cryptographic accelerator to help with SSL performance (the primary consumer of floating-point calculations on most web servers).
    • This processor architecture absolutely rocks for the purpose it was intended, though. It consumes very little power, but handles service loads amazingly well. We also have a Sun v40z (8-core Opteron server) that would barely be able to keep up with the our T2000 (that's saying a lot), and our T2000 consumes only a little more than half as much power going into our v40z (2.6A @ 120VAC compared to 4.6A @ 120VAC).
    • The inclusion of 10GbE support is going to be absolutely essential and will help make servers based upon the Niagara2 stand-out compared to servers from competing vendors. Why is 10GbE so important? I mean, we already have GbE, and most places barely have an infrastructure for that in place, right? The answer is SAN. 10GbE is going to be necessary if you're going to be using iSCSI to consolidate storage and deliver reasonable performance, and most places are heading in that direction, especially the target market for these systems.
    • Solaris Logical Domains (not to be confused with Sun Containers or Zones) is a hardware-based virtualization technology that was packaged with the Niagara1 and will probably be included with the Niagara2. Using Logical Domains, you can create independent virtual servers running different operating systems and divide hardware resources up between them, down to the individual CPU thread and PCI Express bus leaf level. Unlike software virtualization solutions, all your virtual servers are never dependent on any single virtual server (global zone, dom0, etc). This technology is making hardware virtualization a possibility for many places.

    I think the Niagara is a pretty solid design, but it's not the processor to end all processors. For service workloads, I don't think you can get a better processor, but you probably don't want one of these processors in your workstation. Sun Microsystems is also headed in the right direction, establishing an open-community around these processors and Solaris.
  • by Alioth (221270) <no@spam> on Friday August 03, 2007 @06:09AM (#20098873) Journal
    The floating point performance of the new processor should be like night and day compared to the old one you had: the old one apparently only has 1 FPU for the entire device - the new one has an FPU per core.
  • Re:yes but ... (Score:2, Informative)

    by Anonymous Coward on Friday August 03, 2007 @06:34AM (#20098983)

    why doesn't Sun Microsystems make laptops

    They do. Ultra 3 Mobile [sun.com].

    There are also the units from Tadpole [tadpole.com], and I'm sure others

  • Re:Trust me... (Score:3, Informative)

    by wild_berry (448019) on Friday August 03, 2007 @07:08AM (#20099119) Journal
    using them as a graphics processor on a PC

    Good enough for raster graphics, not so good for vector graphics or 3D due to there being only 8 FPUs on the die, with only twice the floating point throughput of the terrible-at-floating-point T1. Unless you do swap some of the throughput for soft-floating-point.
  • Re:Smokin'... (Score:1, Informative)

    by Anonymous Coward on Friday August 03, 2007 @07:15AM (#20099149)
    Actually, it doesn't dissipate that much heat. One of the purposes of the CoolThreads idea and the Niagara chip family is actually to reduce power demands and heat dissipation thus reducing the need for cooling and therefore saving even more power.

    The Niagara 1 (T1) for example only consumes around 70W with 8 cores running 4 threads each. This is comparable to single and dual core chips used in desktop computers today.
  • by DisKurzion (662299) on Friday August 03, 2007 @07:29AM (#20099205)
    I'm going to have to agree with the coward on this one. You don't have a clue. You won't see Sun stuff on the desktops. Sun boxes have their place: The high-performance market. Where I work (hint: Feds), we have multiple Sun boxes set up, which run our virtual servers. If there's one thing that you can never get enough of in this kind of setup, it's multiple threading and RAM. The integrated networking is also a huge boost, since that's the last major bottleneck before hitting the clients.

    He wasn't trying to say that Sun deserves more press. Sure, small businesses and even many large businesses don't require that kind of power. But the coward was right: Sun provides good quality at (relatively) dirt cheap prices. Hence why they make this kind of thing.

    You try running 5+ heavily used virtual servers (Each running a component of Oracle) on one Intel or AMD box. Let me know how that goes for you.

    PS - Solaris kicks ass.
  • Re:Trust me... (Score:5, Informative)

    by TheRaven64 (641858) on Friday August 03, 2007 @07:30AM (#20099207) Journal

    If they used these a bit more aggressively - such as using them as a graphics processor on a PC - they'd be getting some amazing press
    A modern GPU is fairly similar in design to the T2, but there are a few key differences:
    • The T2 is mainly focussed on integer ops with only one floating point pipeline per core. A GPU typically is close to 100% floating point pipelines, and doesn't bother with integer arithmetic.
    • The T2 uses multiple contexts to hide memory latency, mostly caused by incorrectly predicted branches. A GPU typically doesn't bother much with branch prediction, since it runs code that is very light on conditional branches (on average, branches happen every 7 ops in general purpose code. In GPU code, they happen every few hundred).
    • GPUs usually focus on 4-way vector instructions, since most of their data is of this form (RGBA colours, XYZW vertexes). The T2 only has scalar instructions.
    I posted in my journal recently suggesting that it would be easier to produce a modern GPU than an older card, since modern GPUs have much less application-specific logic and do more in software, relying on just having lots of cores / pipelines to give speed.
  • Re:Interesting (Score:3, Informative)

    by TheRaven64 (641858) on Friday August 03, 2007 @07:35AM (#20099253) Journal
    The T2 has one huge advantage over anything in the POWER line, which is that SPARC is the only non-x86 instruction set supported by HiPE (the High Performance Erlang runtime). This is significantly faster than the runtime used on other platforms.

    Probably not applicable to any of the projects you're working on, but anyone writing Erlang code should check out the benchmarks from R11 running on the T1.

  • by TheRaven64 (641858) on Friday August 03, 2007 @08:01AM (#20099377) Journal
    Note that this is per core, and not per context. With eight contexts per core, it's still going to be a bottleneck if your code is more than 1/8th floating point calculations. On the other hand, a big part of the performance problem came from register copying from the individual cores to the FPU and back on the T1, and this should be fixed with the T2. It's still not going to be a great floating point chip, but it should be a bit better.
  • Re:Trust me... (Score:3, Informative)

    by afidel (530433) on Friday August 03, 2007 @08:14AM (#20099461)
    There are tons of research chips made from the OpenSparc designs and Simply RISC [opensparc.net] claims to have an embedded processor made from a single core T1 design.
  • by ricegf (1059658) on Friday August 03, 2007 @08:18AM (#20099489) Journal

    Linspire (back in the day - I've been on Ubuntu for quite a while now) worked this way. IIRC you had to hold down a key to rescan for hardware, otherwise it assumed nothing changed and booted very briskly. I'm surprised it didn't catch on with more popular distros.

    Also, I thought http://www.linuxbios.org/Welcome_to_LinuxBIOS [linuxbios.org] would get through POST and to the payload in just a couple of seconds.

  • by Anonymous Coward on Friday August 03, 2007 @08:21AM (#20099521)
    It has a cryptographic unit per core too. The PDF prezo linked by the page below says that bandwidth of the 8 crypto units is enough to run the on-chip 10 GbE ports encrypted. Sounds like an opportunity for some interesting applications -- VPN, SSL, SAN/NAS encryption, anyone?

    All that and the 64 threads run at 84 watts maximum (not TDP).

    http://sun.systemnews.com/articles/108/3/hw/17688 [systemnews.com]
  • by Anonymous Coward on Friday August 03, 2007 @08:49AM (#20099759)
    I think he's trying to refer to SerDes.

    The T2 has a SerDes switch connecting processor cache, RAM, PCIe buses,
    and the 10 Gb Ethernet interfaces, all on-chip.

    Impressive bandwidth, over 1 Tb/second total if I read correctly.
    See the chart on page 16 of the PDF on the page below:

    http://sun.systemnews.com/articles/108/3/hw/17688 [systemnews.com]

    Lots of tech details in this Sun presentation too, including power (84W max
    for 8 cores/64 threads, including Ethernet and RAM controllers).
  • by Anonymous Coward on Friday August 03, 2007 @09:02AM (#20099869)
    The workloads they excel at aren't as general as you'd like to think. I had a fair amount of hands on experience with some T1s, they would only ever beat a dual opteron if you could get over 8 threads working on a problem and it wasn't I/O bound. Otherwise it was at best a tie and usually a run of the mill dual opteron could pretty easily hang.


    Now that's not to say that there aren't server loads that are like that. Specifically with java app servers, it's not uncommon to have hundreds of threads doing light weight tasks.


    I like the direction that they are moving in though. In a few generations, I'm sure they will either have competitive chips across the board or not make chips anymore and the energy efficiency is a long overdue area to focus on.

  • by TheLink (130905) on Friday August 03, 2007 @09:47AM (#20100413) Journal
    While I'm sure your 20k T1 outperformed your 100k v880. It does not show that the T1 is a better choice than an Intel/AMD system.

    Sun's stuff are slower than IBM's POWER line, and they are nowhere in the same league as IBM mainframes, and IBM mainframes are not in the same league as real nonstop computing clusters.

    Mainframes = very good uptimes, but you have _scheduled_ downtimes.
    Stuff like OpenVMS or Tandem = uptimes of _decades_ possible, don't even need scheduled downtimes where you turn everything off, you can run while replacing the hardware. With the Tandem stuff you even have CPUs running the same thing at the same time for real redundancy. Only thing is HP seems to be burying VMS and Tandem.

    Sun? They didn't even have hardware instruction retry till Fujitsu SPARC. For many years it was pretty embarassing that the really high end SPARCs were Fujitsu rather than Sun - the fastest SPARC systems till just a few years ago were all Fujitsu PRIMEPOWER (I haven't bothered checking recently, the last I recall Sun started using Fujitsu stuff for their high end systems).

    Sun got where they were by making relatively cheap Unix RISC workstations and they provided servers for areas where reliability and availability didn't really matter as much as the real "high end" stuff. They caught the internet wave for "cheap" webservers etc and made a lot of money then.

    The problem now with Sun is, they get blasted at the low end by x86, and at the high end they pale in comparison to IBM's stuff.

    For "normal" webserver/db/internet/corporate stuff it's x86.
    In the HPC arena it's x86 (for scale out), and IBM (for scale up).
    So where does Sun fit?

    Just go google for benchmarks of T1 vs Intel vs AMD. The T1 doesn't even do that well for performance/power consumption when compared to the Intel woodcrest CPU: http://www.anandtech.com/printarticle.aspx?i=2772 [anandtech.com]

    For Sun's sake, their Niagara 2 better be magnitudes much better than their T1, if not it'll be out of date even before it's released.

    Don't get me wrong, I'll be happy if Sun succeeds, but they've fallen way behind.
  • Re:Trust me... (Score:3, Informative)

    by CryoPenguin (242131) on Friday August 03, 2007 @10:07AM (#20100663)
    You're looking for the Open Graphics Project [opengraphics.org]. But hardware is hard to design and expensive to fab, you're not going to get an Xtreme3D Graphics Accelerator competitive with the latest from NVIDIA or ATI.
  • by ancientt (569920) <ancientt@yahoo.com> on Friday August 03, 2007 @10:16AM (#20100771) Homepage Journal

    They sent me a Sunfire last year telling me it had capabilities it didn't. I sent it back. If $21,500 is the price tag, then this would outperform what I was considering at approximately the same price. I doubt these have the capabilities I was looking for, (I want Windows in Xen) but the price isn't going to be it's limiting factor.

    By the way, if they're still running it, the Try and Buy program is every bit as good as it sounds. They shipped me a server for the asking, I tested it, and sent it back, all on their dime (except my time.) If I'd talked to salespeople honest enough to say "I don't know" rather than "of course" I wouldn't have had it sent in the first place and might be running one of their servers right now. Maybe next year.

  • Re:Trust me... (Score:3, Informative)

    by allenw (33234) on Friday August 03, 2007 @10:19AM (#20100803) Homepage Journal
    Linux is already running and certified [ubuntu.com] for Niagara.
  • by ajs (35943) <ajs@@@ajs...com> on Friday August 03, 2007 @10:44AM (#20101213) Homepage Journal

    most people just don't care about them. Tom's Hardware is not going to be reviewing them for the enthusiast market, for example, they are waaaaay out of that range. Same shit with the Power 6. Great chip, coming never to a desktop near you. These specialised high end products are just not of mass interest if for no other reason than price.
    Which Sun has never cared about. They sell to the high performance market, which (outside of trivially distributed applications like serving Web pages or rendering CG) NEEDS the kind of horsepower than Suns can crank out.

    Even if that cheap server breaks right at the end of its warranty, it is still a money saver, a big one.
    This is almost always a red herring. Price of a system is rarely a company's most significant cost (within an order of magnitude) when you're dealing with high performance computing. It's the people and the data vendor relationships that usually cost you the bulk of your outlay. Hardware of just about any sort is fairly cheap by comparison.

    I understand the market for enterprise systems, I also understand that it is small.
    Hrm... small? Well, not really. It's got a low number of players, but those players have a voracious appetite for processing power.

    Don't get me wrong. Most of a large financial house doesn't need a Sun server. However, those who do (e.g. quants) really do. Same goes for government, biotech, military contractors, etc.
  • Re:Trust me... (Score:3, Informative)

    by mhall119 (1035984) on Friday August 03, 2007 @10:58AM (#20101477) Homepage Journal
    I believe Motorola makes Sparc-compatible processors, not sure if they're based on Opensparc of if they licensed it from Sun.
  • by the eric conspiracy (20178) on Friday August 03, 2007 @11:10AM (#20101711)
    The threads don't execute simultaneously anyway. They are there so that processes can continue when there are I/O waits, i.e. for memory. An FPU per core should be enough.

    http://www.sun.com/2003-1014/feature/ [sun.com]

  • Re:Trust me... (Score:2, Informative)

    by Spy Hunter (317220) on Friday August 03, 2007 @03:05PM (#20105433) Journal
    I'm sure it would be difficult to mold this chip into a GPU, but I'd like to point out that NVidia's GeForce 8 series actually does bother (a bit) with integer arithmetic, and is actually a scalar architecture, with no 4-way vector instructions (though it does pair multiple functional units with one instruction decoder, each functional unit executes a different thread).

    The biggest differences between Sun's chip and the 8 series are probably in the memory architecture. The 8 series has a ginormous memory bandwidth and many specialized ways to access it, all wired up in a highly optimized pattern: z-buffering hardware, framebuffer blending hardware, many read-only texture sampling units (which are powerful processors in their own right with dedicated caches), and local programmer-managed read-only and read-write memories for each core.
  • by grigori (676336) on Friday August 03, 2007 @04:30PM (#20106711)
    Above guy is confused partly because Sun reused the word "thread" to mean "hardware thread" when most people think of "software thread" like in Java or pthread_create().

    Hardware threads are virtual CPUs sharing resources on a core so work can proceed when a thread stalls on cache miss - the hardware switches to a new thread in a SINGLE CLOCK. Not expensive. Time-slicing between many software threads on a few CPUs can be expensive but having many hardware threads to run those threads makes the problem tiny. In fact, this design is great way to make multithreaded applications run real fast.

  • by imroy (755) <imroykun@gmail.com> on Friday August 03, 2007 @05:11PM (#20107227) Homepage Journal

    Please note that it doesn't *run* 64 threads simultaneously. It *manages* 8 threads per core -- but each core has only two integer units, one load-store unit, and one floating point unit. At best a core can have ops from four different threads in simultaneous execution, but this will be a very rare case (when int, int, float, load/store happen at same cycle). Most often each core will be able to simultaneously execute instructions from just one or two threads -- which all is still excellent for 84W!

    Quite true. For more info on the way its cores work, see the UltraSPARC T1 [wikipedia.org] article on Wikipedia (which I have edited quite a bit). Each core is a barrel processor [wikipedia.org], meaning each stage in the pipeline is handling an instruction from a different thread. This adds complexity, but in exchange it means that branch mis-prediction is no longer a problem - any branch instruction has already been through the execute stage and the Program Counter modified before the next instruction of the thread gets fetched.

    The other big advantage with the multi-threaded UltraSPARC T1/T2 design is that it has high throughput. While a single-threaded CPU has to wait on cache misses, the T1/T2 just continues chugging along with its remaining threads. It's switching threads on every clock cycle, so each thread gets only 1/8th of the 'power' of each core. But because it's doing something on every single clock cycle, it can do a lot of work - as long as the work is multi-threaded. That's its weakness.

If what they've been doing hasn't solved the problem, tell them to do something else. -- Gerald Weinberg, "The Secrets of Consulting"

Working...