Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
Sun Microsystems

Sun's New MAJC Architecture 61

GFD writes "EETimes has a nice overview of Sun's new MAJC architecture. Combines multiple processors on one chip with VLIW and on chip multi threading. " I've been seeing some information about thsi floating around, but EETimes has done a nice job summarizing the chip itself.
This discussion has been archived. No new comments can be posted.

Sun's New MAJC Architecture

Comments Filter:
  • The SMP part is kinda boring. Yes. The fun is how you have multiple threads on a chip. Imagine that you have a deep pipeline, and there is a stall of some kind (you need a value from memory, f.ex).
    This would normally mean wasted chance to use the ALU. For a threaded chip, there is a another instruction stream (with it's own decoding pipeline and everything), so instead of stalling, the CPU switches decoding pipelines and keeps the ALUs busy. Eventually this thread will stall, or time out or something, and the other thread gets a chance to run. Cool!

    The point is that you replicate everything (I think) but the ALUs (so each thread has its own register file and what not), so context switches are instantaneous. Tera has a (vaporware?) supercomputer that switches after every instruction.

    Ok, so you think this seems like a lot of duplicated silicon. Conscider this: Much of the silicon on a chip is devoted to not stalling the pipeline (register renamiming, forwarding, speculation). If stalls become free, we can simplify circutry alot, and crank up the clock speed ALOT (cimpler circuitry => smaller die => higher speed). cool! Double cool! no stalls and faster clock. Think of all the technologies on a chip these days.

    Ok, things are not entirely simple (hands up those who thought they were), but that's the gist of it. For example, we'll need hardware support for locks and communication, probably.

    So why haven't we seen these chips before, if the news is old? I think some of the previous posters got it right -- its only recently we start getting applications that are almost trivially parallelisable. One poster mentioned communication overhead; by providing annotations that these 10 threads want to be on the same cpu, we can be assured that they'll have very quick interthread communication.

  • After the SPARC debacle, where Sun first encouraged others to use the SPARC then pulled the rug out from those who did, I'll wait until somebody else does such a processor before getting excited.
  • The most useful (and boring) way to look at this chip is to think of it as an ordinary 2-processor symmetric multiprocessor. Run a normal SMP OS on it, no problem. Same benefits, same disadvantages.

    No need to complicate this debate with statements like `this will be *useless* unless we solve the dynamic autothreading problem of single apps!'. SMP is useful for running unrelated threads/apps concurrently today. Tomorrow's single chip form will be just as useful.

    The only complication I see this chip introduce is that, because the two instruction streams share the same resource units (fpu units, barrel shifters, alus), heavy use of resources by one of the threads will slow the other thread down. Traditional SMPs don''t have this problem .. *all* resources are duplicated.

    Thats fine by me. There is some pretty neat scheduling advantages to that restriction. For example, if the chip is designed right, one could schedule thread #1 to get 75% of the instruction microcycles, while #2 gets 25%, and whatever microcycles each can't use the other gets. Alternatively, one could schedule thread #1 to get 100% of the microcyles while #2 gets whatever of those microcyles #1 wasn't able to use.

    Ahhhhhh heaven. True realtime scheduling at the instruction level, no overhead. Who could ask for more? No wonder why this chip is directed at the embedded market. They could really use this feature.

    A bit of history: I believe this idea was floated at Xerox PARC, at the time those folks were inventing mice and desktops and shared printers.
  • Haven't you kept up with the news? The national speed limit has been abolished. Besides, not everyone drives on public roads. :)
  • This not a new opinion, but I thought I'd add some more contemplation.

    A multi-threading chip would be useless in a non-multi-threading environment (e.g. DOS). This sort of chip is responding to the prevalence of multi-threading multitasking operating systems and applications. Since the BeOS is the furthest anyone's gone to making an OS multithread and multitask, to my mind this chip would run the BeOS like teflon-coated lightning.

    Think about it - the best way to get maximum performance out of this architecture is to have lots of 'small' threads, to have as many threads available for immediate execution should one stall. If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.

    But enough of that pipe-dream - let's get back to reality. Be won't dedicate engineering efforts to a new chip whose market is unproven without backup. PC owners won't abandon their investment just for some pretty new architecture which is basically incompatible (as far as we know) with their existing hardware. And the whole thing will stagnate because no-one can start a market big enough to get the software backing (the Hardware-Software Paradox).

    It's on the wish-list somewhere, but I'm not selling the Celeron just yet.
  • Indeed, it was Ebay management's fault. But it is hard to fault them. They have a huge farm of E10000s, but no hot backup in case the sh*t hits the fan. Not only that, in order to keep up with the exponential increase in business they would have to shell out over a million dollars a piece for more hardware to keep up with demand. Is business going to keep growing at 300%/yr? Should they buy 16 more starfires and the storage to go with them? How much would this cost? 50 million? How much profit have they made? There's a lot of tough questions for management to answer. She (the CEO and executives) has made at least one mistake so far. Hopefully they (and others) will learn.

    The morale: With an internet company, you have to spend the money and "bet the farm".

  • GEOS has been doing premptive multithreading & multitasking on PC's since '91 or so. Most apps (like 99%) run with the UI in one thread and the processing in another. Makes the system really responsive and seem faster than it is (not that it isn't already a million times faster than Windows). More threads are extremely easy to create if needed.
  • by B1FF ( 73807 )
    *&^%$#@! anyway, you knew what he meant, didn't you?

    1 SIMP0 TH1GHZ!!!!!!
    P33PL3 USU4LLY KN0W WHUT 1 MEAN T00!!!!!!!!%!%!%!!!!!
    BUT S0MET1M35 THAY SAY, 'B1FF SUX!!!` WHYYYYY???????
    ------ ------ ------
    ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
    ------ ------ ------
    ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
  • Duh.... Which police state do you live in? Speed
    limits do not apply to private property. Period.

    I can drive as fast as I wish on _My_ property.

  • Yes, I know Sun isn't the only one with this approach - I said that in my post.

    I have no idea if Sun plan to do redundancy checking with multiple pipelines with the UltraSparc-V. (some IBM, and other, chips do this...) They might do it as an option, but I would currently guess they're doing it mostly for performance.

    EBay's reliability problems are mostly related to poor management decisions (it seems) rather than EBay's (or Sun's or Oracle's) techs. Doing the above kind of checking wouldn't have helped EBay either. It doesn't matter what OS you use, if you have a screwed up setup, you'll get problems. And you'll be surprised/horrified at just how long it can take screwed setups to be fixed if the site's already gone live. (I know from experience. and no, it wasn't my screwed up setup.)

  • So why haven't we seen these chips before, if the news is old?

    For the simple reason that, though a multithreaded machine is in the aggregate faster, though it makes more efficient use of chip resources, though it promotes fast context switching at the microinstruction level, any single thread will run *slower* on such a chip than on a chip optimized to run a single thread. This kills benchmark results for all the typical highly publisized benchmarks.

    Until now, no one wanted to run a chip which had a lower benchmark rating. Nowdays, there is a greater appreciation for multiprocessing, and, due to the high performance of todays chips, the single-thread benchmark race is finally loosening its grip on the mind of purchasing agents and of the computing public.

  • by Mr Z ( 6791 ) on Thursday August 19, 1999 @08:06PM (#1736747) Homepage Journal

    Three words come to mind: HIT, NAIL, and HEAD. :-)

    To give an example from a paper I'm a coauthor on (being presented at ICSPAT'99), consider a JPEG decoder. Here's a quick overview of the bulk of a JPEG decoder:

    • for each 8x8 block
      • decode the Huffman code for the block
      • Perform inverse-quantization on the block
      • Perform the IDCT on the block
      • Write the block to the correct plane in the image

    On a deeply pipelined / highly parallel processor, this is horribly inefficient, because each task is very small when applied to only one block, whereas switching between tasks is quite expensive. But, that's exactly what alot of JPEG decoders do (including the Independent JPEG Group's decoder). The decoder is alot easier to write that way, but is not nearly as efficient as it could be.

    Instead, you want to batch things up as much as possible:

    • For all chunks of the encoded JPEG, do
      • Read a chunk of encoded JPEG
      • Decode the Huffman code for as many 8x8 blocks as possible
      • Inverse-quantize all of these blocks.
      • Perform IDCT on all of these blocks.
      • Write all of these blocks out to the image.

    Now, you can make massive gains in efficiency due to better instruction cache locality, better parallelism across loop iterations due to the fact you're actually looping quite a bit now, and so on. (The wins are rather dramatic on a DSP which relies on programmed DMAs to move data on and off chip.)

    What's nice about a system with parallel processing units (whether multiprocessor or multithreaded) is that each stage in the pipeline can become another parallel-executing thread. Indeed, that was one common way to program the TMS320C80 family DSPs, which had 2 or 4 DSPs on one chip, alongside a fairly strong RISC CPU ... all on one die! The DSPs would be organized as a pipeline, communicating through a "crossbar" to shared on-chip SRAM. The RISC CPU would coordinate tasks and issue commands to the DSPs. It was really quite cool.


  • I want an Alpha 667. That's the fascist linux chip I know of.

  • The idea of switching threads in hardware to get around cache misses has already been done by Tera, they have a machine at the San Diego Supercomputer Center. Dunno if it's a single chip design, I didn't have a screw driver handy when I visited SDSC. Tera claims some pretty impressive performance numbers.

    FYI - I don't work for Tera or SDSC.
  • by MrEd ( 60684 )
    Hemos, you're doing a good job in cranking out the stories... But trust me, there's time for a spellcheck!
  • I can imagine Beowulf would really rock with this... hee hee hee hee hee... oh the fun we could have!
  • by ChrisRijk ( 1818 ) on Thursday August 19, 1999 @01:48PM (#1736758)
    (I sent the following to a different forum over a day ago - before the EETimes article appeared. This is a word for word copy...)

    MAJC home page [] . See the docs home page [] - introduction, and a []"community" page [] .

    They haven't really released enough details (on their website) just yet, but it does look interesting. One of the more obviously different attitudes the specification takes is highly customisable implimentations - you design a variation targeted at a particular application, whatever that might be - graphics accelerator, MP3 player/decoder, MPEG2/DVD decoder, or a more general purpose chip. Since it is mostly being targeted at embedded applications this is not surprising though.

    Some other interesting aspects include:

    'Support' for JIT/access-time compilers - not only does this help Java, but it is to make backwards compatability with older versions quite simple. This seems a bit like what Transmeta are doing, which was co-founded by an ex Sun guy btw.

    Hardware support for ultra-fast thread switching - so fast that if one thread stalls waiting for DRAM access (which can take up to 100 clock cycles), you can switch to another thread rather than go idle. On many current OSs threads will be switched if the current one has to do some slow I/O say (ie read from disc) - so this is quite an improvement.

    A more general approach to improving parallelism - you can have more than one CPU core in a single physical chip, which might or might not share their 1st level caches. (read this Microprocessor Report [] article for some background on this.) IBM are apparantly going to do a version of the PowerPC G4 which has 2 CPUs on one core, and I kinda suspect Sun might be planning something similar for their UltraSparc-V.

    I'm not sure how Sun plan to make money of the design. It seems pretty likely they might do something like their "community source" model - you can get the design for free, but if you want to use it commercially you pay a license. ARM is doing well just licensing their CPU designs. I'd image Sun using to 'assist' their servers as add-on boards for doing heavy multi-media/3D graphics stuff - can you say "render farm"? Also, since Sun like selling their servers, they'd be happy for people to make lots of little, cheap devices that connect to nice big Sun servers.

    Like the original poster said, IEEE Micro will probably have some interesting stuff, but it seems Sun aren't releasing all the details yet - looks like we'll have to wait until the Microprocessor Forum in October. I liked the article (written by the Sun engineers) about the UltraSparc-III - not only was it interesting (and I like Sun's approach) , it helped me figure out the inherant problem with the IA-64 architecture...

  • "The holy grail in the industry is breaking a single-thread application into multiple threads."

    I'm really surprised by this statement. It seems like a lot of the CPU-bound things that people run these days is easily multi-threadable. e.g. Games, raytracing, image processing, even some aspects of compilers. Obviously people can just as easily name apps that aren't parallelizable, but there's already plenty of code out there just dying to run on more than one processor.

    What kinds of commonly-run CPU-bound apps aren't threadable, which are giving these guys so much grief?

  • by MrEd ( 60684 )
    BeOS is already pervasively multithreaded, unlike almost any other OS out there. Its nature makes debugging your apps a pain in the ass, but allows a 95% increase in processing power if you add a second CPU. Or so I've been told.

    This chip would seem to take the pressure off the OS, and henceforth the programmers. *whew*

  • Some of the Sun material talks about Java applications with tens or hundreds of threads each. However, in the applications I write and the applications I've seen, most of those threads exist to provide nonblocking behaviour for various purposes, and it's hardly ever the case that 2 (or more) threads are runnable at once. The problem is that for many tasks it's just really hard to parallelize them into multiple threads, and to do it right. (One problem is that the thread model of concurrency just sucks, but that's another rant.)

    So here's my shameless plug for CMU research: what we need is hardware support to make it easier to write threaded programs. One approach is thread-level data speculation. In this system, one thread executes normally while other threads execute speculatively, basically assuming that the parallel execution will be safe and correct. The processor is responsible for detecting conflicts between threads that mean the optimistic parallel execution is not correct. When there is a conflict, the speculating thread that caused the conflict is killed and its speculative state is thrown away. It's not as hard to do this as you might think; it seems possible to do it by adding some tags to the data caches on each processor.

    See here for more:

    The Stanford Hydra project does something like this too, BTW.
  • Will Sun port Linux to it?

    Will it be cheap?

    Enquiring minds want to know... (and benchmark )
  • by Anonymous Coward
    MU News [] has been tracking this story a bit, and has some links if you wanna learn more about MAJC.
  • I've been seeing some information about thsi floating around...

    Interesting times we live in: I read this sentence, and immediately my mental english parser interpreted the typo "thsi" as an acronym and went to work on translating it.

    ("THreaded Semiconductor Integrated circuit" was my interpretation before I realized it was an error.)
  • The article is refering to *hardware* multithreading, not software multithreading. The holy grail in the industry is implementing multithreading (not just multiprocessing) in hardware. What they're trying to do is remove the implementation of multithreading from the operating system level and transfer it to the hardware level, which would allow extremely fast thread switching, which would significantly increase speed and efficiency.
  • I believe the problem's not so much finding the apps to parallelize, but rather the cost of parallelizing those apps. ie, the inter thread communication. I've got a program I wrote that would really benefit from parallelization if it wasn't for the communications costs. It's an N-body 3d gravity simulator (ie simulating a solar system). I believe any spead gained from spreading the load would be consumed by the communications (each planet has to know where every other planet is (N^2 problem)). BTW, the program's source (for DJGPP/Allegro) is on my web page. I've got Linux/svgalib code for it, but I haven't posted it yet (too lazy). If anyone's interested, email me.
  • BeOS is already pervasively multithreaded, unlike almost any other OS out there. Its nature makes debugging your apps a pain in the ass, but allows a 95% increase in processing power if you add a second CPU. Or so I've been told.

    Multithreaded code isn't so hard to debug as long as you design your program very carefully in advance with multithreading in mind. It's when you take a program or API that was designed to be single-threaded and try to hack in the multithreading after the fact that things can get awful.
  • ummm... shouldn't that be 2038 (in yer sig)?
  • >hey josh, stop reading slashdot

    But I wasn't reading /. at the time! 8)
    Anyway, it's not THAT big of a time-sink,
    and I get a lot of valuable news from
    the site.
  • by Ungrounded Lightning ( 62228 ) on Thursday August 19, 1999 @04:58PM (#1736777) Journal
    If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.

    Out there currently, perhaps that's true. But looking back in computing history there's the T.H.E. multiprocessing system (by Djikstra and Riddle), plus an arbitrary number of clones of it, typically living in embedded systems.

    I used one done by Mark Weiser, on a Nova, about 1975, and cloned my own onto an 8080 a few years later. Mine was a preemptive multitasking kernel (excluding drivers) a little over 500 bytes long. Add a console driver, a debugger, a network stack (not IP), real-time-clock processing, scheduled event interpreter, instrumentation drivers, a relay logic ladder-diagram interpreter, drivers to receive and send relay/contact signals from/to optoisolators, and a network daemon that downloaded schedules, read meters, examined relay states and stuck virtual screwdrivers in to force them, and it still come in under 2K bytes. This left the other 2K of ROM available for a description of a hysterically-large emulated-relay network.

    That sucker flew, too. With the one tweak I added it became exactly an implementation of "actors", perhaps a bit before they were formalized. If you're not familiar with them: Imagine a machine where every program is in C++, but where every instance of every class is a separate thread of execution, every complicated class has been split into a set of simpler classes with one thread-related member function each, every call to a thread-related member functin is an intertask message - at about the cost of a subroutine call (with free queueing of multiple messages), and every thread-related member function (with all the non-thread-related subroutines it calls) can in principle run simultaneously (because they explicitly mutex when they must share a resource, and the free queueing makes such occasions are extremely rare). Now pour all these tiny tasks into the machine, with a half-K kernel to orchestrate them.

    On a single processor machine the fact that the individual objects could run in parallel was an unused side-effect of a programming style that simplified writing programs to take maximum advantage of the tiny kernel. But with a more modern hardware platform, with a slightly more complicated kernel and perhaps a little hardware assist, the same style automatically produces a great pile of tiny, simple objects that can all be run in parallel on as many CPUs as you've got.

  • Because one uses gasoline, and the other one uses propane. Duh! ;-)
  • Just write it with an actor-based OOP style. Then it's automatically split into tiny, simple, parallizable chunks.

    Instead of having to explicitly declare what's parallizable, you explicitly declare what's interdependent. Typically that's a much smaller set - especially after the message-send/receive dependencies (which are automatically handled for you) are excluded.

"I will make no bargains with terrorist hardware." -- Peter da Silva