GFD writes
"EETimes has a nice overview of Sun's new MAJC architecture. Combines multiple processors on one chip with VLIW and on chip multi threading. " I've been seeing some information about thsi floating around, but EETimes has done a nice job summarizing the chip itself.
Re:It's an SMP, no less, quite alot more. (Score:1)
This would normally mean wasted chance to use the ALU. For a threaded chip, there is a another instruction stream (with it's own decoding pipeline and everything), so instead of stalling, the CPU switches decoding pipelines and keeps the ALUs busy. Eventually this thread will stall, or time out or something, and the other thread gets a chance to run. Cool!
The point is that you replicate everything (I think) but the ALUs (so each thread has its own register file and what not), so context switches are instantaneous. Tera has a (vaporware?) supercomputer that switches after every instruction.
Ok, so you think this seems like a lot of duplicated silicon. Conscider this: Much of the silicon on a chip is devoted to not stalling the pipeline (register renamiming, forwarding, speculation). If stalls become free, we can simplify circutry alot, and crank up the clock speed ALOT (cimpler circuitry => smaller die => higher speed). cool! Double cool! no stalls and faster clock. Think of all the technologies on a chip these days.
Ok, things are not entirely simple (hands up those who thought they were), but that's the gist of it. For example, we'll need hardware support for locks and communication, probably.
So why haven't we seen these chips before, if the news is old? I think some of the previous posters got it right -- its only recently we start getting applications that are almost trivially parallelisable. One poster mentioned communication overhead; by providing annotations that these 10 threads want to be on the same cpu, we can be assured that they'll have very quick interthread communication.
Johan
But will it be usable by anybody but Sun? (Score:1)
It's an SMP, no less, not much more. (Score:1)
No need to complicate this debate with statements like `this will be *useless* unless we solve the dynamic autothreading problem of single apps!'. SMP is useful for running unrelated threads/apps concurrently today. Tomorrow's single chip form will be just as useful.
The only complication I see this chip introduce is that, because the two instruction streams share the same resource units (fpu units, barrel shifters, alus), heavy use of resources by one of the threads will slow the other thread down. Traditional SMPs don''t have this problem
Thats fine by me. There is some pretty neat scheduling advantages to that restriction. For example, if the chip is designed right, one could schedule thread #1 to get 75% of the instruction microcycles, while #2 gets 25%, and whatever microcycles each can't use the other gets. Alternatively, one could schedule thread #1 to get 100% of the microcyles while #2 gets whatever of those microcyles #1 wasn't able to use.
Ahhhhhh heaven. True realtime scheduling at the instruction level, no overhead. Who could ask for more? No wonder why this chip is directed at the embedded market. They could really use this feature.
A bit of history: I believe this idea was floated at Xerox PARC, at the time those folks were inventing mice and desktops and shared printers.
Re:National speed limit abolished (Score:2)
BeOS and a multi-threading chip (Score:1)
A multi-threading chip would be useless in a non-multi-threading environment (e.g. DOS). This sort of chip is responding to the prevalence of multi-threading multitasking operating systems and applications. Since the BeOS is the furthest anyone's gone to making an OS multithread and multitask, to my mind this chip would run the BeOS like teflon-coated lightning.
Think about it - the best way to get maximum performance out of this architecture is to have lots of 'small' threads, to have as many threads available for immediate execution should one stall. If there's any OS out there that is more comprehensively thread oriented (which leads to more application threading) it must be proprietary.
But enough of that pipe-dream - let's get back to reality. Be won't dedicate engineering efforts to a new chip whose market is unproven without backup. PC owners won't abandon their investment just for some pretty new architecture which is basically incompatible (as far as we know) with their existing hardware. And the whole thing will stagnate because no-one can start a market big enough to get the software backing (the Hardware-Software Paradox).
It's on the wish-list somewhere, but I'm not selling the Celeron just yet.
Ebay quandry (Score:1)
The morale: With an internet company, you have to spend the money and "bet the farm".
_damnit_
Re:BeOS and a multi-threading chip (Score:1)
Re:thsi (Score:1)
*&^%$#@! anyway, you knew what he meant, didn't you?
*&^%$#@!
1 SIMP0 TH1GHZ!!!!!!
P33PL3 USU4LLY KN0W WHUT 1 MEAN T00!!!!!!!!%!%!%!!!!!
BUT S0MET1M35 THAY SAY, 'B1FF SUX!!!` WHYYYYY???????
:WQ
------ ------ ------
ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
------ ------ ------
ALL HA1L B1FF, TH3 M05T 31337 D00D!!!!!1
Re:National speed limit abolished (Score:1)
limits do not apply to private property. Period.
I can drive as fast as I wish on _My_ property.
PeterT
Re:More links + some analysis (Score:2)
I have no idea if Sun plan to do redundancy checking with multiple pipelines with the UltraSparc-V. (some IBM, and other, chips do this...) They might do it as an option, but I would currently guess they're doing it mostly for performance.
EBay's reliability problems are mostly related to poor management decisions (it seems) rather than EBay's (or Sun's or Oracle's) techs. Doing the above kind of checking wouldn't have helped EBay either. It doesn't matter what OS you use, if you have a screwed up setup, you'll get problems. And you'll be surprised/horrified at just how long it can take screwed setups to be fixed if the site's already gone live. (I know from experience. and no, it wasn't my screwed up setup.)
Re:It's an SMP, no less, quite alot more. (Score:1)
For the simple reason that, though a multithreaded machine is in the aggregate faster, though it makes more efficient use of chip resources, though it promotes fast context switching at the microinstruction level, any single thread will run *slower* on such a chip than on a chip optimized to run a single thread. This kills benchmark results for all the typical highly publisized benchmarks.
Until now, no one wanted to run a chip which had a lower benchmark rating. Nowdays, there is a greater appreciation for multiprocessing, and, due to the high performance of todays chips, the single-thread benchmark race is finally loosening its grip on the mind of purchasing agents and of the computing public.
Joe
pipelining processes: BINGO!!! (Score:3)
Three words come to mind: HIT, NAIL, and HEAD. :-)
To give an example from a paper I'm a coauthor on (being presented at ICSPAT'99), consider a JPEG decoder. Here's a quick overview of the bulk of a JPEG decoder:
On a deeply pipelined / highly parallel processor, this is horribly inefficient, because each task is very small when applied to only one block, whereas switching between tasks is quite expensive. But, that's exactly what alot of JPEG decoders do (including the Independent JPEG Group's decoder). The decoder is alot easier to write that way, but is not nearly as efficient as it could be.
Instead, you want to batch things up as much as possible:
Now, you can make massive gains in efficiency due to better instruction cache locality, better parallelism across loop iterations due to the fact you're actually looping quite a bit now, and so on. (The wins are rather dramatic on a DSP which relies on programmed DMAs to move data on and off chip.)
What's nice about a system with parallel processing units (whether multiprocessor or multithreaded) is that each stage in the pipeline can become another parallel-executing thread. Indeed, that was one common way to program the TMS320C80 family DSPs, which had 2 or 4 DSPs on one chip, alongside a fairly strong RISC CPU ... all on one die! The DSPs would be organized as a pipeline, communicating through a "crossbar" to shared on-chip SRAM. The RISC CPU would coordinate tasks and issue commands to the DSPs. It was really quite cool.
--Joe--
Re:hmmm, fascist (Score:1)
Tera MTA (Score:1)
FYI - I don't work for Tera or SDSC.
thsi (Score:1)
rarrr! (Score:1)
More links + some analysis (Score:3)
MAJC home page [sun.com] . See the docs home page [sun.com] - introduction, and a [sun.com]"community" page [sun.com] .
They haven't really released enough details (on their website) just yet, but it does look interesting. One of the more obviously different attitudes the specification takes is highly customisable implimentations - you design a variation targeted at a particular application, whatever that might be - graphics accelerator, MP3 player/decoder, MPEG2/DVD decoder, or a more general purpose chip. Since it is mostly being targeted at embedded applications this is not surprising though.
Some other interesting aspects include:
'Support' for JIT/access-time compilers - not only does this help Java, but it is to make backwards compatability with older versions quite simple. This seems a bit like what Transmeta are doing, which was co-founded by an ex Sun guy btw.
Hardware support for ultra-fast thread switching - so fast that if one thread stalls waiting for DRAM access (which can take up to 100 clock cycles), you can switch to another thread rather than go idle. On many current OSs threads will be switched if the current one has to do some slow I/O say (ie read from disc) - so this is quite an improvement.
A more general approach to improving parallelism - you can have more than one CPU core in a single physical chip, which might or might not share their 1st level caches. (read this Microprocessor Report [mdronline.com] article for some background on this.) IBM are apparantly going to do a version of the PowerPC G4 which has 2 CPUs on one core, and I kinda suspect Sun might be planning something similar for their UltraSparc-V.
I'm not sure how Sun plan to make money of the design. It seems pretty likely they might do something like their "community source" model - you can get the design for free, but if you want to use it commercially you pay a license. ARM is doing well just licensing their CPU designs. I'd image Sun using to 'assist' their servers as add-on boards for doing heavy multi-media/3D graphics stuff - can you say "render farm"? Also, since Sun like selling their servers, they'd be happy for people to make lots of little, cheap devices that connect to nice big Sun servers.
Like the original poster said, IEEE Micro will probably have some interesting stuff, but it seems Sun aren't releasing all the details yet - looks like we'll have to wait until the Microprocessor Forum in October. I liked the article (written by the Sun engineers) about the UltraSparc-III - not only was it interesting (and I like Sun's approach) , it helped me figure out the inherant problem with the IA-64 architecture...
Holy grail?! (Score:1)
I'm really surprised by this statement. It seems like a lot of the CPU-bound things that people run these days is easily multi-threadable. e.g. Games, raytracing, image processing, even some aspects of compilers. Obviously people can just as easily name apps that aren't parallelizable, but there's already plenty of code out there just dying to run on more than one processor.
What kinds of commonly-run CPU-bound apps aren't threadable, which are giving these guys so much grief?
No diff (Score:1)
This chip would seem to take the pressure off the OS, and henceforth the programmers. *whew*
Thread level data speculation (Score:1)
So here's my shameless plug for CMU research: what we need is hardware support to make it easier to write threaded programs. One approach is thread-level data speculation. In this system, one thread executes normally while other threads execute speculatively, basically assuming that the parallel execution will be safe and correct. The processor is responsible for detecting conflicts between threads that mean the optimistic parallel execution is not correct. When there is a conflict, the speculating thread that caused the conflict is killed and its speculative state is thrown away. It's not as hard to do this as you might think; it seems possible to do it by adding some tags to the data caches on each processor.
See here for more:
http://www.cs.cmu.edu/~tcm/STAMPede.html
The Stanford Hydra project does something like this too, BTW.
Interesting... but will it run Linux? (Score:1)
Will it be cheap?
Enquiring minds want to know... (and benchmark )
MU has some stories... (Score:1)
Sign of the times (Score:1)
Interesting times we live in: I read this sentence, and immediately my mental english parser interpreted the typo "thsi" as an acronym and went to work on translating it.
("THreaded Semiconductor Integrated circuit" was my interpretation before I realized it was an error.)
Re:Holy grail?! (Score:1)
Re:Holy grail?! (Score:1)
Re:No diff (Score:1)
Multithreaded code isn't so hard to debug as long as you design your program very carefully in advance with multithreading in mind. It's when you take a program or API that was designed to be single-threaded and try to hack in the multithreading after the fact that things can get awful.
Re:hmmm, fascist (Score:2)
Re:hey josh, stop reading slashdot (Score:1)
But I wasn't reading
Anyway, it's not THAT big of a time-sink,
and I get a lot of valuable news from
the site.
J05H
GPL'ed Freedom CPU here. Porters welcome. (Score:1)
Multi-threaded OS (Score:3)
Out there currently, perhaps that's true. But looking back in computing history there's the T.H.E. multiprocessing system (by Djikstra and Riddle), plus an arbitrary number of clones of it, typically living in embedded systems.
I used one done by Mark Weiser, on a Nova, about 1975, and cloned my own onto an 8080 a few years later. Mine was a preemptive multitasking kernel (excluding drivers) a little over 500 bytes long. Add a console driver, a debugger, a network stack (not IP), real-time-clock processing, scheduled event interpreter, instrumentation drivers, a relay logic ladder-diagram interpreter, drivers to receive and send relay/contact signals from/to optoisolators, and a network daemon that downloaded schedules, read meters, examined relay states and stuck virtual screwdrivers in to force them, and it still come in under 2K bytes. This left the other 2K of ROM available for a description of a hysterically-large emulated-relay network.
That sucker flew, too. With the one tweak I added it became exactly an implementation of "actors", perhaps a bit before they were formalized. If you're not familiar with them: Imagine a machine where every program is in C++, but where every instance of every class is a separate thread of execution, every complicated class has been split into a set of simpler classes with one thread-related member function each, every call to a thread-related member functin is an intertask message - at about the cost of a subroutine call (with free queueing of multiple messages), and every thread-related member function (with all the non-thread-related subroutines it calls) can in principle run simultaneously (because they explicitly mutex when they must share a resource, and the free queueing makes such occasions are extremely rare). Now pour all these tiny tasks into the machine, with a half-K kernel to orchestrate them.
On a single processor machine the fact that the individual objects could run in parallel was an unused side-effect of a programming style that simplified writing programs to take maximum advantage of the tiny kernel. But with a more modern hardware platform, with a slightly more complicated kernel and perhaps a little hardware assist, the same style automatically produces a great pile of tiny, simple objects that can all be run in parallel on as many CPUs as you've got.
Re:It doesn't make sense (Score:2)
Use the right style and it's trivial. (Score:1)
Instead of having to explicitly declare what's parallizable, you explicitly declare what's interdependent. Typically that's a much smaller set - especially after the message-send/receive dependencies (which are automatically handled for you) are excluded.