Octopiler to Ease Use of Cell Processor 423
Sean0michael writes "Ars Technica is running a piece about The Octopiler from IBM. The Octopiler is supposed to be compiler designed to handle the Cell processor (the one inside Sony's PS3). From the article: 'Cell's greatest strength is that there's a lot of hardware on that chip. And Cell's greatest weakness is that there's a lot of hardware on that chip. So Cell has immense performance potential, but if you want to make it programable by mere mortals then you need a compiler that can ingest code written in a high-level language and produce optimized binaries that fit not just a programming model or a microarchitecture, but an entire multiprocessor system.' The article also has several links to some technical information released by IBM."
So don't hire mere mortals (Score:5, Funny)
Re:So don't hire mere mortals (Score:2, Funny)
Hmph. "Real Programmers" needing a bleedin' assembler to tell them what their bleedin' instructions mean? Why, back in my day we had to write our programs in machine language. We saved our work by means of a small bar magnet held a short distance above a hard disk platter. And we had to pay for our own bytes.
You had machine language? (Score:3)
Re:So don't hire mere mortals (Score:3, Funny)
Zeus was booked, Apollo was out of town, Hermes is still learning, Posideon just signed a 500-year agreement with Apple and Ares was killed off in God of War, so most of the good non-mortal programmers were out of the question. Hades claims to be a writer instead of a programmer, but most of the plot lines he comes up with ends up with everyone dead.
Re:So don't hire mere mortals (Score:3, Funny)
Re:So don't hire mere mortals (Score:2)
Re:So don't hire mere mortals (Score:2)
Hamlet wasn't that bad. Besides, on some programming projects, having everyone dead is a blessing. It's the ghost of previous projects that continues to haunt the living.
Re:So don't hire mere mortals (Score:2)
"We always coded in assembler. We never let the compiler do all the work for us."
I crapped myself right then and there.
Re:So don't hire mere mortals (Score:2)
Re:So don't hire mere mortals (Score:2)
Re:So don't hire mere mortals (Score:4, Funny)
Must have 5 years experience coding in Assembly for the IBM Cell processor
Re:So don't hire mere mortals (Score:2)
Re:So don't hire mere mortals (Score:2)
And make sure that (Score:3)
[/kfg [slashdot.org] mode off]
Makes you wonder (Score:5, Insightful)
Re:Makes you wonder (Score:2, Interesting)
Actually, it sounds like they still don't have one, just some ideas on how to make one someday.
No, it's there alright (Score:5, Informative)
Not really (Score:2)
according to the article, the compiler's still in early stages of development...
Re:Makes you wonder (Score:2)
Hello, Itanium... (Score:5, Insightful)
Re:Hello, Itanium... (Score:3, Insightful)
From TFA:
"I say "intended to become," because judging from the paper the guys at IBM are still in the early stages of taming this many-headed beast. This is by no means meant to disparage all the IBM researchers who have done yeoman's work in their practically single-handed attempts to move the entire field of computer science forward by a quantum leap. No, the Octopiler paper is full of
Re:Hello, Itanium... (Score:4, Informative)
They didn't call it the Itanic for nothing...
Re:Hello, Itanium... (Score:4, Insightful)
Fortunately for IBM and Sony, games are one place where hand-optimizing certain algorithms is still practical. I doubt they will place all their eggs in the octopiler basket. I can't imagine a compiler will find that much paralellism in code that isn't explicitly written to be parallel. Personally, I think they should instead focus on explicitly parallel libraries for common game algorithms like collision detection.
what are you talking about? (Score:2)
That's kind of a weird comparison given the differences in innovation, demonstrated results and company attitudes.
IBM's Cell is a much more radical break from previous chips like Itanium [wikipedia.org], but the CES demo was reported to be very impressive. IBM has already released the SDK [slashdot.org] and openly published all specifications [slashdot.org]. The pace of development has been very rapid and people are predicting th [linuxinsider.com]
Re:what are you talking about? (Score:2)
Sorry, you lost all credibility there. The Core is a single core with a bunch of DSPs tacked on. It's a great replacement for a general purpose PowerPC in many embedded applications, but won't touch Intel's target market any time soon. In the year and a half since that article was written we've learned how much Intel and AMD can do to keep ahead of the game and how applicable to general-purpose computing the
Re:what are you talking about? (Score:3, Interesting)
It's not quite as clean as it looks. "Full specifications" doesn't include any information on instruction latencies, cache performance, etc. They've documented the platform itself, but not the specific implementation. This makes optimization difficult.
I've had to distill information from several publications to determine even basic things like ho
Don't be a revisionist (Score:3, Interesting)
HP most certainly have not dumped it. If anything they're pushing harder than ever. All I hear from HP these days is Itanium, Itanium, Itanium .... and I've been to a few HP pre-sales events in the last couple of months where they've been pushing it very hard. In a few months they'll be revising their I
Sadly, not a lotta FPU hardware. (Score:5, Insightful)
'Cell's greatest strength is that there's a lot of hardware on that chip. And Cell's greatest weakness is that there's a lot of hardware on that chip.
Sadly, there's almost no FPU hardware to speak of: 32-bit single precision floats in hardware; 64-bit double precision floats are [somehow?] implemented in software and bring the chip to its knees [wikipedia.org].
Why can't someone invent a chip for math geeks? With 128-bit hardware doubles? Are we really that tiny a proportion of the world's population?
Re:Sadly, not a lotta FPU hardware. (Score:2)
Therefore an even smaller portion of an already small population.
Re:Sadly, not a lotta FPU hardware. (Score:2)
Perhaps you meant longdouble precision. Math geeks that can live with 32-bit floating point precision are also a small subset - most of those who do heavy math (not pixel processing) pretty much require 64-bit double precision. And that is not available in hardware from Cell (come to think of it, not for Alitvec, either)
Quad precision (Score:2)
I've also implemented a simple double double (represents numbers as an unevaluated sum of two non-overlapping doubles) arithmetic in CL. It was ~25% as fast as doubles (mostly branchless, each op expands into ~2-8 double precision op). That gives an upper-bound on the slowdown ratio for the emulation of doubles with singles.
Re:Quad precision (Score:2)
I corresponded with a Sparc designer. (Score:2)
I corresponded with the Sparc designer about this very question, because LabVIEW supports a 128-bit "quad-precision" double for Sparc platforms:
I sent some email back and forth with one of the dudes on the Sparc design team, and he said that Sparc's 128-bit quad-precision double is a purely software implementation.
Compare e.g.
Re:Sadly, not a lotta FPU hardware. (Score:3, Insightful)
Re:Sadly, not a lotta FPU hardware. (Score:4, Interesting)
You wish. In a big 32-bit game world, effort has to be made to re-origin the data as you move. Suppose you want vertices to be positioned to within 1cm (worse than that and you'll see it), and you're 10km from the origin. The low order bit of a 32-bit floating point number is now more than 1cm.
It's even worse for physics engines, but that's another story.
If the XBox 360 had simply been a dual- or quad-core IA-32, life would have been much simpler for the game industry.
Re:Sadly, not a lotta FPU hardware. (Score:2, Interesting)
Actually, what I can't figure out is why you want floating point at all. Floating-point data stores a certain number of bits of actual data, and a certain number of bits as a scaling factor. To use your example, this would mean that while items near the origin would be picture-perfect, the object 10km away would be out by well more than a cm.
Back when integer arithmetic was so much faster that floating point it was worth the effort, game coders used to use fixed-point arithmetic. This kept a uniform
Re:Sadly, not a lotta FPU hardware. (Score:3, Interesting)
Yes it is, as long as you're willing to put a few seconds of thought into it (or just google [gameprogrammer.com] for the answer).
Re:Sadly, not a lotta FPU hardware. (Score:2)
-Ack
Re: (Score:2)
Re:Sadly, not a lotta FPU hardware. (Score:2, Informative)
Those machines are Cray Vector Processors, MIPS R8K and later, DEC Alpha, HP/Intel Itanium, IBM Power 4/5/n, IBM Vector Facility for the 3090, etc.
Notice how many of those you see every day, and how many fewer of those you can still buy.
Yes, unfortunately, you are that tiny a proportion of the world pop. I had hoped by this point that we'd have Cray Vector Proces
Re:Sadly, not a lotta FPU hardware. (Score:2)
YES! Re:Sadly, not a lotta FPU hardware. (Score:2)
Yes, in fact you are a really tiny proportion of the world's population!
Re:Sadly, not a lotta FPU hardware. (Score:4, Funny)
You math geeks need to multiply. :)
Re:Sadly, not a lotta FPU hardware. (Score:2)
Because the math geeks won't pay for the fab plants.
Yes. You're the math geek - you do the math.
Re:Sadly, not a lotta FPU hardware. (Score:2)
Each SPU can do 2 DP FMACs (in one vector) in 6 cycles -- not pipelined. and at 3.2 GHz. Then you can add the single pipelined DP FMAC unit in the PPE.
Sure, it's an order of magnitude less than SP, but it's not that anemic. And if I weren't still under NDA, I could speculate about what IBM/SONY might be doing about that situation. But I wont.
Oh and back on topic, I used to work at IBM on compilers, and I recognized some of the names on the list of authors of the
Check out William Kahan at UC-Berkeley. (Score:4, Informative)
What benefit does increasing the precision of floats to 128bits bring? 64bits are more than enough for 99.9999% and the remaining cases can be handled in sw emulation. You can still not solve (without massive growth of the error terms) an equation system described by a Hilbert-matrix using Gaussean-elimination no matter how many bits you make the mantissa.
Check out some of Professor Kahan's shiznat at UC-Berkeley:
In particular, look at the pictures of "Borda's Mouthpiece" [page 13] or "Joukowski's Aerofoil" [page 14] in the following PDF document: As I understand it, the "wrong" pictures are computed using Java's strict 64-bit requirement; the "right" pictures are computed by embedding the 64-bit calculation within Intel/AMD 80-bit extended doubles, performing the calculations in 80-bits worth of hardware, and then rounding back down to 64-bits to present the final answer.MORAL OF THE STORY: Precision matters. You can never have enough of it.
Re:Check out William Kahan at UC-Berkeley. (Score:2)
The two plots you point out aren't really examples of precision errors. Rather, they are errors brought about by not tracking the distinction between "positive 0" and "negative 0." You'll have this problem to some degree no matter how many bits of precision you've got if you don't track the sign of your numbers that round to 0.
Re:Check out William Kahan at UC-Berkeley. (Score:4, Informative)
Gods.
This is eight years old, (1998) and has been fixed for five years.
FIVE YEARS. Join the 21st century, for god's sake.
java.lang.StrictMath
How long will people repeat this, even though it's been fixed for five years, in java 1.3? The latest beta VM is 1.6...
Octointerpreter (Score:3, Interesting)
Re:Octointerpreter (Score:2)
As for not hiding reconfigurability: you can buy anything you desire as an add-on board, like an FPGA board or an array processor. People don't use them a lot because they are a pain to program
Am I ignorant or . . . (Score:2)
Re:Am I ignorant or . . . (Score:2)
Doesn't work that way... (Score:2)
It's similar with programming. Instead of saying, this is a car, and it goes in that world, and
Is it just me or... (Score:2)
Surely it was screaming at them that this isn't something that's meant to be released so soon. I mean, the compiler have 4 tiers of 'optimisation', which is meant for the programmers to set so the compiler doesn't make a mess of their memory-management code if they memory manage correctly, or
Re:Is it just me or... (Score:2)
Not really, it's future proofing. It can be used as pretty much a still pretty powerful single core machine for the initial release titles, and as the programmers get to grips with how to get the most out of the cell architecture, and better tools come out, the titles will keep getting better
Re:Is it just me or... (Score:2)
[1] Much as I like Erlang, it would not actually be quite suitable for the Cell.
vcl v2 (Score:2)
A new era in performance breakthroughs? (Score:2)
This radical of a change in architecture should at least provide an accelerated growth from introduction through the next several years, which I'm sure will provide added incentive for those involved in compiler optimization -- finally, some real enhancements.
Yay! A new generation, FINALLY! (Score:3, Interesting)
Re:Yay! A new generation, FINALLY! (Score:2)
It will be interesting to compare the Cell with the UltraSPARC T1 (Niagara). They both have about 8 cores (T1 is 8 cores, Cell is 8+1), but the T1 can do 32 threads of execution simultaneously. The Cell has good floating point performance, but the T1 only has 1 FPU for all 8 cores (it's specifically not designed for FP performance). The T1 has very low power requirements, at about 72 watts (79 peak), while (as far as I can tell fr
Will you need a modchip to make full use of it? (Score:2)
This is possibly one of the best things they could have done to help the cell. By doing this, you make open source developers happy and more inclined to port over their applications.
It's too bad that the only popular commercial implementation of the Cell processor for several years is going to be in a machine with a lockout chip, a technical measure that prohibits end users from compiling Free software on the machine. Otherwise, game developers could develop a Free engine subsidized by keeping game asse
Re:A new era in performance breakthroughs? (Score:2)
At the last PDC, Microsoft announced some very exciting ideas it is looking at to propose for the next C++ standard that will give language support for parallelism, essentially letting you do
Re:A new era in performance breakthroughs? (Score:2)
It's even harder when there's no memory protection. One might imagine (within reason) that a Java compiler could separate independent tasks by tracking what variables are used in what sections of code, and inferring that one section must be independent of another until you reach line X (at which point you may need to synchonize access to a variable the two pieces have in common, or join the threads). That could (perhaps) achieve decent multithreaded perfor
nothing "radical" about it (Score:2)
There's nothing "radical" about it--it's just a bunch of CPUs on a chip. It's about the least radical way in which you can put a bunch of CPUs on a chip, beyond multicore.
Re:A new era in performance breakthroughs? (Score:3, Funny)
A summary of the idea here... (Score:2)
Posit: The Cell architecture is highly parallel.
Posit: Most programmers today are good at writing serial, not parallel, code.
Hypothesis: A compiler can be developed that takes serially written programs and auto-transforms them into parallel programs to exploit the benefits of parallelism.
Now comes the research to attempt to validate that hypothesis. Will it succeed? We'll find out in several years. There are
Re:A summary of the idea here... (Score:5, Insightful)
Parallel programming and automated parallelization have already been researched exhaustively throughout the last thirty years of the 20th century. The outcome of all this research is that it is not feasible/tractable to create a compiler that is capable of recongising parallelism, as you suggest. Compilers that can do this are sometimes called 'heroic' compilers, for the reason that the required transformations are so incredibly difficult, and heroic compilers that actually work (well) simply don't exist.
Re:A summary of the idea here... (Score:2)
VCL takes sequential code and splits it up into parallel code based on the constraints of the vector-units (each VU is dual-issue, with some restrictions). It'll re-order code, insert wait states, etc. Certainly it's a good start at auto-parallelisation of the code. It's supposed to do as well as a skilled engineer...
Simon
Compiler isn't necessarily serial (Score:2)
Re:A summary of the idea here... (Score:3, Interesting)
Let's also not lose sight of the big picture with regard to the Cell: the 8 parallel vector processors are coupled with a single CPU core derived from the PowerPC chip. So the overarching structure of the Cell isn't a
Anyone having flashbacks? (Score:5, Insightful)
All this meant that as the PS2 aged it could 'keep up' because the coders kept getting better and better.
Mere mortals do not write the latest graphics engines. I think there are a lot more tier1 people running around then /. seems to think. They are just to busy to comment here.
All that really matters is wether the launch titles will be 'good' enough. Then the full power of the system can be unleashed over its lifespan.
If your a game company and your faced with the choice of either making just another engine OR spending some money on the kind of people that code for super computers and get an engine that will blow the competition out of the water then it will be a simple choice.
Just because some guy on website finds it hard doesn't mean nobody can do it.
I'm totally having deja vu. (Score:3, Interesting)
Yea, but what's the full power of a system? Prettier graphics?
The "full power" of the PS1 seemed to be that its games became marginally less ugly as time went on, although FF7 was very well done since it didn't use textured polygons for most of it (the shading methods were much sexier). When I think about FF9, I don't like it more because it uses the PS1 at a fu
compilers ... (Score:5, Insightful)
Re:compilers ... (Score:2)
Why hasn't this been done before ? (Score:2)
Re:Why hasn't this been done before ? (Score:2)
Far too complex? (Score:2, Insightful)
Your average C programmer doesn't take architecture into account, and so there's no user indication of whether a variable can be
Re:Far too complex? (Score:3, Insightful)
Re:Far too complex? (Score:2)
That's because, to the average C (or C++) programmer, speed doesn't matter -- ease of coding and debugging and maintenance does. However, that's not the case with games developers (or, more correctly, games engine developers these days), or high-performance computing people (ie, scientists who write weather prediction programs and such). To them, it matters, and they'll code for it. But, they also have tools like MPI and PVM, which are desi
OK, Great a compiler, but ... (Score:2)
here's the real article... (Score:5, Informative)
enjoy... :)
Re:here's the real article... (Score:2)
On the other hand though, some boffins had to code this, and there were probably a few junior programmers involved somewhere too, who can now claim to have been part of it all
special compilers, expert programmer = DOA product (Score:3, Insightful)
Also, the division into "expert programmer" and "regular programmer" is silly. Most coding is done by people who aren't experts in the cell architecture (or any other architecture). That's not because people are too stupid to do this sort of thing, it's because it's not worth the investment.
If Cell can't deliver top-notch performance with a simple compiler back-end and regular programmers who know how to write decent imperative code, then Cell is going to lose. Hardware designers really need to get over the notion that they can push off all the hard stuff into software. People want hardware that works reliably, predictably,and with a minimum of software complexity.
Maybe CISC wasn't such a bad idea after all--you may get less bang for the buck, but at least you get a predictable bang for the buck.
Re:special compilers, expert programmer = DOA prod (Score:4, Insightful)
Re:special compilers, expert programmer = DOA prod (Score:3, Insightful)
Pretty much all modern CPUs need special compilers to give good performance. Unless you can keep track of the number of pipeline stages, the degree of superscalar architecture, etc. you will get sub-optimal code. The P4, for example, can have 140 instructions in-flight at once. Can you keep track of your code over a 140 instruction window and make sure there are no hazards? If not, then you're probably better of
CISC? (Score:2)
I assumed less complex chips with optimizations coming from compile time were more efficient or cost effective?
Re:CISC? (Score:2, Interesting)
The problem with the Cell is actually pretty interesting. They decided to go for in-order CPU's for the SPE's which means that to get good performance you sure as hell better know what your dependencies are and take into
Re:CISC? (Score:3, Funny)
Yeah, but the advantage of doing it this way is that the 2nd transition (from risc back to risc) is really quick!
Wasn't this the same mistake Sega made? (Score:2)
Re:Wasn't this the same mistake Sega made? (Score:2, Interesting)
Sega didn't make a single mistake, they made a LOT of them. I imagine you're thinking of the Saturn. It was supposed to be a SNES killer. In other words, all the fancy technology it had was meant to throw sprites on the screen. Then Sony showed up with it's fancy ass 3D archit
No (Score:2)
Re:Wasn't this the same mistake Sega made? (Score:4, Interesting)
Simple parallelism? (Score:2)
without having to worry about manually setting up the threads, etc - if there are multiple resources available, they get used, if not, then it happens in serial. Is there anything like this out now?
Re:Simple parallelism? (Score:2)
http://labs.google.com/papers/mapreduce.html [google.com]
Re:Simple parallelism? (Score:2)
Time to let C die ? (Score:2, Interesting)
Let me summarize
Re:Time to let C die ? (Score:4, Interesting)
Certainly if I'm writing a pleasant little modern desktop application I'm going to write in Objective C or C# - would seem a little silly not to ... but for writing a compiler, a network stack, or gods forbid a kernel I don't know of anything that works even close to as well as C. C still has a niche, can't realy change that.
This problem must be solved eventually (Score:3, Interesting)
The problem is that single threaded programs will run just as slowly on your quad-core 'Core-Quattro' in 2008, as they did on your old Pentium 4 - c. 2005. Great, yeah, I know, server loads parallelize very nicely (witness the miracle of Niagra), but consumer grade CPUs are where the volume is at, and people are going to have to notice a real difference in performance in order to stay on the hardware upgrade treadmill. This necessitates that Intel/AMD/IBM come up with new programming models that make it easy to parallelize existing code. Parallelized libraries and frameworks are all well and good, but it will be 20 years before everyone gets around to recoding the existing codebade to the the new platform - and most of them are probably not going to generate optimal code.
No, what we need are compilers that take programs written in a serial fashion, and emit code that scales well on multiple processors. The problems with the PS3 are only the beginning.
I remember (Score:3, Interesting)
They called a developer's conference in August 1998, where after the presentation a veteran game coder shrugged: "Another weird British assembler programming cult".
The Cell strikes me the same way, and for the same reasons, although Big Blue likely has more development tool budget than VM ever did. Not to take anything away from the smart guys at IBM, but I suspect they'll have a fun time working around the Cell's limitations. I can tell them from experience that DMAed local memory will be much more of a pain in the ass than they can imagine, and unless they can guarantee sync in hardware they'll be wasting a bunch of time schlepping spinlocks in and out of memory. The vector stuff will also be nontrivial: the best way to make that usable, apart from having everyone write vector code from the git-go, would be to provide a stonking great math library in the style of the Intel Integrated Performance Primitives.
As an aside, the PS3 is in the tradition of Sony not caring about who programs their machine: the PS1 was easier to code than the Saturn, which was a true horror, the PS2 upped the difficulty a fair bit, and now even experienced coders are bitching about the PS3. Meanwhile Microsoft is learning from their mistakes: the X360 is easier than the X1, and if you doubt that makes a difference, check out game development budgets and time to delivery. I don't care, really: I eat algorithms and machine code for breakfast, so this just means more jobs and money for me.
Why the Cell processor is such a pain (Score:5, Interesting)
This architecture has been tried before, for supercomputers. Mostly unsuccessful supercomputers you've never heard of, such as the nCube [wikipedia.org] and the BBN Butterfly. [paralogos.com] There's no hardware problem building such machines; in fact, it's much easier than building an efficient shared-memory machine with properly interlocked caches. But these beasts are tough to program. The last time around, everybody gave up, mainly because more vanilla hardware came along and it wasn't worth dealing with wierd architectures.
The approach works fine if you're doing something that looks like "streaming", such as multi-stream MPEG compression or cell phone processing. If you want to do eight unrelated things on eight processors, you're good.
But applying eight such processors to the same problem is tough. You've got to somehow break the problem into sections which can be pumped into the little CPUs in chunks that don't require access to any data in main memory. The chunks can't be bigger than 50-100K or so, because you have to double buffer (to overlap the transfers to and from main memory with computation) and you have to fit all the code to process the chunk into the same 256K. That's a program architecture problem; the compiler can't help you much there. Your whole program has to be architected around this limitation. That's the not-fun part.
You have to make sure that you do enough work on each chunk to justify pumping it in and out of the Cell processor. It's like cluster programming, although the I/O overhead is much less.
In some ways, C and C++ are ill-suited to this kind of architecture. There's a basic assumption in C and C++ that all memory is equally accessable, that the way to pass data around is by passing a pointer or reference to it, and that data can be linked to other data. None of that works well on the Cell. You need a language that encourages copying, rather than linking. Although it's not general-purpose, OpenGL shader language is such a language, with "in" and "out" parameters, no pointers, and no interaction between shader programs.
Note that the Cell processors don't do the rendering in the PS3. Sony gave up on that idea and added a conventional NVidia graphics chip. (This guaranteed that the early games would work, even if they didn't do much with the Cell engines.) Since the cell processors didn't have useful access to the frame buffer, that was essential. So, unlike the PS2, the processors with the new architecture aren't doing the rendering.
It's possible to work around all these problems, but development cost, time, and risk all go up. If somebody builds a low-priced 8-core shared memory multiprocessor, the Cell guys are toast. The Cell approach is something you do because you have to, not because you want to.