Slashdot Log In
IBM's OSS Code Morphing Code/or OSS vs. Transmeta
Posted by
Hemos
on Wed Nov 29, 2000 01:28 AM
from the morphing-for-fun-and-profit dept.
from the morphing-for-fun-and-profit dept.
jjr writes: "It seems that IBM has a Open Source Project called Daisy that does a lot of what transmeta does. Their code-morphing technology supports PowerPC, x86, and S/390, as well as the Java Virtual Machine. They Morph the [code] into VLWI just like transmeta but they still have some issues to work out. Other issues dealt with in the report include self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O."
This discussion has been archived.
No new comments can be posted.
IBM's OSS Code Morphing Code/or OSS vs Transmeta
|
Log In/Create an Account
| Top
| 93 comments
(Spill at 50!) | Index Only
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
|
2
(1)
|
2

Re:Interesting spin-off's... (Score:5)
BTW, Transmeta has been working on their stuff since 1995, so the technology mentioned in the 1997 paper doesn't strictly predate it.
I read about Daisy a few years back when I was studying VLIW scheduling techniques and whatnot. The DAISY VLIW is quite different than most VLIWs around. Their instruction word is built upon the ability to execute large numbers of "branches" in parallel every cycle. (As best as I can tell, these "branches" are actually closer to being composite predication conditions in many cases, which is why I put "branches" in quotes.) Their experimental physical implementation could execute something like 8 branches every cycle. Downright weird.
A more traditional VLIW uses predication [google.com] to convert short branches into a simple "if (cond)" prefix on individual instructions. (This technique is known as if conversion.) Also, traditional VLIW instruction words are flat -- all N instructions in a VLIW bundle execute together in parallel, with no tree structure implicit in the encoding.
All that aside, the DAISY scheduling techniques sound pretty similar to trace scheduling [google.com] , which was used on the old Multiflow VLIW machines [google.com]. The actual process of converting PowerPC instructions to individual DAISY operations is mostly search and replace, and preserving program order is a matter of constructing proper dependences between the instructions.
Feel free to ask me questions if you're curious about this kind of stuff. It's my day job.
--Joe--
Program Intellivision! [schells.com]
Re:other stuff (Score:3)
True for LCD, but why limit yourself to one technology?. There's no reason a screen has to emit light at all. After looking at several flavors of "electronic paper" it doesn't seem particularly fanciful to imagine a display which consumes zero power if the image isn't changing and which is readable under the same wide variety of conditions as regular paper. It may well be that such displays will always lag behind more conventional technologies in areas such as transition time or color depth, but for a very wide variety of devices and applications that would still be a big win.
Even within the realm of light-emitting display technology, there's plenty of room to reduce power consumption. For example, the Light Emitting Polymer work at CDT could lead to displays that consume a lot less power than CRT or LCD displays, in addition to being extremely thin, light and flexible.
I'm not trying to argue with you here. I completely agree with your main point that power consumption needs to be addressed beyond the CPU. Displays and rotating media in particular are at least as deserving of attention. This is all just FYI.
Re:Code morphing patented? (Score:4)
Re:Nice start... (Score:5)
>VLIW, but dynamic translation and
>parallelization will always be slower than
>native processes.
No. you're actually wrong (though it is counter-intuitive). Dynamic translation lets you make optimizations at runtime about the behavior of the code that can't be done statically at compile time (or even as well in the CPU using branch prediction, etc etc) . e.g. check out the 'Dynamo' project at HP - emulate the PA-RISC processor on top of itself in software, and get substantial speed improvements....
http://arstechnica.com/reviews/1q00/dynamo/dyna
http://www.hpl.hp.com/cambridge/projects/Dynamo
Re:other stuff (Score:4)
There is certainly research and development on low-energy components besides the CPU; check out the energy usage of the mobile Radion, for one thing. However, there are limits on how much you can possibly squeeze out of some components. Hard drives (which probably eat the most energy in a portable system) need to spin, and there's a certain amount of mass which is being kept moving at a certain velocity, along with a certain amount of energy required to read/write data. That puts a limit on how much energy you can save there. CD-ROM drives have similar limitations.
A color LCD of usable brightness (another huge drain on battery life) is going to output a certain amount of energy; you could make the screens dimmer, but then they are harder to see. Wireless connections are going to require a certain amount of power for broadcast; the further the connection, the more juice. Sound output requires a certain amount of power, and so on.
What you're seeing is the design decisions which made the original Palm Pilot: no movable parts for storage, B&W, passive matrix screen, no wireless. And it could run for two months on two AAAs. Adding on just a color screen drops that down significantly and requires rechargable batteries for a reasonable experience. Ditto for wireless. I just don't think there's going to be much of a way around it until we figure out how to store more energy in a light, safe way.
-jon
Rearranging Compiled Code for Optimization (Score:5)
This was in either late '95 or early '96 - but the IBM work on this had been around for a while by the time I read the paper.
This technology is widely available now - read all the way to the end to see how you can try it out.
If you have a jump to a certain offset in a routine, you can move the code where you jump to elsewhere in the file and change the offset you give in the jump. Complicated, because you need to parse RISC machine code, but doable.
It's made a little easier by PowerPC instructions always being fixed at 32 bits with no extension words (a side effect of that is that there's no way to load a 32-bit constant into a register with a single instruction, which makes it hard to scan machine code by eye for constants in an assembly debugger.)
This has the effect of speeding up the overall program execution because you group frequently used code blocks together in the executable file, and also in memory once it's loaded. You may find less-commonly used branches of an if-statement put miles away at the end of the file, so that you jump a long ways away and then back in sometimes, but this isn't a big deal because all the frequent cases flow straight along.
The reason this is a big win is twofold. First, you reduce virtual memory paging and the code resident in physical memory because less commonly used code is all grouped together and just sits idly paged out on disk; that which is taking up valuable physical RAM is of a minimum size and being used actively.
Also (and more importantly in small programs, and in CPU-bound cases), you make more effective use of your processor's code cache.
This is because jumping over an uncommonly used branch may load a few unused instructions into the cache at the beginning and end of the branch that's not taken - cache lines (blocks) are of a fixed size and are always aligned by the cache block size, so if you have 32 byte cache lines then the start of any cached code falls at a physical address that is divisible by 32.
If you run even one instruction into the address rangle, you load 32 whole bytes of code into the cache, deleting 32 bytes of code that might be useful later, then if your code is not optimized this way you'll just end up jumping over most of it.
Many people who are trying to make their programs run faster would benefit from knowing more about how the cache works. Gary Kacmarcik's Optimizing PowerPC Code [fatbrain.com] has a good discussion of this that will benefit anyone who programs on modern microprocessors - not just PowerPCs. And while Kacmarcik emphasizes PowerPC assembly, most of the benefit of improving cache use you can do from C, C++ or another higher level language.
The way the profiler works is that an interrupt-driven task is used to check the instruction counter at frequent but random intervals. The samples are saved to a file for later analysis, then a postprocessor makes a histogram which gives the number of samples per basic block of instructions.
(A basic block, essentially, is any code that falls between a pair of curly braces if it came from original C source code. It's more complicated than that in practice but basically it's a chunk of machine code that has one entry point and one exit. It's possible to analyize machine code with a program and divvy it up into basic blocks.)
Then basically what you do is sort the machine code, with the most frequently used basic blocks coming earlier in the file.
Note that the profiling process depends necessarily on the use to which the program is put during the sampling. For best results, you might actually want to prepare several seperate binaries of the same program, each optimized for a different purpose. Or you might want to construct test data or a test script that gives you a good overall average performance.
Now, how do you get this tool? It's more than just theory. It's available for IBM RS-6000's, although I don't remember what they call it.
But if you can spare the cash for an iMac you can get it included with the Macintosh Programmer's Workshop [apple.com] - MPW. The particular tool that's used for this is called MrPlus, which is discussed in Apple's Technote 1174 [apple.com] and Technote 1066 [apple.com]
I believe a variant of this is available in the Metrowerks Codewarrior [metrowerks.com] development environment for PowerPC (CodeWarrior also supports Windows, Linux via GCC and lots of embedded systems but I believe the code reordering is only available for PowerPC).
CodeWarrior provides both an IDE (on Windows there's a choice of MDI user interface or Mac style with a global menu bar and free windows, which makes me much happier when I program on Windows) and it also provides command line tools, including the entirety of MPW with mwcc preinstalled so you can do "make" style builds on the MacOS (but with a weird makefile syntax). I don't seem to find any mention of this on Metrowerks' website. I'll ask their friendly support guy if I'm correct about this.
Perhaps you're lusting over using this for Linux. It would certainly be interesting to try using this on the kernel - build the kernel, boot the machine off it, run it for a while under a normal load while you run the instruction pointer sampler, then reorder the instructions in the kernel and boot off the new kernel and you run faster!
This would probably be easiest to do on PowerPC Linux given the availability of published information from IBM and Apple about it, but I don't see why you couldn't do it for any instruction set. Some would just be harder to parse or rearrange correctly than others.
Stop drooling and start studying.
Michael D. Crawford
GoingWare Inc
daisy (Score:3)
From the FAQ (Score:4)
According to their white paper, Transmeta uses dynamic binary translation to convert x86 code into code for Transmeta's internal architecture. This is similar in concept to the current version of DAISY which converts PowerPC code into code for an underlying DAISY VLIW machine. DAISY was developed at IBM independently of Transmeta. The DAISY research project focuses less on low power and more on achieving instruction level parallelism in a server environment and on convergence of different architectures on a common microprocessor core. A more detailed comparison of the DAISY and Transmeta approaches will be possible after Transmeta publishes their techniques in more detail.
IBM licensing from Transmeta (Score:4)
-------
CAIMLAS
Re:Interesting spin-off's... (Score:3)