Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Technology

Clockless Computing: The State Of The Art 140

Michael Stutz writes: "This article in Technology Review is a good overview of the state of clockless computing, and profiles the people today who are making it happen." The article explains in simple terms some of the things that clockless chips are supposed to offer (advantages in raw performance, power consumption and security) and what characteristics make these advantages possible.
This discussion has been archived. No new comments can be posted.

Clockless Computing: The State Of The Art

Comments Filter:
  • Is this another example of the 'bohemian/hippie renegade engineer out to save the computing world by their bold revolutionary ideas'?

    Sort of reminds me of the Rolling Stone cover back in '90 (or so) that had Jesus Jones on the cover. "Will Jesus Jones save Rock & Roll?" (And notice where they are now)
  • How... (Score:2, Funny)

    by blkros ( 304521 )
    are computers going to know what time it is if they don't have any clocks?

    • You'll just think that you've set the fuzziness scale to high.

    • are computers going to know what time it is if they don't have any clocks?


      I believe clock is still needed, but CPU itself doesn't depend on it. OS will surely require clock.
    • They're talking about removing the internal CPU clock, which in effect, isn't really a clock at all. It's just something which ticks at regular intervals, and lets you do a number of things, such as synchronize instructions, pipeline, cache read/writes, and all the other stuff I forgot from CS 101.

      A computer's clock (as in date, time, etc) is on another part of the motherboard, and runs (correct me if I'm wrong) off the CMOS battery. That'll always be a "clock" in the sense we understand.
      • No, a computer does indeed know what time it is based on a clock- it's the same way digital watches know what time it is (counting pulses). The answer? Computers will still have some sort of calibrated oscillating circuit in them, but they won't be synchronizing processor activity.
      • Except every frickin' mobo clock (date/time) I have ever come across can't keep proper time worth a damn. Why is it I could buy some cheap-ass cheesy Care-Bear watch and it is assured of keeping better time, by a long stretch, than any mobo clock? Why is that?!

    • most computer systems have auxillary timers that they use for actually keeping time. for example, in an architecture/assembly language course [uwaterloo.ca] i just took in the summer, we were using a motorola coldfire processor, and a MC68901 [motorola.com] multifunction peripheral controller, which includes 4 independant timers. We wrote some assembly routines which set up one of the timers (i believe they used a 25MHz clock) so that running the signal through a 1/16 clock divider and then to an accumulator which generated an interrupt when it reached a certain value (25 000 000/16 = 1 562 500), i.e. once every second. The ISR that responded to this interrupt was our system clock.

      Of course, this system relies on software to do all the work. A real system clock would just be the same thing implemented in hardware.

  • This is cool. (Score:1, Interesting)

    by Anonymous Coward
    And marketing these chips will have to get back to the real stuff: how many operations of a specific kind they can carry out per second.

    I'm just wondering, would such a processor execute the same machine code using the same internal sequence of signals twice ? I guess asynchronous communication between elements would introduce some kind of randomness.
    • Well, I suppose it's the lesser of two evils. Having a benchmark result printed on a retail box does carry a little more information than having a clock rate on the box.

      (Of course for the benefit of the consumer who doesn't know the differences between benchmarks, some standard benchmark would have to be used so the consumer can simply know "a bigger number on the box is better".)

    • This 'randomness' exists both in syncronous and async logic. Clocked boolean logic will not handle the same operations exactly the same way twice with respect to timing at the gate level, but the clock hides that (that's why it's there). With clockless logic, other mechanisms hide that.

      In the case of Null Convention Logic, it's the extra signal saying 'wait until I'm done!' to the next logic unit. This results in relative dataflow being more random-like. However, both approaches use design to insure the end result is properly achieved.

      With Null Convention Logic, this 'relative randomness' means the chip is producing more 'white noise' since multiple clocks are not joining together to produce a steady electromagnetic frequency. This should make design easier as you don't have to fight your own chip design to keep it from interfearing with itself. This is a HUGE problem with current clocked designs. I believe it results in many forced re-designs to get it right, and the problem only gets worse with higher clock rates and bigger chips.
  • by chill ( 34294 ) on Saturday September 15, 2001 @09:49AM (#2302650) Journal
    What will AMD and Intel try to one-up each other with? No clock speed, so how do you classify, much less hype, new processors?

    The real reason they haven't moved to this yet is their marketing team doesn't want to give up on the MHz race.

    • by Anonymous Coward
      That is NOT the reason they have not moved.
      Designing something as complicated as a CPU without clocks is a daunting challenge. Keeping
      everything in sync, removing race conditions,
      keeping order of execution the same. There's a
      lot of challenges in a clockless design.
    • There should be an independant association that uses a battery of benchmarks to come up with a few "measurements" that the general public can use to gauge the performance. We should implement this even now with our clock speed being almost ambiguous.
    • You have to remember that AMD is already about to abandon the "My Mhz vs. Your Mhz" game since the speeds are beconing an increasingly 'apples to oranges' comparison. They'll be referring to future chips by model number... and from what I've been hearing the consumer will actually have to dig to get to the speed of the chips.
  • I can see the point that clockless design can reduce the power consumption. However, I don't really catch the point why it may solve the other problems inherited from high speed computation.

    Suppose we want to increment the register for 1000M times, clocked circuit will generate hell lot of the noises when all the signal pushes thru the circuit,at say 2GHz,for a duration of say, 0.5s .... But, if we want the clockless design to work as good, its asynchronous gates should still be switched for that much times in the same 0.5.
    In terms of noise generation, it will be on par of convention design. As all the gates still need to switch at pretty much the same speed, other physical barriers still operates.

    Anyone has more detailed info on this topic?

    • But you can save time by removing the clock-synched latch that currently has to separate each piece of logic. Like the article says, async chip design is gradually being introduced in this way to current designs, like the Pentium 4 - not that that is much of an advert for the technology... It's not really anything new, it's just that previously the benefits were greatly outweighed by the difficulty in designing these systems.
    • by TeknoHog ( 164938 ) on Saturday September 15, 2001 @11:27AM (#2302869) Homepage Journal
      Say we're running at 2GHz, which allows a maximum time of 0.5 ns for an instruction. But if you use some simple instructions that only take 0.2 ns each, you'll be wasting 3/5 of your time waiting for the next cycle. With clockless computing you can move on to the next stage as quickly as you're done with the one before.

      Of course there is some overhead. There has to be a system telling other parts of the computer when something is finished. But if that is a long enough stage (perhaps thousands of instructions) then it'll be faster overall.

      • With clockless computing you can move on to the next stage as quickly as you're done with the one before.

        Actually you can do even better. If the instruction executed does not need the memory stage of the pipeline, it can exit the pipline before that stage. This will allow multiple quick instructions (eg shift) to execute and exit the pipeline while the slow memory instruction ties up the memory stage. This psuedo parallel operation is what clocked processors can only do with multiple pipelines.

      • Say we're running at 2GHz, which allows a maximum time of 0.5 ns for an instruction.

        Not so; many instructions take multiple cycles. Which ones depends on the machine, but multiplication, division, jumps, and of course memory accesses, are usually 2-20 clock cycles to execute.

        More time is spend doing memory access and missed branches than anything else (IIRC: Pentium Pro guesses 90% of branches correctly, and missed branches count for about 30% of the overall time of executing a typical piece of code). IA-64 does some interesting things to prevent missed branches from hurting the code (basically, it executes both branches in parallel, throwing away whichever one was wrong). IA-64 has so many functional units that I guess in the long run, it turns out to be a win.

        If the advatages they cite for these chips are true, things could get very interesting in a couple of years. :)

  • With its simplified core, a processor like the crusoe seems like it could be a promising general-purpose chip to first adopt technology like this.

    Any comments from someone more knowledgable than I?
  • Because these chips give off no regularly timed signal, the way clocked circuits do, they can perform encryption in a way that is harder to identify and to crack.

    Not if you have a backdoor. Guess these guys don't read Wired [wired.com]..

    • One of the first common "thinking-out-of-the-box" techniques used to crack smart cards was the sw was written to take different amounts of time to compute legal and illegal keys. By measuring the battery consumption, the smart card crackers could only search the space of legal keys.

      No doubt this was a sw path put in by a well intentioned programmer trying to save battery life, but now all respected encryption systems reccomend a "veil" strategy, where all encryption/decryption operations take the same amount of time and power regardless of the key.

      In practice this means that you find out the max time and power (plus some margin) and if you are done early and without using enough power, you waste time and power to pad out the the veil...

      Nice thought, but this just goes to show that cryptographic systems really need to be designed by experts...
      • No doubt this was a sw path put in by a well intentioned programmer trying to save battery life, but now all respected encryption systems reccomend a "veil" strategy, where all encryption/decryption operations take the same amount of time and power regardless of the key.

        That's not really necessary. All you have to do is randomize the compututation. For example, power analysis of a smart card doing RSA can recover the secret key, if it knew what the input was (in many situtions, a reasonable setting). But if you multiply (or is it exponentiate?) the input by a random number, then do the RSA op, then demask the output, poof! - PA, electromagnetic emission analysis, etc all get very very hard.

        Also, it can be hard to disguise your "wasting time" as being part of the computation, if the attacker can, for example, track which memory is being accessed when.

        I wonder how well these clockless chips would fare against differential fault analysis; basically progressively destroying gates in the chip and looking at it's output over time. Almost any chip will fail against this attack (but it requires lots of expensive equipment and a fair amount of expertise).
  • Clockless ARM (Score:2, Interesting)

    by Anonymous Coward
    The Amulet Group [man.ac.uk] at The University Of Manchester [man.ac.uk] have a clockless ARM (ARMs are used in many mobile phones, the Compaq iPaq and the GBA).
    • Saw this at an Acorn Computer User Group meeting at the University of Manchester about 4 years ago.

      I was only about 14 and didn't have a clue about half the stuff that was being talked about, but the AMULET simulator they showed at the end looked kinda cool :-)

      Maybe it was longer than 4 years, I remember we were waiting for the first shipments of the StrongARM processor upgrade cards for our RiscPC 600's and 700's

      Ah well, guess I'm getting old now...
  • by isj ( 453011 ) on Saturday September 15, 2001 @10:00AM (#2302679) Homepage
    The article is very interesting. I though that research in asynchronous computing died in the sixties. What the article misses is that async. operations has an overhead too - the synchronization "here is the data". Synchronous computing does not have that.

    I have previously read (forgotten where) that in theory async. computers will always be slower that sync. computers. It seems that that is not true anymore. I guess that the latests-and-greatest CPUs have a non-trivial percentage of idle time for instructions which takes slightly longer than an integral number of clock ticks. If an instruction takes 2.1ns and the clock runs at 1ns, everything have to assume that the instruction takes 3ns.

    Also imagine a fully async. computer. No need for a new motherboard or even changing settings in the BIOS when new and faster RAM chips are available - the system will automatically adapt.

    I think that we will see more and more async. parts in the year to come. But I don't know if everything is going to be asynchronous.
    • Also imagine a fully async. computer. No need for a new motherboard or even changing settings in the BIOS when new and faster RAM chips are available - the system will automatically adapt.

      Now i'm not an engineer, but in the article it mentioned that it was important to have wires and gates connected in a special manner so the data arrives in the proper order. It seems to me that it would make the microprocessor more dependent on the hardware and not less so. Maybe this wouldn't be a problem if all of your RAM was the same speed, but it could cause a problem if you had one 100Mhz simm and one 133Mhz simm. I would think that the information coming from the 133 could screw things up. Can anyone clarify this for me?
      • You are right that it depends a lot on how you implement it.

        If the RAM delivers data in serial manner (on bit at a time in one wire), faster RAM would definetely cause problems because the CPU would not know how to distinguish the individual bits ... unless the RAM chip generates a clock on a separate wire, which some RAM chips do.

        On the other hand if the data bus is e.g. 33 wires = 32 data wires and one "handshake" wire, the protocol between the CPU and the RAM chip could be:
        CPU -> "give me the contents of address 0x38762A63"
        CPU then waits for the handshake wire to go high
        The RAM chip sees the address, puts the contents on the data wires, and then sets the handshake wire high.
        And the the CPU can read the data.

        The above asynchronous protocol does not depend on the speed of the RAM chip. The RAM chip could be a future high-speed "zero-latency" chip, or a slow flash-ram chip. The CPU does not need to know.

        There are problems with this too. The protocol is sort of request-reply / step-lock. And how do multiple devices share the same bus. And and and...
        Noone said it was easy :-)
    • I have previously read (forgotten where) that in theory async. computers will always be slower that sync. computers.

      I don't think the advantage of async is a direct increase in speed, but rather a decrease in die size (because the clock signal doesn't have to be propagated to all parts of the chip), which leads to a decrease in power requirements and allows the chip to operate faster without overheating.

  • They have a press release, see here: http://research.sun.com/features/async/

    (I'm sorry, I can't use HTML: the lameness filter don't want to allow the posting otherwise.)

    I imagine the "perfect" laptop:
    - an OLED screen (no need for backlighting)
    - an asynchronous processor (low power)
    - no HDD, but plenty of MRAM (this RAM is persistent)

    • And build in a microphone and make itts screen touch sensitive. That way you can get rid of the keyboard, trackpad and hinge and make it a single, consolidated unit.
  • Old news? (Score:2, Informative)

    by NoMercy ( 105420 )
    The AMULET group at Manchester University have been developing this for years based on ARM cores.

    http://www.cs.man.ac.uk/amulet/index.html [man.ac.uk]
  • Reliability (Score:3, Interesting)

    by numo ( 181335 ) on Saturday September 15, 2001 @10:17AM (#2302704)
    Well, I think that the reason the async chips are not being used is quite simple - a clocked system is much easier to design and verify. You know how long before and after a clock edge your signal needs to be there to be recognised. You know that if these constraints match across your system, it will work. Yes, this makes the system as fast as its slowest link - some circuits operate near their limits, some are actually wasting the time. But it works. An asynchronous design would be a pure hell to debug - that's probably why the industry doesn't (yet) mess with it.

    BTW, does anybody here remember analog computing? A bunch of cleverly connected operating amplifiers? These things were asynchronous, just as mother nature is. If you can get the physics work for you, bingo - compare the time the nature needs for raytracing a complex scene compared to a digital model :-) The only drawback is that the most of us prefer slow digital model of thermonuclear reaction and similar problems...
    • Just the opposite actually -- it is usually easier to debug and design. You do have synchronization to avoid reading things at the wrong time, but all the synchronization is local, rather than tied to a global clock pulse, so you only need to verify things at the boundaries, not chipwide at once.

      If some unit takes a bit long to respond, you don't get a glitch, as you would in synchronous designs, but instead the unit it is talking too
      slows down a bit.

      Synchronous and Asynchronous are really misnomers. Better terms would be "globally synchronized" and "locally synchronized".
    • Well, I think that the reason the async chips are not being used is quite simple - a clocked system is much easier to design and verify. You know how long before and after a clock edge your signal needs to be there to be recognised. You know that if these constraints match across your system, it will work. Yes, this makes the system as fast as its slowest link - some circuits operate near their limits, some are actually wasting the time. But it works. An asynchronous design would be a pure hell to debug - that's probably why the industry doesn't (yet) mess with it.

      Not so. In fact, one of the greatest problems with clocked boolean design is the interference caused by all the clocks on the chip. Fabrication will routinely result in broken chips, forcing multiple redesigns and long development cycles.

      Tremendous resources are dedicated to getting around this problem. Also, you can't really just change the design 'a little bit', as doing so results in more interference issues. Want to add a new unit to a clocked boolean logic chip (a new cache, 3d unit, new pipline, etc)? Sure you can do it, but it will require fundamental redesign as adding those clocks associated with the new unit will interfere with other clocks on the chip, and other clocks will interfere with your new unit. The fact that they all have to fire off simultaneously, generating electromagnetic interference, is a real needle in the eye for chip designers.

      With well thought out async, all you have to do (more or less) is add the unit to the design. The 1st fab should work, no redesign cycle required. You can add cache memory or whatever and as long as the design is logically valid you will have a functioning chip in a few days time (as long as it takes to fab the chip). Try that with syncronous logic.
      • The fact that they all have to fire off simultaneously, generating electromagnetic interference

        Hmm... But in an async setup they maybe fire simultaneously - you simply don't know, it's up to the statistics. I fear that in that complex chips you will end with a system that works by pure coincidence - some picosecond fluctuation somewhere and you get one glitch per 1000 hours of operation.

        It is probably not that simple and as someone wrote, the more proper name would be globally or locally synced. I fully agree with you that there is no reason to tie the bigger units to a single universal clock. But I think that on the lower levels you can get a more reliable design by using traditional approach.

        I have no experience in chip design (so I don't know specific problems of trying to stuff tens of millions transistors onto a square inch), but I designed some non-trivial circuits.
    • I suspect that the reason there aren't many (or any) commercial clockless logic designs has more to do with:
      1. Lack of availability of design and synthesis tools.
      2. Lack of engineers trained and experienced in the use of clockless logic.
      3. Lack of multi-sourced, high volume production of clockless logic components.
      4. Lack of a clear economic incentive to abandon clocked logic in favor of clockless logic.

      I would dearly love to be able to experiment hands-on with a clockless CPU myself, but the cost and difficulty of obtaining just one such device is more than I can justify personally.
  • This isn't new... (Score:1, Informative)

    by Anonymous Coward
    The old CDC supercomputers, and the Cray 1, were clockless. They were designed by that inspired madman, ...

    The reason be built them clockless is that the propogation time to get the clock signal across the machines (which were fairly large) would have significantly slowed the performance. Instead, all of the wires are the right length so that all of the signals arrive at their destination at the right time. I've been told horror stories by ex-CDC salesmen that when they installed new machines, they would spend days or weeks clipping wires to different lengths and debugging hardware failure modes until it all ran smoothly.

    Cray also solved the heat dissapation problem by designing the computer to run hot. This meant that when you turned it on it didn't work reliably until all of the ceramic boards heated up (and expanded) so that the connections were solid, etc.

    F-ing brilliant.
    • by Anonymous Coward
      That's not true, the Cray 1 was clocked at 75MHz. Cray did use exact wire lengths in the way you say, but he used the wires effectively as a buffer. It was most definitely not a clockless design (nor were the older CDC machines)
  • CPU Primer (Score:2, Interesting)

    When designing a "conventional" CPU, you can have a clock that essentially drives events and datamovement.

    If you design a multiplier circuit using a bunch of full-adders, you'll notice that the output take a long of time to settle. In fact, depending on what numbers you are multiplying together, the circuit may take more or less time before the output settles.

    You can always determine the worst-case scenario for a multiply operation to settle. If the multiply takes longer than any other operation, then the multiply op is the "critical path".

    A chip's frequency is the inverse of the period of the critical path (in most cases). So, if it's possible to do 100 million critical path operations in a second, then your machine can run at 100MHz.

    What the article is hinting at is the amount of wasted time because everything is (currently) done on the clock cycle. Allow me to illustrate: Let's say a multiply takes 5 seconds, but an add only takes 1. A fixed clock rate (or having a clock at all) forces that add instruction to take the extra 4 seconds, and use it for nothing. Wasted computer time.

    Now, the reason people are skeptical is because there is no efficient way to tell if a multiply operation (or any other operation) has actually completed and the outputs have settled.

    Incidentally, if this interests you, go grab a free program called "diglog" or "chipmunk". The software (for linux/windows) allows you to simulate almost any digital circuit.

    Another thing to keep in mind about current CPUs is the way they execute an instruction. Every instruction is actually made of smaller instructions (called microinstructions). Microinstructions take one clock cycle each, but there is an arbitrary number of microinstructions for each larger instruction. The microinstructions perform the "fetch execute cycle" - the sequence that decodes the instruction, grabs the associated data, performs the desired task, and goes back for more.

    If you're interested in designing a CPU yourself, go grab a book by Morris Mano called "Computer System Architecture". With that book and DigLog, it's pretty easy, but it takes a long time.
  • by Ungrounded Lightning ( 62228 ) on Saturday September 15, 2001 @10:34AM (#2302737) Journal
    if there is no mass market for asynchronous chips, there's little incentive to create tools to build them; if there are no tools, no chips get produced. The same problem applies to the development of chip-testing technologies. Without any significant quantity of asynchronous circuits to test, there is no market for third-party testing tools.

    But at least here there's an accidental solution - the Cross-Check Array.

    Conventional clocked chips can be tested by scan: A multiplexer is added to the flop inputs, and a test signal turns them into one or more long shift registers. The old state of the flops is shifted out for examination while a new state is shifted in to start the next phase of the test. This only works when the flops to be strung together are all part of a common clocking domain.

    The Cross-Check Array is more like a RAM. A grid of select lines and sense lines are laid down on the chip, with a transistor at each intersection. The transistor is undersized compared to those of the gates, forming a small tap on a nearby signal - or it can inject a signal if the sense line is driven rather than monitored. Select drivers are laid down along one edge of the chip, sense amplifiers/drivers along another.

    This approach does not depend on the flip-flops to be active participants in the observation process (though it can still force their state), and thus can observe signals in asynchronous as well as synchronous designs. It also gives observability of testpoints in combinatorial logic without the addition of extra flops. Compared to a fullscan design it gives much greater observability and takes about half the silicon-area overhead.
  • by andika ( 5684 )
    Does programming for clockless chip differ to synchronous one? Every links I tried to follow only explain about design, or speed, or power consumption difference.
    • by 2nd Post! ( 213333 ) <gundbear.pacbell@net> on Saturday September 15, 2001 @12:17PM (#2303039) Homepage
      It *can* be different, that but's really a function of the state of compilers and languages adapted for an asycn system. It needn't be different at all.

      Disclaimer, I was a student at Caltech, and I took 1 async VLSI course, and not very in depth at that.

      One way to go about it is to make an async CPU that externally looks like a sync CPU; then you drop it into just about any system, and it works. Speed is wholey dependent upn VCore settings, cooling solutions, and drive strength, I think, though of course there's always gate and transistor performance bottlenecks. Programming and using such a chip would be no different than any other CPU.

      Another method is to have a partially async system, in which the CPU, some of the motherboard, and the ram interface is async because of how fast they operate; go ahead and clock something like PCI, USB, etc, because those operate slow enough that the effort of async isn't worth it. This solution is just a question of degrees, really, on how much of the system is async and how much isn't.

      Now, that aside, there's the software aspect; how do you program an async system? At the lowest level it resembles, slightly, multi-threaded programming, in which you have multiple threads equating to the multiple function units, execution units, decoders, and stages in the pipeline, etc.

      You shuttle data around and wait for acknowledges that the data has been processed before you continue shuttling and processing data. You can synchronize around stages or functional units by making other stages or units dependent upon the output of said unit; instead of waiting for a clock to signal the next cycle of execution, you wait for an acknowledge signal.

      To be a little more clear, at the ASM level you would mov data, wait for an ack before another mov data, wait for an ack before sending an instruction, etc. Due to the magic of pipelining, the CPU doesn't have to be finished before you can start stuffing the pipeline, and because it's asynchronous, that means you can actually feed in data as fast as the processor can recieve it, even if the back end or the core is chocking on a particularly nasty multiplication.

      So you're feeding data at a furious rate into the CPU, while the CPU is processing prior instructions. If the front end gets full, or whatnot, it fails to signal an ack, so whatever mechanism is feeding data in (ram, cache, memory, whatever) pauses until the CPU can handle more data.

      The core, independent off the front end, is processing the data and sending out more instructions, branches, setting bits. With multiple functional units, each unit can run at it's own speed at it's own rate. So if all it's doing is adds, checking conditionals, etc, it may be able to outrun the data feed mechanism, since an add can be completed in one pipeline unit, while data always has to wait upon a slower storage mechanism.

      Or if the execution units are waiting because it's doing a square root or something, it just tells the prefetch or whatever front end units to wait, because it cannot handle another chunk of data or instruction, yet, which propogates back to the data feed to wait as well.

      When it finishes with it's current instruction a ready signal would get propogated back through all the stages or so, and then more data would get fed in.

      So at the lowest levels it would start to resemble writing threaded code, in which you have to wait for the thread to be ready, to be awake, to be active before you send data, and if the thread is asleep, you wait until it awakes, or something like that.

      Multiprocessor async is similar, except that each CPU is just another thread, and if there's a hardware front end that decides which CPU to send instructions to, then it's really just a function of stuffing instructions into the least loaded or fastest running CPU; each CPU could, more or less, look like just another functional unit, and clusters pretty well because they all run asynchronously, meaning you don't have to do anything particularly special for load balancing; just send the data to the first one who signals ready, or if there are multiple cpus ready, read a status register to see which is more empty or whatever.

      Apologies if I made some errors, especially to those who know much more than I; this is a 4 year old interpretation of my async vlsi class =)
      • > So at the lowest levels it would start to
        > resemble writing threaded code,

        In case of a pipeline stall, would SMT be advantageous?

        To me it looks like asynchronous non-CPU devices, pipelined CPU, and SMT would be an ideal combination.
        • The interesting part about asynch CPUs is that they, duh, aren't clocked...

          So in the case of a pipeline flush (and accompanying stall), it doesn't take N clocks (whatever the pipeline depth is), it goes as fast or as slow as the flush mechanism reset takes...

          If done well, then a pipeline flush can operate at thousands of times faster than the normal operation of the pipeline because, well, you're just dumping data without doing any work; raise the proper bits and reset signals, and the whole pipeline dumps as fast as it can, while the front end feeder just slows down a bit (without stopping) in feeding data into the pipeline.

          Above assembly, btw, the programming language for the CPU doesn't have to look like SMT; it can, but it doesn't have to.
    • For general programming there will be no difference. Its when you try to optimize that difference appear. Optimization in async logic is very difficult. In synchronous logic you need to optimize around the superpipeline. You reorder your instructions so data is available to later instructions without causing it to stall the pipeline. Since you know exactly how long each instruction takes in cycles you can schedule
      correctly, even force schedules with NOP instructions.

      In async programming NOP instructions can't be implemented. They don't really make sense anyways. The pipeline in an async chip is technically allways stalling. So you will need to learn many more consequences to the instructions you choose. At times the data has an effect on the speed of the instruction.

      Using logic like DCVSL a 32 bit shift operation would finish faster if the data was a binary 1 versus any number that used multiple 1's in its representation. This makes optimization rather intersting. For example you can perform byte operations faster than word operations.

      In the async processor I did for my thesis, we simply ran the optimized synchronous code throwing out the NOP instructions. The result was fater execution even if the code was not optimized for the correct processor.

  • [from article] But after a point, cranking up the clock speed becomes an exercise in diminishing returns. That's why a one-gigahertz chip doesn't run twice as fast as a 500-megahertz chip.
    Wrong. A 1GHz chip doesn't run twice as fast as a 500MHz chip because of pipelining, and because the support infrastructure in a typical PC can't handle a 1GHz chip well, so it spends a lot of time waiting for hard disk access and memory. Eliminating the clock isn't going to make the heads on a hard disk move any faster. The real benefit is that an idle component can't be a bottleneck anymore.
    • The point they were trying to make with that comment is this:
      Even if you removed all other bottlenecks, a 1GHz version of a 500MHz CPU (with no other architectural improvements) will not perform twice the work of the 500MHz version due to clock overhead.
    • If you think about it, the article is actually correct. The latency within the wires themselves also prevent a single central clock from timing the whole system accurately.

      Even with a hypothetical chip that doesn't incur speed decreases due to pipelining, the clock will still end up nearer to parts of the chip than to others, which will result in latency at the end of the pipeline.

      Hence if you've got a 500Mhz chip with 2 stages and the clock physically placed near stage 1, then stage 1 of the pipeline will run at 500MHz, stage 2 will also run at 500MHz but with some latency, so the two-stage pipeline will complete an instruction very slightly over 2 cycles. Add more stages, you'll get a bigger effect at the end. And as clock speeds go faster, you'll eventually hit the ceiling -- the latency might actually be as fast as a single cycle itself.

      And having multiple clocks to offload the work (and to bridge the gap from the other stages) can only do so much -- eventually it becomes an issue of timing all these clocks together. You'll eventually wish to remove the clock altogether. =)

      As for I/O with the rest of the system, it's not really an issue here -- what is being discussed is the processor's raw speed. I/O bottlenecks are already being solved via intelligent caching, and for more improvement we will probably have to wait for a totally new architecture.
      • Hence if you've got a 500Mhz chip with 2 stages and the clock physically placed near stage 1, then stage 1 of the pipeline will run at 500MHz, stage 2 will also run at 500MHz but with some latency, so the two-stage pipeline will complete an instruction very slightly over 2 cycles. Add more stages, you'll get a bigger effect at the end. And as clock speeds go faster, you'll eventually hit the ceiling -- the latency might actually be as fast as a single cycle itself.
        Latency doesn't affect the time required to complete the instruction, only the time at which it is executed. If the clock reaches second pipeline stage late, the time required to complete that pipeline step is latency+calculation time. If that's more than one cycle, the chip is clocked higher than it can run, period. Anyways, there's a fairly obvious solution to clock latency in pipelines. Put the start of the pipeline near the registers, and the end of the pipeline near the registers, then a U shape for all the ones in the middle; since a stage only needs to interact with the ones before and after it, which are physically adjacent, the effective latency is small.
  • It would not be economically viable to try and push this new type of processor to the market overtaken by the traditional synchronized processors and computer equipment, however, it seems that the assynchronous microprocessing can still be used inside traditional computers if it is mixed together with synchronized systems. Imagine a computer that uses a synchronous bus just the way it does now but has an assynchronous co-processor which is communicated to by a special type of synchronous CPU that allows certain operations to be carried out assynchronously. If, for example, a matrix multiplication needs to be done, the normal CPU would require a number of clock cycles that is proportional to the number of multiplications within the matrix over the number of processor pipes allocated for this task. If it can be proven that assynchronous processing can do the same job three times faster than a 'normal' cpu takes, why can't 'normal' or traditional CPU ask the assynchronous co-processor to do the task for it? The problem is of-course assynchronous data retrieval and storage. Probably a co-processor could actually be a co-processor card with its own assynchronous memory bank on board that can be later synchronized with the traditional memory banks. Such a system should not be too difficult to implement, since it could use a PCI slot for example. Soon a computer would become less and less synchronous, with the synchronous parts synchronizing many assynchronous devices.
  • I read that article thru a link at the bottom of C-net's news.com a few days ago. Why bother /. it? Are you implying /. is the only place we look for news?

    Gee...
    • No, /. may not be the only place "we" look for news (not that I can remember anybody saying that, esp. not the original blurb), but it's a damn good place to discuss the news.

      Gee indeed.

  • The Only way they can probly avertise the Async chip is to give the MHZ of the fastest segement of the chip. That or they will actually have to advertise other segments of the computer that determin speed. Dose that meen that computers will be sold with more Cache Again, Or they actaully tell the Bus speed or even the Pipeline of the systems. My god this will turn computing advisertising around. Where a system simular to a SunBlade1000 with 8megs of Cache will actually be advertised faster then a P4Like system with 1/2Meg of cache. Will Wonders never siece.
  • Yikes, seems a little sci-fi and bogus claims....

    Of course the new Pentium 4 contains some elements of asynchronous design... all synchronous chips do! In a synchronous design, the logic between registers (article calls Flip Flops) is asynchronous. The gating factor on the amount of asynchronous logic you can place between registers in a synchronous design is a function of the clock speed and the gate speed -- the faster the gates, and/or the slower the clock speed the more logic you can place between registers. Looks like the article is about a system with a clock rate of 0 without changing gate speed, so the processing rate will be the sum delay of the asynchronous logic -- I wonder what this would be on a chip the complexity of a P4 or G4?

    The upside to slower clocks is reduced piplineing, which can be useful in designs with limited data paths.

    The down side to slower clock speed is increased complexity. Data skew has to be monitored across the chip, so gate delays have to be accounted for every gate in every possible data path (vewy complex). The chances for glitching increase with logic. With no clock it gets worse, every glitch can be seen -- not the case with a clock (glitches between clocks edges may be tolerated).

    I also disagree that clock distribution is limiting factor. This problem is overcome in larger ICs by distributing PLLs throughout the silicon. The limiting factor in clock speed has more to do with materials used in the chip -- gate speed, skin effect, etc.

    Finally, there are quite a few ways to increase the performance of synchronous design. One way is to have multiple data and ALU paths like the Pentium and G4. Another is IC technology. Personally, I'm waiting for the day an all optical processor hits the market.

    So an asynchronous chip runs a little faster, the trade is an enormous design risk, maketing, OS development, etc. I say leave the anarchy to the software.
  • Asynchronous mainframes were built in the 1960s, by, I think, Honeywell. There was a modest performance gain, but well under 2x.

    Parts of processors are already asynchronous. The basic way you get stuff done in a clocked machine is that you have a register feeding an array of logic gates some number of gates deep, with the output going to some other register. Within the array of logic gates, which might be an adder, a multipler, or an instruction decoder, things are asynchronous. But the timing is designed so that the logic will, in the slowest case, settle before the register at the receiving end locks in its input states. The worst case thus limits the clock rate, which is why the interest in asynchronous logic.

    The claims of lower power consumption are probably bogus. As Transmeta found out, the power saving modes weren't exclusive to their architecture. Once power-saving became a competitive issue, everybody put it in.

  • Wouldn't an asychronous microchip be fabricated as a disc rather than a square, to help make the wires closer to the same length?
  • The article is surprisingly accurate, for a change. Read it.

    However, it seems to have spawned the usual problems here with misunderstanding and confusion. Practically a /. trademark by this point...

    Whether you construct a processor using conventional or asynchronous logic makes no difference to the programmer. The programming paradigm can be completely independant from the underlying hardware. (Admittedly, if you want to squeeze the absolute most performance from a given hardware design, you need to program with it in mind, but there is no reason why an ix86, or PPC, or SPARC, or MIPS chip couldn't be implemented asynchronously.)

    One of the most interesting advantages of asynchronous logic is that it allows the use of arbitrarily large die sizes. In synchronous logic, you're limited by the delays that arise from transmitting your clock pulses across the chip... at some point maintaining a global lock-step becomes infeasible.

    One of the most marketable advantages of asynchronous logic is the power saved by not having to constantly drive the same clock circutry. Most chips support a 'sleep' or 'low power' mode where they turn off the clock or provide it to only a limited portion of the chip. The chip then has to go through a 'wake up' cycle to re-establish the clock throughout the chip before returning to normal operation. The power saved by asynchronous operation can be substantial, and the lack of a 'wake up' latency can be critical in certain applications.

    The biggest problem right now is that the vast Layout and Design masses are used to solving the synchronous problems and not the asynchronous problems, ditto for the availible tools. Howver, with an asynchronous-savvy group, a given solution can be designed in less time than the equivalent synchronous solution (someone here was claiming otherwise...).

    And this technology is -not- vaporware... it's real and it's here. And whether you believe it or not, it's at least one part of the future.

    -YA

    PS: BS in EE from Caltech. Working for a company mentioned in the article, although their opinions have no logical relation or tie to mine.
  • The article mentions Theseus' approach to asynchronous design -- Null Convention Logic (NCL) -- but does not go into any detail. For more info, check out Theseus' white paper on the subject: ncl_paper.pdf [theseus.com]. I read this a couple of years ago and thought it was fascinating. At the time, I tried to design some "primitives" that could be implemented in an FPGA to at least try out some of the ideas. Not a trivial excercise.
  • The human brain doesn't have a clock speed on the Central Processing Unit- in fact, there _is_ no central clock, but our minds manage to function with a great deal of processing power. Imagine the bandwidth of the file equivilant of all the .wav, .avi, .ogg, .mp3, .txt, Optical character recognition, and AI functions we use, plus mechanical functions like bipedal balance. I've heard estimates and approximations that the brain performs about a trillion operations per second, is that about right? Pretty impressive.

    An interesting thing to think about is, with no clock speed, how we still can perceive time. We need to do this to predict the paths of moving objects, like birds and arrows and spears... or more recently car trajectories when we're driving. With no absolutely authoritive center time in our minds, how do we still have such an accurate sense of time when it comes to predictiong these paths?

    I personally imagine that the brain does have some sense of ratios...I imagine that neural loops have some sense of ratios... for example, if hypothetically the motor loop between between say the basal ganglia and the corpus collupsum is were twice the speed of an eyeblink? The exact milliseconds could vary between people but still give a basis for comparing motion and "time" in the real world. Of course, this would be affected by age as the loops break down- this would account for the way the old people I've seen tend to drive.
  • There's a man named Charles Moore who has been developing asynchronous microprocessors over the last decade. His current chip is called the X18 and it can maintain a sustained processing rate of 2.4 billion instructions per second. The power consumption at that rate is 20 milliwatts. Check out http://www.mindspring.com/~chipchuck/X18.html Also check out http://www.mindspring.com/~chipchuck/25x.html, which describes his X25, currently available only as a prototype. Basically its 25 X18s on one chip, running in parallel. Assuming that you can write a program that could take full advantage of 25 such cpus that would amount to 60 billion instructions per second. The power consumption is so low as to allow operation of the microprocessor array for one year on one 100mAh battery.
    • I got curious as to how the speed of X18 would compare to a pentium 4.
      Pulled up Intel's Instruction Set Reference
      (ftp://download.intel.com/design/Pentium4/manual s/ 24547104.pdf) and was
      surprised to discover that they are apparently not giving the programmer
      any clue as to how long, or how many clock cycles, it takes these instructions
      to execute.

      Likely this is because this is a very difficult question to answer, clock
      cycles per instruction being highly variable dependent on what else is
      going on in the processor at the same time.

      As I recall in earlier versions of the the pentium clock cycles per instruction
      would range from 110 to 20 cycles.

      If we assume the average pentium 4 instruction takes 30 clock cycles to
      complete, then a pentium 4 running at 2 gigahertz is executing 66 million
      instructions per second.

      The X18 executes 2.4 billion instructions per second. That's 36 times
      faster.

      Further the X18 in any quantity would probably cost several cents per cpu to
      produce. The pentium 4 at 1.7 gigahertz cost about 209 dollars.

      A little fairer comparison would be the X25 costing one dollar in quantity once
      one has gone a million units down the learning curve. This is an array of 25 cpus
      and its practical instruction processing rate is probably highly variable dependent
      on application. There might be special cases where one could use all the cpus and
      deliver 60 billion instructions per second (909 times faster than the pentium 4),
      but more typically I would guess it would be a fraction of that although still
      of course much faster.
  • Fant says, "There's no clear signal to watch. Potential hackers don't know where to begin."

    Don't you just have to look for the handshake signals instead?

    Also, what are the implications of the "dual-rail" circuits -- doesn't this mean that you won't be able to fit as many transistors on the chip?

  • How will Intel sell chips if clockless computing is ever successful? They won't be able to double the length of their pipe to "speed up" their chips. I guess we will have to finally develop some fair metric to finally be able to compare chips between product lines....

A morsel of genuine history is a thing so rare as to be always valuable. -- Thomas Jefferson

Working...