No, it was indeed the Slashdot Pre-Teen Cruiser. Naked petrified Natalie Portman approved. Comes a lifetime of hot grits down your pants.
Oh, and I love this item from the old Slashdot FAQ:
If you're not corporate drones, whose idea was the Slashdot PT Cruiser?
Marketing. Personally, we (Rob and Jeff) think the Slashdot Cruiser was a really stupid idea, and if they'd asked us about it, we'd have told them so. Usually the marketing department consults with us about promotional ideas, but they're not required to, and in this case they didn't. Given that the reaction to it has been largely negative, we expect they've learned their lesson.
What piece of code, in a non-assembler format, has been run the most often, ever, on this planet? By 'most often,' I mean the highest number of executions, regardless of CPU type. For the code in question, let's set a lower limit of 3 consecutive lines.
Somehow I don't think your entry is in the spirit of the question.
As far as the original question is concerned: If you don't tie to any particular program, but just a subroutine used everywhere, heavily, I think memcpy() is a contender. (Pick a memcpy() implementation to put here to reach the 3 line minimum.) You'd be surprised how many programs' run times are dominated by memory copies.
Apparently, the TRON family of operating systems one of the most popular operating system families in the world, embedded in literally billions of devices. Sure, all those devices are probably quite slow, but they're also probably running around the clock, mostly looping in some idle loop. So I wouldn't be surprised to find out that it has the most-executed sequential code.
If you consider RTL (register transfer language) to be code, then I'm sure one of the state machines at the heart of the fastest, most prevalent CPU family is well in the lead. It could be the state machine that selects instructions to dispatch, the state machine that retires results, the state machine that fetches instructions, the branch predictor state machine... If a given state machine gets reused over many products in a family, that acts as a multiplier. If it's part of a core that gets replicated many, many times (such as a GPU core), that also acts as a multiplier. Of course, GPUs will clock down and power down cores aggressively when not needed.
I pretty much agree with all of the above, having worked in the biz awhile myself.
Since this is a graphics algorithm (apparently), the OP might do better to try to state what the computational complexity is in terms of the operations involved for one output, in terms of basic operations such as multiplies and adds, and perhaps how much storage you need.
Consider this example: If someone came to me and asked me "How much does an 8x8 IDCT cost?" After asking them if it needs bit exactness or not (some standards require it, others don't), I could give them some numbers and some implementation bounds. "The Chen IDCT needs around 11 multiplies and 20 adds per 8-pt IDCT. Multiply that by 16 to get the full cost for an 8x8. (176 multiplies, 320 adds) To meet video precision requirements for an 8x8, the multiplies should be greater than 16 bit precision, and you should carry greater than 16 bits of precision between horizontal and vertical passes."
How many gates is that? Well, depends on the throughput you require, and the details of the implementation. Given the number of multiplies and adds required, you can work toward a number. Suppose you needed to have enough IDCT bandwidth to update a 1080p 4:2:2 image at 60Hz. So, that's 1920 * 1080 * 2 * 60 = approx 250M pixels/second that you need to produce. In terms of 8x8 blocks, that's a little under 4M blocks/second, with 176 multiplies and 320 adds. So, that's approx 700M multiplies a second and 1.3B adds.
Still, that's far from enough to get to a gate count. If you put down 1 multiplier and 2 adders and ran it at 1GHz, you'd have more than enough compute throughput. You still need to add some control logic around it (especially if you only put 1 multiplier and 2 adders, because the IDCT's compute pattern is non-trivial), and some memory to store inputs, outputs and intermediate results. A more likely implementation probably has a lot more multipliers and adders in hardware, but also runs at a much slower clock rate.
So how many gates is that? You need much more information to answer that question, despite the analysis above. You now need to pick an implementation strategy, and more than one makes sense. But, you have a much better idea of the computational cost, and can pick among multiple implementations. For example, if energy efficiency is your goal, you might implement the horizontal and vertical IDCTs in explicitly tuned multiplies and adds tuned to the exact precision necessary and connected exactly as the dataflow requires, and run the whole block at a low clock rate using slower transistors with less leakage. If flexibility is your goal, you might put in a small CPU with enough grunt to fit the computational load. with the idea that you can run other algorithms there if you need to. etc...
Oh, I'm familiar with the noun form of affect; however, it's usually applied to someone or something capable of having and displaying an emotion, as opposed to the emotional impact of an inanimate object. Either way, it's sufficiently obscure as to effect the appropriate response.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
You could of used irregardless in you're sig to embiggen it's affect.
Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight?
Actually, that's pretty much exactly how it works. If you have a bunch of independent requests to DDR—and by independent, I mean that the processor(s) do not stall waiting for the information from one request in order to make the next—then you can get multiple requests in flight and they can pipeline. Streaming works this way, for example. The STREAM benchmark is a textbook example of a benchmark dominated by throughput, where all the accesses are independent. For example, a[i] = b[i] + c[i] does not depend on a[i - K] = b[i - K] + c[i - K] or a[i + K] = b[i + K] + c[i + K] for any value of K in STREAM's "Add" loop. All four loops of the benchmark have that character. So as long as the processor can get enough work in-flight, it can get multiple cache misses outstanding to DDR. And if one processor and its caches have limited ability to 'execute ahead' like this, multiple processors (or multiple independent threads on the same processor) acting independently can fill in those gaps.
Linked list traversal results in a series of requests that are all dependent on each other. If all the requests miss the caches and must go out to DDR, then the CPU's performance is bounded by the round trip latency to DDR, not the DDR's throughput. Take a look at the linked list benchmarks in Ulrich Drpper's paper, "What Every Programmer Should Know About Memory." (Specifically, go down to section 3.3.2 on page 20.) Pay particular attention to Figure 3.15, Sequential vs. Random Read (for a single thread), and also compare to Figure 3.21 which shows multi-threaded random accesses for 1, 2, and 4 threads.
The paper might be a little old (it uses a Pentium 4 for its benchmarks, after all), but the principles remain true. I should know... part of my day job is as a memory system architect.
Well, even on a shared memory, certain data structures are latency bound, not throughput bound.
For example, consider a linked list. If none of the 'next' pointers are in cache, then you spend a full round-trip to DDR to get the next 'next' pointer. Depending on the machine, that could be anywhere from 50 to 150 cycles of latency, but not a huge hit on throughput.
Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR, and you will be able to get some decent scaling up until the point where you have enough parallel requestors to start actually taxing the bandwidth.
If you bring disk accesses in the picture, you have some additional opportunities for scaling, if only some of the threads go to disk, while others hit in DDR. But, I grant that the crux of my argument assumes that accesses to DDR from a single thread bottleneck on latency, not throughput.
I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?
Two words: Map Reduce
Thank goodness Google doesn't linearly search the entire Internet every time I make a search. It'd get exponentially slower every year...
Did you miss the part in the article about 512-bit AVX and being able to do 32 double precision floating point operations per clock? Or the other part about running four-way SMT to hide memory system latency? Or the other, other part about 128 byte (1024-bit) L1D to CPU bandwidth?
These ain't plain ol' Atom processors.
For HPC workloads, these seem to be right up the alley of "heavy lifting."
Well, you'd hardly get to VGA quality. You can't even get all the way to CGA resolution. The TMS9918A VDP supported a maximum of 16 colors on a 256 x 192 (sub-CGA) resolution. Furthermore, it had a limitation of no more than 2 colors per 8-pixel wide span (in Graphics II "bitmap" mode). You need a minimum of 24K bytes to support a 4bpp bitmap at 256 x 192, but the TMS9918A VDP could only address 16K bytes. The graphics II mode (with its color limitations) fit into more like ~13K bytes. (6K for the pattern table, 6K for the color table and 768 bytes for the name table.)
Of course, TI BASIC and TI Extended BASIC didn't offer this mode. For one thing, TI BASIC / TI XB programs got stored in graphics memory, because the console itself only had 256 bytes of CPU addressable memory, and relied over-heavily on the VDP's graphics memory for everything else, including storing BASIC programs, variables, and sprite motion information (on top of the sprite position data that the VDP accessed directly itself).
From BASIC, you had a subset of the character set that was redefineable. I forget how many characters; I think it was something like 192 in TI BASIC and 168 in TI Extended BASIC, but those numbers are from (faded) memory. So, you could (slowly) simulate bitmapped memory in TI BASIC / TI XB by drawing a small patch of screen (much less than full-screen), and redefining character tiles to simulate bitmaps. It wasn't terribly efficient, because CALL CHAR took character patterns as ASCII strings of hexadecimal digits. Useful for fixed bitmaps, but not for variable bitmaps. If you had a TI MiniMemory cartridge, though, you could use CALL PEEKV andCALL POKEV to do things more efficiently, or write a small assembly routine stored in the cartridge accessed through CALL LINK.
Not that I'd actually know how to use that old computer....
ASCII a stupid question, you get an EBCDIC answer.