Mr Z - Slashdot User

Comment Re: Verilog (Score 1) 365

by Mr Z on Thursday January 09, 2014 @10:50AM (#45906333) Attached to: Ask Slashdot: How Many (Electronics) Gates Is That Software Algorithm?

Comment Re:Holy crap (Score 2) 365

by Mr Z on Thursday January 09, 2014 @10:24AM (#45906135) Attached to: Ask Slashdot: How Many (Electronics) Gates Is That Software Algorithm?

I pretty much agree with all of the above, having worked in the biz awhile myself.

Since this is a graphics algorithm (apparently), the OP might do better to try to state what the computational complexity is in terms of the operations involved for one output, in terms of basic operations such as multiplies and adds, and perhaps how much storage you need.

Consider this example: If someone came to me and asked me "How much does an 8x8 IDCT cost?" After asking them if it needs bit exactness or not (some standards require it, others don't), I could give them some numbers and some implementation bounds. "The Chen IDCT needs around 11 multiplies and 20 adds per 8-pt IDCT. Multiply that by 16 to get the full cost for an 8x8. (176 multiplies, 320 adds) To meet video precision requirements for an 8x8, the multiplies should be greater than 16 bit precision, and you should carry greater than 16 bits of precision between horizontal and vertical passes."

How many gates is that? Well, depends on the throughput you require, and the details of the implementation. Given the number of multiplies and adds required, you can work toward a number. Suppose you needed to have enough IDCT bandwidth to update a 1080p 4:2:2 image at 60Hz. So, that's 1920 * 1080 * 2 * 60 = approx 250M pixels/second that you need to produce. In terms of 8x8 blocks, that's a little under 4M blocks/second, with 176 multiplies and 320 adds. So, that's approx 700M multiplies a second and 1.3B adds.

Still, that's far from enough to get to a gate count. If you put down 1 multiplier and 2 adders and ran it at 1GHz, you'd have more than enough compute throughput. You still need to add some control logic around it (especially if you only put 1 multiplier and 2 adders, because the IDCT's compute pattern is non-trivial), and some memory to store inputs, outputs and intermediate results. A more likely implementation probably has a lot more multipliers and adders in hardware, but also runs at a much slower clock rate.

So how many gates is that? You need much more information to answer that question, despite the analysis above. You now need to pick an implementation strategy, and more than one makes sense. But, you have a much better idea of the computational cost, and can pick among multiple implementations. For example, if energy efficiency is your goal, you might implement the horizontal and vertical IDCTs in explicitly tuned multiplies and adds tuned to the exact precision necessary and connected exactly as the dataflow requires, and run the whole block at a low clock rate using slower transistors with less leakage. If flexibility is your goal, you might put in a small CPU with enough grunt to fit the computational load. with the idea that you can run other algorithms there if you need to. etc...

Comment Re:I think most of us grasped this intuitively (Score 1) 264

by Mr Z on Tuesday January 07, 2014 @01:46PM (#45889135) Attached to: Experiments Reveal That Deformed Rubber Sheet Is Not Like Spacetime

Comment Re:I think most of us grasped this intuitively (Score 2) 264

by Mr Z on Tuesday January 07, 2014 @12:09AM (#45884659) Attached to: Experiments Reveal That Deformed Rubber Sheet Is Not Like Spacetime

Comment Re:How is DDR pipelined? (Score 1) 208

by Mr Z on Monday January 06, 2014 @12:53AM (#45875849) Attached to: Intel's Knights Landing — 72 Cores, 3 Teraflops

Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight?

Actually, that's pretty much exactly how it works. If you have a bunch of independent requests to DDR—and by independent, I mean that the processor(s) do not stall waiting for the information from one request in order to make the next—then you can get multiple requests in flight and they can pipeline. Streaming works this way, for example. The STREAM benchmark is a textbook example of a benchmark dominated by throughput, where all the accesses are independent. For example, a[i] = b[i] + c[i] does not depend on a[i - K] = b[i - K] + c[i - K] or a[i + K] = b[i + K] + c[i + K] for any value of K in STREAM's "Add" loop. All four loops of the benchmark have that character. So as long as the processor can get enough work in-flight, it can get multiple cache misses outstanding to DDR. And if one processor and its caches have limited ability to 'execute ahead' like this, multiple processors (or multiple independent threads on the same processor) acting independently can fill in those gaps.

Linked list traversal results in a series of requests that are all dependent on each other. If all the requests miss the caches and must go out to DDR, then the CPU's performance is bounded by the round trip latency to DDR, not the DDR's throughput. Take a look at the linked list benchmarks in Ulrich Drpper's paper, "What Every Programmer Should Know About Memory." (Specifically, go down to section 3.3.2 on page 20.) Pay particular attention to Figure 3.15, Sequential vs. Random Read (for a single thread), and also compare to Figure 3.21 which shows multi-threaded random accesses for 1, 2, and 4 threads.

The paper might be a little old (it uses a Pentium 4 for its benchmarks, after all), but the principles remain true. I should know... part of my day job is as a memory system architect. :-)

Comment Re:But... why? (Score 1) 430

by Mr Z on Sunday January 05, 2014 @01:45AM (#45868963) Attached to: Cairo 2D Graphics May Become Part of ISO C++

Comment Re:But... why? (Score 1) 430

by Mr Z on Sunday January 05, 2014 @12:33AM (#45868713) Attached to: Cairo 2D Graphics May Become Part of ISO C++

Comment Re:Vectorized factorials! (Score 1) 225

by Mr Z on Thursday January 02, 2014 @02:19PM (#45847885) Attached to: Comparing G++ and Intel Compilers and Vectorized Code

Comment Re: Verilog (Score 1) 365

Comment Re:Holy crap (Score 2) 365

Comment Re:I think most of us grasped this intuitively (Score 1) 264

Comment Re:I think most of us grasped this intuitively (Score 2) 264

Comment Re:How is DDR pipelined? (Score 1) 208

Comment Re:Cores wait for their turn to read RAM (Score 1) 208

Comment Re:I fail to see parallelism in CSS flow (Score 1) 208

Comment Re:Yay more cores that I won't be using much of! (Score 1) 208

Comment Re:But... why? (Score 1) 430

Comment Re:But... why? (Score 1) 430

Comment Re:Vectorized factorials! (Score 1) 225

Comment Re:First let's understand this x32 correctly. (Score 1) 262

Comment Re:First let's understand this x32 correctly. (Score 1) 262

Comment Re:Totally missed memorable computers of the 80s (Score 2) 165

Comment Totally missed memorable computers of the 80s (Score 2) 165

Slashdot Top Deals

Slashdot