Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Comment Re:Holy crap (Score 2) 365

I pretty much agree with all of the above, having worked in the biz awhile myself.

Since this is a graphics algorithm (apparently), the OP might do better to try to state what the computational complexity is in terms of the operations involved for one output, in terms of basic operations such as multiplies and adds, and perhaps how much storage you need.

Consider this example: If someone came to me and asked me "How much does an 8x8 IDCT cost?" After asking them if it needs bit exactness or not (some standards require it, others don't), I could give them some numbers and some implementation bounds. "The Chen IDCT needs around 11 multiplies and 20 adds per 8-pt IDCT. Multiply that by 16 to get the full cost for an 8x8. (176 multiplies, 320 adds) To meet video precision requirements for an 8x8, the multiplies should be greater than 16 bit precision, and you should carry greater than 16 bits of precision between horizontal and vertical passes."

How many gates is that? Well, depends on the throughput you require, and the details of the implementation. Given the number of multiplies and adds required, you can work toward a number. Suppose you needed to have enough IDCT bandwidth to update a 1080p 4:2:2 image at 60Hz. So, that's 1920 * 1080 * 2 * 60 = approx 250M pixels/second that you need to produce. In terms of 8x8 blocks, that's a little under 4M blocks/second, with 176 multiplies and 320 adds. So, that's approx 700M multiplies a second and 1.3B adds.

Still, that's far from enough to get to a gate count. If you put down 1 multiplier and 2 adders and ran it at 1GHz, you'd have more than enough compute throughput. You still need to add some control logic around it (especially if you only put 1 multiplier and 2 adders, because the IDCT's compute pattern is non-trivial), and some memory to store inputs, outputs and intermediate results. A more likely implementation probably has a lot more multipliers and adders in hardware, but also runs at a much slower clock rate.

So how many gates is that? You need much more information to answer that question, despite the analysis above. You now need to pick an implementation strategy, and more than one makes sense. But, you have a much better idea of the computational cost, and can pick among multiple implementations. For example, if energy efficiency is your goal, you might implement the horizontal and vertical IDCTs in explicitly tuned multiplies and adds tuned to the exact precision necessary and connected exactly as the dataflow requires, and run the whole block at a low clock rate using slower transistors with less leakage. If flexibility is your goal, you might put in a small CPU with enough grunt to fit the computational load. with the idea that you can run other algorithms there if you need to. etc...

Comment Re:I think most of us grasped this intuitively (Score 1) 264

Oh, I'm familiar with the noun form of affect; however, it's usually applied to someone or something capable of having and displaying an emotion, as opposed to the emotional impact of an inanimate object. Either way, it's sufficiently obscure as to effect the appropriate response.

Comment Re:How is DDR pipelined? (Score 1) 208

Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight?

Actually, that's pretty much exactly how it works. If you have a bunch of independent requests to DDR—and by independent, I mean that the processor(s) do not stall waiting for the information from one request in order to make the next—then you can get multiple requests in flight and they can pipeline. Streaming works this way, for example. The STREAM benchmark is a textbook example of a benchmark dominated by throughput, where all the accesses are independent. For example, a[i] = b[i] + c[i] does not depend on a[i - K] = b[i - K] + c[i - K] or a[i + K] = b[i + K] + c[i + K] for any value of K in STREAM's "Add" loop. All four loops of the benchmark have that character. So as long as the processor can get enough work in-flight, it can get multiple cache misses outstanding to DDR. And if one processor and its caches have limited ability to 'execute ahead' like this, multiple processors (or multiple independent threads on the same processor) acting independently can fill in those gaps.

Linked list traversal results in a series of requests that are all dependent on each other. If all the requests miss the caches and must go out to DDR, then the CPU's performance is bounded by the round trip latency to DDR, not the DDR's throughput. Take a look at the linked list benchmarks in Ulrich Drpper's paper, "What Every Programmer Should Know About Memory." (Specifically, go down to section 3.3.2 on page 20.) Pay particular attention to Figure 3.15, Sequential vs. Random Read (for a single thread), and also compare to Figure 3.21 which shows multi-threaded random accesses for 1, 2, and 4 threads.

The paper might be a little old (it uses a Pentium 4 for its benchmarks, after all), but the principles remain true. I should know... part of my day job is as a memory system architect. :-)

Comment Re:Cores wait for their turn to read RAM (Score 1) 208

Well, even on a shared memory, certain data structures are latency bound, not throughput bound.

For example, consider a linked list. If none of the 'next' pointers are in cache, then you spend a full round-trip to DDR to get the next 'next' pointer. Depending on the machine, that could be anywhere from 50 to 150 cycles of latency, but not a huge hit on throughput.

Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR, and you will be able to get some decent scaling up until the point where you have enough parallel requestors to start actually taxing the bandwidth.

If you bring disk accesses in the picture, you have some additional opportunities for scaling, if only some of the threads go to disk, while others hit in DDR. But, I grant that the crux of my argument assumes that accesses to DDR from a single thread bottleneck on latency, not throughput.

Comment Re:I fail to see parallelism in CSS flow (Score 1) 208

I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

Two words: Map Reduce

Thank goodness Google doesn't linearly search the entire Internet every time I make a search. It'd get exponentially slower every year...

Comment Re:Yay more cores that I won't be using much of! (Score 1) 208

Did you miss the part in the article about 512-bit AVX and being able to do 32 double precision floating point operations per clock? Or the other part about running four-way SMT to hide memory system latency? Or the other, other part about 128 byte (1024-bit) L1D to CPU bandwidth?

These ain't plain ol' Atom processors.

For HPC workloads, these seem to be right up the alley of "heavy lifting."

Comment Re:But... why? (Score 1) 430

Well, you'd hardly get to VGA quality. You can't even get all the way to CGA resolution. The TMS9918A VDP supported a maximum of 16 colors on a 256 x 192 (sub-CGA) resolution. Furthermore, it had a limitation of no more than 2 colors per 8-pixel wide span (in Graphics II "bitmap" mode). You need a minimum of 24K bytes to support a 4bpp bitmap at 256 x 192, but the TMS9918A VDP could only address 16K bytes. The graphics II mode (with its color limitations) fit into more like ~13K bytes. (6K for the pattern table, 6K for the color table and 768 bytes for the name table.)

Of course, TI BASIC and TI Extended BASIC didn't offer this mode. For one thing, TI BASIC / TI XB programs got stored in graphics memory, because the console itself only had 256 bytes of CPU addressable memory, and relied over-heavily on the VDP's graphics memory for everything else, including storing BASIC programs, variables, and sprite motion information (on top of the sprite position data that the VDP accessed directly itself).

From BASIC, you had a subset of the character set that was redefineable. I forget how many characters; I think it was something like 192 in TI BASIC and 168 in TI Extended BASIC, but those numbers are from (faded) memory. So, you could (slowly) simulate bitmapped memory in TI BASIC / TI XB by drawing a small patch of screen (much less than full-screen), and redefining character tiles to simulate bitmaps. It wasn't terribly efficient, because CALL CHAR took character patterns as ASCII strings of hexadecimal digits. Useful for fixed bitmaps, but not for variable bitmaps. If you had a TI MiniMemory cartridge, though, you could use CALL PEEKV andCALL POKEV to do things more efficiently, or write a small assembly routine stored in the cartridge accessed through CALL LINK.

Not that I'd actually know how to use that old computer.... ;-)

Comment Re:Vectorized factorials! (Score 1) 225

I think you misunderstood here: There were two different levels of optimization that happened depending on the compiler I used. Both were much faster than a stacked ifelse construction.

On the older GCCs and other tool-chains, I got code roughly equivalent to FORTRAN's computed GOTO. The switchcase became roughly goto label_table[ switchvar ], and each of the cases had a branch back to the common join point. That's not pessimizing at all. For the general switchcase statements with relatively dense case values, that's pretty much what you need to do.

On newer GCCs, for this particular construct, where every case was of the form x = value , it got rid of all the branches and replaced it with a lookup table on x itself, removing nearly all the branches (except some basic range guards), and generating, roughly, x = lookup_table[ switchvar ], which is quite a fundamental leap over normal switchcase reduction.

So to re-cap, the default behavior is to generate a lookup table of branch targets. That's what I expect is a baseline minimum for switchcase code generation, and I think we agree on that. On modern deep-pipeline processors, branches with variable targets tend to be very expensive (the BTB needs to guess correctly, and often it can't if the inputs vary wildly), but they still may be the best choice for some algorithms. The more advanced optimization that I was surprised to see GCC implement replaced the computed-GOTO construct with an actual lookup table on the value itself. This eliminates the computed branch entirely and is a huge win (order of magnitude!) on modern architectures, when it can be applied.

Comment Re:First let's understand this x32 correctly. (Score 1) 262

I think we're in violent agreement. The only reason I included Firefox in the list is that I've seen it top 4GB on my own system. Maybe it's a x86-64 related memory leak, since all the memory measurements I hear people touting for the 32-bit version are far lower. Or maybe it's just full of pointers, which wouldn't be too surprising, really. :-)

again, most people posting on the x32 subject don't even know what exactly is a mmap mapping, a shared memory segment, memory allocated from sbrk and other methods, stacks allocated dynamically, and jump to the conclusion a browser must require over 4GB of address space

Well I only included Firefox because, on my computer right now, it has mapped over 3GB, and has 2.1GB resident. That still fits within the 3GB/1GB split of 32-bit Linux. I have seen it go as high as 6GB, but its usual steady state for me is around 4GB. I never close any tabs. You're right though that a web-browser shouldn't require that much RAM under normal circumstances. FWIW, I am quite familiar with mmap, shared memory, sbrk, huge pages (including using mmap to map files on a hugetlbfs to get larger pages and improve my TLB performace), etc. I didn't include Firefox out of ignorance.

So, I reiterate: I think we're in violent agreement that x32 looks interesting and relevant, and the vast majority of applications don't need 64-bit, and many would benefit from smaller pointers.

A few applications I've written (large heuristic solvers, for example) benefit from 10GB - 15GB RAM. And from what I hear, some EDA apps they use at work could use 200+GB. And, of course, there's the ever present large databases, as you mentioned. But that's pretty specialized as compared to everything else.

Comment Re:First let's understand this x32 correctly. (Score 1) 262

I have a warm place in my heart for x32 for the reasons you mention: 95% - 98% of code is perfectly happy with 32-bit pointers. (I would posit 99%+ for many folks, actually. The kernel and a few big apps benefit from 64-bit, and the rest fit well in 4GB. I put "web browser" in the "big app" list, but it is only one app even if it's one of the biggest cycle-eaters.)

Now that the ABI has matured and presumably they've had some time to work the kinks out, I'd love to see them post some up-to-date benchmarks. The previous benchmarks were actually somewhat disappointing. I expected a bit more noticeable speedup.

Slashdot Top Deals

8 Catfish = 1 Octo-puss

Working...