Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror

Comment Re:How is DDR pipelined? (Score 1) 208

Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight?

Actually, that's pretty much exactly how it works. If you have a bunch of independent requests to DDR—and by independent, I mean that the processor(s) do not stall waiting for the information from one request in order to make the next—then you can get multiple requests in flight and they can pipeline. Streaming works this way, for example. The STREAM benchmark is a textbook example of a benchmark dominated by throughput, where all the accesses are independent. For example, a[i] = b[i] + c[i] does not depend on a[i - K] = b[i - K] + c[i - K] or a[i + K] = b[i + K] + c[i + K] for any value of K in STREAM's "Add" loop. All four loops of the benchmark have that character. So as long as the processor can get enough work in-flight, it can get multiple cache misses outstanding to DDR. And if one processor and its caches have limited ability to 'execute ahead' like this, multiple processors (or multiple independent threads on the same processor) acting independently can fill in those gaps.

Linked list traversal results in a series of requests that are all dependent on each other. If all the requests miss the caches and must go out to DDR, then the CPU's performance is bounded by the round trip latency to DDR, not the DDR's throughput. Take a look at the linked list benchmarks in Ulrich Drpper's paper, "What Every Programmer Should Know About Memory." (Specifically, go down to section 3.3.2 on page 20.) Pay particular attention to Figure 3.15, Sequential vs. Random Read (for a single thread), and also compare to Figure 3.21 which shows multi-threaded random accesses for 1, 2, and 4 threads.

The paper might be a little old (it uses a Pentium 4 for its benchmarks, after all), but the principles remain true. I should know... part of my day job is as a memory system architect. :-)

Comment Re:Cores wait for their turn to read RAM (Score 1) 208

Well, even on a shared memory, certain data structures are latency bound, not throughput bound.

For example, consider a linked list. If none of the 'next' pointers are in cache, then you spend a full round-trip to DDR to get the next 'next' pointer. Depending on the machine, that could be anywhere from 50 to 150 cycles of latency, but not a huge hit on throughput.

Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR, and you will be able to get some decent scaling up until the point where you have enough parallel requestors to start actually taxing the bandwidth.

If you bring disk accesses in the picture, you have some additional opportunities for scaling, if only some of the threads go to disk, while others hit in DDR. But, I grant that the crux of my argument assumes that accesses to DDR from a single thread bottleneck on latency, not throughput.

Comment Re:I fail to see parallelism in CSS flow (Score 1) 208

I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

Two words: Map Reduce

Thank goodness Google doesn't linearly search the entire Internet every time I make a search. It'd get exponentially slower every year...

Comment Re:Yay more cores that I won't be using much of! (Score 1) 208

Did you miss the part in the article about 512-bit AVX and being able to do 32 double precision floating point operations per clock? Or the other part about running four-way SMT to hide memory system latency? Or the other, other part about 128 byte (1024-bit) L1D to CPU bandwidth?

These ain't plain ol' Atom processors.

For HPC workloads, these seem to be right up the alley of "heavy lifting."

Comment Re:But... why? (Score 1) 430

Well, you'd hardly get to VGA quality. You can't even get all the way to CGA resolution. The TMS9918A VDP supported a maximum of 16 colors on a 256 x 192 (sub-CGA) resolution. Furthermore, it had a limitation of no more than 2 colors per 8-pixel wide span (in Graphics II "bitmap" mode). You need a minimum of 24K bytes to support a 4bpp bitmap at 256 x 192, but the TMS9918A VDP could only address 16K bytes. The graphics II mode (with its color limitations) fit into more like ~13K bytes. (6K for the pattern table, 6K for the color table and 768 bytes for the name table.)

Of course, TI BASIC and TI Extended BASIC didn't offer this mode. For one thing, TI BASIC / TI XB programs got stored in graphics memory, because the console itself only had 256 bytes of CPU addressable memory, and relied over-heavily on the VDP's graphics memory for everything else, including storing BASIC programs, variables, and sprite motion information (on top of the sprite position data that the VDP accessed directly itself).

From BASIC, you had a subset of the character set that was redefineable. I forget how many characters; I think it was something like 192 in TI BASIC and 168 in TI Extended BASIC, but those numbers are from (faded) memory. So, you could (slowly) simulate bitmapped memory in TI BASIC / TI XB by drawing a small patch of screen (much less than full-screen), and redefining character tiles to simulate bitmaps. It wasn't terribly efficient, because CALL CHAR took character patterns as ASCII strings of hexadecimal digits. Useful for fixed bitmaps, but not for variable bitmaps. If you had a TI MiniMemory cartridge, though, you could use CALL PEEKV andCALL POKEV to do things more efficiently, or write a small assembly routine stored in the cartridge accessed through CALL LINK.

Not that I'd actually know how to use that old computer.... ;-)

Comment Re:Vectorized factorials! (Score 1) 225

I think you misunderstood here: There were two different levels of optimization that happened depending on the compiler I used. Both were much faster than a stacked ifelse construction.

On the older GCCs and other tool-chains, I got code roughly equivalent to FORTRAN's computed GOTO. The switchcase became roughly goto label_table[ switchvar ], and each of the cases had a branch back to the common join point. That's not pessimizing at all. For the general switchcase statements with relatively dense case values, that's pretty much what you need to do.

On newer GCCs, for this particular construct, where every case was of the form x = value , it got rid of all the branches and replaced it with a lookup table on x itself, removing nearly all the branches (except some basic range guards), and generating, roughly, x = lookup_table[ switchvar ], which is quite a fundamental leap over normal switchcase reduction.

So to re-cap, the default behavior is to generate a lookup table of branch targets. That's what I expect is a baseline minimum for switchcase code generation, and I think we agree on that. On modern deep-pipeline processors, branches with variable targets tend to be very expensive (the BTB needs to guess correctly, and often it can't if the inputs vary wildly), but they still may be the best choice for some algorithms. The more advanced optimization that I was surprised to see GCC implement replaced the computed-GOTO construct with an actual lookup table on the value itself. This eliminates the computed branch entirely and is a huge win (order of magnitude!) on modern architectures, when it can be applied.

Comment Re:First let's understand this x32 correctly. (Score 1) 262

I think we're in violent agreement. The only reason I included Firefox in the list is that I've seen it top 4GB on my own system. Maybe it's a x86-64 related memory leak, since all the memory measurements I hear people touting for the 32-bit version are far lower. Or maybe it's just full of pointers, which wouldn't be too surprising, really. :-)

again, most people posting on the x32 subject don't even know what exactly is a mmap mapping, a shared memory segment, memory allocated from sbrk and other methods, stacks allocated dynamically, and jump to the conclusion a browser must require over 4GB of address space

Well I only included Firefox because, on my computer right now, it has mapped over 3GB, and has 2.1GB resident. That still fits within the 3GB/1GB split of 32-bit Linux. I have seen it go as high as 6GB, but its usual steady state for me is around 4GB. I never close any tabs. You're right though that a web-browser shouldn't require that much RAM under normal circumstances. FWIW, I am quite familiar with mmap, shared memory, sbrk, huge pages (including using mmap to map files on a hugetlbfs to get larger pages and improve my TLB performace), etc. I didn't include Firefox out of ignorance.

So, I reiterate: I think we're in violent agreement that x32 looks interesting and relevant, and the vast majority of applications don't need 64-bit, and many would benefit from smaller pointers.

A few applications I've written (large heuristic solvers, for example) benefit from 10GB - 15GB RAM. And from what I hear, some EDA apps they use at work could use 200+GB. And, of course, there's the ever present large databases, as you mentioned. But that's pretty specialized as compared to everything else.

Comment Re:First let's understand this x32 correctly. (Score 1) 262

I have a warm place in my heart for x32 for the reasons you mention: 95% - 98% of code is perfectly happy with 32-bit pointers. (I would posit 99%+ for many folks, actually. The kernel and a few big apps benefit from 64-bit, and the rest fit well in 4GB. I put "web browser" in the "big app" list, but it is only one app even if it's one of the biggest cycle-eaters.)

Now that the ABI has matured and presumably they've had some time to work the kinks out, I'd love to see them post some up-to-date benchmarks. The previous benchmarks were actually somewhat disappointing. I expected a bit more noticeable speedup.

Comment Re:Probably optimizing for larger numbers (Score 1) 225

Well, I just compared the vectorized implementation to the simple tail-call optimized version (ie. reduced to a simple loop) that GCC 4.4 produced.

The vectorized version was a paltry 6% faster on my box, measured across 65,536 iterations. (My machine is a AMD Phenom X4, 3.4GHz.)

So, it is faster, but not by a meaningful amount, and probably not enough faster to justify the hilarious code size increase.

Comment Re:News for nerds or not (Score 1) 225

So did you read the two paragraphs that followed? And in case you think I'm imagining these compilers, flip to PDF page 129 (page 119 in the text) and see:

High-level optimizations (HLO) exploit the properties of source code constructs, such as loops and arrays, in the applications developed in high-level programming languages, such as C++. They include loop interchange, loop fusion, loop unrolling, loop distribution, unroll-and-jam, blocking, data prefetch, scalar replacement, data layout optimizations, and others. The option that turns on the high-level optimizations is -O3.

The optimizations I highlighted all will affect the data access pattern, especially loop interchange, blocking and data layout optimization. Next, check the date on the document: 2003. That's a decade ago. It's reasonable to expect they've only gotten more aggressive since then.

This much shorter survey of modern compilers also goes into optimizations that might change access order. They even call this reordering out: "Since loop interchanging may change the access pattern, care must be taken to not introduce non-unit stride access."

Welcome to modern compilers.

Slashdot Top Deals

"Lead us in a few words of silent prayer." -- Bill Peterson, former Houston Oiler football coach

Working...