Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×

Comment Re:Vectorized factorials! (Score 1) 225

I think you misunderstood here: There were two different levels of optimization that happened depending on the compiler I used. Both were much faster than a stacked ifelse construction.

On the older GCCs and other tool-chains, I got code roughly equivalent to FORTRAN's computed GOTO. The switchcase became roughly goto label_table[ switchvar ], and each of the cases had a branch back to the common join point. That's not pessimizing at all. For the general switchcase statements with relatively dense case values, that's pretty much what you need to do.

On newer GCCs, for this particular construct, where every case was of the form x = value , it got rid of all the branches and replaced it with a lookup table on x itself, removing nearly all the branches (except some basic range guards), and generating, roughly, x = lookup_table[ switchvar ], which is quite a fundamental leap over normal switchcase reduction.

So to re-cap, the default behavior is to generate a lookup table of branch targets. That's what I expect is a baseline minimum for switchcase code generation, and I think we agree on that. On modern deep-pipeline processors, branches with variable targets tend to be very expensive (the BTB needs to guess correctly, and often it can't if the inputs vary wildly), but they still may be the best choice for some algorithms. The more advanced optimization that I was surprised to see GCC implement replaced the computed-GOTO construct with an actual lookup table on the value itself. This eliminates the computed branch entirely and is a huge win (order of magnitude!) on modern architectures, when it can be applied.

Comment Re:First let's understand this x32 correctly. (Score 1) 262

I think we're in violent agreement. The only reason I included Firefox in the list is that I've seen it top 4GB on my own system. Maybe it's a x86-64 related memory leak, since all the memory measurements I hear people touting for the 32-bit version are far lower. Or maybe it's just full of pointers, which wouldn't be too surprising, really. :-)

again, most people posting on the x32 subject don't even know what exactly is a mmap mapping, a shared memory segment, memory allocated from sbrk and other methods, stacks allocated dynamically, and jump to the conclusion a browser must require over 4GB of address space

Well I only included Firefox because, on my computer right now, it has mapped over 3GB, and has 2.1GB resident. That still fits within the 3GB/1GB split of 32-bit Linux. I have seen it go as high as 6GB, but its usual steady state for me is around 4GB. I never close any tabs. You're right though that a web-browser shouldn't require that much RAM under normal circumstances. FWIW, I am quite familiar with mmap, shared memory, sbrk, huge pages (including using mmap to map files on a hugetlbfs to get larger pages and improve my TLB performace), etc. I didn't include Firefox out of ignorance.

So, I reiterate: I think we're in violent agreement that x32 looks interesting and relevant, and the vast majority of applications don't need 64-bit, and many would benefit from smaller pointers.

A few applications I've written (large heuristic solvers, for example) benefit from 10GB - 15GB RAM. And from what I hear, some EDA apps they use at work could use 200+GB. And, of course, there's the ever present large databases, as you mentioned. But that's pretty specialized as compared to everything else.

Comment Re:First let's understand this x32 correctly. (Score 1) 262

I have a warm place in my heart for x32 for the reasons you mention: 95% - 98% of code is perfectly happy with 32-bit pointers. (I would posit 99%+ for many folks, actually. The kernel and a few big apps benefit from 64-bit, and the rest fit well in 4GB. I put "web browser" in the "big app" list, but it is only one app even if it's one of the biggest cycle-eaters.)

Now that the ABI has matured and presumably they've had some time to work the kinks out, I'd love to see them post some up-to-date benchmarks. The previous benchmarks were actually somewhat disappointing. I expected a bit more noticeable speedup.

Comment Re:Probably optimizing for larger numbers (Score 1) 225

Well, I just compared the vectorized implementation to the simple tail-call optimized version (ie. reduced to a simple loop) that GCC 4.4 produced.

The vectorized version was a paltry 6% faster on my box, measured across 65,536 iterations. (My machine is a AMD Phenom X4, 3.4GHz.)

So, it is faster, but not by a meaningful amount, and probably not enough faster to justify the hilarious code size increase.

Comment Re:News for nerds or not (Score 1) 225

So did you read the two paragraphs that followed? And in case you think I'm imagining these compilers, flip to PDF page 129 (page 119 in the text) and see:

High-level optimizations (HLO) exploit the properties of source code constructs, such as loops and arrays, in the applications developed in high-level programming languages, such as C++. They include loop interchange, loop fusion, loop unrolling, loop distribution, unroll-and-jam, blocking, data prefetch, scalar replacement, data layout optimizations, and others. The option that turns on the high-level optimizations is -O3.

The optimizations I highlighted all will affect the data access pattern, especially loop interchange, blocking and data layout optimization. Next, check the date on the document: 2003. That's a decade ago. It's reasonable to expect they've only gotten more aggressive since then.

This much shorter survey of modern compilers also goes into optimizations that might change access order. They even call this reordering out: "Since loop interchanging may change the access pattern, care must be taken to not introduce non-unit stride access."

Welcome to modern compilers.

Comment Re:News for nerds or not (Score 1) 225

Ugh... should have previewed. That third paragraph should read:

Consider a simple memory copy: for (i = 0; i < len; i++) dst[i] = src[i]; As long as dst and src do not overlap, you can vectorize this code. But, that implies reading N elements of src before writing N elements of dst. I don't know how you define reordered memory access, but that looks like reordered memory accesses to me.

Comment Re:News for nerds or not (Score 1) 225

I don't know what world you live in. Nearly every modern C compiler majorly reorders memory accesses when optimization is enabled. Reordering memory accesses is part of what the new restrict keyword is about, after all. The volatile keyword is there to prevent the compiler from reordering or eliminating memory access.

Vectorized memory accesses nearly always implies reordered memory access.

Consider a simple memory copy: for (i = 0; i As long as dst and src do not overlap, you can vectorize this code. But, that implies reading N elements of src before writing N elements of dst. I don't know how you define reordered memory access, but that looks like reordered memory accesses to me.

Ok, I hear you shouting "That's not what I meant! I meant it still accesses array a in the same order before and after, and likewise for b!" Well, that's just a trivial example. I've worked with C compilers that interchange loop levels, unroll outerloops and jam the inner loops together, etc. Such compilers are more the rule than the exception these days.

So, I call BS on this bald, false statement of yours: "no optimizer of any C compiler changes the memory access pattern of your code." That pretty much hasn't been true since nearly the beginning. Even just having a register allocator changes the memory access pattern of your code, not to mention instruction scheduling, etc. But it's especially true these days. Google "loop interchange", "loop fusion", "loop tiling", "loop distribution", etc. You might be surprised what compilers might do to your code.

Comment Re:Vectorized factorials! (Score 1) 225

FWIW, I ran the same test on an older machine with G++ 3.4. (1.2GHz Duron, if you're curious.) G++ doesn't optimize the switch-case into a lookup table. On that system, the switch-case code ran 1/3rd the speed of an explicit lookup table. The switch-case turns into a jump table, so you end up with an indirect branch, move, and unconditional branch on any given path through the switch-case. The unconditional branch is easy for hardware to handle. The indirect branch, not so much, especially since I sent a pseudorandom sequence of inputs to the function.

So, this goes back to what someone said in a comment elsewhere on this article: Trust, but verify. Trust your compiler to do a good job, but verify that it actually is.

Comment Re:News for nerds or not (Score 1) 225

How is Floating Point Operations Per Second not a real unit? It measures amount of floating point computation performed in a unit of time. I can compute the theoretical peak FLOPS for a given architecture by multiplying the number of computational units by the clock rate. I can compute the achieved FLOPS for an algorithm by dividing the total work performed by the time taken. The ratio of achieved rate to theoretical peak even gives me a measure of efficiency, so I know what's performing well and what isn't.

And, I would argue that cache is extremely important when considering vectorization, especially when considering loop nests. I might get much more impressive vectorization if I execute a loop nest in a particular order. But, if I get better cache locality by interchanging two loops, I may see much better performance in the second case. Matrix multiply is a poster child for this.

So if you're looking at the output of the compiler's optimizer and saying "compiler A is better than compiler B at vectorizing" looking only at the instruction sequence, and ignoring the actual memory access pattern and the effects of the cache, you might draw the wrong conclusion. The optimizer may have made other transformations to help the cache that have the effect of throttling vectorization, but result in better overall code. I recall seeing elsewhere that Intel's compiler is more aggressive about reordering loop nests to help out the cache, for example. When that happens, it might look like the Intel compiler didn't vectorize nearly as aggressively as another compiler, when that's not really what's at play.

Slashdot Top Deals

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...