Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror

Comment Re:Probably optimizing for larger numbers (Score 1) 225

Well, I just compared the vectorized implementation to the simple tail-call optimized version (ie. reduced to a simple loop) that GCC 4.4 produced.

The vectorized version was a paltry 6% faster on my box, measured across 65,536 iterations. (My machine is a AMD Phenom X4, 3.4GHz.)

So, it is faster, but not by a meaningful amount, and probably not enough faster to justify the hilarious code size increase.

Comment Re:News for nerds or not (Score 1) 225

So did you read the two paragraphs that followed? And in case you think I'm imagining these compilers, flip to PDF page 129 (page 119 in the text) and see:

High-level optimizations (HLO) exploit the properties of source code constructs, such as loops and arrays, in the applications developed in high-level programming languages, such as C++. They include loop interchange, loop fusion, loop unrolling, loop distribution, unroll-and-jam, blocking, data prefetch, scalar replacement, data layout optimizations, and others. The option that turns on the high-level optimizations is -O3.

The optimizations I highlighted all will affect the data access pattern, especially loop interchange, blocking and data layout optimization. Next, check the date on the document: 2003. That's a decade ago. It's reasonable to expect they've only gotten more aggressive since then.

This much shorter survey of modern compilers also goes into optimizations that might change access order. They even call this reordering out: "Since loop interchanging may change the access pattern, care must be taken to not introduce non-unit stride access."

Welcome to modern compilers.

Comment Re:News for nerds or not (Score 1) 225

Ugh... should have previewed. That third paragraph should read:

Consider a simple memory copy: for (i = 0; i < len; i++) dst[i] = src[i]; As long as dst and src do not overlap, you can vectorize this code. But, that implies reading N elements of src before writing N elements of dst. I don't know how you define reordered memory access, but that looks like reordered memory accesses to me.

Comment Re:News for nerds or not (Score 1) 225

I don't know what world you live in. Nearly every modern C compiler majorly reorders memory accesses when optimization is enabled. Reordering memory accesses is part of what the new restrict keyword is about, after all. The volatile keyword is there to prevent the compiler from reordering or eliminating memory access.

Vectorized memory accesses nearly always implies reordered memory access.

Consider a simple memory copy: for (i = 0; i As long as dst and src do not overlap, you can vectorize this code. But, that implies reading N elements of src before writing N elements of dst. I don't know how you define reordered memory access, but that looks like reordered memory accesses to me.

Ok, I hear you shouting "That's not what I meant! I meant it still accesses array a in the same order before and after, and likewise for b!" Well, that's just a trivial example. I've worked with C compilers that interchange loop levels, unroll outerloops and jam the inner loops together, etc. Such compilers are more the rule than the exception these days.

So, I call BS on this bald, false statement of yours: "no optimizer of any C compiler changes the memory access pattern of your code." That pretty much hasn't been true since nearly the beginning. Even just having a register allocator changes the memory access pattern of your code, not to mention instruction scheduling, etc. But it's especially true these days. Google "loop interchange", "loop fusion", "loop tiling", "loop distribution", etc. You might be surprised what compilers might do to your code.

Comment Re:Vectorized factorials! (Score 1) 225

FWIW, I ran the same test on an older machine with G++ 3.4. (1.2GHz Duron, if you're curious.) G++ doesn't optimize the switch-case into a lookup table. On that system, the switch-case code ran 1/3rd the speed of an explicit lookup table. The switch-case turns into a jump table, so you end up with an indirect branch, move, and unconditional branch on any given path through the switch-case. The unconditional branch is easy for hardware to handle. The indirect branch, not so much, especially since I sent a pseudorandom sequence of inputs to the function.

So, this goes back to what someone said in a comment elsewhere on this article: Trust, but verify. Trust your compiler to do a good job, but verify that it actually is.

Comment Re:News for nerds or not (Score 1) 225

How is Floating Point Operations Per Second not a real unit? It measures amount of floating point computation performed in a unit of time. I can compute the theoretical peak FLOPS for a given architecture by multiplying the number of computational units by the clock rate. I can compute the achieved FLOPS for an algorithm by dividing the total work performed by the time taken. The ratio of achieved rate to theoretical peak even gives me a measure of efficiency, so I know what's performing well and what isn't.

And, I would argue that cache is extremely important when considering vectorization, especially when considering loop nests. I might get much more impressive vectorization if I execute a loop nest in a particular order. But, if I get better cache locality by interchanging two loops, I may see much better performance in the second case. Matrix multiply is a poster child for this.

So if you're looking at the output of the compiler's optimizer and saying "compiler A is better than compiler B at vectorizing" looking only at the instruction sequence, and ignoring the actual memory access pattern and the effects of the cache, you might draw the wrong conclusion. The optimizer may have made other transformations to help the cache that have the effect of throttling vectorization, but result in better overall code. I recall seeing elsewhere that Intel's compiler is more aggressive about reordering loop nests to help out the cache, for example. When that happens, it might look like the Intel compiler didn't vectorize nearly as aggressively as another compiler, when that's not really what's at play.

Comment Re:Vectorized factorials! (Score 1) 225

I never meant to suggest that it is optimal. But it certainly is "optimized!" Vectorizing this function is simply ridiculous.

That said, I just ran a benchmark, comparing it to the more straightforward code output by G++ 4.4. The vectorized version produced by 4.8 is slightly faster, by about 12%. The recursive approach is still quite a bit slower than a lookup table or switch-case. Interestingly, the lookup table and switch-case versions got slightly slower in 4.8 compared to 4.4.

Comment Re:Vectorized factorials! (Score 1) 225

Interestingly, GCC 4.8 actually replaces that switch/case with a lookup table. On older GCCs and with compilers for other platforms, the switch-case is an order of magnitude slower or worse, as it actually resulted in branches. And `switch-case` branches are sometimes very difficult to predict, depending on the hardware branch predictor and the code around it.

It appears GCC has an "unswitch" optimization that handles a switch-case used in this way.

Comment Re:Very different code (Score 1) 225

I think you meant if ( ( a = b ) ), which highlights a different reason this construct is problematic: If you make that error outside the context of a control construct, you'll get a warning about a meaningless computation.

Your proposed fix isn't really a fix, though. It shuts up GCC, but it doesn't shut up RVCT, for example.

Comment Re:Trust but verify (Score 1) 225

I'm in the same camp.

It's also worth noting that C and C++ make it really easy to trip up the optimizer and disqualify code from certain optimizations. Little things like const, restrict, alignment and trip-count hints can go a long way, though. Reviewing the generated code can highlight places where these hints would be useful.

Comment Re:Very different code (Score 1) 225

I've used naked { } to scope things before. It's actually quite handy. In C, it serves two purposes: It makes sure that this variable's value doesn't intentionally spill into code beyond it, and it gives you a new scope to declare temporaries without having to worry about clobbering some other value you didn't think of. (But, if you're doing things right, then that latter concern should be a lesser concern.) In C++, it also gives you a clear boundary where objects go out of scope that isn't just "by the end of the function."

Slashdot Top Deals

In order to get a loan you must first prove you don't need it.

Working...