Comment: Re:Clueless (Score 1) 125
Counting operations is not enough. Memory access time is nonuniform because of cache effects, architecture (NUMA, distributed memory), code layout (e.g., is your loop body one instruction larger than L1 i-cache?), etc. Machine instructions have different timings. CISC instructions may be slower than their serial RISC counterparts. Or they may not be. SMT may make sequential code faster than parallel code by resolving dependencies faster. Branch predictors and speculation can precompute parts of your algorithm with idle function units. Better algorithms can do more work with fewer "flops". And on and on and on...
The best way to try to write fast code is to write it and run it (on representative inputs). Then write another version and run it. Run it like an experiment, and do an hypothesis test to see which one has the statistically-significant speedup. That's the only way to write fast code on modern machines. The idea that you can hand-write fast code on modern architectures is largely a myth.
The best way to try to write fast code is to write it and run it (on representative inputs). Then write another version and run it. Run it like an experiment, and do an hypothesis test to see which one has the statistically-significant speedup. That's the only way to write fast code on modern machines. The idea that you can hand-write fast code on modern architectures is largely a myth.