...Of course, since the biggest bottlenecks in code usually occur in essentially serial sections of code, you can't just reorder them and hope for the best.
Well, that's highly dependent on your code. If you're writing something like matrix manipulation or most image processing routines, none of your code is particularly serial. You can even get away with things like WAW hazards because of register renaming.
If you've already done all of the high-level optimization that you can, maybe it's time to start looking at a VTune or CodeAnalyst and figuring out where your branches are being mispredicted and where you're seeing stalls. But the reality is, those optimizations get you the last 20%, which doesn't mean shit if your algorithm is inefficient or accesses memory inefficiently.
TFA using examples such as shift-vs-multiply sound like your grandfather complaining that you don't double-clutch on downshifting into first gear. "True" in the sense that yes, it once had meaning and no longer does - But totally wrong in the sense that people who think about their code at that level have moved beyond such trivialities and onto actual modern ones such as how to feed N pipelines so as to minimize stalls, or what degenerate conditions flog the latest branch prediction techniques (or more usefully, as a classic example, how to write your code so as to minimize branching)
I also hate the 'instruction weenie' optimizations. It doesn't matter (for example) if an integer multiply instruction has two-cycle latency and a shift instruction has one-cycle latency. Something like 1 in 5 instructions is a memory access, and another 1 in 5 is a branch. Both of those are potentially far more problematic than an extra cycle that's probably going to be scheduled around anyway.
Mostly this article sounds like exactly the reasons I don't like Java for every task, and why the vast majority of Java apps feel like molasses in January despite every benchmark telling you that in theory they run just as fast as unmanaged code - Because although you can do the above, you have to work against the language rather than with it.
Pretty much no benchmark shows Java (or .NET) running as fast as unmanaged code with a decent compiler; even the best JIT runtimes usually come out in the 50-70% range.
Also, Java doesn't feel slow because of execution performance. It feels slow because it has crappy UI libraries that are slow. There are many, many GTK+/Python apps that are perfectly fine despite the fact that Python is abysmally slow compared to even Java.
When merely assigning a value to a basic machine-supported data type (32 bit integer, as the simple example) involves an implicit function call (and the whole stack-frame preservation that entails)
I'm not sure where you're getting this, but assigning to an int (not an Integer) in Java does not involve a function call in any mainstream JRE I'm familiar with; indeed, it performs very similarly to assignment in C.
The big fault of Java (and also .NET) is that the JIT doesn't have very much time to optimize. Compared with unoptimized C/C++ compilers, Java is considerably faster. It's only once you add the substantial benefits of optimization (loop unrolling, constant propagation, function inlining, instruction scheduling, and a whole host of other optimizations) that the JIT starts to look pretty crappy.
Simple cases, like assignment to an int, are well-optimized by the JIT.