In the discussion about IBM putatively buying Sun, we were having a side-discussion about prefetches and branch prediction.
I had forgotten why my branch prediction performance experiments had failed ("confirmed the null hypothesis") and had to go back to my notes.
It turns out that mature production software tends to be full of small blocks of error-handling and debug/logging code, which is not often used. A Smarter Colleague[TM] and I set out to test the newly-available branch prediction logic, expecting to see a significant improvement. I manually set the branch prediction bits in a large production application, only to find no detectable improvement.
The test application was Samba, so we changed the driver script to only read a few files from a ram disk, to eliminate disk I/O overheads. Still no detectable advantage from predicting the branches correctly!
Then we tried just a single few functions, under a test framework that did no I/O at all. Still nothing.
Eventually we tracjked it down to the debug/log/else logic: the branches areound it were always taken, but the branch-arounds were long enough that the next instructions were in a different icache line, and the cache-line had to be fetched.
It turned out that we had reproduced in code what our HPC colleagues see in data: the cache doesn't help if you're constantly leaping to a different cache line!