Ah, I remember looking at the sorting algorithm in a low level graphics library. Tightly hand coded, packed to keep every pipeline stage filled and make optimal use of all the parallel units in a VLIW graphics processor, it was probably the most efficient bubble sort I'd ever seen.
I coded up a quicksort to about the same degree of tightness in a couple days, and, golly gee, a whole bunch of code suddenly got faster. Some MUCH faster... That was Lesson 1 for me. Optimize algorithms before optimizing implementations.