Generic programming in many languages uses virtual dispatch as everything is an Object, in contrast to C++ templates' that create a class/function catered to the type requested. So, it isn't all bad.
With C++'s templates, the compiler often produces better code for generic vs non-generic( c++ std::sort vs c qsort). Generic is faster because the compiler can reason about the data type being sorted better and doesn't have to use a type erased comparison function(c's qsort uses a int(void*, void*) type function pointer. In C++ as long as your data type has less than, it works or you can pass a function like bool is_less( Type a, Type b)
The biggest bottleneck it seems is feeding today's cpu's. The STL's algorithms are a great aid to this. Thinking in terms of sorts, rotations, partitions, transforms...etc keeps the code small and clear. Also, it allows one to easily reason and improve upon the algorithm. Most of them can be parallelized without changing the observable behaviour(mostly). So dropping in a parallel version can be very easy. Following this chain of thought, one can easily account for things like false sharing in the algorithm and not in an ungodly for loop with 5 or 6 branches.