I remember doing image processing on a 4MHz 8088, in 1986, in assembly
And nowadays would you write a separate version in assembly for x86_64 processors with SSE instruction support, then one that exploited the benefits of SSE2, then one that used SSE3, then one that used SSE4 then one for AVX and then one for each of those targeting 1, 2, 4, 6, 8, 12 cores to squeeze "every bit of performance out of a system"? Then take a look at all the custom Apple chips and all the custom ARM chips and write individually optimized versions for those as well? Of course you wouldn't.
Sure you could do that but is the performance gain (if there is one at all) worth it over writing it in C and targeting these platforms using different compilers/compiler flags? I'm curious as to what you're actually suggesting should be done here.
Hand optimization has its place when you know what you're targeting and there is a measurable performance advantage to doing so whilst not having considerable maintenance debt, tuning shader algorithms for various GPUs and their specific extensions is certainly still done for example. But most software these days needs to target many architectures, configurations and software platforms and the benefits of hand optimization simply aren't there vs the cost.