Bullshit. Thinking that caches are the only thing that matters just indicates that you are clueless when it comes to optimization (and most likely much more too). But also when it comes to caching assembly language programmers tend to generate tighter code leading to less instruction cache misses.
Most instruction scheduling is done by the processor, it's a "new" thing called OoO or out of order execution. The rest is trivial.
Register coloring is a funny way to describe register allocation, normally a coloring algorithm isn't used by assembly language programmers as it just isn't useful. Register allocation is needed yes however a linear scan algorithm combined with (learned) heuristics is probably closer to what real programmers use. The performance difference is in the noise anyway. BTW standard register coloring isn't optimal for x86.
There's another reason for this too... today's CPUs are designed to recognize some standard compiler instruction chains and shortcut them -- so if you hand-code those instructions, the CPU will have to take your instructions literally, whereas if you use the manufacturer's compiler (or a common compiler such as provided by GNU or MS), the CPU will often recognize the expensive routines and optimize them for you in the pipeline.
That said, if the assembler actually knows the cpu they're targeting, they can take advantage of these pipeline shortcuts as well. But it won't be portable unless they duplicate a lot of the logic that goes into compilers in the first place -- at which point, you're adding an extra layer that's going to take more time/space.
I think you are mistaken here. Yes some Intel processors optimize some instruction patterns however those same patterns are those that are used by assembly language programmers too. Some examples of this is fusing together some comparison instructions followed by a conditional branch. Any assembly programmer not using that pattern isn't optimizing for performance either by ignorance or intention (=size optimization). Now these patterns have been used since the Pentium Pro was released so it isn't a recent change.
Somewhat more esoteric is the detection and special handling of CALL x; x: POP EAX type patterns. Here one calls the next instruction (labeled x here) causing the processor to push the return address onto the stack which then is then stored into the EAX register by the POP instruction. Intel processors detect this pattern and avoids treating it as a branch instruction leading to faster execution.
Other than those two examples I can't recall anything not exposed to assembly language programmers - in fact those kind of rewrites _are_ exposed to programmers if they'd bothered to read the manuals.
[My political viewpoint is significantly closer to Fascism than Marxism BTW]
This is measurably false. I measured a modern high-end notebook recently and it have ~70 ohms output impedance. It also doesn't have enough output power to drive e.g. 600 ohm headphones (but I guess those are in your "ridiculously high impedance" category) which means the output have clearly measurable impact on the frequency response which one also easily can hear.
Also even though the chips in many cases have better specifications than any human sound receiver (=ear) in the majority of cases they are built into circuits that can't give the specified performance. One example of faulty design is having the audio level lines drawn parallel with e.g. USB data lines which can induce ~1kHz pulses in the audio output which are detectable when driving sensitive headphones. Another is not filtering the power lines to the codec and relying only on the supply ripple rejection of the codec/amplifier which isn't enough in some cases.