There's not 64k of assembly pumping bytes into a framebuffer and twiddling the PC speaker port to synthesize digital audio.
Of course. But all the creative work is squeezed into 64K.
One thing I couldn't find in there (and I've been out of the scene for a LONG time, so I don't know how this works on new-fangled fancy computers...) -- do these write directly to the video hardware? Or do they use OS services like DirectX11, etc?
They use DirectX, because that is the only way to support a reasonable range of hardware. (Also, you can't hit the hardware without installing a new driver or exploiting a kernel bug. Neither of which is very friendly.)
But are people still getting down and counting clock cycles?
Cycle counts aren't even documented today. Now it's all about avoiding cache misses and cache invalidation.