I'm not sure. The locked instructions, compare and exchange and mfence ensure cache coherency so in my experience the flushes are not necessary.
Maybe driver code needs the flushes. Driver needs to know data is really in the RAM before hardware with DMA can get it.
Cache flush instructions seem to be a late addition with SSE2.