Actually, a lot larger effect is gained by the differing attentuation and reflection of the signal by each ear. This is how "2-speaker 3D sound" systems like QSound, A3D etc. worked - by slightly changing the actual sound pattern to simulate passing through your skull / around your head instead of just changing the volume.
The problem is that 0.7ms of delay is NOTHING when the primary data channel is operating over something like Bluetooth (i.e. a 2.4GHz carrier, data rates around 1Mbit/s, etc.). In those instances, 0.7ms is orders of magnitude greater than the base data rate even after error detection, retransmission (if you even bother), correction, etc.
What they are saying is that they are having to synchronous three separate wireless devices to within 0.7ms of each other. My wireless network does that all day long on similar frequencies, with base levels of hardware expense, with error correction, encryption and retransmission.
And if you're really that worried, you buffer ever so slightly (even a few ms will do) and spend more time on sync to make sure you keep the same idea of "now" on both earbuds.
Basically, Apple chose a crap design with inherent problems that everyone else has thus far avoided, and then they blame that for problems with supply, when similar - and far superior - solutions are sitting in everyone's phones, laptops, cars and access points already.
Honestly, if your pings across a local network spike more than a few ms, you have a crap network. Hell, I have to use the Linux ping tool as it does the proper floating point ping rather than just "1ms" which is all that Windows ping will give me.
And once you buffer and accept a tiny imperceptible difference between the audio source and the headphones for that buffer, then syncing two buffered speakers playing the same source is relatively trivial.