It all depends on what timespan you have. All you need to do is to emit sounds that are quite inaudible or at least indistinguishable from high frequency noise that we have been trained to accept (PWM noise from LCD brightness control etc). If you have plenty of time, you can reduce your bitrate heavily in the handshaking step, basically looking for just a few bits of signature in a very wide span of frequencies and encodings. When you have a basic channel, you can tell your counterpart what SNR you are getting and successively tune the channel.
You would never want this for regular networking with any kind of latency demands. If you are rather just trying to get a specific updated payload across at some point, with any number of retransmissions, then I find it quite believable.