Their problem was designing their architecture before they had put real constraints on performance (latency).
I can only assume that the designers weren't audio programmers, because the architecture is a mess. (So many layers of mixing, buffering)
Then on top of this messy architecture, you have the problem that none of your user applications run with a high enough priority to get CPU when they need so you get gaps in the playback for shorter latencies.
I mean, pulse audio beats it hands down..... On it's own platform.....
For the moment, I don't see this getting fixed to the levels of satisfaction you can run soft synths and various drum machines in real time.
Someone at google needs to stand up, take ownership of it and _rewrite the audio layer_ from the ground up (maybe start with pulse), and put sound modules _inside_ the realtime sound generation thread.
Tell you what google, hire me, and I'll do it.