Why? Because voice processing and searching on the scale of some of the applications such as SIRI require centralized processing. Therefore your voice commands have to be sent someplace else and processed.
At the moment. As the technology improves more and more will be done client side because round-tripping audio is stupid if you could do it locally. If SIRI or something like it was completely local, then there would be no issue. Unfortunately there has been little or no work on practical on-the-spot voice recognition lately because the money is all in spying - be it for surveillance or ads.
It's not like appliance controls are complicated - there's only a handful of "TV: Change channel to ESPN" or "Kettle: Tea, Earl Grey, Hot" phrases that need to be trained in. But since the business models of operators like Nuance are predicated on licensing access to their huge server farms, no other option is even considered except the one that destroys privacy.
We need regulation - no server-side processing of client-side controls. If you could do it locally, then you MUST.