For starters, there are a lot of tricks used in devices. You don't need inference to do decent voice recognition for dialing and commands.
Offline inference is pretty spotty for me on my Pixel 9A. Like unable to understand me well enough to take down an address. But on the other hand when it has good signal it is good enough for sniff conversations in the room and start recording products my wife and I are discussing. Probably because the engines are far better trained on product names than on place names. Just the sort of thing you'd expect from an enterprise that is giving away free services in order to advertise more.
CC automation is not trivial. It's requires a much broader context in order to detect different voices and different characters, and the wider range of things they are going to say compared to someone trying to order a new blender. And these engines typically pull in extra contexts when they detect some key tokens, a whole markov chain sort of matching some extended dictionaries. With a film or TV program, with lots of foley work mixed into the main track, the inference is tough to do in a general way. At least on the cheap with a small model that scales to Borg instances that YT would conceivably launch en masse.