I did some video where I had a script. So it was easy to extract subtitle and compare to script. Yeah, its pretty bad. Probably every other sentence had some obviously incorrect subtitling. And these videos were on clean audio inputs: no background noise, good mic.
I ended up having to correct the subtitles using the transcript.
I have seen a study recently on noisy audio in a group setting, so the speakers maybe not be perfectly mic-ed and all. The speech-to-text was done with a couple of Whisper models. The error rate was something like 30% of the transcription was bogus. (yeah, 3 words in 10.) I imagine that YT would do some preprocessing. But overall, I find the subtitles not terribly good. It's useful and better than nothing.