Comment Re:Shit (Score 5, Interesting) 29
Not entirely.
Whisper actually works rather well in several specific use cases, and fails spectacularly in others. You need to know this in advance:
- Whisper is roughly 90% accurate at transcription and translation
- Whisper absolutely does not know what to do with silence and will randomly inject "subtitled by (fansub group, netflix, etc)" into silence
- Whisper does not really understand singing well
- Whisper does not understand code-switching (eg switching between English and Japanese in the same context window)
- Whisper understands zero onomatopoeia, just like all ASR systems.
With that said, it is not useful or reliable for:
1. Fansubbing, especially anything adult. It can only understand words, not onomatopoeia. So when it stumbles into a scene where someone goes "ah!" it has zero context for it. The result is actually pretty silly, and often turns sex scenes in R-rated and unrated media into a series of random gibberish words that begin with the same sound. Likewise children playing and women giggling often turns it into a series of nonsense, sometimes sexually charged words.
2. Transcription of podcasts. Sorry bub, your average podcaster has a shitty microphone, and can not subtitle when multiple people are speaking over each other. Especially when people use Zoom or Discord to have a multi-party video. If you want to use it to transcribe a podcast, record each participant separately and merge the result.
3. ASR technology is often built on corpus of bad data that elevates profanity when it tries to guess words it can not understand. So it's more likely to use racist language "trigger" becomes the same word with an n, that isn't even in the audio. So your input source must be professional grade, or it's word error rate will be higher and favor profanity or racist language over other more less-often but more obvious words.
I doubt most people will use this in practice as Whisper.cpp is insanely slow without being expressly used on a 16GB nvidia GPU anyway.