Researchers Use Fluid Dynamics To Spot Artificial Imposter Voices (theconversation.com) 23
An anonymous reader quotes a report from The Conversation: To detect audio deepfakes, we and our research colleagues at the University of Florida have developed a technique that measures the acoustic and fluid dynamic differences between voice samples created organically by human speakers and those generated synthetically by computers.
The first step in differentiating speech produced by humans from speech generated by deepfakes is understanding how to acoustically model the vocal tract. Luckily scientists have techniques to estimate what someone -- or some being such as a dinosaur -- would sound like based on anatomical measurements of its vocal tract. We did the reverse. By inverting many of these same techniques, we were able to extract an approximation of a speaker's vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker who created the audio sample.
From here, we hypothesized that deepfake audio samples would fail to be constrained by the same anatomical limitations humans have. In other words, the analysis of deepfaked audio samples simulated vocal tract shapes that do not exist in people. Our testing results not only confirmed our hypothesis but revealed something interesting. When extracting vocal tract estimations from deepfake audio, we found that the estimations were often comically incorrect. For instance, it was common for deepfake audio to result in vocal tracts with the same relative diameter and consistency as a drinking straw, in contrast to human vocal tracts, which are much wider and more variable in shape. This realization demonstrates that deepfake audio, even when convincing to human listeners, is far from indistinguishable from human-generated speech. By estimating the anatomy responsible for creating the observed speech, it's possible to identify the whether the audio was generated by a person or a computer.
The first step in differentiating speech produced by humans from speech generated by deepfakes is understanding how to acoustically model the vocal tract. Luckily scientists have techniques to estimate what someone -- or some being such as a dinosaur -- would sound like based on anatomical measurements of its vocal tract. We did the reverse. By inverting many of these same techniques, we were able to extract an approximation of a speaker's vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker who created the audio sample.
From here, we hypothesized that deepfake audio samples would fail to be constrained by the same anatomical limitations humans have. In other words, the analysis of deepfaked audio samples simulated vocal tract shapes that do not exist in people. Our testing results not only confirmed our hypothesis but revealed something interesting. When extracting vocal tract estimations from deepfake audio, we found that the estimations were often comically incorrect. For instance, it was common for deepfake audio to result in vocal tracts with the same relative diameter and consistency as a drinking straw, in contrast to human vocal tracts, which are much wider and more variable in shape. This realization demonstrates that deepfake audio, even when convincing to human listeners, is far from indistinguishable from human-generated speech. By estimating the anatomy responsible for creating the observed speech, it's possible to identify the whether the audio was generated by a person or a computer.
Arms race (Score:3)
Re: (Score:1)
I was thinking the same ...
Re: Arms race (Score:4)
Re: Arms race (Score:3)
Re: (Score:2)
https://youtu.be/-zVgWpVXb64?t... [youtu.be]
Re: (Score:2)
Just re-watched it last month!
Re: (Score:1)
I would imagine a lot of the deepfake detection research going on around the world is unpublished since publishing the detector is a roadmap to defeating it. Ultimately a recorded sound is a sequence of bytes and there's no theoretical reason it couldn't be faked perfectly.
A recorded voice has organic vocalizations that computers cant emulate. Regardless how good you think you made your simulation, short of cutting out or growing custom, organic vocal cords and pumping the output through those, you are not going to get the subtle oscillations of organic resonances.
Re: Arms race (Score:5, Informative)
Bruh do you even ML. In short, no. You don't need to make meat voiceboxes to train a neural network to make the correct sounds. You literally just need what this paper has - a detector - and then you use it's signal during training. The neural network will figure it out somehow. (how it does it "depends")
Re: Arms race (Score:2)
How it does it at a low level is probably a mystery even to its programmers. ANN's are highly opaque.
Re: (Score:2)
Deepfakes don't have to be perfect, they just have to be good enough to "pass" as real.
But, let's suppose you are right, and the only way to "pass" is to use real or lab-grown vocal cords. If that's what it takes, you can be pretty sure that CIA-type organizations around the world and deep-pocked private companies are already trying to do exactly this.
Re: (Score:2)
Man no. Just no. We have yet to find an audio source we can't model.
Audio is just sine waves my dude, lots of sine waves. And we've known how to do those since Ptolemys table of chords, and we've known how to work out what those sine waves are since Fourier in the 1700s.
Ultimately you really just need the formant frequency and the overtones, and you'll derive the parameters of your artificial voiebox from that.
Re: (Score:2)
Re: Arms race (Score:2)
Not necessarily. If the AI doesnt have enough parameters to tweak it may solve one issue but cause another in the process.
Re: (Score:2)
Ultimately a recorded sound is a sequence of bytes and there's no theoretical reason it couldn't be faked perfectly.
I'm not an audio engineer, but I imagine it would depend on the bit rate of the fake/digital sound and the capability of the analysis gear. Human speech, and natural sound, is analog and continuous...
Re: (Score:2)
Nah. 100% of deepfakes give away themselves:
1. They lack emotional depth. SOTA (State of the Art) stuff still can not emote, express sarcasm, laugh, or scream on cue. It can only do these in post-processing. If you were to engage in real time, the deepfake will fail every time.
2. Most deepfake audio is done deliberately with poor quality audio. This is because the voice systems used are often only sampled at 16khz, not the 48khz or 96khz that would be necessary to fool the human ear AND digital forensics. T
Department of Redundancy Department (Score:1)
What is an "Artificial Imposter"? Is it a real person or a robot that's pretending to be an imposter?
Re: (Score:3)
It's a simulation of a human rather than an actual human performing the imposture.
Adversarial training will solve this (Score:3)
Re: (Score:2)
upvote parent. That was my first thought too.
Of course, now this can be fixed (Score:2)
All you need to do is set up a loop with the "deepfake detection" algorithm against deepfake audio creation and in a few weeks, you won't be able to detect a difference any more.
Re: (Score:2)
Well.. hell I should have read the responses ... everyone realizes this and I can't delete it.