AI applied to audiobooks is simply the next step in the same progression of AI and skilled labor, as applied to audiobooks, that we see everywhere else in our society.
I have been listening to TTS books since before 2008, when I used eSpeak to make an audiobook of the Shon Harris 4th Edition Complete CISSP Study Guide. I have since used eSpeak for gigabytes of fiction. However, now I listen to the Google TTS on Android, using AlReader.
eSpeak sounded like a "drunken robot", and I hid some of those errors by using an accent that was not my own--in my case, a UK English voice. Listening to eSpeak required you to train your ear to the inflections of the robot, but while it was mechanical, it had remarkable pronunciation and moderately good inflection. It handled commas, colons, semicolons, periods, question marks, and exclamation marks with reasonable choices for stresses in the generated voice, better than other TTS voices AT THAT TIME. What it couldn't do was sound natural.
AlReader uses the Google TTS voices on my phone, and handles ePub, mobi, text, and some other formats. It still stumbles on some words; "CO", as "Commanding Officer", is regularly read as "Colorado".
An indicator of how TTS has improved is that I no longer use UK voices unless the narrator is British. The American accents don't jar me the way they used to. And the inflections have improved; are the performances less than perfect? Yes. However, I would say that the Google TTS performances are better than a bad Librivox effort, almost equivalent to a poor commercial effort. One significant advantage: the TTS voices NEVER get tired. The accent doesn't change across chapter breaks. If it mispronounces something, it ALWAYS mispronounces it the same way.
But THESE have been mechanical reproductions, without AI contributions. I expect that AI reproductions will begin to sound even better, so that their efforts will be as good as a good LibriVox recording, BETTER than a poor commercial recording.
But one thing eSpeak COULD do was follow markup language for voice synthesis. I experimented once, and found that I didn't know quite what I was doing--but the Russian speaker had a deep voice with a Russian accent, while another character had a British accent, and another...if the markup language was made available with the Google TTS engine, you might hear something quite remarkable.
The AI impact on copyright and voice performance and reproduction rights is going to take years to shake out, and I'm not going to even TRY to predict where it's going. But cars are built by robots, now.