I don't understand the training push to be honest. If you had "good" AI, couldn't you have it run locally (which I think is Apple's gambit - so "you" run the AI, not some cloud server somewhere), and then do what humans do, which is get the information by "reading" a webpage, instead of trying to download and encode the entire internet into "the parameters", which is what it appears that much of the current brand of AI is doing?
That is - why do you have to "train" an AI by having it process a billion images, instead of when you ask it to find something it, oh, goes and finds it?
(The flip side, of course, is that training AI by looking at, say, images is, in fact, materially no different from humans looking at images to learn. So there is either copyright infringement any time any person learns from looking at an image, or there is no copyright infringement when an AI "learns" from "looking" at an image. Beyond the "copy" of the data that is made to "view" the image, once it's processed into parameters, isn't it no longer a copy? Are companies really trying to make rules like "no entity can look at this web page and learn from it"?)