A point, but (and this is admittedly a quibble) I wouldn't call languages a "vast body of human knowledge". The data encoded within that language might qualify, but not the language itself. Unfortunately, without understanding the language there's no way of reasonably estimating the size of the contained "human knowledge" that isn't contained in sources already covered.
FWIW, I think treating "the internet" as a body of human knowledge is foolish. Parts of it are, but much of it is negative-knowledge (i.e. learning it makes you stupider). The internet *is* a body of human information...but some information is garbage.
Now I admit that, say, Tamil may contain encoded large amounts of history and large amounts of myth. Whether they are clearly enough separated to be called knowledge isn't something I can tell. (Actually, Tamil should contain much of the history of the development of math...but it's not clear to me that this would be readily distinguishable from the related myths even by a careful historian, much less by a current LLM.)