My first thought was, who in their right mind in private industry is going to share datasets? Training datasets are the most valuable piece of intellectual property in this new world and most companies in this space spend huge sums of money curating those datasets. The exact contents of those datasets and (most importantly) the processes used to curate them are tightly held secrets.
I think a better approach would be to fund the development (probably via the existing NSF process) of relatively generic training datasets (e.g. like ImageNet and PASCAL) that are useful for comparing different models and techniques. And hopefully without some of their flaws and quirks.
On the other side of it, you don't need cloud services to run or train AIs. At this point in time a lot of the biggest leverage, in my opinion, is going to be from fairly lightweight AI implementations that you can put in low-power and low-cost edge computing devices. You can train those models on an inexpensive Linux box with a few RTX cards (I train 'em in about an hour with dual RTX cards and an older Ubuntu box). So there are a couple of things you could do there: one is to fund the open-source development of standardized training tools and evaluation tools; another is to subsidize the purchase of such hardware (already not very expensive) by researchers, even a very modest subsidy of around $1000 would likely go a long way.
The final thing you could do is break the damned NVIDIA monopoly.
Also, please have specific research objectives that you are going to prioritize before you start writing checks.
Note that none (except possibly fixing the NVIDIA monopoly) of these things would cost very much money and all of them would speed up AI development quite a bit.