Let's look at this from the opposite point of view: Pytorch is the actual joke here, offering not much more than entry level programmer friendly data transformation pipelines on top of a haphazardly coded bridge between tensors and whatever CUDA has to offer this week. Other GPU architectures don't seem to get much attention at all and their support comes down to a series of badly maintained hacks that barely make things work (hence your frustration with MPS I guess).
On the other hand, Apple has a rock-solid track record when it comes to supplying APIs and toolchains for their own Stack, so there's reason to expect something pretty decent once they start rolling out LLM functionalities for their own devices.
As for things like FP16 support, it's still quite debatable whether this is a good idea for training in the first place (I'm sure there will be some kind of quanitzation for inference). Having said that, even with Apple's insane memory pricing, you get MUCH more RAM per buck compared to Nvidia cards, so whether you're using FP16, FP32 or FP64 becomes much less important.