Apple Silicon can't realistically "replace" a discrete. Rather, they're... different.
The compute performance of Apple Silicon is vastly inferior to a mid-range discrete. Its bandwidth isn't great in comparison, either.
So, in terms of GB-of-VRAM-to-GB-of-VRAM, Apple Silicon is worst than any discrete you're likely to have for ML purposes.
However, they've got something you can't get on a discrete- 128GB of VRAM in a laptop, and 512GB of VRAM in a desktop.
This changes the equation, because it means your Apple Silicon (with enough RAM) can simply run models that the discrete just can't*
So in terms of "being able to run a local model of size X", where X>32GB for a top-end NVIDIA, or far less for a mid-grade, Apple Silicon is competing with datacenter cards, and clusters, at least as far as models it can actually run**
As for the NPU, they're useless for bandwidth-bound tasks (which means LLMs), and for non-bandwidth-bound tasks, they generally perform worse than the GPU, though much more efficiently. They also have the drawback of usually requiring interaction with weird frameworks (while compute shaders are generally well understood)
I would not say its GPU is designed for client-side processing, as unless the model is big- any discrete will do the job drastically better.
That being said, I have an M4 Max with 128GB that I purchased specifically for local agentic LLM testing and development.
* Not strictly true, but effectively true since the performance is generally as bad as doing it on your CPU alone.
** Technically, AMD has the AI Max 395+, which kind of competes with an M4 Pro, but not an M5 Pro, M4 Max, and absolutely not an M3 Ultra.