Anybody doing special hand optimisation for Arm processors has probably spent years optimising for Intel chipsets as well, so you can't really call bullshit on that.
Indeed, Apple have popped up and offered us some advice to improveme our SIMD optimisations that we'd done for Arm/NEON, and found an extra 10% speedup. Those optimisations are good for all Arm systems though, whether they're on-prem Ampere Altras or Amazon Graviton instances. And believe me, we've spent years tuning threading, writing better C or C++ code and hand writing Intel assembly or utilising compiler intrinsics coupled with some expensive profiling tools to find the code hotspots. In most cases, the Arm optimisations were done quickly because the hard work had already been done.
The Amazon Graviton instances are now getting pretty with higher numbers of cores. I don't have recent benchmarks, but Graviton = v3, they were always cheaper than the Intel and AMD machines, but very performant.