It depends hugely on the workload and it also depends a lot on the core. The ARMv8 ecosystem is quite diverse. For example, you have some players like nVidia's Project Denver, which fuses some of their GPU ideas with designs inherited from Transmeta. The Denver core is VLIW, but with staggered pipelines, so that results from one instruction in a VLIW bundle can be fed into the next (without needing rename registers, which are one of the biggest power sinks on a modern OoO CPU). When you start a program running, there's a simple decoder that turns ARM instructions into fairly inefficient VLIW instructions, but after a little while hot loops are optimised by a JIT and get a lot faster.
At the other end of the design spectrum, Cavium's Thunder X has 48 ARMv8 cores (not hyperthreads) per die, and supports dual-socket configurations for up to 96 processors per board. Individually the cores are weaker than a Xeon, but on some workloads (network routing, some database serving), they're pretty impressive in aggregate. That many physical cores also makes it easier to load balance VMs in a hosted environment. This is especially good for the kind of workload where most clients are idle for a lot of the time, but when they're busy they're very busy.