SMT on a dynamically scheduled architecture requires resolving and tagging data dependencies between instructions from two or more contexts as those instructions enter the reservation station, are dispatched to execution units, and eventually enter the reorder buffer. Speculative and cancelled instructions also need to be resolved from two or more contexts at once. That's not particularly easy to do, and the difficulty grows with the number of execution ports and accompanying execution units. Intel has been working at this for many, many years. Even when HT was not used in x86 (Core 2 era) it was being developed into the Itanium family of microprocessors.
If the ideality condition for efficiency is having all backend execution ports busy on every cycle CMT can theoretically reach parity with SMT by having a demanding thread running on each logical processor and without other factors such as cache latency becoming a bottleneck (assume ideality on these). However, outside of synthetic benchmarks this is incredibly hard to accomplish. As soon as one thread blocks (IO syscall for example) or enters a long-latency event (page fault for example) the operating system can either toss the thread on the waiting queue and context switch, or simply do nothing and issue stalls until the event is resolved. Excessive context switches cause overhead, and should be avoided, and stalls are inefficient by definition. If no context switch is performed, the CMT frontend must stall which means that the backend execution units will be idled. A SMT frontend must also stall, but the backend execution units may still be used unless the complementary thread also encounters a long-latency event.
Intels performance advantage most certainly does erode when highly-concurrent tasks are employed, but AMD's microprocessors require significantly more transistors and significantly more power to obtain the same level of performance.