Since the inception of HT, is there a reason CPU design hasn't advanced to the point of executing 4 threads per core rather then the 2 it always has been?
Workload and system balance, mostly.
If you look back several years (2008? earlier?) you'll see some Sun Sparc designs, and some IBM POWER designs, that supported 4 or 8 threads per core. They worked well for very specific workloads and applications.
The Sun Sparc designs with 8 threads per core were mostly tailored for "simple" highly-scalable web servers, where a thread is blocking on I/O most of its time, and a web server could spawn many many threads to support many simultaneous connections. Worked very well for that purpose.
IBM did stuff like that with their POWER architecture for terminal servers and financial transaction processing, where, again, the thread spends most of its time blocking on I/O.
You don't get that so much for Intel x86/x64 systems, because, on the desktop side, frankly, most users don't use 4 cores well, and the few that do aren't doing I/O-blocking tasks, they are doing CPU-bound tasks, video encoding, stuff that hits the SIMD units hard. HT doesn't benefit nearly as well for CPU-bound tasks, and that market is small enough not to be worth the extra architecture/development time. For x64 servers, there is a bit more of a market there, but Intel would much rather serve that market with their high-end Xeon 4-socket systems. 10 cores per CPU, 4 CPUs, you get 40 cores and 80 threads. Oh, and you pay about $4,000 per CPU that way. That also gets you ridiculous amounts of RAM, and better networking support too. Usually you want both of those on your 80-thread server system, anyway.
So I suppose the answer is, basically, it has, but only where it's worthwhile.