The existing password hashing methods won't run on GPU well for user authentication, even when they do run well for cracking passwords. They lack sufficient parallelism within one hash computation. This is an issue I first raised in 1998, in pre-GPU context (it applies to recent CPUs as well, and the problem is getting worse with time).
A solution is to define a new password hashing method with sufficient (configurable) parallelism within one instance. We could then consider running it on GPU, unless it is GPU-unfriendly by other criteria. Do we really want to, though? GPUs in servers are not yet common, except in computing clusters. Their reliability may be lower than that of other typical server components. The drivers are currently relatively unreliable as well (although they may be reliable enough if running the same code, with no upgrades). Sure, computing clusters use them anyway, and get them to run reliably enough for their needs, but the extra hurdle and/or risk is there. Will we get embedded GPUs in typical servers soon? Will they be similar to current gamers' or HPC GPUs or not? This is not clear. Then there's Intel MIC, which delivers GPU-like performance, but is a lot closer to a CPU - it will require a lot of parallelism in the algorithm too, but it may run certain types of otherwise GPU-unfriendly code. Is this possibly a better target?
For current GPUs, a better strategy might be to make them inefficient - by using GPU-unfriendly hashes (for cracking, and for validation as well - as a side-effect).
We had a project last summer to research this kind of possibilities, focusing on use of FPGA boards in authentication servers. This could optionally buy us GPU-unfriendliness (if we want to make things more difficult for attackers with GPUs, but not FPGAs, and for botnets, which almost surely will lack FPGAs). We even considered some moderate CPU-unfriendliness of the component that we'd put on FPGA. Specifically, we experimented with bcrypt on FPGA, as well as with much smaller Blowfish-like "non-crypto" cores (not actual Blowfish), so that we could hopefully fit hundreds or thousands of those per chip (and have them somewhat CPU-unfriendly as well). Yuri, our GSoC 2011 student working on this project, did have some of this implemented in an experimental fashion, and some of it even worked (on FPGA boards kindly provided by Pico Computing), but an outcome of the summer project was that this would be time-consuming to bring to desired levels of performance and reliability. At that point, the project was put on hold.
A simpler and cheaper alternative (if there are only a handful of customers for this) may be to use dedicated servers, existing HSMs, or microcontrollers for just the password hashing. Indeed, microcontrollers are super slow, so their only function would be to hold and apply a local parameter, with the rest of the hashing method implemented on the host's CPU and RAM. If dedicated servers are used, they would need to be separate from authentication servers - that is, they won't know usernames, won't have access to any database, won't have any persistent storage except for the local parameter, and the OS and software indeed. They will accept password, salt, and parameters (such as the configurable per-hash processing and memory cost settings), and provide the hash. Thus, their attack surface would be minimal and they'd provide an extra layer of security against network-based attacks. We'd do this with FPGA boards as well, and we'd also have the greater/unusual computational complexity as a security layer (in case the local parameter or its backup copy is leaked/stolen), but well - using typical and pre-existing server hardware, drivers, etc. is just simpler and cheaper unless we start a new business and expect to have plenty of customers (although that might be possible).
Prospective customers for these solutions involving extra hardware would be organizations with large user/password databases (millions of records): social networks, major service providers, banks, some government sites.