First, what are basics, fpgas and gpgpus?
A gpgpu is an asic. The designer wants to make a GPU, which can also be used for general purpose computer tasks. The designer chooses how many shades to include, and how they are architected, and how it is all controlled. The designer puts registers where it makes sense to, to optimize this architecture.
A CPU is similar, also an asic, and the designer figures out how to most optimally (in his opinion) to implement the instruction set, what a core looks like and how it works, alu design, and puts registers where it makes the most sense to do so.
An fpga is also an asic. The application specific to an fpga chip is to be a reconfigurable logic chip. The designer 1 makes up an architecture to implement the design that is an fpga and implements that into silicon, and then the fpga user (designer 2) puts his particular design configuration into that now programmable asic chip that designer 1 called an fpga. So there are two levels of design for fpga. I myself have been a designer1 in this case.
Now, as above, an asic can be pretty much anything. A chip designer sees the word "register" in a somewhat different context than a software person who wants to put a value in a particular address. A chip designer considers a flip flop to be a register. Something that holds its value between clock edges, and changes value at a clock edge. Most of these are probably not even seen or observable by software people. Some become CPU register file registers, thr vast majority probably are not. There are hundreds dreds of millions, or billions of transistors on CPU and GPU chips today, but how many software register addresses are there???
ANYWAY... So the CPU designer optimized his logic to be an x64 or an arm processor, and the GPU designer optimized his logic to be a gpu chip with gp capability, and the fpga designer1 optimized his logic to become programmable or reconfigurable logic, which someone else then makes some unpredictable use for.
By definition, an fpga has a lot of overhead to be programmable, and designer1 can make it be a giant expensive and gate. Even optimally, there is a lot of multiplexes and wiring to make an and gate. A direct asic design would drop an and gate directly between the input pad circuit and the output pad circuit. The fpga is complicated, and that makes it less efficient than a direct asic. Anything an fpga can do, a hardwired, optimal asic design can do faster and with less power usage.
Now, a CPU can do a lot of things, but is not optimal for the particular algorithm in question.
A GPU can do a lot of things, and can do a lot of things much faster than a cpu, but the gpu may still not be optimal for the algorithm in question.
An fpga can do any digital circuit that fits, and your clock rate is whatever that best runs at without violating timing constraints. You can optimize a digital logic design to be the best digital design for the algorithm in question, forget the software control, and just have an appropriately designed pipeline for data to pass through, inputs at one end and outputs at the other end, whatever registers in the middle that make sense as "variables" storage locations between the algorithmic combinations logic steps, but likely never to be seen by software at all. But fpga has lots of overhead.
Finally, take your optimal and tested fpga design and convert it to a hardwired asic design, shave off the programmable overhead, and build on the fastest fab process you can afford. (Can you afford a 7nm mask set and capable asic design tools?? Not many can...)
An optimal digital design exactly for this one algorithm purpose, and no other, will be the fastest thing possible to make. Depending on your finances, you may not be able to afford the fastest possible, but can likely still beat an fpga or gpgpu or CPU in performance and power. If you are financially poor, then start with what you can afford...
Now, how can you simulate... First, spend a lot of time optimizing the digital circuit. Consider what you want clock rate to be, and how that affects how deep the combinational logic between registers can be to meet timing constraints. Run that design through synthesis and place and route steps to see how your timing looks. This isn't even a verilog simulator, this is just timing to make sure it will function as expected at speed. Verilog simulation is for functional testing, not for speed testing... you do also want to do functional simulation to make sure it actually does the algorithm that you think it does, and does everything correctly. But that does not understand the clock speed being a friend or a foe to correct functionality. Your timing analysis tool does that.
Look up the VSD set of courses on udemy by Kunal Ghosh (sp??) For some good intro to the world of chip design, mostly digital from what I have seen. (The real world remains analog, and there remains a lot of analog chip design as well, but not as relevant to this question)