I shall give it a go.

First up, most algorithms can't be directly translated to hardware without either changing them or taking a serious performance hit.

Nearly all widespread algorithms (eg. H264 video) are designed specifically with a hardware implementation in mind, and in fact must usually have elements removed that would produce good results simply because it wouldn't be sensible to implement in hardware.

In particular, in hardware, loops that iterate an unknown number of times are generally not allowed.

Steps to make this estimate would probably be to take your code and 'flatten' it (IE. Rewrite it to avoid all use of pointers, except arrays).

For every variable, figure out how many bits wide it needs to be(IE. What is the smallest and largest possible value). You probably want to convert floating point to fixed point.

Next, to make a lower bound of how many gates would be used if you were to design for minimal gate use, take every add and subtract operation and call them 15 gates per bit. For every multiply call it 5 gates per input bit squared. Don't do division (division can be done as a multiplication by the inverse of a number).

For the upper bound, do the same, but multiply by the number of times each loop goes round. That gives you a design with lots more gates but much higher performance.

For the upper bound finally add on 5 gates for every bit of every variable times the number of lines of your input code. This approximates the d type flip flops for storage in a pipeline. Note that if two lines of code operate on entirely different variables, you can call them the same line as far as this metric goes.

For the lower bound, if you got a value greater than 10000 plus 16 times the number of bytes that your program is compiled plus the ram it allocates to run, it would be more gate efficient to put in a tiny processor and keep your algorithm in a ROM. (Lots of complex algorithms are implemented this way when space is at a premium).