Here is my best answer. I am not active in the field so the answer is a combination of knowledge, extrapolation and intuition. I think it provides some of the kind of info you are curious about
Typically the first layer of nodes will receive input feature detectors run on the image. For example edge sharpness and orientation calculation. This will be at a range of scales that are small compared to the overall picture.
This first layer will connect and provided weighted values to another layer or two that is also probably spatially restricted in range
You would not actually need to have so many independent low level nodes because you can run pieces of the image be the low level node and then rout the output to the next aggregating layer.
On aspect of deep learning is that you would train input nodes based on the final output [or at least high level nodes] rather than back propagating through all the layers. This improves the results and simplifies the interaction, allowing for more nodes to be implemented due to reduce computing time
I tried to google for some practical values, but did not see anything that offered up number just guidelines. I am now quite curious about the value and will have to spend more than 15 minutes searching. If someone has some practical experience and some typical values, I would certainly be interested in the answer. A will now hazard a complete guess. For each 40x40 pixel square I would imaging roughly a hundred of two feature values going into the first layer of nodes on a one to one bases. I would imagine 3 to 5 intermediately layers that tapered down more minimally over the next 2 layers and then more dramatically as it goes to the final layer. This ends up with a ball park calculation for a 2000x2000 image of 2,500 patches [obviously they are overlapping areas but good enough for the estimation] and a first layer 500,000 nodes x 3 for 5 layers of reduced count to get 1.5 million nodes. I am confident that I am within 4 orders of magnitude with this guess.