I'm going to make some rough approximations here.
There are difficulties in dissipating power in high speed processors. Assume that the power that can be dissipated is proportional to the area of the chip. Relative to a single active layer chip, the power that can be dissipated per layer is 1/(number_of_layers * thermal_conduction_to_coolant). Thermal conduction to coolant is dominated by copper in the heatsink and SiO2 in the chip. Copper is at least 200 times more thermally conductive than SiO2. Assume that the maximum acceptable temperature rise is 50 Kelvin across a 1 cubic centimeter copper cube; that corresponds to 200 Watts. Assume that diminishing returns occurs when the thermal drop across SiO2 equals the drop across the copper. Since they add, if we keep the limit at 50 K the limiting power is 100 Watts. The implied thickness of SiO2 is (1 cm)/200 = 50 microns. How many layers can be squeezed into 50 microns?
A brief internet search seems to yield a minimum layer thickness of 100 nm (0.1 micron) for gate logic -- (1 active layer plus many interconnect layers.) Thus 500 active layers can be squeezed into 50 microns. What happens then?
Power dissipation in CMOS logic, ignoring leakage, is proportional to freq * V^2. Let our single layer CPU performance be 1 unit, limited by 1 cm copper and running at 1.2 volts (There's very little SiO2 for the heat to pass through.) At first glance, our 500 layer CPU with same voltage limited by 1 cm copper plus 50 micron SiO2 is 1 * (500 layers) * (1/500 heat per layer) * (1/2 thermal conductivity) = 1/2 unit. Layering loses. However, that is not the whole truth. Layering allows many more transistors, thus more clever circuitry, which might be enough to improve the performance some. 3D means shorter interconnects, shorter interconnects means less capacitance, less capacitance means less power dissipation. (The other major contributor to capacitance is the FET's gate.) I can only guess how much lower heat (more speed) that allows. Maybe 1.5X? speed is then 3/4 unit. That (1/500 heat per layer) is (1/500 speed) and with CMOS reduced speed allows reduced voltage.
Over a limited range, CMOS speed is proportional to voltage. By lowering voltage, heating is reduced. Thus reducing voltage means speed does not have to be reduced to 1/500 of the single layer CPU. With a supply voltage of 1.2 x 1/10 = 0.12, speed reduced to 1/10, power per layer is reduced to 1/1000 compared to the single layer CPU. 500 layers operating at 1/10 the speed is a 50x performance improvement.
Alas, we can't do that. Huge CMOS CPUs can't be made to operate at 0.12 V, and I don't know if it will ever be possible. I'll guess and say that somewhere in the range of 0.3 V and 0.6 V will some day be practical. If it's 0.6 V, speed could be 1/250, times 500 layers = 2 units. If it's 0.3 V, speed could be 1/62.5, times 500 layers = 8 units.
The above is too optimistic, because of difficulties in controlling threshold voltage and leakage, and the difficulties in massive parallelism and massive multi-threading.
I'd like to repeat the calculations for 10 layers and 50 layers. I'd like to check my work. I've already spent about 2 hours on this reply, so I'm giving up. Have fun.