And you think printf() and strtol() are major bottlenecks worth dedicated silicon area why?
Modern CPUs already have many accelerators for high end functions, such as numerical computations, cryptography, and the all important memcpy. (Memory copies are a traditional bottleneck, and general enough that they can be easily offloaded.) They come in two forms—specialized SIMD/vector instruction sets, and dedicated blocks for high-level functions that take multiple microseconds. An example of the former are the SIMD-oriented AVX instructions found on modern x86 chips. As an example of the latter, chips aimed at high end signal processing often have discrete blocks such as FFT accelerators. Others aimed at network tasks (especially DPI) have regular expression engines.
The problem with accelerator blocks is that they do take up area. And if they're powered up, they leak. Leakage current is a significant factor in modern designs. To get faster transistors, you need to drive their threshold voltage down. As you lower the threshold voltage, their leakage current goes up exponentially. So, that circuit better be bringing a lot of bang for the buck if it's going to be sitting there taking up space and leaking.
Another issue with dedicating area to fixed functions is the impact it has on distance between functions on the die. In the Old Days, you could get anywhere on the die in a single clock cycle. With modern designs and modern clock rates, cross-die communication is slow, taking many many cycles. So, when you plop down your custom accelerator, you have to figure out where to put it. Do you put it right in the middle of the rest of the computational units, slowing down the communication between their functions (either lowering clock rate or increasing cycle counts), or do you put it on the other side of the cache, meaning it takes several cycles to send it a request and several cycles to see the result?
This is why many custom accelerator blocks out there today focus on meaty workloads. A large FFT still takes a good bit of time to execute, and there's usually other work the main CPU can do while it executes. Thus, the communication overhead doesn't tank your performance. printf(), on the other hand, generally shows up right in the middle of a bunch of other serial steps. You can't overlap that with anything. Hauling off to a printf() accelerator block generally would make zero sense. If you're really spending that much time in printf(), you're better off rewriting the code to use a less general facility.
A final issue with dedicated hardware is that you can't patch it. Someone finds a bug in your printf() and you're back to using a library version. I could go on, but I think I've made my point.