Given how long these libraries have existed, I am surprised there are still opportunities for improvement such as that described in the TFA.
Back in the mid-80s I was involved in the design of a "mini Cray" supercomputer. We did not yet have any hardware to run on, but we did have a software simulator, and we wanted to publish some "whetstone" numbers. We got some numbers, were not too happy with them, and really dug in to analyze what we could do to improve them. The Whetstone code was in C, and used a fair number of library functions to both accomplish the numerical results and the preparation of the Whetstone answer. It turned out that most of the time was being spent in string copy and string compare functions from the library. We concentrated our efforts on redoing those library routines in assembly to take advantage of as many register-to-register operations and multiple-byte operations as we could. Although the Whetstone benchmark was supposed to measure numerical performance, our results showed that the numerical calculations took up little of the time.
Sadly, our "mini Cray" never saw the light of day. The mid-80s were a tough time to stand out in that arena as there were so many people trying to do the same thing.