I was shocked to find how poor the performance of `expf()` was compared to `exp()` in glibc. Turns out that in a handful of functions, they are changing the rounding mode of the FPU, which flushes the entire FPU state, obliterating performance. After switching to a different version -- from another library -- that didn't change rounding modes, performance was back on par.

It's perfectly understandable why rounding mode changes are necessary, since the FPU can be in any rounding mode coming in, and some guarantees are required, but they should really provide variations that do not do this. I truly hope the new implementation avoids it altogether, otherwise we're back on square one.