How often does the leap second bug recur?
That one? Once. Seen plenty of different style leap second bugs (too many - leap seconds should be a relatively easy calculation, but we only get to test them once every 3 years or so, and in real time because it's kinda hard to convince a global time keeping system that a fake leap second is about to happen for testing. Still, I'd rather we fixed the software than do stupid things like get rid of UTC like some idiots are proposing), but one that causes a futex loop in java processes (and the opera web browser) just the once, and mostly only on RHEL6 and debian ~wheezy kernels at the time.
If It is known to occur, then why would such platforms be relied upon instead of patching it ahead of time?
The point of bugs is that they're not known to occur beforehand. This particular one was quite neat in that it wasn't the leap second code itself that was at fault, but it was the mechanism ntp used within the kernel to inform the kernel that a leapsecond was coming up. At least it didn't happen over the public holiday New Year period this time. I knew Monday was going to be a busy day in the datacentre when I saw my 3 laptops at home exhibit the problem on Sunday morning though.
It seems to me that developing new DCIM solutions is a bit of a stretch to solve the leap second issue. Or is that just an excuse to fund new DCIM solutions (in other words, a solution in search of a problem)?
Anything can cause a kernel or userland software to suddenly enter a hard loop burning through CPU cycles and thus power. And in a large homogenous environment, that bug can be triggered in many locations all at the one exact moment in time. Another good example might be the RHEL6 bug that affected us around the same time last year - the old "uptime has reached a hundred and something days, let's overflow a counter and kernel PANIC now!" bug. We found out about that bug after patching all of our systems, found out that it only applied to the version of the patch we managed to apply, and had to start planning to bring the next patching cycle forward (but at least we knew about it) . You'd think these were the kinds of bugs that we learnt about in 1995 and were never stupid enough to put such bugs back into the kernel, but it seems every generation must learn about it for themselves instead of reading their Operating System text books.
The point of these bugs is that anything might cause a large fraction of your machines to start chewing through electricity. In an overprovisioned environment (VMs, power, thin storage, whatever), you want to know about them before you trip your fuses/run out of memory, fill up all your disks.