We had a program that was doing session matching of RTP streams (via RTCP). We had to be able to handle a potentially very high load.
Things had been going OK - development progressing, QA testing going well. And then one day our scaling tests took a nosedive. Whereas we had been handling tens of thousands of RTP sessions with decent CPU load, suddenly we were running at 100% CPU with an order of magnitude fewer sessions.
I spent over a week inspecting recent commits, profiling, etc. I could see where it was happening in a general sense, but couldn't pin down the precise cause. And then a comment by one of the other developers connected up with everything I'd been looking at.
Turns out that we had been using a single instance of an object to handle all sessions going through a particular server, but that resulted in incorrect matching - it was missing a vital identifier. So an additional field had been added to hold the conversation ID, and an instance was created for each conversation.
Now, that in itself wasn't an issue - but the objects were stored in a hash table. Objects for the same server but different conversations compared non-equal ... but the conversation ID hadn't been included as part of the hashcode calculation. So all conversation objects for a particular server would hash the same (but compare different).
We had 3 servers and tens of thousands of conversations between endpoints. Instead of the respective server objects being approximately evenly spread across the hash map, they were all stuck into a single bucket per server ... so instead of a nice amortised O(1) lookup, we instead effectively had an O(N) lookup for these objects - and they were being looked up a lot.
The effect was completely invisible under low load and in unit tests. The hash codes weren't verified as being different in the unit tests as there was the theoretical possibility that the hashcodes being verified as different could end up the same with a new version of the compiler/library/etc.