Comment Two embarassments... (Score 1) 377
When I was a junior programmer working on a mainframe, I was given a problem ticket for an intermittent issue. I stuck diagnostics into the code, but because my disk quota was far to small, I sent the output to a virtual printer that I looped back to my account. Unfortunately, after I got the whole testcase set up (couple hours) the mainframe crashed and I went for coffee along with the rest of the 300 users on the system, for the 10 mins it took to restart. After several days where I hadn't been able to make progress because of the suddenly frequent mainframe crashes, I got a message from the operator asking me to delete my large spool files, since the mainframe was crashing due to a lack of spool space. That's when the penny dropped that my testcase had been exhausting the system spool space, crashing the mainframe about 8 times. Probably $100,000 in lost labour.
Years later, working on extending some high reliability software, I found some bugs in pre-existing code. The system had some internal checks and watchdog timers that would force a restart if it thought some code was taking too long. Both bugs would trigger the restart system by making something take too long and triggering the watchdog timer. One was in very complicated code, but explained some intermittent issues we'd seen over the years. The other was in a newly released, still unused utility, that didn't work properly on old HW, but would need to be re-written to fix. I only had time to fix and test one bug before going on a month long vacation, so I fixed the complicated one. While I was on vacation, an alpha release of the product went out, and promptly started crashing intermittently with stack corruption issues. I got back, to find six such tickets on my desk. In the meantime, the broken utility had acquired some users, so I decided to spend a couple of days fixing the utility.
It turned out that the stack corruption issue was holding up the production release, worth many millions of dollars.
Of course, I wasn't able to reproduce the intermittent stack corruption.
I spent 3 weeks looking everywhere, trying anything to reproduce it, resorting to rebuilding the alpha load where I could sometimes reproduce it, but not if I loaded my diagnostics.
Meanwhile, management was getting very antsy about the revenue implications.
My boss was very good, and sheilded me from the flames, but I didn't like seeing him getting fried, as the release date kept getting pushed.
I tried hunting around to see if anyone had been changing code in that area of the system, but of course, there were only my updates. I asked anyone I could find for suggestions, and nobody had any ideas until one person said it reminded them of one very old issue they'd worked on, and described the problem they'd had.
I went back and checked my archived output. Sure enough, I'd been a bit careless testing the broken utility before fixing it. I only checked that my testcase triggered a restart, not why. It turned out that long before it could trigger the watchdog timer, the utility corrupted the stacks of other processes.
I'd just spent 3 weeks holding up an important release, because I didn't realize I'd already fixed the bug.
Years later, working on extending some high reliability software, I found some bugs in pre-existing code. The system had some internal checks and watchdog timers that would force a restart if it thought some code was taking too long. Both bugs would trigger the restart system by making something take too long and triggering the watchdog timer. One was in very complicated code, but explained some intermittent issues we'd seen over the years. The other was in a newly released, still unused utility, that didn't work properly on old HW, but would need to be re-written to fix. I only had time to fix and test one bug before going on a month long vacation, so I fixed the complicated one. While I was on vacation, an alpha release of the product went out, and promptly started crashing intermittently with stack corruption issues. I got back, to find six such tickets on my desk. In the meantime, the broken utility had acquired some users, so I decided to spend a couple of days fixing the utility.
It turned out that the stack corruption issue was holding up the production release, worth many millions of dollars.
Of course, I wasn't able to reproduce the intermittent stack corruption.
I spent 3 weeks looking everywhere, trying anything to reproduce it, resorting to rebuilding the alpha load where I could sometimes reproduce it, but not if I loaded my diagnostics.
Meanwhile, management was getting very antsy about the revenue implications.
My boss was very good, and sheilded me from the flames, but I didn't like seeing him getting fried, as the release date kept getting pushed.
I tried hunting around to see if anyone had been changing code in that area of the system, but of course, there were only my updates. I asked anyone I could find for suggestions, and nobody had any ideas until one person said it reminded them of one very old issue they'd worked on, and described the problem they'd had.
I went back and checked my archived output. Sure enough, I'd been a bit careless testing the broken utility before fixing it. I only checked that my testcase triggered a restart, not why. It turned out that long before it could trigger the watchdog timer, the utility corrupted the stacks of other processes.
I'd just spent 3 weeks holding up an important release, because I didn't realize I'd already fixed the bug.