I have no idea if once a week is realistic, it sounds far too high. I have around 5-10 such windows a year, some are stuff I can do from home (with support from the guys on shift) and some entail me being physically there, so there have been none of the second kind this year.
Major Outages of one of our production systems have been featured on national news and Slashdot before, although it requires an outage of several hours to cross that threshold. Our windows are at around 02:00 to 03:00 depending on which system is affected.
Murphy has really bitten us in the ass a few times:
- Someone making an update (on a test system) which meant that the system did not come up properly after the next reboot which was days later. The symptoms made it look as though the test "window update" caused the problems. It was an accident but very annoying.
- A weird error on one switchable hardware unit rendered it unusable on our main production system. That unit was one of 32 and the allocation system automatically only used it on other machines, the next reboot would have cleared the problem anyway. Someone decided to use *that* unit for a critical update and brought it up manually for that purpose. The update failed and our main system was down. I drove in at 03:30 and (I thought) fixed things by falling back. Shortly after I left again, one application stopped working and dragged the rest down with it. I went back in again and did the original update cleanly - over initial management objections - after which things were fine.
There have been others but they were even more arcane. The absolute worst cases we had were with virtually everyone there. They made the news, two of them made it to Slashdot. Different causes in each case.