I managed to go 16 years in the IT world, first as a sys admin and now up through an awesome mid-level management position, without any serious data management scares. (And by 'awesome', I mean I work for demoralizing leadership and I've hit a glass ceiling which will force me to go find another company to work for if I want any shot at career advancement.) I've always made sure there's many, many layers of redundancy and good processes in place.
That was until three weeks ago.
We use Microsoft DFS to sync data between two sites. Because of some other things going on, we had to turn DFS off for 3 weeks. We thought we had everyone transitioned to using the "master" file repository, the one that gets backed up every night, etc, etc. The day we turned on DFS back on, all hell broke loose.
Oh - and this is fairly important stuff: 10 years worth of CAD, design, and legal paperwork. It's a few terabytes worth. For our medium-size company, this is basically everything that we hold near and dear.
The first thing that happened is DFS completely puked and completely trashed BOTH filesystems. Fantastic, Microsoft - what a wonderful piece of shit DFS is. Fairly quickly we had to face some data integrity issues. First, we discovered apparently there was a fella at the remote site that was using the copy of files there. Great.. through a fairly manual process we were able to retrieve most of his changes to the dataset. Next, we fairly quickly gave up on trying to fix the DFS - on the advice of Microsoft it seemed to be fairly hopeless.
This is where shit gets real.
Our head sys admin had been troubleshooting an issue with a drive in a RAID'ed NAS backup device had failed. All the other backups had been shifted to other NAS devices, but that backup was so large that it apparently had just been failing. While looking for that, we also discovered the quarterly backup from December had failed (that's the point where I wanted to put on my manager hat and go rip someone a new one, but decided that probably wouldn't be the most productive thing at the moment and could save that little teachable moment asskicking until after we were out of the woods.) Now, the sys admin hadn't been completely foolish, before turning DFS back on he had run some full backups using a different NAS device.
In a f*cking brilliant stroke of disastrous luck, when we went to perform the recovery we discovered that RAID array on the backup NAS device also had corruption.
Now, how bad the corruption was and what exactly that meant remained to be seen. The backups had completed without error, it was the NAS filesystem itself that was throwing the errors. The NAS was still running and our backup software seemed to recognize the backup catalogs on it. Ok, other than what seemed to one potentially corrupt backup, it was seeming like the next best case scenario was a quarterly backup from September, and I was also staring a complete set of disks from 2010 dreading the thought of bringing them back online. Well, with nothing to do other than try a restore, we pressed the button.
That's when I went home mid-morning, chainsmoked four cigarettes on my porch and wondered what would happened if everything went south. In other words, I was contemplating my next job.
'Lo and behold, and restore worked. We had to merge all kinds of things back together to get a complete copy reassembled, then we still had to get DFS working (which took four days of syncing over the WAN.) When it was all said and done, it looked like there were just two files from one set of changes that we couldn't recover.
I think I'll go double check on the backup jobs now.