So been on this new job exactly 3 weeks today. No real training program, no real "hey this is how we do things" document. Just 8 admins, 4 of us less than 6 months here, and about 200 Sun servers with a mix of Xserves and some Linux systems for good measure, and a whole directory of "HowTos" labeled and categorized with no real sense of order.
So what is the first system I build (rebuild actually)? A critical system that is 1) the batching system for the computer room operators, and 2) the main backup server for a library/archival system that has a 400 tape silo/robot attached to it. Completely lost 90% of the shared objects on the system, after 3 hours trying to recover boss decides to reload the system and come back up from tapes. Except it is decided to also upgrade the system while we are at it. And of course, the guy who has never built a system here and also has about 0.1% experience with Solaris 10 gets to be the stuckee, because of all the late shift crew (leave at 5pm) I apparently have the most clue.
OS load went fine, basic system software went fine, Backup software (Legato) installs fine. Note I said INSTALLS fine, not WORKS fine... This is Friday @ 8am (started Thurs @3pm) after I stayed 2 hours late and came in 2 hours early.
We then spend 2 complete days (Friday and Monday) on the phone with Legato cursing them out because when they went from .1 to .2 of their software, they dropped the Motif gui and went to a Java web-app interface for the management piece. The problem is the management console:
1. Doesn't like being routed through a proxy
2. Being behind a NAT
3. Being behind a firewall
4. Being on a DOD locked down system.
The company's FAQ of course says to disable all of these things if you can't seem to connect. We managed to get around it by installing about 20 extra packages we usually don't install just to get a browser installed on the local system (this is a very stripped system load) so we could just forward the console. This isn't the kicker, most of those 2 days were spent trying to get the backups restore because nothing seemed to want to work with this new software, as in "Hey, I can see your entire catalogue of indexed backups, I can't seem to ACCESS them to restore from"...which we had problems with even before the crash.
So, the reason I am upset now is just as I was powering off my system tonight my boss asks if I can stop by his cube. So I ask if I he means stop by on my way out and he kind of hems and haws. End result, now that the phone suppport with the vendors got us access to our backups, I get to miss my mountain biking plans and spend another 1.5 hours after work fixing this stupid server still. And never did get to finish it, since I couldn't find a certain key set of files in the backups and had to leave because I am not on the after-hours access list for our area. This all because I stayed an extra 5 minutes late to finish a trouble ticket a DBA had put in right at 5pm.
So now I got a boss who is extremely upset since it MUST be my fault that none of the key files exist on tape in the last month or so of backups, and we have REALLY IMPORTANT PEOPLE who are now upset that this system has been down for a total of 5 days now, even though it was the boss' call to not work over the weekend. And tomorrow has forecasted rain, which means the trails are closed, either officially or by unwritten rule (don't ride on muddy trails so you do't ruin them).
Someday I'll get back on that bike.
PS. If anyone is considering buying Legato as a new backup/restore system, DON'T! I used to think Veritas' interface was terrible, at least is was a local java app interface, not some Web Service wiz-bang non-functioning beta release that apparently never got tested outside of a lab. I'll post some excerpts of our phone conversation with 3 different reps if I feel in the mood later.