Automating shit that can be automated so that you can actually do thing that benefit the business instead of simply maintaining the status-quo is not a bad thing. Doing automate-able drudge work when it could be automated is just stupid. Muppets who can click next through a Windows installer or run apt-get, etc. are a dime a dozen. IT staff who can get rid of that shit so they can actually help people get their own jobs done better are way more valuable.
The job of IT is to enable the business to continue to function and improve. Never forget that. People don't spend up big on computer stuff just because. They do it in order to save money by improving process. Improving process is where you should be focused, anything to do with general maintenance of the status quo is dead time.
Alternatively, perhaps somewhere up the chain they have no idea what can be done (this IT shit isn't their area of expertise), and are not being told by their IT department how to actually fix the problem properly. Rather, they are just applying band-aid after band-aid for breakage that happens.
It is my experience that if you outline the risks, the costs and the possible mitigation strategies to eliminate the risk, most sensible businesses are all ears. At the very least, if they don't agree on the spot, they are at least aware of what is possible and when the inevitable happens, be more keen to fix the problem next time.
Downtime cost adds up pretty fucking quickly. For example, my company: We have 650 PC users. pay rate probably ranges from 25 bucks an hour to 100 bucks an hour or more. Lets say the average is probably somewhere around 45 per hr.
1 hour of downtime, by 650 users, by 45 bucks per hour = $29,250 in lost productivity. Plus the embarrassment of not being able to deal with clients, etc. Plus potentially other flow on effects (e.g., in our case, possibly: maintenance scheduling for our mining equipment - trucks, drills, etc. didn't run. Plant therefore didn't get serviced properly, $500k engine dies).
If you fuck something up and are down for a day? Well... you can do the math.
This is why you move the fuck on and adapt. If your job is relying on stuff that can be done by a shell script, you need to up-skill and find another job. Because if you don't do it, someone like myself will.
And we'll be getting paid more due to being able to work at scale (same shit for 10 machines or 10,000 machines), doing less work and being much happier, doing it.
Yeah, don't get me wrong (i've been posting about setting up a test lab using vSphere, vFilters and vlans) - you can't replace the need to have someone on call or watching in case it all fucks up. But you can generally reduce the outage window and risk significantly by actually testing (both the roll out and roll back) first. And if you've got it to the point where you can reliably test, you can work on your automation scripts, test the shit out of them, and having been tested with a copy of live using a copy of live data, be reasonably confident that they will work.
If they don't? Snapshot the breakage, roll back to pre-fuckup, and examine at your leisure. Then re-schedule once you know wtf happened.
OS choice is irrelevant. I've seen plenty of critical linux fuck ups in my day, and OS choice doesn't account for human error. And, being human, you WILL make human errors. You need a test environment and a backout plan. If you don't at least have a back-out plan and an estimate of how much the fuckup will cost BEFORE proceeding (and balancing that against the cost/risk of leaving it the fuck alone), you should not be carrying out the work.
Sure, that sounds like management speak, but seriously... cover your fucking ass. Because one day it will fuck up (whatever, the OS, this isn't just a Linux or Windows problem) and whilst the fuck up may not necessarily be your fault, the extended downtime because you have not tested and have no backout plan will be.
Yup. Although, that said, if you have a proper test environment, like say, a snap-clone of your live environment and an isolated test VLAN, you can do significant testing on copies of live systems and be pretty confident it will work. You can figure out your back-out plan, which may be as simple as rolling back to a snapshot (or possibly not).
Way too many environments have no test environment, but these days with the mass deployment of FAS/SAN and virtualization, you owe it to your team to get that shit set up.
I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?