Comment And if it doesn't work? (Score 1) 265
To the original poster: It is entirely possible, but you're going to need to learn a lot about modern automation and configuration management tools appropriate to the types of maintenance you're looking to automate. You also need solid vision and alignment on how you're going to achieve this level of automation across multiple parts of your business -- Development, IT, the Business, everyone. They all have to buy in and commit, because all of those folks have the ability to fuck it all up if everyone isn't on the same page. You can't do it alone on the admin side. As a start, I would suggest learning about Continuos Integration/Continuous Delivery and Agile and Devops methodologies to get started on the road to where you want to be.
To the rest of you:
The original comment ("Learn and use Puppet") is grossly oversimplified -- there is a lot more to it -- but with proper implementation of configuration management software (Chef, Puppet, Salt, etc), proper automated testing (think Jenkins, Teamcity, etc) and a real commitment in your organization to Continuous Integration and Delivery practices, you can easily do regular automated maintenance. Yes -- sometimes it will break and you'll have to clean it up. But properly and thoughtfully implemented in policy and practice, those times when it breaks will be the exception that proves the rule.
Forgive the argument from authority, but at our firm (International, thousands of primarily linux servers across 14 countries and 40+ datacenters, mostly bare-iron, some virtualization) we have regular daily and weekly automated maintenance. We handle all sorts of significant change -- driver updates, software upgrades, network switch configuration, even forklift OS upgrades involving the full re-imaging of a bare iron system combined with re-deployment of software (including things like databases and hadoop clusters) -- automatically and without human intervention on a regular basis. And by regular I mean daily.
The attitude that "Murphy always wins" or "something will fail and you will have failed by not being there to fix it immediately" is a relic of a time when the tools available to manage large scale infrastructure were inadequate or unavailable. Again, there are failures that will require manual intervention, but if you are doing your jobs well as developers, network admins, systems admins, 'devops' [NOTE: I strongly object to that term being used as a job title, but that's how folks have started using it] then you should be able to conduct automated hands-free production change at 2am on a Saturday and sleep like a baby knowing that when you check your upgrade report in the morning 99% of the time everything will have gone off without a hitch.
Frankly if you approach complex infrastructure management with that defeatist viewpoint of "things will always fail", you are doing yourself and your employer a disservice, and you are severely restricting your career prospects. My company is not in any way unique in our ability to automate and manage our infrastructure, and maintaining that type of outdated attitude is going to cause lots of doors to be slammed in your face. Do you really believe the Googles, Facebooks and Amazons of the world rely on having a human being white-knuckling every change in their infrastructure?
One additional note: If your infrastructure is designed such that you cannot push change without guaranteed downtime or the risk of downtime then you have failed to design your infrastructure properly.