... is actually quite simple: You keep your hands off the systems. Period.
In detail, you plan, install and _test_ your setup before it enters production. You make sure that you can survive whatever you throw at it wrt. errors and incidents. You then figure out how much downtime you are allowed to have according to SLA. You then divide this number into equal sized maintaince windows together with the customer. And then you adhere to these windows! No manager should ever be allowed to demand downtime out of band. Period. In between you basically minimize your involvement with the systems and plan your activities for the next scheduled closing window.
And you ofcourse only deploy stable, true and tested versions of software and operating systems. And even though your OS supports online capacity expansion on the fly, you really shouldn't use the capability unless you absolutely have to. Instead you plan ahead in your capacity management procedure and add capacity in the closing windows. And you do not test and rehearse failures! It only introduces risks
So in essence. Common sense will easily yield 99.9%. Carefull planning and execution will yield 99.99%. The really hard part is 99.999%...