Being a ditto-head here (and in the same field) having an ATS fail sounds right. No matter how well one tries to design redundancy and resiliency into a data center, there will always be that one weak spot. I would hope that Delta management and IT conduct a serious failure assessment (without it devolving into a witch hunt) to understand the failure process here and determine how to change it. I would like to see a better discussion of the exact cause so those of us in the data center industry can use the information to assess our own facilities.
And yes, regular service and maintenance might have helped catch this issue before it caused the outage.