Comment Its All In the Process ... (Score 2) 185
Having been involved in Technical Ops of both large and small companies for many years, I have seen DR exercises and design that have run the gambit. I tend to think The key thing I have found to the success of any organization, exercise, or philosophy, is the underlying process that drives execution. The larger the team/org, the more change points, which in turn leads to more variables between tests. This creates complexity, as a test that ran fine a few months ago may not run the same today. However, ensuring change does not overrun process in understanding and applying the change into the greater design is a key to ensuring each test improves upon the last, until such time this is a finite process.
For example, when working for one of the big 401k's, the first DR exercise evaluated the data center completely being leveled and re-locating both technical services as well as the ~300 on site employees to another location. Long story short, the first exercise of this was scheduled for 2 days, and while it worked, we identified dozens of issues. We scheduled the next test 6 months later and addressed what we believed were all of the issues; on next test, we ran into perhaps ~10 issues. The next test we scheduled 3 months ahead and ran into ~2 issues. All awhile, things continue to change and innovation is occurring, change process control is ensuring that new things are being factored into the continual DR process/exercise. For a small telecom I worked for, the same type of testing was accomplished with ~2-3 week turn around time (smaller team, less change points, more dynamic response), but with same underlying principles.
Documentation of such things is critical, and employee turnover is often one of the greatest risk points. Having a diversified staff with overlapping knowledge should minimize the later risk to some degree, and if implemented fully, risk should be diminished.
So how does all this tie back into maint? Well, it is anticipated that if any system runs long enough, their will be opportunity for failure. It is preparation for when such failure occurs, one can balance the capability of providing a measured window of downtime (if any) and provide some degree of predictability (i.e. I test once a quarter). The counter to this can certainly be overzealous maint, so certainly their is a point to being reasonable. For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective. Either way, this is providing some degree of confidence that this should prolong engine life.
For example, when working for one of the big 401k's, the first DR exercise evaluated the data center completely being leveled and re-locating both technical services as well as the ~300 on site employees to another location. Long story short, the first exercise of this was scheduled for 2 days, and while it worked, we identified dozens of issues. We scheduled the next test 6 months later and addressed what we believed were all of the issues; on next test, we ran into perhaps ~10 issues. The next test we scheduled 3 months ahead and ran into ~2 issues. All awhile, things continue to change and innovation is occurring, change process control is ensuring that new things are being factored into the continual DR process/exercise. For a small telecom I worked for, the same type of testing was accomplished with ~2-3 week turn around time (smaller team, less change points, more dynamic response), but with same underlying principles.
Documentation of such things is critical, and employee turnover is often one of the greatest risk points. Having a diversified staff with overlapping knowledge should minimize the later risk to some degree, and if implemented fully, risk should be diminished.
So how does all this tie back into maint? Well, it is anticipated that if any system runs long enough, their will be opportunity for failure. It is preparation for when such failure occurs, one can balance the capability of providing a measured window of downtime (if any) and provide some degree of predictability (i.e. I test once a quarter). The counter to this can certainly be overzealous maint, so certainly their is a point to being reasonable. For example, what many of go through with our cars - the dealer wants us to come in every 3k miles for an oil change, whereas realistically most mfr's and my own experience dictates that ~5k (if not longer depending on circumstance) is much more cost effective. Either way, this is providing some degree of confidence that this should prolong engine life.