I have built and seen succeed and crash-n-burn spectacularly a variety of automated test frameworks for large enterprises. Let's start with the successes:
- High availability / Robust
- Staging environment for automated test developers
- Performance metrics
- Easy to understand test results
The failures were due to:
- Brittle and poorly designed tests which don't run the same in the CI system versus the tester's machine
- Testers committing bugs into the test environment
- No performance metrics
- Hard-to-understand failure results require duplication and deep analysis
As you can tell, the failures are the opposite of the successes. Allow me to further explain.
The most important item is that the tests always work and always be running. This means test machines and back-up test machines. Running the same test on three different machines is even better because you can throw away or temporarily ignore outliers. Outliers need to be addressed eventually, but for day-to-day developers and managers just want to know if the code they committed causes failures. Having the test be in any way unreliable causes faith in the tests to disappear. The tests must run, and they must run well. Test environments go down or require maintenance and you want to be able to continue to run tests during these downtimes. Treat the tests like a production system.
Next, a big improvement I've seen is to have automated test developers contribute new work to a separate-but-equal staging environment. Automated test teams run on an Agile/Scrum iteration and only "release" their new tests at specific times. Another thing which reduces faith in test results if the tests are breaking due to the fault of automated test developers-- which happens all too often.
Automated tests are the ideal platform for generating performance metrics.
Lastly, a big pet peeve of mine is for understandable test failures. Test failures should obviously describe the set-up, expected and actual result. If test failures are obtuse and require a lot of time to analyze and triage-- that is wasted time that could have been spent fixing the root cause.
Good luck! If past experience is any indicator, you will be spending far more time and money than you ever imagined to create a robust system that developers and managers will have faith in.