Overarching principle of making-your-life-easy: if you support more than three systems, treat them as a cluster.
- This means you have a dedicated admin machine that only a few very trustworthy admins have access to, that is very secure (no root logins, firewalled heavily, patched often, etc). I highly recommend running
SuSE Enterprise Linux 9 with the IBM EAL4+ Security Configuration
All maintenance activities are run from this management server.
- Use the Parallel Distributed SHell (PDSH) utilities: http://www.llnl.gov/linux/pdsh/pdsh.html. These allow you run commands or copy files to a single system, a group of systems, or all systems at the same time. Wondering what kernel all your systems are running? Just issue a `pdsh -a uname -a`. Need to copy out the sudoers file? `pdcp -a /home/admin/node_files/sudoers /etc/sudoers`
- Run Ganglia for resource monitoring: http://ganglia.info/
- Run Samhain for filesystem integrity scanning on all servers: http://la-samhna.de/samhain/
- Host based firewalls for all servers: http://www.shorewall.net/
- Power supplies have caused more instability in my experience than any other single hardware component. Buy both good equipment and buy systems with dual redundant hot-swappable power supplies for the important machines
- Good deals can be had from the big vendors. Although we run a lot of whitebox and IBM equipment, Sun currently has a great system for a very cheap price (starts at $745): http://www.sun.com/servers/entry/x2100/.
- NFS sucks, but is the best filesystem glue-layer available. It is very sensitive to high latency environments, so run it over Infiniband (it has very low latency, and massive bandwidth (5us, 1.25GB/s) if you need to sqeeze out the best performance.
- Every system should have an electronic "system book", which contains the full hardware specs, including where each part gets service from (if bought separately), how long the warranty lasts (give end dates), contact info, etc. If you are managing 50 or less systems, keep track of all changes in a central location, otherwise track all changes by using a system which scales (even a handwritten script and DB table would be sufficient).
- Good enough is the enemy of the Best, but that is a good thing. Never overengineer a solution, this only means that other problems go unsolved.