Where I work, we run a number of servers around the world to meet the varying needs of our customers. Some of these are dedicated hardware, virtual private servers, shared hosting, dedicated database server, intranets, you name it. One thing we have always found challenging is monitoring the general status of the servers in a reliable fashion. We currently are using a combination of services and tools to achieve our goal.
We use Pingdom to monitor our web, dns and email servers. Pingdom is a relatively inexpensive service that will ping your server on a regular basis from multiple locations around the world and time the responses. It will then create some nice pretty graphs reporting your uptime. If there should ever happen to be some down time (that never happens does it?), it can notify a list of people via email or sms. The main downside to pingdom is that it is only a reactive service. By the time it sends out an email, the server is down.
A more ideal solution is to have a pro-active monitoring system. For this we use a OSS solution named Monit. It can be configured to trigger actions when certain limits are met. For example, if Apache is using up >= 75% of your system’s memory, Monit can trigger a restart of httpd. Or, if you volume is >= 95% full, it can send out a notification email to an admin to take appropriate actions. Check out their samples and documentation. It’s a pretty powerful system that can help prevent a complete server crash. One thing we have noticed how ever… if you intentionally bring down apache for maintenance and Monit is checking for a live instance of the webserver, be sure to kill monit first. Otherwise, it will unexpectedly restart apache causing potential issues.