Server Monitoring Solutions

Monday, June 2nd, 2008 at 8:30 am

Where I work, we run a number of servers around the world to meet the varying needs of our customers. Some of these are dedicated hardware, virtual private servers, shared hosting, dedicated database server, intranets, you name it. One thing we have always found challenging is monitoring the general status of the servers in a reliable fashion. We currently are using a combination of services and tools to achieve our goal.

We use Pingdom to monitor our web, dns and email servers. Pingdom is a relatively inexpensive service that will ping your server on a regular basis from multiple locations around the world and time the responses. It will then create some nice pretty graphs reporting your uptime. If there should ever happen to be some down time (that never happens does it?), it can notify a list of people via email or sms. The main downside to pingdom is that it is only a reactive service. By the time it sends out an email, the server is down.

A more ideal solution is to have a pro-active monitoring system. For this we use a OSS solution named Monit. It can be configured to trigger actions when certain limits are met. For example, if Apache is using up >= 75% of your system’s memory, Monit can trigger a restart of httpd. Or, if you volume is >= 95% full, it can send out a notification email to an admin to take appropriate actions. Check out their samples and documentation. It’s a pretty powerful system that can help prevent a complete server crash. One thing we have noticed how ever… if you intentionally bring down apache for maintenance and Monit is checking for a live instance of the webserver, be sure to kill monit first. Otherwise, it will unexpectedly restart apache causing potential issues.

Since I thought only the source was available for the linux distribution on one of the servers I manage I had to compile it. Afterwards, I did find a .rpm though but hey, compiling has never killed anyone right? Thankfully it was a pretty straight forward

yum install byacc flex gcc
wget http://www.tildeslash.com/monit/dist/monit-4.10.1.tar.gz
md5sum monit-4.10.1.tar.gz
tar -zxvf monit-4-10.1.tar.gz
cd monit-4.10.1
./configure
make
make install

Then edit /etc/monitrc to add some checks into it. Change this to whatever will make sense for you. Feel free to take a look at some of the configuration samples on the Monit Documentation page.

set daemon  60 #number of seconds between checks

#where and how to log stuff
set logfile syslog facility log_daemon

#outgoing mail settings
set mail-format { from: you_server@address.here
 subject: Server monit alert -- $SERVICE $EVENT
}

#email addresses to notify
set alert email@1.com
set alert email@2.com

#overall system health checks
  check system server_name-memory_cpu
    if loadavg (1min) > 5 for 3 cycles then alert
    if loadavg (5min) > 2 for 3 cycles then alert
    if memory usage > 90% then alert
    if cpu usage (user) > 95% then alert
    if cpu usage (system) > 95% then alert
    if cpu usage (wait) > 95% then alert

#apache web server checks
  check process apache with pidfile /var/run/httpd.pid
    start program = "/etc/init.d/httpd start"
    stop program  = "/etc/init.d/httpd stop"
    #if cpu > 60% for 2 cycles then alert
    #if cpu > 80% for 5 cycles then restart
    if totalmem > 512.0 MB for 5 cycles then alert
    if children > 400 then restart
    if failed host www.yourdomain.com port 80 protocol http
       and request "/index.html"
       for 2 cycles then alert

#drive space checks
check device datafs_main with path /dev/sda1
if space usage > 90% for 5 times within 15 cycles then alert
if inode usage > 80% then alert

then double check things will work with

monit validate

If you get the thumbs up, finally start the daemon with

monit

It should send out an email to your contact address letting you know that it has started. Now any time something starts to go awry on your server, you’ll know ahead of time.

It’s great to have a combination of monitoring solutions in place. Monit is good for notifying you before serious server issues happen. The main downside is, if sendmail dies on your system, there will be no way for it to send out it’s notifications. This is where having another solution such as Pingdom as a backup is very handy.

Has anyone else found a better solution for monitoring multiple remote servers? Something that will be pro-active and still work if the server completely fails? Feel free to comment and let me know.

UPDATE:

The developers of Monit were very kind and pointed out a couple of things I missed. To prevent monit from restarting apache (during maintenance, or whatever) use the command:

monit unmonitor apache

and then when you’re ready to go again,

monit monitor apache

Also, monit has a built in event queue (for the situation where sendmail dies for example) as well as options for backup smtp servers. To set these up, use the following lines in your config file:

set eventqueue basedir /var/monit slots 100
set mailserver smtp1.foo.bar, smtp2.foo.bar

Thanks again Monit Team!

Tags:

Leave a Reply