Where I work, we run a number of servers around the world to meet the varying needs of our customers. Some of these are dedicated hardware, virtual private servers, shared hosting, dedicated database server, intranets, you name it. One thing we have always found challenging is monitoring the general status of the servers in a reliable fashion. We currently are using a combination of services and tools to achieve our goal.
We use Pingdom to monitor our web, dns and email servers. Pingdom is a relatively inexpensive service that will ping your server on a regular basis from multiple locations around the world and time the responses. It will then create some nice pretty graphs reporting your uptime. If there should ever happen to be some down time (that never happens does it?), it can notify a list of people via email or sms. The main downside to pingdom is that it is only a reactive service. By the time it sends out an email, the server is down.
A more ideal solution is to have a pro-active monitoring system. For this we use a OSS solution named Monit. It can be configured to trigger actions when certain limits are met. For example, if Apache is using up >= 75% of your system’s memory, Monit can trigger a restart of httpd. Or, if you volume is >= 95% full, it can send out a notification email to an admin to take appropriate actions. Check out their samples and documentation. It’s a pretty powerful system that can help prevent a complete server crash. One thing we have noticed how ever… if you intentionally bring down apache for maintenance and Monit is checking for a live instance of the webserver, be sure to kill monit first. Otherwise, it will unexpectedly restart apache causing potential issues.
Since I thought only the source was available for the linux distribution on one of the servers I manage I had to compile it. Afterwards, I did find a .rpm though but hey, compiling has never killed anyone right? Thankfully it was a pretty straight forward
yum install byacc flex gcc wget http://www.tildeslash.com/monit/dist/monit-4.10.1.tar.gz md5sum monit-4.10.1.tar.gz tar -zxvf monit-4-10.1.tar.gz cd monit-4.10.1 ./configure make make install |
Then edit /etc/monitrc to add some checks into it. Change this to whatever will make sense for you. Feel free to take a look at some of the configuration samples on the Monit Documentation page.
set daemon 60 #number of seconds between checks
#where and how to log stuff
set logfile syslog facility log_daemon
#outgoing mail settings
set mail-format { from: you_server@address.here
subject: Server monit alert -- $SERVICE $EVENT
}
#email addresses to notify
set alert email@1.com
set alert email@2.com
#overall system health checks
check system server_name-memory_cpu
if loadavg (1min) > 5 for 3 cycles then alert
if loadavg (5min) > 2 for 3 cycles then alert
if memory usage > 90% then alert
if cpu usage (user) > 95% then alert
if cpu usage (system) > 95% then alert
if cpu usage (wait) > 95% then alert
#apache web server checks
check process apache with pidfile /var/run/httpd.pid
start program = "/etc/init.d/httpd start"
stop program = "/etc/init.d/httpd stop"
#if cpu > 60% for 2 cycles then alert
#if cpu > 80% for 5 cycles then restart
if totalmem > 512.0 MB for 5 cycles then alert
if children > 400 then restart
if failed host www.yourdomain.com port 80 protocol http
and request "/index.html"
for 2 cycles then alert
#drive space checks
check device datafs_main with path /dev/sda1
if space usage > 90% for 5 times within 15 cycles then alert
if inode usage > 80% then alert
then double check things will work with
monit validate |
If you get the thumbs up, finally start the daemon with
monit |
It should send out an email to your contact address letting you know that it has started. Now any time something starts to go awry on your server, you’ll know ahead of time.
It’s great to have a combination of monitoring solutions in place. Monit is good for notifying you before serious server issues happen. The main downside is, if sendmail dies on your system, there will be no way for it to send out it’s notifications. This is where having another solution such as Pingdom as a backup is very handy.
Has anyone else found a better solution for monitoring multiple remote servers? Something that will be pro-active and still work if the server completely fails? Feel free to comment and let me know.
UPDATE:
The developers of Monit were very kind and pointed out a couple of things I missed. To prevent monit from restarting apache (during maintenance, or whatever) use the command:
monit unmonitor apache
and then when you’re ready to go again,
monit monitor apache
Also, monit has a built in event queue (for the situation where sendmail dies for example) as well as options for backup smtp servers. To set these up, use the following lines in your config file:
set eventqueue basedir /var/monit slots 100 set mailserver smtp1.foo.bar, smtp2.foo.bar
Thanks again Monit Team!
Tags: server admin
