I recently joined Tag1 Consulting full time. My main project since I joined was building up our monitoring infrastructure. My personal goal being world domination, but first the best monitoring system I can build. Baby steps...baby steps. For quite awhile now, I've used the monitoring system we built at the Open Source Lab to detect and diagnose problems with drupal.org, apache.org, kerneltrap.org and many more Free Software projects hosted at the lab. These experiences convinced me of the absolute necessity of a robust monitoring system.

It began simply at the OSL, a very stripped down Cacti instance to monitor load, memory usage and network usage. Combine this with Nagios monitoring ping latency and not much more. We went from this to monitoring disk space, load, users, MySQL availability and much more with Nagios. We also added some simple Cacti graphs for monitoring MySQL performance.

At Tag1, I've been allowed to take this even farther. First, I tracked down some newer MySQL Cacti Templates developed by Xaprb. These templates are extremely cool and allow you to graph: InnoDB Buffer Pool Activity, IO Activity, Log Activity, Processlist, MyISAM Index Usage, Connections, Network Traffic, Sorts, Temp Tables and much more. You can see a complete list and some screenshots here.

Along with this, we tracked down a non-standard MySQL Nagios plugin. This plugin not only checks MySQL availability, but also index hit rate, buffer pool hit rate, slave lag and even the number of threads connected. Nagios can then send out pages for Warning and Critical levels for this variety of metrics. The upshot being that we can be paged when problems start appearing and not when they end with downtime. This external plugin is currently developed here. The default plugin has an issue with segfaulting that we have fixed and patches have been submitted upstream.

All of this we deployed on a server locked down with Tripwire, RK-Hunter, Apache Authentication, SSL and all the usual security trappings. The point behind all of this being a hosted monitoring service to give everyone access to a truly robust monitoring server. We also use this in our "Remote DBA" server, in which our consultants are actually On-Call for your infrastructure. It has been quite a lot of fun working on all of this.