System monitoring – what are my options?

There are many options for system monitoring –  so many in fact that a lot of people turn to one of the two worst options: Writing your own, or getting struck with paralysis by analysis and doing nothing.

Monitoring your systems and alerting when something weird happens is crucial to understanding and tackling issues as early as possible. That means allowing you to work on the activities you have planned instead of reacting to outages, and ultimately keeping your customers happy.

There’s a whole range of free tools that monitor your systems and create metrics for you to graph, evaluate, and use to create alerts. In part 1 of this series, we’ll explore the pros and cons of three of these popular libraries.

CollectD

Collect D is a daemon that gathers system information and passes it on to Graphite. It is, as the name suggests, a collector rather than a monitoring tool, and stresses modularity, performance and scale.

Benefits

Quick and Easy – Setup is straightforward, configuration is painless, and maintenance is minimal. It’s light on system resources, as a programmed multithreaded daemon written in C, and fast on clients. CollectD supports multiple servers and has a multitude of ways to store data.

Plugins – CollectD has a pile of ‘em: for specialized servers, for sending metrics to various systems, for logging and notification, for nearly anything. The default is enough to get started, but there’s plenty of flexibility once you get going. It plays nicely with Graphite.

Drawbacks

No GUI – CollectD is not a graphing tool – it simply spits out RRD files. There are scripts for a minimal interface packaged with it, but even the project admits that it’s not up to much. You’ll need to plug into Graphite or something similar to read CollectD’s outputs effectively.

Too Much Info – Sub-minute refreshing and variety of plugins makes it easy to overreach. If you ask for a lot of statistics from a node, you may get more data than you can graph and read effectively.

 

Munin

Munin is a resource and performance monitoring tool written in Perl. It doesn’t provide alerting, but Munin is a robust solution for cleanly presenting a lot of network data.

Benefits

Out-of-the-Box – Munin stresses ease of use; installation and configuration take minutes. Writing code to extend monitors is so simple you can use it for non-performance tasks like counting web traffic. You can set thresholds in Munin, but there is a recommended Nagios plugin to generate alerts.

Plug and Play – Like CollectD, Munin has a wide range of plug-ins to choose from: just grab a few scripts from the Plugin Gallery. The more elegant plug-ins can monitor long-view trends like annual resource usage. Writing new plug-ins for yourself is also no trouble.

Drawbacks

Central Server – Each server you’re monitoring runs a Munin process; these servers then connect to a main server. This model can lead to performance issues when the scale rises to hundreds of servers. Budgeting for that dedicated server will need to come sooner rather than later.

Graphs – The graphs generated by Munin are static – not ideal if you want some interactive views of your data. Also, these HTML graphs redraw after every refresh, creating big disk I/O and CPU hits on your system. As a whole, it’s pretty dated.

 

Dropwizard

Dropwizard is a Java framework that supports metrics and ops tools for web services. This collection of best-in-breed libraries is built for speed and robustness.

Benefits

Built-in Metrics – Choose your service calls and performance metrics run automatically. Health Checks publishes metrics by service, too – handy for doing a lot of REST calls. Add in service configuration validation as a default feature and Dropwizard is quick to both deploy and change.

Container-less – All resources and dependencies packed into fat JARs, making it a snap to write micro-services or add instances. Default configurations are sensible and updates are easy, too – you can deploy with one line.

Drawbacks

Performance – Each request has its own thread – even tuning maxThreads and maxConnections may not help throughput. This is problematic for the kind of I/O-bound applications that Dropwizard is likely to service. Dropwizard’s light weight cuts both ways – if you have high loads and a lot of developers, other options may work better.

Support – Dropwizard has an active community, but it’s no match for when Coda Hale developed it. The cadence of releases can stretch to months. Documentation could be meatier, and even StackOverflow doesn’t talk about it as much as other tools.

 

In the next article, we’ll check out a few other useful libraries and dig through the main factors you might look at when making a decision.