Alerting from first principles

An Introduction to Alerting

Having recently added our Alerting for Graphite, we thought it’d be useful to put together a short primer on Alerting. What do you need to look at when considering what you alert on, and where those alerts go? An early warning system is only as good as its alarms.

What is alerting?

Monitoring uses alerts to tell you when something unexpected happens, if you need to act, and how you might fix a problem. Good alerts give you the right context to act and enough lead time to be effective. Bad alerts tell you what you already know or don’t need to hear – once you know a database is down, you don’t need to be reminded every minute.

If monitoring gives you data, then alerting gives you information.

How to do Alerts

Done properly, your alerts should trigger only for states or events that require attention or intervention. If you flood your sysadmins with minor alerts, they will try to read them all or ignore them altogether – both poor outcomes! Every sysadmin I’ve ever spoken to gets a thousand-yard-stare when I mention Nagios’s propensity to fill your mailbox with redundant information.

For simple record keeping, set up descriptive logging in a human readable format to capture an event so you can dig into it later – e.g. “Production web server number of 500 errors”.  A good rule of thumb for logging sensitivity is to trigger alerts on what might equate to syslog standard severity levels of Error and higher.

 

Each alert should capture at least these fields:

A simple, specific statement of what’s changed: a server offline, power supply interrupted, large numbers of users dropped, unusually long response times.

A list of product/service owners, escalation paths, and immediate corrective actions. This is a good place for some easy troubleshooting – if the team working the overnight can solve the issue with a reboot, then you don’t need to take it any further. Runbooks are a life-saver in the small hours of the morning, giving the bleary-eyed ops team some simple guidance when nothing’s making sense.

 

Further Tips

 

A final word

Every business has a different set of critical paths – you know your systems and people best. Alerts can be automated, but the wisdom behind them can’t be.

An alarm doesn’t mean panic when everyone knows there’s an established process they can trust.