Amazon CloudWatch Monitoring: Feature Spotlight

This is the first of a new series that will highlight some key features and integrations we’ve launched. It’s a look at things we’ve been working hard to improve, as well as spotlighting some other features that customers have told us we’ve been a bit too quiet about up until now. First up, we’re looking at monitoring your Amazon Services with Hosted Graphite.

We’ve had a lot of great feedback on our AWS CloudWatch add on in the years since its release. It syncs your AWS metrics straight to your Hosted Graphite account so you can view your CloudWatch data on curated, interactive dashboards. As a result, the health of your CloudWatch services is very easy to check.

Data store

Where CloudWatch stores your five minute data points for 63 days by default, our maximum retention is 180. They store your one hour data points for 455 days, while we hold onto them for two years. We also give you access to a full alert history so you can track historic incidents and review anomalies.

Alerts and annotations

All teams have different sets of critical paths. Our alerts tell you when your metrics do something unexpected, if you need to act and how you might fix the problem. Flexible rules and notifications give you more control over what you hear and how you hear it. Choose to be alerted via email, PagerDuty, VictorOps, HipChat, Slack, OpsGenie or alternatively build your own integration with our webhooks. Whatever way you’re notified, for deeper context you can overlay automatic alert annotations on your historic graphs.

Sharing dashboards

It’s easy to share dashboards with your team. You can hide or share specific dashboards with particular users or verify access using Active Directory.

Graphite monitoring

All of the following CloudWatch services are covered:

Application Elastic Load Balancing  AWS Billing
CloudFront DynamoDB (DDB)
Elastic Block Store (EBS) Elastic Compute Cloud (EC2)
ECS Metrics Elastic Load Balancing (ELB)
ElastiCache Elastic MapReduce (EMR)
Kinesis Firehose Kinesis Streams
Lambda Functions Relational Database Service (RDS)
Redshift Route 53
Simple Queue Service (SQS)

Getting set up

It’s quick to get started. If you’re already a Hosted Graphite customer, you can start monitoring your Amazon Services immediately. Just give us the access key for a read-only IAM user and we’ll take care of the rest. What you share is up to you: you can send us everything or choose the metrics that matter most. AWS lets you tag most of their resource types, so if you select the Service Tagging option and send us a tag name (and value), we’ll only get metrics for resources with those specific tags.

Read the full getting started guide.


Adventures in fault tolerant alerting with Python

We gave a presentation a couple of weeks ago Python Ireland’s April meetup where we described our experiences with PySyncObj, a relatively new but solid library for building fault tolerant distributed systems in Python. Most of the services that run Hosted Graphite are built in Python, and this includes our alerting system. While that talk wasn’t recorded, this blog post discusses what we did, the tech we chose, and why.

A monitoring system without alerting

We’ve been running Hosted Graphite as a big distributed custom time-series database since 2012, and once we mastered the monitoring side of things, the next obvious step was an integrated alerting system.


We released a beta version of the alerting feature in 2016, and we’ve kept it technically in “beta” for the same reasons Google kept GMail in beta for more than five years: people are using it for real work, we’re supporting it, but we know we still have work to do and so there’s still a “beta” label on it for now.

One of the biggest things that was holding us back from calling the alerting feature fully baked was the failover characteristics. Almost everything else at Hosted Graphite is a distributed, fault-tolerant distributed system, and we knew we needed the same for an alerting system. After all, we’re on-call 24/7 for it!

Keeping state

Some parts of the alerting system are easy to do failover for because they keep no state, like the streaming and polling monitors in this slide:

Basic alerting architecture slide, taken from the Python Ireland presentation.

The trouble starts when we consider the state kept by the Alerter component. The Alerter receives a series of “OK”, “OK”, “not OK!” messages from the streaming and polling monitors, and needs to keep a lot of those in memory in order to make decisions about which notifications to send, if we need to wait a few more minutes to satisfy a “must be alerting for more than five minutes” constraint, when we’ve already sent enough notifications, and so on. Discarding the alerter state each time there is a failure would cause a pretty poor user experience with duplicate, delayed and missing notifications and we can’t tolerate that.

Fine – just duplicate all the traffic!

We considered just duplicating all the traffic to keep multiple alerters in sync with checkpointing to allow new nodes to catch up. Kafka looked like a pretty good option at first, but it didn’t fare well in a review with the SRE team.


Here’s why:

  1. The data volume flowing to the alerter nodes isn’t huge – only a few thousand messages per second.
  2. We run everything on bare metal hardware for high performance and avoiding the noisy neighbour problem.
  3. While most of our users are served by our shared production infrastructure, we have many users whose needs demand a dedicated cluster. This isn’t a problem in itself, but it does mean that we’ll need to duplicate the services for each cluster, no matter how much load that customer puts on us.
  4. Kafka requires Zookeeper for cluster management too, so that’s yet another thing needing several machines. We already run etcd anyway, and we don’t want to run two services that do similar things.

Basically, Kafka is designed for bigger data volumes than providing high availability for this one service, and it didn’t make a lot of sense to run it here.

“Using Kafka for this is like using a JCB to plant a daffodil.” — Hosted Graphite’s SRE team

Re-evaluating: what do we actually want?

We zoomed out a bit and reconsidered. What failover properties do we want the alerting system to have? Here’s what we settled on:

  • Nodes should exchange state among themselves.
  • Nodes should detect failure of a peer.
  • Nodes should figure out a new primary after the old one fails.
  • Nodes should assemble themselves into a new cluster automatically when we provision them.
  • … all without waking anyone up.

PySyncObj – pure Python Raft consensus

pysyncobjPySyncObj is a pure Python library that solves all the requirements except the basic service discovery one, which we figured we could handle with etcd. PySyncObj allows us to define an object that is replicated among all nodes in the cluster, and it takes care of all the details It uses the Raft protocol to replicate the data, elect a leader, detect when nodes have failed, deal with network partitions, etc.

Following Python’s “batteries included” philosophy, PySyncObj includes a set of basic replicated data types: a counter, dictionary, list, and a set. You can also define custom replicated objects with a couple of extra lines and a decorator, which is pretty awesome.

Here’s a simple example of how easy it is to replicate a dictionary across three servers with PySycObj:

from pysyncobj.batteries import ReplDict

dict1 = ReplDict()

syncobj = SyncObj('serverA:4321', ['serverB:4321', 'serverC:4321'],\

dict1['somekey'] = 'somevalue'

All the distributed Raft magic is done for you. Pretty fantastic. For more examples, check out other examples on the PySyncObj github page.

Monitoring the new Raft cluster

PySyncObj exports a bunch of data about the state of the Raft cluster with the getStatus() function:

# Get internal Raft cluster state
status_dict = self.sync_obj.getStatus()
 'readonly_nodes_count': ...,
 'log_len': ...,
 'unknown_connections_count': ...,
 'last_applied': ...,
 'uptime': ...,
 'match_idx_count': ...,
 'partner_nodes_count': ...,
 'state': ...,
 'leader_commit_idx': ...,
 'next_node_idx_count': ...,
 'commit_idx': ...,

To monitor this data while we built confidence in the new service before rolling it out, we just fired it into Hosted Graphite using the dead simple graphiteudp library:

def send_metrics(self):
for key, value in \
metric_path = "%s.s.%s" % (self.hg_api_key,
self.metric_prefix, key)
graphiteudp.send(metric_path, value)

The result is a pretty dashboard detailing the Raft cluster state for production and staging environments:


Chaos monkey testing

At this stage, we were pretty confident that this was the right route to take, but we wanted more. The only way to make your failover handling works is to continuously make it fail and see how it reacts, and that’s exactly what we did. We stood up a full-size test cluster and ran a copy of all of the alerting traffic from the production environment through it. The output was discarded so customers didn’t get duplicate alerts, but we were collecting all the same metrics about the output that we usually do.

Randomly restarting alerter nodes.

In the style of the Netflix chaos monkey, we wrote a small tool to wait a random amount of time, and then restart one of the nodes without warning. We ran this for several days and it worked flawlessly – every time we lost a node, another was elected the new leader in a few seconds, the restarted node recovered the state it lost, and the overall output of the cluster (alerts checked, notifications that would have been sent, etc) was steady throughout.

Emergency options

Distributed systems are pretty fantastic when they work well, but it’s easy to get into a state where you just want a simple option to take control of a situation. In case we ever need it, we built in a ‘standalone’ mode, where all the state is persisted to disk regularly. During a particularly chaotic incident, we have the option of quickly bypassing the clustered automatic failover functionality while we get a situation under control. We hope never to need it, but it’s nice that it’s there.

Current state

After a couple of weeks building confidence in the new alerter cluster, we quietly launched it to… no fanfare at all and no user interruption during the migration. In the weeks since launch, it has already faced several production incidents and fared well in all but one: a small bug that took the alerting service out of the 99.9% SLA for the month. Oops. Despite all the testing, there’ll always be something to catch you out.

After more than a year of effort by our dev and SRE teams, the alerting service will soon have the ‘beta’ label torn off and this automatic failover work already in production forms a crucial piece of that.

Here’s what it looks like now:


Service discovery for the alerter cluster is handled by etcd, and PySyncObj takes care of all other cluster operations. We’re pretty happy with this – we’re able to avoid waking someone on the SRE team for losing a machine or two, resize and upgrade the cluster without maintenance periods, and scale it up and down to trade off SLA requirements against cost, all of which is pretty powerful stuff.


To learn more, visit or follow us on twitter.

Hosted Graphite’s Alerting now integrates with OpsGenie!

TL;DR: Hosted Graphite’s alerting feature now integrates with OpsGenie, including auto-resolving incidents according to the alerting rules.

Hosted Graphite’s alerting feature continues to sprout new functionality – we just launched the ability to send notifications of infrastructure problems straight to your on-call engineering team via OpsGenie.

If you’re not familiar with OpsGenie, here’s how they describe their service: “OpsGenie is an incident management solution for dev & ops teams, providing on-call schedule management, escalations and alerting via email, SMS, phone and mobile push notifications.”

Here’s what the integration of our two complementary services looks like in the OpsGenie UI:


Automatic incident resolution

Being notified of a problem is one thing, but once the incident has been dealt with you’ll often need to manually mark the incident as ‘resolved’. If you’re using Hosted Graphite and OpsGenie, this step is unnecessary because you can have your Hosted Graphite alerts automatically close your OpsGenie incident:


This saves you time and frustration, not only for incidents that require a lot of attention but for those inevitable quick blips that resolve themselves after a few minutes, perhaps after a brief network interruption. In those cases, having Hosted Graphite resolve an OpsGenie incident automatically reduces frustration because your responding engineer will see the incident is no longer open, sometimes before they’ve even managed to get back to their keyboard.

For longer incidents, it is incredibly helpful to have your monitoring tell your ops team when everything is OK again, rather than having to check the state of the alerts and manually resolve your OpsGenie incidents. This approach lowers stress, (“Whew, all resolved!”) reduces confusion (“What’s the state of this incident right now?”) and saves time for everyone.

Setting up Hosted Graphite and OpsGenie

Setting it up involves just three steps:

1. In your OpsGenie account, find the Hosted Graphite integration. Copy the API key, and click ‘save integration’


Don’t forget to click the ‘Save integration’ button! (and don’t worry, the example key in the gif has been deactivated)

2. In your Hosted Graphite account, add a new Notification Channel for OpsGenie:


3. Configure an alert to use the new OpsGenie Notification Channel:


That’s it! The next time a Hosted Graphite alert fires, OpsGenie will know about it seconds later and start notifying your team according to the schedules you’ve set up and the notification preferences of your team. When Hosted Graphite’s monitoring detects that your alert is in a healthy state again, the incident will be automatically resolved in OpsGenie.

More resources:

If you have any trouble, just send our stellar support team an email:

System monitoring – what are my options? (part 2)

In part one of this series on system monitoring libraries we checked out a some popular libraries used to monitor servers. In this follow-up, we take a look at a few more options and make a recommendation to answer the question ‘which of the many available monitoring tools is best for your environment?’


Diamond is a Python daemon for collecting system metrics and presenting them to Graphite. Diamond is good for scale and customization.


Extensibility – Diamond uses a straight-forward model: add a collector to the configuration file to add new monitoring. This makes it low friction to scale to dozens or even hundreds of servers, because each instance is the same and responsible for reporting its own metrics. Diamond can handle it, too – the project claims up to 3m datapoints per minute on 1000 servers without undue CPU load.

Variety – Support extends to a range of operating systems with the documentation to back it up. Diamond comes with hundreds of collector plugins plus it lets you customize collectors and handlers with little effort for metrics gathering from nearly any source. Installation is easy, too.



Functionality – Collection is all Diamond does. It can talk directly to tools such as Graphite, but many setups still choose to use an aggregator like StatsD in addition to Diamond for their application metrics.

Updates – Brightcove, the original developers, stopped working on Diamond and it graduated to a standalone open source project. That’s noticeably slowed its release cycle.  Diamond is a mature and well-established project, though, so decide for yourself how much of an issue this is.


StatsD is metrics aggregation daemon originally released by Etsy. It is not a collector like other tools on this list but instead crunches high-volume metrics data and forwards statistical views to your graphing tool.


Maturity – The simple, text-based protocol has spawned client libraries in nearly every popular languages. You can plug into just about any monitoring or graphing back end. StatsD has been around a long time, and it shows.

Free-Standing – The StatsD server sits outside your application – if it crashes, it doesn’t take anything down with it. Listening for UDP packets is a good way to take in a lot of information without a latency hit to your application, or needing to worry about maintaining TCP connections.



Limited Functionality – StatsD is an aggregator more than a collector. You still need to instrument your application to send data to it. For thresholds or alerts, you’ll need to build in a backend or use something like Nagios.

Data Reliability – Fire-and-forget UDP has downsides, too – for dispersed networks or essential data, packet loss is a risk. Also, if you are using vanilla graphite and send too much to StatsD in its flush interval, those metrics drop and won’t graph. Hosted Graphite can handle it though. 🙂


Zabbix is an open source monitoring framework written in C. Zabbix positions itself as an enterprise-level solution and so offers all-in-one monitoring that handles collection, storage, visualization and alerting.



Range of Coverage – Zabbix can track not just in real time, but trends and historical data, too. Monitoring extends to hardware, log files, and VMs. A small footprint and low maintenance means you can fully exploit what Zabbix has to offer.

Convenience – Don’t like plug-ins? Nearly all the functionality you might want is built in to Zabbix. If something is missing, there are extensive, simple customization options. Templates tell you what to collect and configurations are set centrally through the GUI.



New Alerts  – Service discovery and configuring things like triggers are both more involved than they really ought to be. Tools like Graphite and StatsD can start to track a new metric just by referencing it in your application. Zabbix is not one of those tools.

Large Environments – Zabbix doesn’t do well on big networks – all that historical data is greedy for database space, PHP for the front end has performance limitations, and debugging is awkward. For an enterprise-level system, that’s a bit underwhelming.



So – what do we recommend? One set of criteria for running monitoring services in production goes like this:

  1. Operationally it should play well with others – no taking down a box if something goes wrong, no dependency hell, no glaring security issues.
  2. Extensible without hacking on source code – a library should have some mechanism for supporting the entire stack you’re using without having to mutilate the code to get it working.
  3. Well supported by the community – Lots of users means that you know that bugs will be squashed, and updates will arrive to support new measuring scenarios or technologies.
  4. Works to its strengths – A library that tries to do everything itself ends up doing nothing particularly well.  Opinionated design means that a library can focus on the important issues without adding in pointless frivolities.

So – If I were to pick based on these criteria, I’d recommend either Diamond or CollectD.  They both handle the collection of your data admirably with extensible plugins, and can forward it on to a storage and visualization service like Hosted Graphite (we even give you pre-made dashboards for both services!). They’re both well supported by the open-source community and play nicely with your systems.

If you’re in the Java ecosystem there may be some natural attraction to DropWizard, or StatsD if you’re using a PaaS such as Heroku – but if you’re running your own servers or using AWS, CollectD or Diamond are a good fit.

System monitoring – what are my options?

There are many options for system monitoring –  so many in fact that a lot of people turn to one of the two worst options: Writing your own, or getting struck with paralysis by analysis and doing nothing.

Monitoring your systems and alerting when something weird happens is crucial to understanding and tackling issues as early as possible. That means allowing you to work on the activities you have planned instead of reacting to outages, and ultimately keeping your customers happy.

There’s a whole range of free tools that monitor your systems and create metrics for you to graph, evaluate, and use to create alerts. In part 1 of this series, we’ll explore the pros and cons of three of these popular libraries.


Collect D is a daemon that gathers system information and passes it on to Graphite. It is, as the name suggests, a collector rather than a monitoring tool, and stresses modularity, performance and scale.


Quick and Easy – Setup is straightforward, configuration is painless, and maintenance is minimal. It’s light on system resources, as a programmed multithreaded daemon written in C, and fast on clients. CollectD supports multiple servers and has a multitude of ways to store data.

Plugins – CollectD has a pile of ‘em: for specialized servers, for sending metrics to various systems, for logging and notification, for nearly anything. The default is enough to get started, but there’s plenty of flexibility once you get going. It plays nicely with Graphite.


No GUI – CollectD is not a graphing tool – it simply spits out RRD files. There are scripts for a minimal interface packaged with it, but even the project admits that it’s not up to much. You’ll need to plug into Graphite or something similar to read CollectD’s outputs effectively.

Too Much Info – Sub-minute refreshing and variety of plugins makes it easy to overreach. If you ask for a lot of statistics from a node, you may get more data than you can graph and read effectively.



Munin is a resource and performance monitoring tool written in Perl. It doesn’t provide alerting, but Munin is a robust solution for cleanly presenting a lot of network data.


Out-of-the-Box – Munin stresses ease of use; installation and configuration take minutes. Writing code to extend monitors is so simple you can use it for non-performance tasks like counting web traffic. You can set thresholds in Munin, but there is a recommended Nagios plugin to generate alerts.

Plug and Play – Like CollectD, Munin has a wide range of plug-ins to choose from: just grab a few scripts from the Plugin Gallery. The more elegant plug-ins can monitor long-view trends like annual resource usage. Writing new plug-ins for yourself is also no trouble.


Central Server – Each server you’re monitoring runs a Munin process; these servers then connect to a main server. This model can lead to performance issues when the scale rises to hundreds of servers. Budgeting for that dedicated server will need to come sooner rather than later.

Graphs – The graphs generated by Munin are static – not ideal if you want some interactive views of your data. Also, these HTML graphs redraw after every refresh, creating big disk I/O and CPU hits on your system. As a whole, it’s pretty dated.



Dropwizard is a Java framework that supports metrics and ops tools for web services. This collection of best-in-breed libraries is built for speed and robustness.


Built-in Metrics – Choose your service calls and performance metrics run automatically. Health Checks publishes metrics by service, too – handy for doing a lot of REST calls. Add in service configuration validation as a default feature and Dropwizard is quick to both deploy and change.

Container-less – All resources and dependencies packed into fat JARs, making it a snap to write micro-services or add instances. Default configurations are sensible and updates are easy, too – you can deploy with one line.


Performance – Each request has its own thread – even tuning maxThreads and maxConnections may not help throughput. This is problematic for the kind of I/O-bound applications that Dropwizard is likely to service. Dropwizard’s light weight cuts both ways – if you have high loads and a lot of developers, other options may work better.

Support – Dropwizard has an active community, but it’s no match for when Coda Hale developed it. The cadence of releases can stretch to months. Documentation could be meatier, and even StackOverflow doesn’t talk about it as much as other tools.


In the next article, we’ll check out a few other useful libraries and dig through the main factors you might look at when making a decision.

Enabling remote work

At Hosted Graphite, we rely on remote work – our CEO works full-time from the US and the rest of the team work from Ireland. We have a flexible policy on working from home (essentially, Nike-style: just do it). As long as work gets done, we don’t sweat the details of when or where it happens. Some work from home regularly, and some rarely.

The key to effective remote teamwork is communication. Every interaction with a colleague is always easier in person and that means when we’re remote we need to put in a bit more effort, but we need to know exactly when we need to put that effort in.

If you’re a remote worker or a team that works with remote employees or freelancers, then we hope these tips will make your life a little easier.


It’s crucial that your team knows when you’re working and when you’re unavailable. If you just drift in and out all day, your colleagues won’t be able to rely on you for a discussion or a decision because they never know when you’re going to be responsive.

When your day starts, declare it to the team. Nothing complicated, just “Morning! Working from home today.” will do. When you’re done for the day, make sure you say it: “I’m outta here, talk to yis tomorrow.” If you’re on a call, in a meeting or out for lunch then make your remote team aware of it, (“Lunch, back in 30”) especially if timezones are involved.

Being present

Related to “Declaring” but different is being present. The team should feel that they can call on you when you’re around, just as they would in the office. This means checking your communication tools regularly, or keeping notifications turned on. Basically, don’t hide and isolate yourself, be available for your colleagues because they can’t tap you on the shoulder as they would in the office. If you do need to go quiet for a while to focus on a task, declare it so people know what to expect your availability to be.

Discussions and decisions

With some people in the office and some remote, it quickly becomes apparent to anyone remote that when decisions happen in the office in person, they don’t get discussed online. If this happens frequently enough you feel completely disconnected with the decision-making process, which quickly leads to dissatisfaction and demotivation.

Often it’s easier to just turn to each other in the office for a discussion and that’s fine, but if you know one of the remote folks might like to comment you should (1) make them aware that there is a discussion, and (2) give them a chance to either take part or say they’re happy to go with what the in-office folks decide on that issue. Often just leaving a little note like “X and Y discussed this in person and decided it’s OK” on a pull request, or saying “We’re discussing this in the office – Z, do you have an opinion on this one?” is a big step forward.

Being clear about when and how decisions are made and discussions are had will go a long way toward dealing with the feeling of disconnection that remote workers can feel.

Virtual standups

Usually, some of the teams within the company will have a daily standup where they talk through what they’re doing and any blocking points they have before deciding on what they’ll do next, and discussing the reasoning behind any changes in direction.
Once an in-person standup is done, the team will summarise it on Slack to allow remote folks to participate, and also to create a record for answering the “Where /did/ the week go?” question. A bot pings everyone to update their standup notes at the same time every day.

Usually this is a quick three-liner:

Done: (What I've done so far, how long it took)
Doing: (What I'm doing now, how long I've spent on it)
Next: (What task I'm switching to soon)

From this, anyone can see at a glance what the whole team are working on and how long it’ll take for us to push a feature, fix, or new integration. We can make sure any marketing efforts are keeping pace with the development team, and any new automation needed is in place. If anyone’s getting stuck it becomes pretty clear and we can figure it out.


Not everyone does the dev-oriented daily standup notes. We also keep a git repo of what we call ‘snippets’, somewhat modelled on Google’s snippets. This is a daily summary of the stuff that people work on. It’s an optional way of describing the building blocks of the day – particularly useful for developers that are usually remote and anyone working on softer, non-dev work. My day usually looks something like this:

June XXth 2016
– Call with Charlie
– Sales call with Customer X
– Talking with our Lawyer about Y
– Support followup with Customer Z
– One-to-one call with <team member>
– Talking with partner company A
– Reviewing our marketing conversion rates

This has the added benefit of highlighting regular tasks that might be better off automated. If you’re doing something that a computer can do on a semi-regular basis, you probably just need to leave it to a script. If someone’s regularly working on low-value tasks we can direct that attention somewhere more valuable. There have been some great articles on the difference between ‘action’ vs ‘work’ – work drives your business forward, action smells like work but has no outcome (e.g. any task that starts with “investigating” is usually a giveaway).

It’s also a good way of noticing where you’re spending time that you shouldn’t be. For example, if you’re supposed to be doing product and development management but you have lots of actual development tasks in your snippets, that’s evidence that maybe you don’t have the balance right.

Snippets can also be a helpful tool for giving the team a chance to see what the non-technical management types do all day – how much is support, sales, hiring, dealing with the accountant, etc

Planning and task management

We keep a set of Trello boards which we separate into a few different themes:

  1. Product – Adding features to Hosted Graphite, or improving existing ones.
  2. Growth – Efforts to get new users into and through our sales funnel, improve conversion, or promote referrals.
  3. Bugs – Anything that’s creating a less than amazing experience for our customers (or operations team!) and needs to be fixed, or technical debt that needs to be paid off.
  4. The Salt Mines – Current issues/projects, what we’re working on in the next week or two.

We then divide these boards into Short, Medium, and Long-term sections. As co-founders and managers of the product Charlie and I have a call a few times a week to wrangle these tasks between queues as necessary, or break them into smaller tasks. Once the tasks are on the Salt Mines board developers are free to pull items from the ‘Next’ queue and drop them into ‘Doing’, then ‘Done’, Kanban-style.


Keeping a partially remote team on the same page is tricky, and possibly even harder than a fully remote team because there are more edge cases. We hope this insight into how we run a partially remote team is helpful, and that you’ll find some of these tips useful for your own team.

Alerting from first principles

An Introduction to Alerting

Having recently added our Alerting for Graphite, we thought it’d be useful to put together a short primer on Alerting. What do you need to look at when considering what you alert on, and where those alerts go? An early warning system is only as good as its alarms.

What is alerting?

Monitoring uses alerts to tell you when something unexpected happens, if you need to act, and how you might fix a problem. Good alerts give you the right context to act and enough lead time to be effective. Bad alerts tell you what you already know or don’t need to hear – once you know a database is down, you don’t need to be reminded every minute.

If monitoring gives you data, then alerting gives you information.

How to do Alerts

Done properly, your alerts should trigger only for states or events that require attention or intervention. If you flood your sysadmins with minor alerts, they will try to read them all or ignore them altogether – both poor outcomes! Every sysadmin I’ve ever spoken to gets a thousand-yard-stare when I mention Nagios’s propensity to fill your mailbox with redundant information.

For simple record keeping, set up descriptive logging in a human readable format to capture an event so you can dig into it later – e.g. “Production web server number of 500 errors”.  A good rule of thumb for logging sensitivity is to trigger alerts on what might equate to syslog standard severity levels of Error and higher.


Each alert should capture at least these fields:

  • Status – What’s wrong?

A simple, specific statement of what’s changed: a server offline, power supply interrupted, large numbers of users dropped, unusually long response times.

  • Priority – How urgent is it?
      • High – Something is on fire that must be fixed; wake the right person to tackle the problem. A smoke alarm in a data centre needs a quick response from your on-call engineer, and probably the Fire Department, too.
      • Medium – Something needs action but not right away; check the logs tomorrow so technical staff can follow up. Your secondary backup server running low on disk space is a risk for you to deal with this month, but not a crisis today.
      • Low –  Something unusual happened; email the details to create an evidence trail for later investigation. There are weird traffic patterns on your internal network – is everyone streaming Game of Thrones clips on Monday morning? Have a look when you get the chance.
  • Next steps – What do we do?

A list of product/service owners, escalation paths, and immediate corrective actions. This is a good place for some easy troubleshooting – if the team working the overnight can solve the issue with a reboot, then you don’t need to take it any further. Runbooks are a life-saver in the small hours of the morning, giving the bleary-eyed ops team some simple guidance when nothing’s making sense.


Further Tips

  • Tune your thresholds regularly to eliminate noise and create alerts for previously undetected incidents. If load spikes during commercial breaks in the big game, tweak your alerts to accommodate that.
  • Don’t confuse priority and severity. Extra processing time for an ecommerce transaction, for example, might be a medium severity defect; but priority depends on factors such as user traffic and SLA terms. What’s an inconvenience on Easter Sunday could be a potential disaster on Black Friday!
  • Disable alerts for test environments, maintenance windows, and newly deployed services – waking someone up for a false positive makes for an angry ops team.
  • Update your call sheet with current contact details – when time is crucial, there’s no room to chase down the former service owner who handed over their admin rights last month.


A final word

Every business has a different set of critical paths – you know your systems and people best. Alerts can be automated, but the wisdom behind them can’t be.

  • Establish the remediation procedures that will be kicked off by alerts.
  • Discuss with engineers the kind of diagnostic data that is useful to them – Hosted Graphite alerts can drop graph images directly into Hipchat and Slack.
  • Write a text description of your alerts so that it gives unambiguous instructions for resolution.

An alarm doesn’t mean panic when everyone knows there’s an established process they can trust.