Hosted Graphite’s Alerting now integrates with OpsGenie!

TL;DR: Hosted Graphite’s alerting feature now integrates with OpsGenie, including auto-resolving incidents according to the alerting rules.

Hosted Graphite’s alerting feature continues to sprout new functionality – we just launched the ability to send notifications of infrastructure problems straight to your on-call engineering team via OpsGenie.

If you’re not familiar with OpsGenie, here’s how they describe their service: “OpsGenie is an incident management solution for dev & ops teams, providing on-call schedule management, escalations and alerting via email, SMS, phone and mobile push notifications.”

Here’s what the integration of our two complementary services looks like in the OpsGenie UI:


Automatic incident resolution

Being notified of a problem is one thing, but once the incident has been dealt with you’ll often need to manually mark the incident as ‘resolved’. If you’re using Hosted Graphite and OpsGenie, this step is unnecessary because you can have your Hosted Graphite alerts automatically close your OpsGenie incident:


This saves you time and frustration, not only for incidents that require a lot of attention but for those inevitable quick blips that resolve themselves after a few minutes, perhaps after a brief network interruption. In those cases, having Hosted Graphite resolve an OpsGenie incident automatically reduces frustration because your responding engineer will see the incident is no longer open, sometimes before they’ve even managed to get back to their keyboard.

For longer incidents, it is incredibly helpful to have your monitoring tell your ops team when everything is OK again, rather than having to check the state of the alerts and manually resolve your OpsGenie incidents. This approach lowers stress, (“Whew, all resolved!”) reduces confusion (“What’s the state of this incident right now?”) and saves time for everyone.

Setting up Hosted Graphite and OpsGenie

Setting it up involves just three steps:

1. In your OpsGenie account, find the Hosted Graphite integration. Copy the API key, and click ‘save integration’


Don’t forget to click the ‘Save integration’ button! (and don’t worry, the example key in the gif has been deactivated)

2. In your Hosted Graphite account, add a new Notification Channel for OpsGenie:


3. Configure an alert to use the new OpsGenie Notification Channel:


That’s it! The next time a Hosted Graphite alert fires, OpsGenie will know about it seconds later and start notifying your team according to the schedules you’ve set up and the notification preferences of your team. When Hosted Graphite’s monitoring detects that your alert is in a healthy state again, the incident will be automatically resolved in OpsGenie.

More resources:

If you have any trouble, just send our stellar support team an email:

System monitoring – what are my options? (part 2)

In part one of this series on system monitoring libraries we checked out a some popular libraries used to monitor servers. In this follow-up, we take a look at a few more options and make a recommendation to answer the question ‘which of the many available monitoring tools is best for your environment?’


Diamond is a Python daemon for collecting system metrics and presenting them to Graphite. Diamond is good for scale and customization.


Extensibility – Diamond uses a straight-forward model: add a collector to the configuration file to add new monitoring. This makes it low friction to scale to dozens or even hundreds of servers, because each instance is the same and responsible for reporting its own metrics. Diamond can handle it, too – the project claims up to 3m datapoints per minute on 1000 servers without undue CPU load.

Variety – Support extends to a range of operating systems with the documentation to back it up. Diamond comes with hundreds of collector plugins plus it lets you customize collectors and handlers with little effort for metrics gathering from nearly any source. Installation is easy, too.



Functionality – Collection is all Diamond does. It can talk directly to tools such as Graphite, but many setups still choose to use an aggregator like StatsD in addition to Diamond for their application metrics.

Updates – Brightcove, the original developers, stopped working on Diamond and it graduated to a standalone open source project. That’s noticeably slowed its release cycle.  Diamond is a mature and well-established project, though, so decide for yourself how much of an issue this is.


StatsD is metrics aggregation daemon originally released by Etsy. It is not a collector like other tools on this list but instead crunches high-volume metrics data and forwards statistical views to your graphing tool.


Maturity – The simple, text-based protocol has spawned client libraries in nearly every popular languages. You can plug into just about any monitoring or graphing back end. StatsD has been around a long time, and it shows.

Free-Standing – The StatsD server sits outside your application – if it crashes, it doesn’t take anything down with it. Listening for UDP packets is a good way to take in a lot of information without a latency hit to your application, or needing to worry about maintaining TCP connections.



Limited Functionality – StatsD is an aggregator more than a collector. You still need to instrument your application to send data to it. For thresholds or alerts, you’ll need to build in a backend or use something like Nagios.

Data Reliability – Fire-and-forget UDP has downsides, too – for dispersed networks or essential data, packet loss is a risk. Also, if you are using vanilla graphite and send too much to StatsD in its flush interval, those metrics drop and won’t graph. Hosted Graphite can handle it though. 🙂


Zabbix is an open source monitoring framework written in C. Zabbix positions itself as an enterprise-level solution and so offers all-in-one monitoring that handles collection, storage, visualization and alerting.



Range of Coverage – Zabbix can track not just in real time, but trends and historical data, too. Monitoring extends to hardware, log files, and VMs. A small footprint and low maintenance means you can fully exploit what Zabbix has to offer.

Convenience – Don’t like plug-ins? Nearly all the functionality you might want is built in to Zabbix. If something is missing, there are extensive, simple customization options. Templates tell you what to collect and configurations are set centrally through the GUI.



New Alerts  – Service discovery and configuring things like triggers are both more involved than they really ought to be. Tools like Graphite and StatsD can start to track a new metric just by referencing it in your application. Zabbix is not one of those tools.

Large Environments – Zabbix doesn’t do well on big networks – all that historical data is greedy for database space, PHP for the front end has performance limitations, and debugging is awkward. For an enterprise-level system, that’s a bit underwhelming.



So – what do we recommend? One set of criteria for running monitoring services in production goes like this:

  1. Operationally it should play well with others – no taking down a box if something goes wrong, no dependency hell, no glaring security issues.
  2. Extensible without hacking on source code – a library should have some mechanism for supporting the entire stack you’re using without having to mutilate the code to get it working.
  3. Well supported by the community – Lots of users means that you know that bugs will be squashed, and updates will arrive to support new measuring scenarios or technologies.
  4. Works to its strengths – A library that tries to do everything itself ends up doing nothing particularly well.  Opinionated design means that a library can focus on the important issues without adding in pointless frivolities.

So – If I were to pick based on these criteria, I’d recommend either Diamond or CollectD.  They both handle the collection of your data admirably with extensible plugins, and can forward it on to a storage and visualization service like Hosted Graphite (we even give you pre-made dashboards for both services!). They’re both well supported by the open-source community and play nicely with your systems.

If you’re in the Java ecosystem there may be some natural attraction to DropWizard, or StatsD if you’re using a PaaS such as Heroku – but if you’re running your own servers or using AWS, CollectD or Diamond are a good fit.

System monitoring – what are my options?

There are many options for system monitoring –  so many in fact that a lot of people turn to one of the two worst options: Writing your own, or getting struck with paralysis by analysis and doing nothing.

Monitoring your systems and alerting when something weird happens is crucial to understanding and tackling issues as early as possible. That means allowing you to work on the activities you have planned instead of reacting to outages, and ultimately keeping your customers happy.

There’s a whole range of free tools that monitor your systems and create metrics for you to graph, evaluate, and use to create alerts. In part 1 of this series, we’ll explore the pros and cons of three of these popular libraries.


Collect D is a daemon that gathers system information and passes it on to Graphite. It is, as the name suggests, a collector rather than a monitoring tool, and stresses modularity, performance and scale.


Quick and Easy – Setup is straightforward, configuration is painless, and maintenance is minimal. It’s light on system resources, as a programmed multithreaded daemon written in C, and fast on clients. CollectD supports multiple servers and has a multitude of ways to store data.

Plugins – CollectD has a pile of ‘em: for specialized servers, for sending metrics to various systems, for logging and notification, for nearly anything. The default is enough to get started, but there’s plenty of flexibility once you get going. It plays nicely with Graphite.


No GUI – CollectD is not a graphing tool – it simply spits out RRD files. There are scripts for a minimal interface packaged with it, but even the project admits that it’s not up to much. You’ll need to plug into Graphite or something similar to read CollectD’s outputs effectively.

Too Much Info – Sub-minute refreshing and variety of plugins makes it easy to overreach. If you ask for a lot of statistics from a node, you may get more data than you can graph and read effectively.



Munin is a resource and performance monitoring tool written in Perl. It doesn’t provide alerting, but Munin is a robust solution for cleanly presenting a lot of network data.


Out-of-the-Box – Munin stresses ease of use; installation and configuration take minutes. Writing code to extend monitors is so simple you can use it for non-performance tasks like counting web traffic. You can set thresholds in Munin, but there is a recommended Nagios plugin to generate alerts.

Plug and Play – Like CollectD, Munin has a wide range of plug-ins to choose from: just grab a few scripts from the Plugin Gallery. The more elegant plug-ins can monitor long-view trends like annual resource usage. Writing new plug-ins for yourself is also no trouble.


Central Server – Each server you’re monitoring runs a Munin process; these servers then connect to a main server. This model can lead to performance issues when the scale rises to hundreds of servers. Budgeting for that dedicated server will need to come sooner rather than later.

Graphs – The graphs generated by Munin are static – not ideal if you want some interactive views of your data. Also, these HTML graphs redraw after every refresh, creating big disk I/O and CPU hits on your system. As a whole, it’s pretty dated.



Dropwizard is a Java framework that supports metrics and ops tools for web services. This collection of best-in-breed libraries is built for speed and robustness.


Built-in Metrics – Choose your service calls and performance metrics run automatically. Health Checks publishes metrics by service, too – handy for doing a lot of REST calls. Add in service configuration validation as a default feature and Dropwizard is quick to both deploy and change.

Container-less – All resources and dependencies packed into fat JARs, making it a snap to write micro-services or add instances. Default configurations are sensible and updates are easy, too – you can deploy with one line.


Performance – Each request has its own thread – even tuning maxThreads and maxConnections may not help throughput. This is problematic for the kind of I/O-bound applications that Dropwizard is likely to service. Dropwizard’s light weight cuts both ways – if you have high loads and a lot of developers, other options may work better.

Support – Dropwizard has an active community, but it’s no match for when Coda Hale developed it. The cadence of releases can stretch to months. Documentation could be meatier, and even StackOverflow doesn’t talk about it as much as other tools.


In the next article, we’ll check out a few other useful libraries and dig through the main factors you might look at when making a decision.

Enabling remote work

At Hosted Graphite, we rely on remote work – our CEO works full-time from the US and the rest of the team work from Ireland. We have a flexible policy on working from home (essentially, Nike-style: just do it). As long as work gets done, we don’t sweat the details of when or where it happens. Some work from home regularly, and some rarely.

The key to effective remote teamwork is communication. Every interaction with a colleague is always easier in person and that means when we’re remote we need to put in a bit more effort, but we need to know exactly when we need to put that effort in.

If you’re a remote worker or a team that works with remote employees or freelancers, then we hope these tips will make your life a little easier.


It’s crucial that your team knows when you’re working and when you’re unavailable. If you just drift in and out all day, your colleagues won’t be able to rely on you for a discussion or a decision because they never know when you’re going to be responsive.

When your day starts, declare it to the team. Nothing complicated, just “Morning! Working from home today.” will do. When you’re done for the day, make sure you say it: “I’m outta here, talk to yis tomorrow.” If you’re on a call, in a meeting or out for lunch then make your remote team aware of it, (“Lunch, back in 30”) especially if timezones are involved.

Being present

Related to “Declaring” but different is being present. The team should feel that they can call on you when you’re around, just as they would in the office. This means checking your communication tools regularly, or keeping notifications turned on. Basically, don’t hide and isolate yourself, be available for your colleagues because they can’t tap you on the shoulder as they would in the office. If you do need to go quiet for a while to focus on a task, declare it so people know what to expect your availability to be.

Discussions and decisions

With some people in the office and some remote, it quickly becomes apparent to anyone remote that when decisions happen in the office in person, they don’t get discussed online. If this happens frequently enough you feel completely disconnected with the decision-making process, which quickly leads to dissatisfaction and demotivation.

Often it’s easier to just turn to each other in the office for a discussion and that’s fine, but if you know one of the remote folks might like to comment you should (1) make them aware that there is a discussion, and (2) give them a chance to either take part or say they’re happy to go with what the in-office folks decide on that issue. Often just leaving a little note like “X and Y discussed this in person and decided it’s OK” on a pull request, or saying “We’re discussing this in the office – Z, do you have an opinion on this one?” is a big step forward.

Being clear about when and how decisions are made and discussions are had will go a long way toward dealing with the feeling of disconnection that remote workers can feel.

Virtual standups

Usually, some of the teams within the company will have a daily standup where they talk through what they’re doing and any blocking points they have before deciding on what they’ll do next, and discussing the reasoning behind any changes in direction.
Once an in-person standup is done, the team will summarise it on Slack to allow remote folks to participate, and also to create a record for answering the “Where /did/ the week go?” question. A bot pings everyone to update their standup notes at the same time every day.

Usually this is a quick three-liner:

Done: (What I've done so far, how long it took)
Doing: (What I'm doing now, how long I've spent on it)
Next: (What task I'm switching to soon)

From this, anyone can see at a glance what the whole team are working on and how long it’ll take for us to push a feature, fix, or new integration. We can make sure any marketing efforts are keeping pace with the development team, and any new automation needed is in place. If anyone’s getting stuck it becomes pretty clear and we can figure it out.


Not everyone does the dev-oriented daily standup notes. We also keep a git repo of what we call ‘snippets’, somewhat modelled on Google’s snippets. This is a daily summary of the stuff that people work on. It’s an optional way of describing the building blocks of the day – particularly useful for developers that are usually remote and anyone working on softer, non-dev work. My day usually looks something like this:

June XXth 2016
– Call with Charlie
– Sales call with Customer X
– Talking with our Lawyer about Y
– Support followup with Customer Z
– One-to-one call with <team member>
– Talking with partner company A
– Reviewing our marketing conversion rates

This has the added benefit of highlighting regular tasks that might be better off automated. If you’re doing something that a computer can do on a semi-regular basis, you probably just need to leave it to a script. If someone’s regularly working on low-value tasks we can direct that attention somewhere more valuable. There have been some great articles on the difference between ‘action’ vs ‘work’ – work drives your business forward, action smells like work but has no outcome (e.g. any task that starts with “investigating” is usually a giveaway).

It’s also a good way of noticing where you’re spending time that you shouldn’t be. For example, if you’re supposed to be doing product and development management but you have lots of actual development tasks in your snippets, that’s evidence that maybe you don’t have the balance right.

Snippets can also be a helpful tool for giving the team a chance to see what the non-technical management types do all day – how much is support, sales, hiring, dealing with the accountant, etc

Planning and task management

We keep a set of Trello boards which we separate into a few different themes:

  1. Product – Adding features to Hosted Graphite, or improving existing ones.
  2. Growth – Efforts to get new users into and through our sales funnel, improve conversion, or promote referrals.
  3. Bugs – Anything that’s creating a less than amazing experience for our customers (or operations team!) and needs to be fixed, or technical debt that needs to be paid off.
  4. The Salt Mines – Current issues/projects, what we’re working on in the next week or two.

We then divide these boards into Short, Medium, and Long-term sections. As co-founders and managers of the product Charlie and I have a call a few times a week to wrangle these tasks between queues as necessary, or break them into smaller tasks. Once the tasks are on the Salt Mines board developers are free to pull items from the ‘Next’ queue and drop them into ‘Doing’, then ‘Done’, Kanban-style.


Keeping a partially remote team on the same page is tricky, and possibly even harder than a fully remote team because there are more edge cases. We hope this insight into how we run a partially remote team is helpful, and that you’ll find some of these tips useful for your own team.

Alerting from first principles

An Introduction to Alerting

Having recently added our Alerting for Graphite, we thought it’d be useful to put together a short primer on Alerting. What do you need to look at when considering what you alert on, and where those alerts go? An early warning system is only as good as its alarms.

What is alerting?

Monitoring uses alerts to tell you when something unexpected happens, if you need to act, and how you might fix a problem. Good alerts give you the right context to act and enough lead time to be effective. Bad alerts tell you what you already know or don’t need to hear – once you know a database is down, you don’t need to be reminded every minute.

If monitoring gives you data, then alerting gives you information.

How to do Alerts

Done properly, your alerts should trigger only for states or events that require attention or intervention. If you flood your sysadmins with minor alerts, they will try to read them all or ignore them altogether – both poor outcomes! Every sysadmin I’ve ever spoken to gets a thousand-yard-stare when I mention Nagios’s propensity to fill your mailbox with redundant information.

For simple record keeping, set up descriptive logging in a human readable format to capture an event so you can dig into it later – e.g. “Production web server number of 500 errors”.  A good rule of thumb for logging sensitivity is to trigger alerts on what might equate to syslog standard severity levels of Error and higher.


Each alert should capture at least these fields:

  • Status – What’s wrong?

A simple, specific statement of what’s changed: a server offline, power supply interrupted, large numbers of users dropped, unusually long response times.

  • Priority – How urgent is it?
      • High – Something is on fire that must be fixed; wake the right person to tackle the problem. A smoke alarm in a data centre needs a quick response from your on-call engineer, and probably the Fire Department, too.
      • Medium – Something needs action but not right away; check the logs tomorrow so technical staff can follow up. Your secondary backup server running low on disk space is a risk for you to deal with this month, but not a crisis today.
      • Low –  Something unusual happened; email the details to create an evidence trail for later investigation. There are weird traffic patterns on your internal network – is everyone streaming Game of Thrones clips on Monday morning? Have a look when you get the chance.
  • Next steps – What do we do?

A list of product/service owners, escalation paths, and immediate corrective actions. This is a good place for some easy troubleshooting – if the team working the overnight can solve the issue with a reboot, then you don’t need to take it any further. Runbooks are a life-saver in the small hours of the morning, giving the bleary-eyed ops team some simple guidance when nothing’s making sense.


Further Tips

  • Tune your thresholds regularly to eliminate noise and create alerts for previously undetected incidents. If load spikes during commercial breaks in the big game, tweak your alerts to accommodate that.
  • Don’t confuse priority and severity. Extra processing time for an ecommerce transaction, for example, might be a medium severity defect; but priority depends on factors such as user traffic and SLA terms. What’s an inconvenience on Easter Sunday could be a potential disaster on Black Friday!
  • Disable alerts for test environments, maintenance windows, and newly deployed services – waking someone up for a false positive makes for an angry ops team.
  • Update your call sheet with current contact details – when time is crucial, there’s no room to chase down the former service owner who handed over their admin rights last month.


A final word

Every business has a different set of critical paths – you know your systems and people best. Alerts can be automated, but the wisdom behind them can’t be.

  • Establish the remediation procedures that will be kicked off by alerts.
  • Discuss with engineers the kind of diagnostic data that is useful to them – Hosted Graphite alerts can drop graph images directly into Hipchat and Slack.
  • Write a text description of your alerts so that it gives unambiguous instructions for resolution.

An alarm doesn’t mean panic when everyone knows there’s an established process they can trust.

No brogrammers: Practical tips for writing inclusive job ads

A common problem with hiring for tech companies is that job ads often use strong, offputting language that alienates women, people of colour, and other minorities in the tech community. By paying attention to the language we use to describe ourselves, our ideal candidates, and the job responsibilities, we can broaden the net of candidates that might apply and help in some small way to tackle the tech diversity problem.

We’re trying to make our job ads more friendly and focused on human qualities rather than technical qualities. While we can always train someone in a technical skill, we can’t train someone to be a nice person. Or, maybe we can but that’s much, much harder, and it doesn’t sound like anyone would have a good time with that.

Quite a lot has been written about what to do about the problem of poor diversity in tech, but it can be hard to distill that into actionable suggestions. That’s what this post is – an attempt to show examples of what we changed and the thinking behind it.

In this post we’ll be comparing two versions of the job spec for very similar positions. Here’s the one from two years ago, and here’s the most recent version.

We used to say:

Several years of Linux sysadmin experience.

Now we say:

Significant Linux system administration experience.

The reason is that stating an amount of time is offputting to people that haven’t put in that specific amount of time. Years of experience is a proxy for learned skill, but in this case it wasn’t helping us filter candidates better and could only have been offputting to someone without “several years” of experience. “Significant” is still poorly defined (intentionally!) and hopefully less strict.

We used to say:

Your code will be exercised by 125,000 events every second, so performance is pretty important to us! A decent knowledge of common data structures and algorithms is expected.

Now we say:

An eye for performance is important – your contributions will be exercised by more than fifty billion events per day. We always have to think about how something will scale and fail.

The thinking here is that “A decent knowledge of common data structures and algorithms is expected.” is quite specific, and this could be offputting. Instead, we talk about how we have to think about things will scale and fail, which puts more of a focus on learning than already knowing the technical details. We also changed the number of events to a per-day figure – we’re very proud of our growth, but we don’t want the figure to be offputting to a potential candidate.

Where once we said:

We want to see that you know your stuff.

Now we say:

We want to see that you have relevant experience, that you like automating away repetitive work, that you have good attention to detail, an aptitude for learning new skills and that you have empathy for your team-mates and our customers.

We hope this one is obvious. What does “knowing your stuff” mean, anyway? It sounds too close to the rockstar/brogrammer/crushing it nonsense that infests the tech industry. We chose to replace this macho phrase with something better that covers relevant experience, a preference for the kind of work we do, a personality trait, a focus on learning, and empathy for colleagues and customers.

Where we used to put a burden on someone by saying:

You’ll need to help us scale them individually, …

We now say:

We’ll need your help to scale them individually, …

This seems minor, but turning this around makes it clear which direction the responsibility and contribution goes. It’s not that you need to help us, it’s that we need your help. Instead of having a burden dumped on you individually, maybe you can help the team work on this problem?


We make a point of saying that we care about the health of our employees:

We want healthy, well rested ops people.

Some early feedback on this blog post pointed out that the word ‘healthy’ here might feel exclusionary to someone with a disability. After some thought, this was changed to ‘relaxed’:

We want relaxed, well rested ops people.

This isn’t quite the same as what we meant by ‘healthy’ because on-call work can damage one’s mental and physical health and we wanted to point out that that we care about this, but ‘relaxed’ and ‘well rested’ convey most of it and are good enough. Suggestions appreciated!


Where we previously implied good communication skills:

We have one co-founder living in the US and we use IRC, Workflowy and video chat tools like to keep in touch.

We now explicitly state it:

Most of the team works out of the Dublin office, but we’re flexible about working from home and one of our co-founders is living in the US, so we’re partially remote and we have to be good at communicating. We use Slack, Google Docs, Trello, Workflowy and video chat tools like to keep in touch.

The thinking here is that we wanted to mention that we’re flexible about working from home which is better for families, and we explicitly say that we “have to be good at communicating.” We’re not saying that we are good at communicating, just that the business recognises that we have to be, so you can expect supportive and communicative colleagues that won’t make working from home any harder than it has to be.


As a small and growing company, we didn’t offer health insurance two years ago but we do now, and we wanted to make sure to point out that it includes family cover:

Health insurance for you and your family.


In both job ads, we said:

We’d like to see some of your code, but it’s not essential.

We understand that not everyone will have published code – we recognise that open source male privilege is real, and some people are discouraged from publishing code because of that. Other people may not be able to publish code due to their employer’s privacy requirements, or are just too busy spending time with their family to code the evenings away.


Finally, we used to say:

No ninjas, rockstars or brogrammers, please.

This is amusing and captures our opposition to Silicon Valley rockstar/brogrammer culture well, but for a job ad that’s all about inclusiveness it felt a little odd. So we made it better:

No ninjas, rockstars or brogrammers, please; just nice, caring humans.

Of course if you actually practice martial arts, play in a rock band, or enjoy coding *and* going to the gym, you are welcome here. It’s just the “bro culture” and hiring “ninjas and rockstars” trends we’re not keen on. 🙂

There you go, those are some of the thoughts that went into our most recent job ad. We think our old job ad was already pretty good (and many people have told us so!) but we tried to make the latest one more inclusive and to focus more on the individual, learning, family, and support. We tried to remove elements of competition or hard technical requirements, and to keep the job ad buzzword nonsense to a minimum.

So how we are doing? What can we improve for next time? Let us know by tweeting at @HostedGraphite.

Managing ChatOps Signal-to-Noise with HipChat and Hosted Graphite

We’re close to releasing a big new feature for Hosted Graphite – the ability to define alerts that can notify you when your metric data indicates something might be wrong with your infrastructure. You can choose to be notified via email, webhooks, PagerDuty, Slack, and now, HipChat. The alerting feature is in beta right now but many teams are already using it. If you’d like to get early access to try it out, get in touch with our support team and we’ll flip the switch for you.

Configure alert thresholds for  your metrics.

When our friends at Atlassian asked if we’d be interested in building an add-on for HipChat, we weren’t immediately sold on the idea. However, once we saw that we could embed parts of the Hosted Graphite experience inside HipChat, we paid closer attention!

We thought the new alerting feature would be a great place to start. Most of the notification methods are relatively straightforward – when something goes wrong, we notify you. When it’s fixed, we notify you.

Nothing ground-breaking there, but with the HipChat Connect API we were able to offer a much richer experience by embedding parts of the Hosted Graphite product experience right into the HipChat interface. This is pretty powerful, and something that other chat tools don’t offer.


First, the basics. When an alert goes off, we post a notification to your HipChat room:notification

This is a good start – there’s a thumbnail of a graph, a link to the full size graph, the metric name related to the alert, and the conditions that caused it to fire. This “card” view lets us pack a lot of information into a small space, making the notification as useful as possible and giving the signal-to-noise ratio a welcome boost.

Keeping up with the chaos

One of the challenges of ChatOpsglance_healthy is keeping everyone on the same page during a chaotic incident, without everyone having to read every single thing that’s said and done in the chat room. We found that the HipChat “glance” feature provides an excellent way to communicate real-time high level status information to everyone in the room by embedding a small widget into the right-hand panel that sits next to the conversation. The picture on the right shows what it looks like when everything is healthy.

Unhealthy looks like this:


Context switching

alert_web_panelContext switches are expensive – flipping between multiple tools always adds overhead, so the more you can do in a single tool, the less time you’ll waste. Seeing the overall state of the infrastructure at a glance is useful, but we wanted more. If there’s something wrong, you need to know which things are wrong in order to be able to act. If you click the “glance” inside HipChat, the side bar changes to a filterable list of all your alerts, showing which ones are unhealthy and what the alert conditions are.

With other tools you’d have to switch to another browser tab and navigate the UI to get a high level view of the state of your infrastructure, and that’s assuming you’re already logged in. Having the information right next to the relevant discussion is powerful, and it’s available for your entire team, which adds up to a lot of saved time.

Acting quickly

dashboardsNow that you know what’s alerting, the next step is usually to check one of your dashboards to get the full picture, look for correlations, see what the magnitude of the problem is, etc. For that, you can jump right from HipChat to a Hosted Graphite dashboard. For teams with a large number of dashboards, there’s a quick filtering box.

These dashboard links include a Hosted Graphite access key, so everyone in the room gets one-click read-only access as quickly as possible to help them diagnose the problem. They’ll need to login to make any changes, of course.


Using our HipChat add-on is the richest way to take advantage of our new alerting feature. It keeps your team informed, improves the signal-to-noise ratio in your chat rooms, reduces context switching, and provides for lower time-to-resolution, which directly impacts your customers.

Want to try it out? You can install the Hosted Graphite for HipChat integration in the Atlassian Marketplace.