Adventures in fault tolerant alerting with Python

We gave a presentation a couple of weeks ago Python Ireland’s April meetup where we described our experiences with PySyncObj, a relatively new but solid library for building fault tolerant distributed systems in Python. Most of the services that run Hosted Graphite are built in Python, and this includes our alerting system. While that talk wasn’t recorded, this blog post discusses what we did, the tech we chose, and why.

A monitoring system without alerting

We’ve been running Hosted Graphite as a big distributed custom time-series database since 2012, and once we mastered the monitoring side of things, the next obvious step was an integrated alerting system.

beta

We released a beta version of the alerting feature in 2016, and we’ve kept it technically in “beta” for the same reasons Google kept GMail in beta for more than five years: people are using it for real work, we’re supporting it, but we know we still have work to do and so there’s still a “beta” label on it for now.

One of the biggest things that was holding us back from calling the alerting feature fully baked was the failover characteristics. Almost everything else at Hosted Graphite is a distributed, fault-tolerant distributed system, and we knew we needed the same for an alerting system. After all, we’re on-call 24/7 for it!

Keeping state

Some parts of the alerting system are easy to do failover for because they keep no state, like the streaming and polling monitors in this slide:

alerter_beta.png
Basic alerting architecture slide, taken from the Python Ireland presentation.

The trouble starts when we consider the state kept by the Alerter component. The Alerter receives a series of “OK”, “OK”, “not OK!” messages from the streaming and polling monitors, and needs to keep a lot of those in memory in order to make decisions about which notifications to send, if we need to wait a few more minutes to satisfy a “must be alerting for more than five minutes” constraint, when we’ve already sent enough notifications, and so on. Discarding the alerter state each time there is a failure would cause a pretty poor user experience with duplicate, delayed and missing notifications and we can’t tolerate that.

Fine – just duplicate all the traffic!

We considered just duplicating all the traffic to keep multiple alerters in sync with checkpointing to allow new nodes to catch up. Kafka looked like a pretty good option at first, but it didn’t fare well in a review with the SRE team.

 

Here’s why:

  1. The data volume flowing to the alerter nodes isn’t huge – only a few thousand messages per second.
  2. We run everything on bare metal hardware for high performance and avoiding the noisy neighbour problem.
  3. While most of our users are served by our shared production infrastructure, we have many users whose needs demand a dedicated cluster. This isn’t a problem in itself, but it does mean that we’ll need to duplicate the services for each cluster, no matter how much load that customer puts on us.
  4. Kafka requires Zookeeper for cluster management too, so that’s yet another thing needing several machines. We already run etcd anyway, and we don’t want to run two services that do similar things.

Basically, Kafka is designed for bigger data volumes than providing high availability for this one service, and it didn’t make a lot of sense to run it here.

jcb_flowers
“Using Kafka for this is like using a JCB to plant a daffodil.” — Hosted Graphite’s SRE team

Re-evaluating: what do we actually want?

We zoomed out a bit and reconsidered. What failover properties do we want the alerting system to have? Here’s what we settled on:

  • Nodes should exchange state among themselves.
  • Nodes should detect failure of a peer.
  • Nodes should figure out a new primary after the old one fails.
  • Nodes should assemble themselves into a new cluster automatically when we provision them.
  • … all without waking anyone up.

PySyncObj – pure Python Raft consensus

pysyncobjPySyncObj is a pure Python library that solves all the requirements except the basic service discovery one, which we figured we could handle with etcd. PySyncObj allows us to define an object that is replicated among all nodes in the cluster, and it takes care of all the details It uses the Raft protocol to replicate the data, elect a leader, detect when nodes have failed, deal with network partitions, etc.

Following Python’s “batteries included” philosophy, PySyncObj includes a set of basic replicated data types: a counter, dictionary, list, and a set. You can also define custom replicated objects with a couple of extra lines and a decorator, which is pretty awesome.

Here’s a simple example of how easy it is to replicate a dictionary across three servers with PySycObj:

from pysyncobj.batteries import ReplDict

dict1 = ReplDict()

syncobj = SyncObj('serverA:4321', ['serverB:4321', 'serverC:4321'],\
    consumers=[dict1])

dict1['somekey'] = 'somevalue'

All the distributed Raft magic is done for you. Pretty fantastic. For more examples, check out other examples on the PySyncObj github page.

Monitoring the new Raft cluster

PySyncObj exports a bunch of data about the state of the Raft cluster with the getStatus() function:

# Get internal Raft cluster state
status_dict = self.sync_obj.getStatus()
{
 'readonly_nodes_count': ...,
 'log_len': ...,
 'unknown_connections_count': ...,
 'last_applied': ...,
 'uptime': ...,
 'match_idx_count': ...,
 'partner_nodes_count': ...,
 'state': ...,
 'leader_commit_idx': ...,
 'next_node_idx_count': ...,
 'commit_idx': ...,
 'raft_term'
}

To monitor this data while we built confidence in the new service before rolling it out, we just fired it into Hosted Graphite using the dead simple graphiteudp library:

def send_metrics(self):
for key, value in \
self.sync_obj.getStatus().iteritems():
metric_path = "%s.s.%s" % (self.hg_api_key,
self.metric_prefix, key)
graphiteudp.send(metric_path, value)

The result is a pretty dashboard detailing the Raft cluster state for production and staging environments:

cluster_dashboard

Chaos monkey testing

At this stage, we were pretty confident that this was the right route to take, but we wanted more. The only way to make your failover handling works is to continuously make it fail and see how it reacts, and that’s exactly what we did. We stood up a full-size test cluster and ran a copy of all of the alerting traffic from the production environment through it. The output was discarded so customers didn’t get duplicate alerts, but we were collecting all the same metrics about the output that we usually do.

leader_status_many
Randomly restarting alerter nodes.

In the style of the Netflix chaos monkey, we wrote a small tool to wait a random amount of time, and then restart one of the nodes without warning. We ran this for several days and it worked flawlessly – every time we lost a node, another was elected the new leader in a few seconds, the restarted node recovered the state it lost, and the overall output of the cluster (alerts checked, notifications that would have been sent, etc) was steady throughout.

Emergency options

Distributed systems are pretty fantastic when they work well, but it’s easy to get into a state where you just want a simple option to take control of a situation. In case we ever need it, we built in a ‘standalone’ mode, where all the state is persisted to disk regularly. During a particularly chaotic incident, we have the option of quickly bypassing the clustered automatic failover functionality while we get a situation under control. We hope never to need it, but it’s nice that it’s there.

Current state

After a couple of weeks building confidence in the new alerter cluster, we quietly launched it to… no fanfare at all and no user interruption during the migration. In the weeks since launch, it has already faced several production incidents and fared well in all but one: a small bug that took the alerting service out of the 99.9% SLA for the month. Oops. Despite all the testing, there’ll always be something to catch you out.

After more than a year of effort by our dev and SRE teams, the alerting service will soon have the ‘beta’ label torn off and this automatic failover work already in production forms a crucial piece of that.

Here’s what it looks like now:

alerter_beta_pysyncobj.png

Service discovery for the alerter cluster is handled by etcd, and PySyncObj takes care of all other cluster operations. We’re pretty happy with this – we’re able to avoid waking someone on the SRE team for losing a machine or two, resize and upgrade the cluster without maintenance periods, and scale it up and down to trade off SLA requirements against cost, all of which is pretty powerful stuff.

 

To learn more, visit hostedgraphite.com or follow us on twitter.

Hosted Graphite’s Alerting now integrates with OpsGenie!

TL;DR: Hosted Graphite’s alerting feature now integrates with OpsGenie, including auto-resolving incidents according to the alerting rules.

Hosted Graphite’s alerting feature continues to sprout new functionality – we just launched the ability to send notifications of infrastructure problems straight to your on-call engineering team via OpsGenie.

If you’re not familiar with OpsGenie, here’s how they describe their service: “OpsGenie is an incident management solution for dev & ops teams, providing on-call schedule management, escalations and alerting via email, SMS, phone and mobile push notifications.”

Here’s what the integration of our two complementary services looks like in the OpsGenie UI:

og-alert-ui

Automatic incident resolution

Being notified of a problem is one thing, but once the incident has been dealt with you’ll often need to manually mark the incident as ‘resolved’. If you’re using Hosted Graphite and OpsGenie, this step is unnecessary because you can have your Hosted Graphite alerts automatically close your OpsGenie incident:

autoresolve

This saves you time and frustration, not only for incidents that require a lot of attention but for those inevitable quick blips that resolve themselves after a few minutes, perhaps after a brief network interruption. In those cases, having Hosted Graphite resolve an OpsGenie incident automatically reduces frustration because your responding engineer will see the incident is no longer open, sometimes before they’ve even managed to get back to their keyboard.

For longer incidents, it is incredibly helpful to have your monitoring tell your ops team when everything is OK again, rather than having to check the state of the alerts and manually resolve your OpsGenie incidents. This approach lowers stress, (“Whew, all resolved!”) reduces confusion (“What’s the state of this incident right now?”) and saves time for everyone.

Setting up Hosted Graphite and OpsGenie

Setting it up involves just three steps:

1. In your OpsGenie account, find the Hosted Graphite integration. Copy the API key, and click ‘save integration’

integration

Don’t forget to click the ‘Save integration’ button! (and don’t worry, the example key in the gif has been deactivated)

2. In your Hosted Graphite account, add a new Notification Channel for OpsGenie:

hg-nc

3. Configure an alert to use the new OpsGenie Notification Channel:

hg-alert

That’s it! The next time a Hosted Graphite alert fires, OpsGenie will know about it seconds later and start notifying your team according to the schedules you’ve set up and the notification preferences of your team. When Hosted Graphite’s monitoring detects that your alert is in a healthy state again, the incident will be automatically resolved in OpsGenie.

More resources:

If you have any trouble, just send our stellar support team an email: help@hostedgraphite.com

System monitoring – what are my options? (part 2)

In part one of this series on system monitoring libraries we checked out a some popular libraries used to monitor servers. In this follow-up, we take a look at a few more options and make a recommendation to answer the question ‘which of the many available monitoring tools is best for your environment?’

Diamond

Diamond is a Python daemon for collecting system metrics and presenting them to Graphite. Diamond is good for scale and customization.

Benefits

Extensibility – Diamond uses a straight-forward model: add a collector to the configuration file to add new monitoring. This makes it low friction to scale to dozens or even hundreds of servers, because each instance is the same and responsible for reporting its own metrics. Diamond can handle it, too – the project claims up to 3m datapoints per minute on 1000 servers without undue CPU load.

Variety – Support extends to a range of operating systems with the documentation to back it up. Diamond comes with hundreds of collector plugins plus it lets you customize collectors and handlers with little effort for metrics gathering from nearly any source. Installation is easy, too.

 

Drawbacks

Functionality – Collection is all Diamond does. It can talk directly to tools such as Graphite, but many setups still choose to use an aggregator like StatsD in addition to Diamond for their application metrics.

Updates – Brightcove, the original developers, stopped working on Diamond and it graduated to a standalone open source project. That’s noticeably slowed its release cycle.  Diamond is a mature and well-established project, though, so decide for yourself how much of an issue this is.

StatsD

StatsD is metrics aggregation daemon originally released by Etsy. It is not a collector like other tools on this list but instead crunches high-volume metrics data and forwards statistical views to your graphing tool.

Benefits

Maturity – The simple, text-based protocol has spawned client libraries in nearly every popular languages. You can plug into just about any monitoring or graphing back end. StatsD has been around a long time, and it shows.

Free-Standing – The StatsD server sits outside your application – if it crashes, it doesn’t take anything down with it. Listening for UDP packets is a good way to take in a lot of information without a latency hit to your application, or needing to worry about maintaining TCP connections.

 

Drawbacks

Limited Functionality – StatsD is an aggregator more than a collector. You still need to instrument your application to send data to it. For thresholds or alerts, you’ll need to build in a backend or use something like Nagios.

Data Reliability – Fire-and-forget UDP has downsides, too – for dispersed networks or essential data, packet loss is a risk. Also, if you are using vanilla graphite and send too much to StatsD in its flush interval, those metrics drop and won’t graph. Hosted Graphite can handle it though. 🙂

Zabbix

Zabbix is an open source monitoring framework written in C. Zabbix positions itself as an enterprise-level solution and so offers all-in-one monitoring that handles collection, storage, visualization and alerting.

 

Benefits

Range of Coverage – Zabbix can track not just in real time, but trends and historical data, too. Monitoring extends to hardware, log files, and VMs. A small footprint and low maintenance means you can fully exploit what Zabbix has to offer.

Convenience – Don’t like plug-ins? Nearly all the functionality you might want is built in to Zabbix. If something is missing, there are extensive, simple customization options. Templates tell you what to collect and configurations are set centrally through the GUI.

 

Drawbacks

New Alerts  – Service discovery and configuring things like triggers are both more involved than they really ought to be. Tools like Graphite and StatsD can start to track a new metric just by referencing it in your application. Zabbix is not one of those tools.

Large Environments – Zabbix doesn’t do well on big networks – all that historical data is greedy for database space, PHP for the front end has performance limitations, and debugging is awkward. For an enterprise-level system, that’s a bit underwhelming.

 

Recommendations

So – what do we recommend? One set of criteria for running monitoring services in production goes like this:

  1. Operationally it should play well with others – no taking down a box if something goes wrong, no dependency hell, no glaring security issues.
  2. Extensible without hacking on source code – a library should have some mechanism for supporting the entire stack you’re using without having to mutilate the code to get it working.
  3. Well supported by the community – Lots of users means that you know that bugs will be squashed, and updates will arrive to support new measuring scenarios or technologies.
  4. Works to its strengths – A library that tries to do everything itself ends up doing nothing particularly well.  Opinionated design means that a library can focus on the important issues without adding in pointless frivolities.

So – If I were to pick based on these criteria, I’d recommend either Diamond or CollectD.  They both handle the collection of your data admirably with extensible plugins, and can forward it on to a storage and visualization service like Hosted Graphite (we even give you pre-made dashboards for both services!). They’re both well supported by the open-source community and play nicely with your systems.

If you’re in the Java ecosystem there may be some natural attraction to DropWizard, or StatsD if you’re using a PaaS such as Heroku – but if you’re running your own servers or using AWS, CollectD or Diamond are a good fit.

System monitoring – what are my options?

There are many options for system monitoring –  so many in fact that a lot of people turn to one of the two worst options: Writing your own, or getting struck with paralysis by analysis and doing nothing.

Monitoring your systems and alerting when something weird happens is crucial to understanding and tackling issues as early as possible. That means allowing you to work on the activities you have planned instead of reacting to outages, and ultimately keeping your customers happy.

There’s a whole range of free tools that monitor your systems and create metrics for you to graph, evaluate, and use to create alerts. In part 1 of this series, we’ll explore the pros and cons of three of these popular libraries.

CollectD

Collect D is a daemon that gathers system information and passes it on to Graphite. It is, as the name suggests, a collector rather than a monitoring tool, and stresses modularity, performance and scale.

Benefits

Quick and Easy – Setup is straightforward, configuration is painless, and maintenance is minimal. It’s light on system resources, as a programmed multithreaded daemon written in C, and fast on clients. CollectD supports multiple servers and has a multitude of ways to store data.

Plugins – CollectD has a pile of ‘em: for specialized servers, for sending metrics to various systems, for logging and notification, for nearly anything. The default is enough to get started, but there’s plenty of flexibility once you get going. It plays nicely with Graphite.

Drawbacks

No GUI – CollectD is not a graphing tool – it simply spits out RRD files. There are scripts for a minimal interface packaged with it, but even the project admits that it’s not up to much. You’ll need to plug into Graphite or something similar to read CollectD’s outputs effectively.

Too Much Info – Sub-minute refreshing and variety of plugins makes it easy to overreach. If you ask for a lot of statistics from a node, you may get more data than you can graph and read effectively.

 

Munin

Munin is a resource and performance monitoring tool written in Perl. It doesn’t provide alerting, but Munin is a robust solution for cleanly presenting a lot of network data.

Benefits

Out-of-the-Box – Munin stresses ease of use; installation and configuration take minutes. Writing code to extend monitors is so simple you can use it for non-performance tasks like counting web traffic. You can set thresholds in Munin, but there is a recommended Nagios plugin to generate alerts.

Plug and Play – Like CollectD, Munin has a wide range of plug-ins to choose from: just grab a few scripts from the Plugin Gallery. The more elegant plug-ins can monitor long-view trends like annual resource usage. Writing new plug-ins for yourself is also no trouble.

Drawbacks

Central Server – Each server you’re monitoring runs a Munin process; these servers then connect to a main server. This model can lead to performance issues when the scale rises to hundreds of servers. Budgeting for that dedicated server will need to come sooner rather than later.

Graphs – The graphs generated by Munin are static – not ideal if you want some interactive views of your data. Also, these HTML graphs redraw after every refresh, creating big disk I/O and CPU hits on your system. As a whole, it’s pretty dated.

 

Dropwizard

Dropwizard is a Java framework that supports metrics and ops tools for web services. This collection of best-in-breed libraries is built for speed and robustness.

Benefits

Built-in Metrics – Choose your service calls and performance metrics run automatically. Health Checks publishes metrics by service, too – handy for doing a lot of REST calls. Add in service configuration validation as a default feature and Dropwizard is quick to both deploy and change.

Container-less – All resources and dependencies packed into fat JARs, making it a snap to write micro-services or add instances. Default configurations are sensible and updates are easy, too – you can deploy with one line.

Drawbacks

Performance – Each request has its own thread – even tuning maxThreads and maxConnections may not help throughput. This is problematic for the kind of I/O-bound applications that Dropwizard is likely to service. Dropwizard’s light weight cuts both ways – if you have high loads and a lot of developers, other options may work better.

Support – Dropwizard has an active community, but it’s no match for when Coda Hale developed it. The cadence of releases can stretch to months. Documentation could be meatier, and even StackOverflow doesn’t talk about it as much as other tools.

 

In the next article, we’ll check out a few other useful libraries and dig through the main factors you might look at when making a decision.

Enabling remote work

At Hosted Graphite, we rely on remote work – our CEO works full-time from the US and the rest of the team work from Ireland. We have a flexible policy on working from home (essentially, Nike-style: just do it). As long as work gets done, we don’t sweat the details of when or where it happens. Some work from home regularly, and some rarely.

The key to effective remote teamwork is communication. Every interaction with a colleague is always easier in person and that means when we’re remote we need to put in a bit more effort, but we need to know exactly when we need to put that effort in.

If you’re a remote worker or a team that works with remote employees or freelancers, then we hope these tips will make your life a little easier.

Declaring

It’s crucial that your team knows when you’re working and when you’re unavailable. If you just drift in and out all day, your colleagues won’t be able to rely on you for a discussion or a decision because they never know when you’re going to be responsive.

When your day starts, declare it to the team. Nothing complicated, just “Morning! Working from home today.” will do. When you’re done for the day, make sure you say it: “I’m outta here, talk to yis tomorrow.” If you’re on a call, in a meeting or out for lunch then make your remote team aware of it, (“Lunch, back in 30”) especially if timezones are involved.

Being present

Related to “Declaring” but different is being present. The team should feel that they can call on you when you’re around, just as they would in the office. This means checking your communication tools regularly, or keeping notifications turned on. Basically, don’t hide and isolate yourself, be available for your colleagues because they can’t tap you on the shoulder as they would in the office. If you do need to go quiet for a while to focus on a task, declare it so people know what to expect your availability to be.

Discussions and decisions

With some people in the office and some remote, it quickly becomes apparent to anyone remote that when decisions happen in the office in person, they don’t get discussed online. If this happens frequently enough you feel completely disconnected with the decision-making process, which quickly leads to dissatisfaction and demotivation.

Often it’s easier to just turn to each other in the office for a discussion and that’s fine, but if you know one of the remote folks might like to comment you should (1) make them aware that there is a discussion, and (2) give them a chance to either take part or say they’re happy to go with what the in-office folks decide on that issue. Often just leaving a little note like “X and Y discussed this in person and decided it’s OK” on a pull request, or saying “We’re discussing this in the office – Z, do you have an opinion on this one?” is a big step forward.

Being clear about when and how decisions are made and discussions are had will go a long way toward dealing with the feeling of disconnection that remote workers can feel.

Virtual standups

Usually, some of the teams within the company will have a daily standup where they talk through what they’re doing and any blocking points they have before deciding on what they’ll do next, and discussing the reasoning behind any changes in direction.
Once an in-person standup is done, the team will summarise it on Slack to allow remote folks to participate, and also to create a record for answering the “Where /did/ the week go?” question. A bot pings everyone to update their standup notes at the same time every day.

Usually this is a quick three-liner:

Done: (What I've done so far, how long it took)
Doing: (What I'm doing now, how long I've spent on it)
Next: (What task I'm switching to soon)

From this, anyone can see at a glance what the whole team are working on and how long it’ll take for us to push a feature, fix, or new integration. We can make sure any marketing efforts are keeping pace with the development team, and any new automation needed is in place. If anyone’s getting stuck it becomes pretty clear and we can figure it out.

Snippets

Not everyone does the dev-oriented daily standup notes. We also keep a git repo of what we call ‘snippets’, somewhat modelled on Google’s snippets. This is a daily summary of the stuff that people work on. It’s an optional way of describing the building blocks of the day – particularly useful for developers that are usually remote and anyone working on softer, non-dev work. My day usually looks something like this:

June XXth 2016
– Call with Charlie
– Sales call with Customer X
– Talking with our Lawyer about Y
– Support followup with Customer Z
– One-to-one call with <team member>
– Talking with partner company A
– Reviewing our marketing conversion rates

This has the added benefit of highlighting regular tasks that might be better off automated. If you’re doing something that a computer can do on a semi-regular basis, you probably just need to leave it to a script. If someone’s regularly working on low-value tasks we can direct that attention somewhere more valuable. There have been some great articles on the difference between ‘action’ vs ‘work’ – work drives your business forward, action smells like work but has no outcome (e.g. any task that starts with “investigating” is usually a giveaway).

It’s also a good way of noticing where you’re spending time that you shouldn’t be. For example, if you’re supposed to be doing product and development management but you have lots of actual development tasks in your snippets, that’s evidence that maybe you don’t have the balance right.

Snippets can also be a helpful tool for giving the team a chance to see what the non-technical management types do all day – how much is support, sales, hiring, dealing with the accountant, etc

Planning and task management

We keep a set of Trello boards which we separate into a few different themes:

  1. Product – Adding features to Hosted Graphite, or improving existing ones.
  2. Growth – Efforts to get new users into and through our sales funnel, improve conversion, or promote referrals.
  3. Bugs – Anything that’s creating a less than amazing experience for our customers (or operations team!) and needs to be fixed, or technical debt that needs to be paid off.
  4. The Salt Mines – Current issues/projects, what we’re working on in the next week or two.

We then divide these boards into Short, Medium, and Long-term sections. As co-founders and managers of the product Charlie and I have a call a few times a week to wrangle these tasks between queues as necessary, or break them into smaller tasks. Once the tasks are on the Salt Mines board developers are free to pull items from the ‘Next’ queue and drop them into ‘Doing’, then ‘Done’, Kanban-style.

Summary

Keeping a partially remote team on the same page is tricky, and possibly even harder than a fully remote team because there are more edge cases. We hope this insight into how we run a partially remote team is helpful, and that you’ll find some of these tips useful for your own team.

Alerting from first principles

An Introduction to Alerting

Having recently added our Alerting for Graphite, we thought it’d be useful to put together a short primer on Alerting. What do you need to look at when considering what you alert on, and where those alerts go? An early warning system is only as good as its alarms.

What is alerting?

Monitoring uses alerts to tell you when something unexpected happens, if you need to act, and how you might fix a problem. Good alerts give you the right context to act and enough lead time to be effective. Bad alerts tell you what you already know or don’t need to hear – once you know a database is down, you don’t need to be reminded every minute.

If monitoring gives you data, then alerting gives you information.

How to do Alerts

Done properly, your alerts should trigger only for states or events that require attention or intervention. If you flood your sysadmins with minor alerts, they will try to read them all or ignore them altogether – both poor outcomes! Every sysadmin I’ve ever spoken to gets a thousand-yard-stare when I mention Nagios’s propensity to fill your mailbox with redundant information.

For simple record keeping, set up descriptive logging in a human readable format to capture an event so you can dig into it later – e.g. “Production web server number of 500 errors”.  A good rule of thumb for logging sensitivity is to trigger alerts on what might equate to syslog standard severity levels of Error and higher.

 

Each alert should capture at least these fields:

  • Status – What’s wrong?

A simple, specific statement of what’s changed: a server offline, power supply interrupted, large numbers of users dropped, unusually long response times.

  • Priority – How urgent is it?
      • High – Something is on fire that must be fixed; wake the right person to tackle the problem. A smoke alarm in a data centre needs a quick response from your on-call engineer, and probably the Fire Department, too.
      • Medium – Something needs action but not right away; check the logs tomorrow so technical staff can follow up. Your secondary backup server running low on disk space is a risk for you to deal with this month, but not a crisis today.
      • Low –  Something unusual happened; email the details to create an evidence trail for later investigation. There are weird traffic patterns on your internal network – is everyone streaming Game of Thrones clips on Monday morning? Have a look when you get the chance.
  • Next steps – What do we do?

A list of product/service owners, escalation paths, and immediate corrective actions. This is a good place for some easy troubleshooting – if the team working the overnight can solve the issue with a reboot, then you don’t need to take it any further. Runbooks are a life-saver in the small hours of the morning, giving the bleary-eyed ops team some simple guidance when nothing’s making sense.

 

Further Tips

  • Tune your thresholds regularly to eliminate noise and create alerts for previously undetected incidents. If load spikes during commercial breaks in the big game, tweak your alerts to accommodate that.
  • Don’t confuse priority and severity. Extra processing time for an ecommerce transaction, for example, might be a medium severity defect; but priority depends on factors such as user traffic and SLA terms. What’s an inconvenience on Easter Sunday could be a potential disaster on Black Friday!
  • Disable alerts for test environments, maintenance windows, and newly deployed services – waking someone up for a false positive makes for an angry ops team.
  • Update your call sheet with current contact details – when time is crucial, there’s no room to chase down the former service owner who handed over their admin rights last month.

 

A final word

Every business has a different set of critical paths – you know your systems and people best. Alerts can be automated, but the wisdom behind them can’t be.

  • Establish the remediation procedures that will be kicked off by alerts.
  • Discuss with engineers the kind of diagnostic data that is useful to them – Hosted Graphite alerts can drop graph images directly into Hipchat and Slack.
  • Write a text description of your alerts so that it gives unambiguous instructions for resolution.

An alarm doesn’t mean panic when everyone knows there’s an established process they can trust.

No brogrammers: Practical tips for writing inclusive job ads

A common problem with hiring for tech companies is that job ads often use strong, offputting language that alienates women, people of colour, and other minorities in the tech community. By paying attention to the language we use to describe ourselves, our ideal candidates, and the job responsibilities, we can broaden the net of candidates that might apply and help in some small way to tackle the tech diversity problem.

We’re trying to make our job ads more friendly and focused on human qualities rather than technical qualities. While we can always train someone in a technical skill, we can’t train someone to be a nice person. Or, maybe we can but that’s much, much harder, and it doesn’t sound like anyone would have a good time with that.

Quite a lot has been written about what to do about the problem of poor diversity in tech, but it can be hard to distill that into actionable suggestions. That’s what this post is – an attempt to show examples of what we changed and the thinking behind it.

In this post we’ll be comparing two versions of the job spec for very similar positions. Here’s the one from two years ago, and here’s the most recent version.

We used to say:

Several years of Linux sysadmin experience.

Now we say:

Significant Linux system administration experience.

The reason is that stating an amount of time is offputting to people that haven’t put in that specific amount of time. Years of experience is a proxy for learned skill, but in this case it wasn’t helping us filter candidates better and could only have been offputting to someone without “several years” of experience. “Significant” is still poorly defined (intentionally!) and hopefully less strict.

We used to say:

Your code will be exercised by 125,000 events every second, so performance is pretty important to us! A decent knowledge of common data structures and algorithms is expected.

Now we say:

An eye for performance is important – your contributions will be exercised by more than fifty billion events per day. We always have to think about how something will scale and fail.

The thinking here is that “A decent knowledge of common data structures and algorithms is expected.” is quite specific, and this could be offputting. Instead, we talk about how we have to think about things will scale and fail, which puts more of a focus on learning than already knowing the technical details. We also changed the number of events to a per-day figure – we’re very proud of our growth, but we don’t want the figure to be offputting to a potential candidate.

Where once we said:

We want to see that you know your stuff.

Now we say:

We want to see that you have relevant experience, that you like automating away repetitive work, that you have good attention to detail, an aptitude for learning new skills and that you have empathy for your team-mates and our customers.

We hope this one is obvious. What does “knowing your stuff” mean, anyway? It sounds too close to the rockstar/brogrammer/crushing it nonsense that infests the tech industry. We chose to replace this macho phrase with something better that covers relevant experience, a preference for the kind of work we do, a personality trait, a focus on learning, and empathy for colleagues and customers.

Where we used to put a burden on someone by saying:

You’ll need to help us scale them individually, …

We now say:

We’ll need your help to scale them individually, …

This seems minor, but turning this around makes it clear which direction the responsibility and contribution goes. It’s not that you need to help us, it’s that we need your help. Instead of having a burden dumped on you individually, maybe you can help the team work on this problem?

 

We make a point of saying that we care about the health of our employees:

We want healthy, well rested ops people.

Some early feedback on this blog post pointed out that the word ‘healthy’ here might feel exclusionary to someone with a disability. After some thought, this was changed to ‘relaxed’:

We want relaxed, well rested ops people.

This isn’t quite the same as what we meant by ‘healthy’ because on-call work can damage one’s mental and physical health and we wanted to point out that that we care about this, but ‘relaxed’ and ‘well rested’ convey most of it and are good enough. Suggestions appreciated!

 

Where we previously implied good communication skills:

We have one co-founder living in the US and we use IRC, Workflowy and video chat tools like appear.in to keep in touch.

We now explicitly state it:

Most of the team works out of the Dublin office, but we’re flexible about working from home and one of our co-founders is living in the US, so we’re partially remote and we have to be good at communicating. We use Slack, Google Docs, Trello, Workflowy and video chat tools like appear.in to keep in touch.

The thinking here is that we wanted to mention that we’re flexible about working from home which is better for families, and we explicitly say that we “have to be good at communicating.” We’re not saying that we are good at communicating, just that the business recognises that we have to be, so you can expect supportive and communicative colleagues that won’t make working from home any harder than it has to be.

 

As a small and growing company, we didn’t offer health insurance two years ago but we do now, and we wanted to make sure to point out that it includes family cover:

Health insurance for you and your family.

 

In both job ads, we said:

We’d like to see some of your code, but it’s not essential.

We understand that not everyone will have published code – we recognise that open source male privilege is real, and some people are discouraged from publishing code because of that. Other people may not be able to publish code due to their employer’s privacy requirements, or are just too busy spending time with their family to code the evenings away.

 

Finally, we used to say:

No ninjas, rockstars or brogrammers, please.

This is amusing and captures our opposition to Silicon Valley rockstar/brogrammer culture well, but for a job ad that’s all about inclusiveness it felt a little odd. So we made it better:

No ninjas, rockstars or brogrammers, please; just nice, caring humans.

Of course if you actually practice martial arts, play in a rock band, or enjoy coding *and* going to the gym, you are welcome here. It’s just the “bro culture” and hiring “ninjas and rockstars” trends we’re not keen on. 🙂

There you go, those are some of the thoughts that went into our most recent job ad. We think our old job ad was already pretty good (and many people have told us so!) but we tried to make the latest one more inclusive and to focus more on the individual, learning, family, and support. We tried to remove elements of competition or hard technical requirements, and to keep the job ad buzzword nonsense to a minimum.

So how we are doing? What can we improve for next time? Let us know by tweeting at @HostedGraphite.