Status page updates: It’s all about timing

To continue our push for transparency in how we do things at Hosted Graphite, we’ve decided to share a series of posts detailing our internal SRE processes. Last time, we looked at how to write a status page update, with some practical tips, examples, and the thinking behind it all. In this second post of…

How to write a status page update

No matter how hard we try to provide an uninterrupted service, outages and service degradations are inevitable. Notifying customers about incidents is tricky, and so often engineering teams get it wrong. However, by paying close attention to how we write our status page updates, it’s possible to make a significant impact on our customer’s experience…

The Secrets of Load-balancing Long Lived TCP Connections

How do you deal with load-balancing customer traffic at the border of your infrastructure when you don’t own the network? Following a series of experiments, I implemented a service that leverages our internal Graphite monitoring to dynamically weight HAProxy backend servers based on some measurement of load. Scaling problems In the early days, we relied…

Communicating with twits: How to minimize friction between Dev and SRE

In most companies, there’s an ongoing battle between Dev and SRE teams. A lot has already been written about this and the problems it causes. As with the majority of conflicts, most of the friction comes from misunderstanding and poor communication. That’s why one way to resolve this tension is to break down barriers and…

Request schema validation, a double-edged sword

Making sure data is valid can be a tedious process, especially for complex systems. We have many models in our system that are changed constantly – these models are controlled by our APIs. An example is our alerting API, which allows users to control their alerts via HTTP requests. Over the past few years, we…

Walk, talk and git commit: SRE onboarding (2/2)

In part one of this series, I talked about my early weeks as an SRE at Hosted Graphite. After jumping into on-call, getting to grips with our Architecture and getting acquainted with 5 years worth of tasks, I was almost ready to call myself a fully fledged member of SRE. Little did I know, my…

But first, on-call: SRE onboarding (skydiving for nerds) 1/2

Onboarding a new hire is a tricky process and can be very difficult to get right. I’ve worked at/with companies that have had zero onboarding or way too much. In the past, it was either: being pushed out of the plane without a parachute; or the parachute was already deployed and I didn’t make it…

Spooky action at a distance, how an AWS outage ate our load balancer

Distributed systems are complex beasts and notoriously hard to debug. Sometimes it’s hard to understand how an outage on one service will affect another, and no matter how much we think we understand a given system, it will still surprise us in new and interesting ways. What follows is the story of one of those…

Developing and deploying Python in private repos

At Hosted Graphite, most of our deployed services are written in Python, and run across a large installation of Ubuntu Linux hosts. Unfortunately, the Python packaging and deployment ecosystem is something of a tire fire, particularly if your code is in private Git repositories. There are quite a few ways to do it, and not…