How to write a status page update

No matter how hard we try to provide an uninterrupted service, outages and service degradations are inevitable. Notifying customers about incidents is tricky, and so often engineering teams get it wrong. However, by paying close attention to how we write our status page updates, it’s possible to make a significant impact on our customer’s experience…

The Secrets of Load-balancing Long Lived TCP Connections

How do you deal with load-balancing customer traffic at the border of your infrastructure when you don’t own the network? Following a series of experiments, I implemented a service that leverages our internal Graphite monitoring to dynamically weight HAProxy backend servers based on some measurement of load. Scaling problems In the early days, we relied…

Communicating with twits: How to minimize friction between Dev and SRE

In most companies, there’s an ongoing battle between Dev and SRE teams. A lot has already been written about this and the problems it causes. As with the majority of conflicts, most of the friction comes from misunderstanding and poor communication. That’s why one way to resolve this tension is to break down barriers and…

Walk, talk and git commit: SRE onboarding (2/2)

In part one of this series, I talked about my early weeks as an SRE at Hosted Graphite. After jumping into on-call, getting to grips with our Architecture and getting acquainted with 5 years worth of tasks, I was almost ready to call myself a fully fledged member of SRE. Little did I know, my…

But first, on-call: SRE onboarding (skydiving for nerds) 1/2

Onboarding a new hire is a tricky process and can be very difficult to get right. I’ve worked at/with companies that have had zero onboarding or way too much. In the past, it was either: being pushed out of the plane without a parachute; or the parachute was already deployed and I didn’t make it…

Spooky action at a distance, how an AWS outage ate our load balancer

Distributed systems are complex beasts and notoriously hard to debug. Sometimes it’s hard to understand how an outage on one service will affect another, and no matter how much we think we understand a given system, it will still surprise us in new and interesting ways. What follows is the story of one of those…

Collaboration > evaluation: Why we pay SRE candidates to interview all-day

As a team of (mostly) engineers we understand why a growing consensus think the process of hiring for tech is broken. Between us, we’ve interviewed hundreds of times and have our fair share of lacklustre, or downright terrible, interview experiences. When interviewing for SRE it becomes particularly difficult — we’re looking for qualities like empathy…