In most companies, there’s an ongoing battle between Dev and SRE teams. A lot has already been written about this and the problems it causes. As with the majority of conflicts, most of the friction comes from misunderstanding and poor communication. That’s why one way to resolve this tension is to break down barriers and work in a way that’s transparent to other teams.
Transparency is one of those words that’s thrown around a lot but often amounts to very little impactful action. At Hosted Graphite we don’t just say we’re transparent, we share everything from the status of our internal systems and how we test them, to the thoughts behind the language in our job ads and what the first few months are like as a new engineer. We even teach staff how to ask for a pay rise.
That said, when it comes to SRE, working in a transparent way can be tricky. There are usually hurdles like tribal knowledge and technical context that make it difficult to keep Devs and other teams in the loop. Below, Dave Fennell talks about all the steps Hosted Graphite’s SRE team take to keep the rest of the company posted about how they work and what they’re up to -- and why this is so important.
‘Us vs Them’
Poor communication causes all sorts of issues between teams. Very often, a lack of understanding makes it hard to know who is responsible for what piece of infrastructure or code. In the worst and most dysfunctional cases, there is an ‘Us vs Them’ attitude where teams operate as isolated tribes, in opposition to everyone else. This isn’t unique to SRE of course, but compared to other teams where work is more apparent (such as seeing feature work rolled out), the risk is higher.
To that end, things like an open planning board or task management tool are often pegged as the solution. This sounds great on paper but in reality rarely works. While it’s better than being totally closed off, realistically very few people are going to keep track, and those that do will miss one of the fundamental aspects of transparency – context.
The Twits
At Hosted Graphite, the SRE team release a weekly update aimed at the whole company called “This Week in Team SRE”, or more colloquially “TWiTS”. It usually includes what each team member has been working on, a count of incidents (and the times they occur) as well as SLO breaches, if there are any. On top of this, the on-call engineer talks about any incidents, what the team did to mitigate them, and planned actions for preventing similar issues in the future. The update is aimed at a general audience, so offers context without the need for much technical knowledge. As a result, the wider company can keep an eye on what SRE are up at a glance, compared to reading through a week’s worth of tickets (which, let’s face it, nobody’s going to do). Engineers from other teams have told us it’s particularly useful when we’re working on long running projects that don't have an immediate external impact, as well as flagging blockers (such as busy on-call weeks).
Audience participation time
While a weekly update like TWiTS is a good way to update teams, there will always be people who just won’t get to read it. That’s one of the reasons we also organise a weekly talk. Every Wednesday, someone in the company will give a talk about their own area of expertise, be it marketing or sales or how a specific part of our infrastructure works. This encourages people to share what they’re working on and offer that ever-valuable context across teams. This is useful when, for example, a support situation arises that SRE need to help out with – we have common ground to work from. On top of that, SRE has been experimenting with on-call drills and incident simulations (almost like a role-playing game, with members of the SRE team serving as the game masters). This gives other teams a feel for our on-call and incident management protocol by using an abstracted version of one of our real incidents, condensed into an hour’s resolution time.
Why this is helpful to SRE
This openness has already been beneficial. Knowing what other teams are doing has allowed us to catch problems or flag potential pain points across various projects well in advance, where a lack of transparency would have hidden potential flaws in the design. One of the things we do, regardless of the team, is to consider ‘who is this going to affect outside the team?’ when we make changes, or plan to. As a part of the process, we make sure they provide input and are involved in the review.
Transparency comes in a lot of forms, and will likely mean something different to everyone, but in our experience, regularly investing in transparency has already started to show benefits.
Like the sound of working in an open team? Join us, we’re hiring software engineers!