To continue our push for transparency in how we do things at Hosted Graphite we’ve decided to share a series of posts detailing our internal SRE processes. Last time, we looked at how to write a status page update, with some practical tips, examples, and the thinking behind it all. In this second post of the series, we dig a little deeper into the timings and responsibilities of communication during incidents–when to write an update, how quickly it should be posted, and who’s responsible for what.
Who should write status page updates?
The SRE team is responsible for incident communication on our public status page. Although it’s the on-call person’s responsibility to decide whether a status update is needed, it’s not necessarily the on-call person’s responsibility to write the status update themselves, and they can delegate this task to other SREs following an Incident Command approach.
When should I write a status page update?
Not every alert or event that requires on-call attention will automatically require a status page update. For instance, sometimes we’re able to detect issues before they’re able to cause user impact, so there’s no point in notifying users of impact that doesn’t exist. While it’s very important that our status page shows our commitment to transparency with our users, we need to make sure our updates don’t result in alert fatigue for our users.
Deciding whether a given incident requires us to write a status page update or not is up to you (assuming you’re the on-call person). That said, there are a few guidelines we can follow to inform this decision:
- Is this impacting some of our users in a visible way? For example, are render times visibly increased for 5% of our users? Or maybe there are gaps in 2% of metrics for all users? Something that we know affects a single user isn’t necessarily worthy of a status page update, and can be handled via the usual support mechanisms.
- Can this result in people noticing the impact and deciding to open a support ticket asking us about it? In that case, a status page update is a good way to indicate users that we’re aware of a given problem and currently working on correcting it.
As a general rule, if in doubt err on the side of caution and post an update anyway. We’d rather overshare than give users the impression that we don’t care/notice about issues with our infrastructure. Therefore, should we find ourselves debating whether an incident belongs on our status page or not, the answer should default to “yes, we’ll post it and we’ll sit down to discuss and define our policies afterwards.”
How quickly should I get the status page update out?
Ideally, as soon as possible.
Our users won’t know that we’ve detected something is wrong and are working to mitigate the issue until we acknowledge the situation a quick response is important. That said, there’s a delicate balance between providing accurate and relevant information (what the exact impact is, the actual start time of the incident) and having a fast response.
For the initial notification, we should prioritise speed over accuracy. Later, we can (and should) expand on the initial update and add/correct any details that might be either missing or incorrect.
As a general rule of thumb, if we decide a status page is needed for a particular incident, it’s good to cap initial fact gathering to a few minutes before posting the first update. We’ll have time later on to be more accurate.
How often should we update the status page?
This usually depends on the status of the incident’s current status:
- For incidents in “Investigating” status, we shouldn’t wait more than 20 minutes between updates. Obviously, if we have a relevant update (such as moving the incident to an “Identified” status) we should post that right away.
- For incidents in “Identified” status, 20 minutes between updates is a good rule of thumb. However, when working on mitigating incidents, frequent updates might not necessarily make sense. In that case, the frequency can be lowered–provided this is communicated in the previous update. For example, if we need to run a data replay that we expect to take 4 hours, we don’t need an update every 20 minutes saying that the replay is in progress. Instead, it makes more sense to lower the frequency to hourly and explain that new updates will be done hourly, or whenever relevant new information emerges.
- For incidents in “Monitoring” status, frequency of updates depends entirely on the situation. That said, it’s still important to keep users in the loop by letting them know when they should expect an update. If we plan to update them in four hours but we don’t let them know, they might think we’ve forgotten about them (and we wouldn’t do that sort of thing).
Can we provide additional details via other means?
As a norm, we don’t provide details about ongoing incidents outside of this channel (such as support requests or Twitter) and we should encourage other teams to redirect any user enquiries on the matter to our status page.
The reason we take this approach is that any time we spend explaining a new development or details to a single customer, could be better spent adding said information to our status page. Added to that, if you’re communicating through two different channels, one is bound to contradict the other when information gets out of date which leads to confusion rather than clarity for our users.
During incidents, we need to be efficient as possible with our time. To that effect, we’ve found that explaining things to members of other teams so they can communicate them to customers isn’t the best use of our time during an incident and most importantly, results in a worse quality of service for our customers. This is usually due to the fact that information relayed to other teams during an ongoing incident is prone to be inaccurate and/or out of date by the time it’s communicated and if the information is relevant to the incident we need to ensure that all of our users are exposed to it.
If we (or our users, or other teams) feel that we’re missing important details on our status page it’s our job to collect that feedback and address that, as we want this page to be useful and relevant for everybody.
Components are people too!
Well, OK, not really people (not even in the "soylent green is people" sense), but they're an important part of our status page that's often overlooked. The first thing any user sees when they visit https://status.hostedgraphite.com is the historical uptime graphs, which are based in component statuses:
If we don’t update our components when there’s something wrong with them, these graphs will be built on lies and broken dreams. Users will lose confidence in our status page if they know alerting was down yesterday but our graph shows a historical 100% uptime.
To avoid this, we try to update components to a relevant status whenever an issue has been identified/resolved.
We’re done now! Do we need a post-mortem?
That’s a really good question, and the answer is a resounding and definite “maybe”.
In an ideal world every single incident except the most trivial ones would have a post-mortem attached to them (and “trivial” incidents would never make it to our status page ), but this is not always possible. Sometimes we have too many other things going on and can’t afford to spend time writing a public post-mortem. As it stands, this is a judgement call on the part of on-call/management. That said, our goal is to have a post-mortem included with as many incidents as we can.
If a given incident felt “big” enough (remember, all incidents are big to the users affected) or if there’s an important lesson to be learned, then it’s quite probable that a post-mortem will be required.
Just because it’s on-call’s responsibility to make sure a post-mortem is written, doesn’t necessarily mean the on-call person themselves need to write it. You just need to make sure it gets done, but can delegate the actual work to your fellow SREs so you can focus on coming up with follow-up tasks or addressing any other fires that might be going off elsewhere. Working on a post-mortem shouldn’t prevent you from dealing with other ongoing incidents, but ongoing incidents aren’t necessarily an excuse for not publishing a post-mortem on a past incident. The bottom line is that a post-mortem is a great way to remind our users that we care about providing them with good service.
When should we write a post-mortem?
Some investigations take longer than others, but as a general rule, posting the post-mortem at some point on the following business day is acceptable. There’s no need to stay late just so we can publish a post-mortem on the same day the incident happened. You’re also not expected to write one over the weekend (unless something truly catastrophic happened, and at that point your manager will probably want to help you with it). We don’t write post-mortems to check a box, so there’s no sense in rushing it. If, when resolving an incident, you already know we’ll want to publish a post-mortem, you can include in the resolution message that a post-mortem can be expected in the next 24-48 hours.
Some related reading
- https://www.atlassian.com/blog/statuspage/how-to-write-a-good-status-update
- https://signalvnoise.com/posts/1528-the-bullshit-of-outage-language
- https://blog.serverdensity.com/write-status-updates/
Next up in this series, we’ll cover how to write a post-mortem after a production incident. If you haven’t read it, don’t forget to check out part 1 for some tips on writing a status update. To learn more about what we do, visit hostedgraphite.com or follow us on twitter.