No matter how hard we try to provide an uninterrupted service, outages and service degradations are inevitable. Notifying customers about incidents is tricky, and so often engineering teams get it wrong. However, by paying close attention to how we write our status page updates, it’s possible to make a significant impact on our customer’s experience when things do go off course.
At Hosted Graphite, we push to be as open as possible about how we do things. So in this new series, we’re sharing some of our internal SRE processes. This first post looks at the guidelines our SRE team follow to communicate with customers during an incident, with some practical tips, examples, and the thinking behind it all.
We’re a team of engineers, so for most of us writing is not our focus (and no, Harry Potter/Teenage Mutant Ninja Turtles crossover fanfic doesn’t count). We don’t expect our updates to be perfect, but we do try to set a common tone that correctly represents the company. Keep in mind: every mention of “you” is intended as “you, the oncall person and keeper of the sacred pager” or the person in charge of comms if you’re following an Incident Command system.
The elements of a good status page update
In this section, we look at each element of an update, its content and structure, and include some notes on language and tone. As a cloud monitoring service, our customers tend to be fellow engineers or otherwise technical people. To that end, our users expect a certain level of technical detail and our updates are written with this type of reader in mind.
In some ways, the title is the most important part of an update. Very often, it’s the only thing the user will see before making the decision to read on, so it must answer the following question: “Should I [the user] care about this?”.
To answer this, the title should state how the problem affects them, as clearly and succinctly as possible. For example,
is not a very descriptive title, and it could mean a myriad of things (are they slow? missing data on the leading edge? are there gaps? is it something else?), whereas,
Render gaps when querying 30s data
Increased render times
are both clear and specific.
Try to think what would be useful for you to read if you were in the user’s shoes. As a rule this should include:
- A clear description of what the impact is, with as much detail as possible: There’s nothing more frustrating than a status update that doesn’t tell you if whatever you’re experiencing relates to the described incident or not.
- What’s not impacted: In particular, if the incident relates to an issue that affects our ingestion layer, it’s important to clearly state whether data has, or hasn’t, been lost. The worst type of incident for our users usually involves data loss, so we need to make sure there’s no ambiguity in our communications regarding the integrity of the data we store.
- The time at which the problem started: This is important because it shows that we know what’s happening and that our monitoring gives us full observability into user impact.
- The time we consider the issue to be resolved: If we’re still seeing problems after the incident is marked as resolved, there may be something else at play.
- A technical description of what the problem is: Our audience is (mostly) composed of technical people, so some degree of detail and openness in our technical descriptions is appreciated. This also lends us credibility when we outline the actions we will make to avoid/mitigate future incidents.
- What we are doing to resolve the problem, again with some level of detail: Ideally this means something other than “we are applying a fix”. Of course we are applying a fix, but we need to to communicate what that fix is. Maybe we are increasing capacity on one specific layer in our ingestion pipeline, or clearing the cache on one particular service.
Updates should be clear and concise, but at the same time, should answer all the questions above.
How open should we be?
The answer is, as much as we reasonably can be. We already provide a bunch of what are essentially internal metrics in our public status page, because our users appreciate that level of detail. That said, there are some things we don’t necessarily want to share:
- Internal service names: Refer to them with generic names that make sense to our users, such as “aggregation service” or “alerting service”.
- User/server counts, traffic rates: Essentially, think twice about sharing things that are defined by an absolute number, i.e. “1234567 datapoints were dropped during this incident”. It’s better to express them as percentages instead, i.e. “10% of users were affected”, “3.5% of datapoints ingested between X and Y were affected”, etc.
- Additional details that don’t relate to this particular incident. For example, if things broke after a deployment, it might not be necessary to specify that the deployment was “adding stardate support to grafana” unless it’s particularly relevant to the issue at hand.
Some notes on language and things to do/avoid
Do: Be clear and concise
Our goal is to provide our users with answers, not more questions, so let’s try to be as clear as possible. This means that,
we have implemented a fix
is meaningless without actual details on what the fix really is. Here’s a great example of what not to do (what happened? was there an issue? did it get fixed? what did you do? who killed Laura Palmer?):
|18:23 IST Everything is operating normally.
18:06 IST We are investigating issues loading stylesheets.
18:04 IST We are investigating reports of a spike in error rates.
We are monitoring the situation.
is a bit useless as well, unless we want our users to know we’re keeping tabs on this guy. If we want to convey to our users that we’re monitoring, it pays off to be specific about what exactly we’re monitoring, for example:
We are monitoring the results of the backup restore job and it looks good so far.
So far no errors have been reported by our plumbus cleanup subroutine.
Usually we prefer the active voice over the passive voice, so:
we are rebooting the affected server.
is better than,
the affected server is being rebooted.
This helps us better indicate who is performing the action.
Splitting our update across paragraphs is usually a good idea too. For example, use the first paragraph to describe the issue that has been identified, the second paragraph to talk about the actions we are currently taking to mitigate it, and the closing paragraph to add some extra clarification. This helps to separate different ideas in a clear way. Here’s an example from our own status page:
|We have identified and resolved an issue that caused some traffic alerts to trigger as false positives. This only affected traffic alerts (alerts of the form “HG-Alert: Concurrent metrics above 80% of limit”) and not regular alerts.
A new version of one of our services was deployed at around 16:00 UTC yesterday that introduced this issue, which was fixed at 23:10 UTC. Unfortunately the alerts triggered by this issue had entered an invalid state that needed to be fixed before the alerts could be resolved. This change has been made as of 10:45 UTC, so all traffic alerts are back to normal.
If you have received a traffic alert during this time interval it’s likely that it’s just a false positive.
Don’t: Use language that implies we don’t care
Our users pay us to care about their data (and we do!) but sometimes it’s easy to give the wrong impression if we’re not careful with how we communicate certain things.
For example, calling something a minor issue feels like we don’t think it’s important enough, but it definitely feels important to any affected user.
Empty apologies are another great way to give users the impression that you don’t really care about their issue. Things like,
We apologize for any inconvenience this may have caused.
sounds like an empty apology at best, and a mildly infuriating dismissal at worst (there’s no “may have” when users are impacted). If you can’t find a way to word a sincere apology, then it’s better to leave it out of the update, rather than including some random stock apology for the sake of it. We are humans, not NPCs after all.
Passing the buck to our providers is also not okay. You can’t just say that it’s your provider’s fault and shrug it off, as this implies that you don’t care enough and believe that it’s not our responsibility to provide a reliable service to our users. Providing the right level of service to our users is always our responsibility. It’s acceptable to say that we are working with our provider to address an issue, as long as we make it clear that we’re not just shifting the blame in their general direction.
Do: There is no IST, there’s only UTC
Try to make sure to add exact times whenever possible, and always include the timezone with them. In order to be consistent, all times in our updates will be in UTC and 24h format:
16:20 UTC # Like this
04:20 PM # Not like this
17:20 IST # And not like this
Stardate 2017.3234 # What is wrong with you
Ask for help!
If you’re not sure what the appropriate wording for a given situation is, ask around! People will be happy to offer suggestions on what language to use, or do some proof reading on what you’ve written.
As a final note — most of us are just “techies”, so we’re not expecting any of our status page updates to be a literary masterpiece. That said, communicating clearly and with the right level of detail is important. Given that we’re representing the company in public communications, taking some time to consider using the right (and consistent) tone is a good idea.
Next up in this series, we’ll look at when to write a status page. Comments, questions, or anything to add? Tweet us @HostedGraphite.
This post was written by Fran Garcia, SRE at Hosted Graphite.