Advanced data views: Better observability, more control

At Hosted Graphite, we process billions of datapoints every day. We aggregate every metric you send us — ten different ways. These data views give you more control over what you see and at the most appropriate resolution. It’s an advanced feature we offer that makes it easier to analyse your data, set up effective alerts and take action right when you need to.

Storage and how we differ from standard Graphite

The default Graphite storage backend, Whisper, is a fixed-size database comprising multiple “archives” of varying resolution – e.g. 10s, 30s, 300s, 3600s.

When a datapoint arrives, it is added to each archive separately: the finer resolution as-is and the coarser resolutions using a configurable aggregation function.

A problem might occur with the finest resolution, where the “last write” wins. So if you send many datapoints for your metric within one 10s period, all but the last are lost. This means, for example, that it’s difficult to send to the same metric from multiple writers, and that writers have to be careful to either “pre-aggregate” data or use something like carbon-aggregator in front of Whisper. 

StatsD (hosted yourself or using our Hosted StatsD feature) will also help with avoiding this last-write-wins behaviour, but it has the unfortunate side effect of creating many more output metrics as it splits out the different views. This explodes your metric count, but using the data views feature you can have better data and with fewer metrics.

We built Hosted Graphite from the ground up to be multi-tenant and highly reliable. As such, we needed storage that handled data for many customers at once and which could easily survive partial outages (e.g. a missing machine). Building something like this around Whisper is possible, but it’s relatively difficult to get the clustering parts right. Instead, we settled on Riak as the basis of our data store. This meant that we had an opportunity to improve on Whisper’s last-write-wins behaviour.

Instead of a single datapoint, we store a kind of summary of all data received in a resolution period, which we call a bucket.

How it works

When a user sends a metric observation (or “datapoint”) O to Hosted Graphite, it is added to a set of buckets inside our aggregation service.

 

Observed datapoints are represented by O

By default, these buckets represent each of 5s, 30s, 300s, and 3600s resolutions. For dedicated cluster users, these resolutions are configurable.

The aggregation service maintains one bucket of each resolution per metric for a given period. e.g. for an N-second resolution, one such bucket could be sent from the aggregation service to storage every N seconds, and a new one started.

A bucket is laid out as follows:

Bucket layout

Data Views

This allows us to show multiple views of the same data and forms the basis of the Hosted Graphite data views feature. All metrics sent to us have these data views created for them. This allows users to access all of this summary data and controls which of the different fields inside each bucket is extracted and available to view.

By default you get the average for a particular resolution period.

Screen Shot 2017-07-26 at 10.46.50

You can control what you view either using the dropdown in Grafana, or by adding suffixes to the metric expression in other tools. For example, you can look at the number of observations that arrived in a period, the minimum and maximum, or a percentile.

An example application

To demonstrate the use of data views, we’re going to look at a small component of Hosted Graphite’s infrastructure that we’ve instrumented natively using the graphiteudp module.

This component, flapwings (named after the moustache), accepts line-based log data on a /add endpoint. For each request, it emits an “event counter”:

graphiteudp.send(“log.add”, 1)

the number of lines added:

graphiteudp.send(“log.lines”, lines_total)

and how long it took to process the request:

graphiteudp.send(“log.time”, time.time() – start)

Note that it does no batching or aggregation of these datapoints before sending to Hosted Graphite: it just fires them out on the wire. A re-implementation might handle this more elegantly, but for now this gets us lots of useful data with low overhead.

Counting requests

First, let’s look at the log.add metric. The default data view is avg (average), which is useful for lots of metrics but not so good when every datapoint is 1.0:

1d

Instead, we’d like to count the number of times we see a request, it’s obvs (observations)

2d

That’s more like it! However, these are counts of observations per resolution period (here 30s); let’s refine this with a per-second rate instead:

3d

Counting lines

Next, let’s look at log.lines. Note that we send this metric once per request, so obvsrate looks the same as for log.add:

4d

However, each request contains multiple lines. Let’s look at the avg default view:

5d

On average, each request has between 15 and 25 log lines. How about the min and max per resolution period?

6d

We can see that some requests have only one or two log lines, while others have 70+. We’d like a total sum though, so we can use sumrate (similarly to how we used obvsrate for counting events):

7d

Tracking latency

Finally, let’s look at log.time. This measures the time for each /add request, in seconds. When looking at request latencies, using avg can be misleading.

8a

The average can “swallow” outlier events. It’s better to look at percentiles. For example, we can get an idea of the median latency using the 50pct data view:

9a

Percentile data views are calculated from the bucket’s samples. This is a set of observed datapoints, reservoir sampled such that any datapoint has an equal chance of being retained. We keep 10 samples at 5s resolution, 40 at 30s & 300s resolution, and 100 at 3600s resolution.

This is not as useful or as accurate as storing a full histogram or something like a t-digest, but it requires little storage.

As long as we keep the sampling in mind, we find it’s a reasonable trade-off for simple metrics. Presenting the sampled median and tail latencies on a single log-scale graph gives us a useful idea of the latency distribution:

10d

Conclusion

Our advanced data views feature gives you more control over what you see and lets you easily view your metrics at the most appropriate resolution. Aside from those highlighted here, there are several other ways you can use these views when exploring and troubleshooting your metrics. Read more about advanced data views and their uses in our Docs.