At Hosted Graphite, we receive a lot of metric data from our customers. They rely on us to ingest the data, store it, and make it available to query -- viewing their data through their dashboards. Our customers benefit from their amazing dashboards, right in-app!
Over the years, we've run Grafana in several different ways. What started as duct-tape reactions to early Grafana updates has become a mature infrastructure that supports hundreds of customers sending us millions of metrics per second.
Now, we have independent Grafana servers in Docker containers for each one of our users. This post describes the long road to where we are today, with Grafana servers in Docker services/containers for each of our users, and the many bumps along the way to getting it right.
It was a long, challenging road to this containerized world, and the story of Grafana at Hosted Graphite is a parallel story of my progress from Junior Developer to Technical Lead of the Development team, and lessons learned along the way.
Oh, and this “Per User Grafana” project was (obviously) named “PUG”, so you can expect some cute dog photos as well.
I'll begin with a description of how we introduced Grafana as a backend, by showing outlines and simple descriptions of how Grafana can be run on a small scale with Docker, and continue by describing how we evolved into the full-scale operation we use today. Along the way, I’ll describe the tooling we use to orchestrate everything and the monitoring we’ve found most useful to employ.
Multiple Grafana Instances
Multiple Grafana instances are a huge challenge to run, and they're extremely necessary in many use cases. If setting up your own set of Grafana instances takes too much time, consider getting Hosted Graphite to do it for you! Here at Hosted Graphite, we host Grafana dashboards so our customers don't have to, and we also handle the data ingestion, storage, alerting and maintenance. Use any number of instances you like, and visualize your data on pre-built dashboards.
If you're interested in using Hosted Grafana, sign up for the Hosted Graphite free trial here. Also, reach out to us at the Hosted Graphite team and we'll jump on a video call to talk about monitoring solutions that work for you! Happy monitoring!
The Grafana Journey at Hosted Graphite
Grafana, before v2, was very simple. It had no back-end server or integrated database, so we ran it for our users and saved the JSON for a dashboard into our regular user database. The next version of Grafana would make some pretty big changes, which cemented their plans for the future and led to them being as popular as they are today.
In 2015, we began work to upgrade to "Grafana 2". This marked a big change for Grafana, introducing a Golang-based server and an integrated database. To continue hosting Grafana for our users, we needed to learn and adapt to this new setup.
As Hosted Graphite was a smaller company at the time, the task of upgrading to Grafana 2 became my responsibility. I was in my 3rd or 4th month of being a Junior Developer at Hosted Graphite, and the decisions I made on that project had a longer-lasting impact than I anticipated.
I wrote almost every line of code that was part of this upgrade, beginning with the first PR, and had PRs for the next 2 months consisting almost entirely of Grafana-based bug fixes. This was a large project with a pretty high impact, and I learned a great deal in the 2 or 3 months it took to complete that first upgrade.
There were several goals with this first integration, but the overall theme was consistency for our users. Things that worked before Grafana 2 would need to work after. For example, a pattern of "https://www.hostedgraphite.com/unique-id/grafana/" was going to need to exist within Grafana 2, and the users' dashboards would have to continue to exist at known URLs.
It’s worth noting that at the time, Grafana didn’t have nearly the same number of features and options for configuration that it has today, so I worked with what we had and ploughed ahead with what turned out to be some rather duct-tape solutions.
Initially, this was ok. The new Grafana did work, things were kept consistent for our users, and the upgrade went pretty smoothly (only a very small percentage of dashboards went temporarily missing, and all was fine after a few quick fixes).
We made a lot of small changes to the Grafana app: including changing which URLs were requested, changing the format of the URL, swapping icons, hiding features we couldn’t yet support, and even removing large sections of code to get rid of some random behaviours we thought could cause trouble.
For a long time, we changed which HTTP method was used for all Graphite renders (Grafana defaults to using a POST, we chose to make it a GET). These small changes (duct tape!) would, as time went on, set the precedent for more and more duct tape.
By the time we deployed “Grafana 2”, I was happy with things. Our Grafana set-up was a bit unique and a little rough in places, but could essentially be summarised as looking like this:
- All of our users were using a single shared instance of Grafana, and talking to a single Grafana database.
- HG Users and teams were kept separate from each other using Grafana's concept of "Organisations". The logic was maintained by passing all requests through our web app.
- We handled all of Grafana’s session tokens, cookies, and authorizations within our web app.
- We proxied API endpoints for Grafana through our own API endpoints in our web app.
A month or two after the first set of Grafana 2 deploys, I graduated from Jr. Developer to full-time developer. The bosses were happy with things so far.
However, as Grafana kept churning out new releases, we started to fall behind. The many little hacks we had made to make the project work became very difficult to maintain. We were a small team and had plenty of other projects to work on. With more and more Grafana upgrades coming out, it felt like we were constantly chasing our tails.
Our Development team changed over time, and now that I was no longer a Jr. Developer, I could delegate Grafana upgrades to newer members of the team. Completing a Grafana upgrade was almost a rite of passage when someone joined the development team. It was time-consuming, boring and quite confusing at times. After completing one, however, you came away with a very clear picture of the intricacies of how our various web apps hooked together. While everyone was delighted when it was done, nobody wanted to do another one. This was not ideal when you consider that this was a key part of our infrastructure.
It became clear that this Grafana situation was not ideal and, besides the dread of having to work on another Grafana version upgrade, could be broken up into several key issues.
- Grafana wanted to run at a “root URL”:
Probably the biggest issue (or at least the one with the biggest impact on development work for us) was that we couldn't make use of a unique per-user "root URL" for Grafana (in the examples below that will be "http://0.0.0.0:3000/"). This led to us making a boatload of useless changes to Grafana to support the idea of our single Grafana instance serving requests. Such ashttps://www.hostedgraphite.com/myid789/grafana/
. - Using “Organisations” to separate users wasn’t very comfortable:
As time went on, we became increasingly concerned about how users were split by “organisation.” It led to a lot of code changes in our web app dealing with users’ permission levels and led to some unintuitive mappings between what our users in the web app and the users in Grafana were doing. In this world, a small bug would have been enough to suddenly have Customer A see all of Customer B’s dashboards and graphs. Thankfully, this never happened. Our unit, integration, and manual tests proved savvy enough to catch these, although the manual tests were very time-consuming.
One mildly interesting quirk was that at some point in time (it’s different now), Grafana saved the default light vs dark theme as a per-instance setting. This was not ideal when we had hundreds of customers all with different preferences all living blissfully unaware of each other in the same Grafana instance. As a result, we introduced yet another hack with some JS which would load the different CSS files based on URL parameters. - Grafana introduced installable plugins (which are great), but with all the duct tape solutions we’d implemented, it was almost impossible to swiftly add them in. It took us a few weeks to add one which should have been a 2-minute “install and restart”.
- Feature clashes:
Grafana supplied some things we already had in our stack and some newer features we didn’t yet have the capacity to support - Alerting and Datasources being two good examples (these are now part of our mature product). - Our package build and test configuration began to differ from Grafana’s over time. We also used CircleCI, but the initial configs were quite different and when we tried to upgrade, the test environment would have incompatible package dependencies, leading to tests that were working in Grafana’s project suddenly failing in ours. There were even a scary few days in the early stages of “Grafana 2” deployment, where the only place we could successfully build the package was on my own laptop.
Just to be clear, these issues weren’t Grafana’s fault -- they all sprung out of different design decisions we made over 2.5 years as we gradually moved toward the mature solution we have today.
Enough is enough
After a while, we got fed up with this and knew we wanted a different set-up. Here were our goals:
- Isolate our users into separate Grafana instances. This gives us:
- Better overall security.
- Individual fail case tolerance (one instance going down impacts 1 user, not all users).
- More customizability: each user’s Grafana can be loaded with its config file. - Make it possible to upgrade our Grafana versions quickly:
- We wanted to get rid of as many of the duct-tape solutions as possible.
- Provide a way to test that features work - no more searching files and replacing seemingly random lines of code. - Support newer Grafana features.
- Easily install plugins to existing Grafana instances, aiming for a scale of days rather than weeks.
Docker seemed like the obvious solution for our first goal. Docker provided a simple way of running one Grafana per user, thus isolating each user from all of the others.
For a super quick proof of concept, I searched Grafana documentation, found a way to run a simple Grafana server (described below), and then reworked sections of our web app to make it forward any Grafana requests for a particular user to the container. Within a morning's work, I had a clear idea that this was a viable solution, though the path to release would be a long one.
Running Grafana in Docker
This should work for you, too, and serve as a nice break from our early Grafana woes.
It’s easy to recreate the test I did as a proof of concept. While the version of Grafana may differ, the result would essentially be the same. Of course, you probably don’t have another web app managing users, logins, or teams that you need to work around.
You need to have installed Docker, and https://docs.docker.com/install/ should make that pretty easy no matter what operating system you have.
Grafana provides some good documentation on running a single Grafana server with what I call "stock settings". Run this command:
$ docker run -d -p 3000:3000 grafana/grafana
Flags explained:
"-d" means that as the command is executed, we tell Docker to start the container in the background and just print back the container ID.
"-p 3000:3000" tells Docker that there's a port being used in the container at "3000", and we want to publish that to the host machine (where you typed in the "$ docker run"
command).
Once the command runs successfully:
Head over to http://0.0.0.0:3000/ in your browser and log in with "admin" as both username & password (this is the default for Grafana and you will be prompted to change it).
For the PUG project’s proof of concept, I then pointed our web app to talk to 0.0.0.0:3000 for all Grafana requests, and most things loaded just fine! However, we didn’t have any access to any metric data.
Hosted Grafana on Hosted Graphite in 2020
We've come such a long way from our early Grafana installation at Hosted Graphite. Today, you can see a lot of our developments related to Hosted Grafana on our MetricFire blog! MetricFire is a new brand that encompasses Hosted Grafana, Hosted Graphite and Hosted Prometheus. Check out these guides on monitoring your own infrastructure with Graphite and Grafana, deploying Grafana to Kubernetes, and our Azure Integration with Graphite and Grafana.
We're now running Grafana alerting, multiple Grafana data sources, and all kinds of Grafana panels and integrations!
You can sign in to our Hosted Graphite free trial here! You can check out the Hosted Grafana dashboards and start monitoring your metrics right away!
How about graphing some data
So, I had a new Grafana in Docker, and I had my web app talking to the Grafana instance. Now I needed to configure a Grafana “data source,” which would allow me to graph the data we care about.
Adding a data source via the UI is fairly easy in Grafana. If you ran the commands above, you can follow these steps to graph the metric data you have in Hosted Graphite:
- Go to: http://0.0.0.0:3000/datasources
- Click "add data source"
- Choose the "Graphite" option
- Browse to https://www.hostedgraphite.com/app/sharing/ (log in if prompted)
- Copy the "Graphite" key from this page. This URL should look something like: https://www.hostedgraphite.com/<ID>/<TOKEN>/graphite/
- Choose "Browser" under the dropdown for "Access Type" (both work in the latest Grafana versions, but that was not always the case).
- Click "Save & Test"
Now that we had a working Grafana with a working Graphite data source, I concluded that we could scale, automate and configure to run Per User Grafana at Hosted Graphite.
Docker, at scale, for many users
We knew we could do this for one user. Now, we needed to determine how to do it for a multi-tenant setup. Put simply, we wanted to run a Grafana server, in Docker, for each user.
Knowing it is possible to hook the two together and graph data, the first question we asked was "How reliable will that be?". One Grafana server in one container didn't sound ideal, and we ran our services to very high standards of reliability. Docker Swarm https://docs.docker.com/engine/swarm/ was an obvious first place to start looking for another solution.
A Swarm of PUGs
Using Docker Swarm, we could run a Docker "service" for each user. A service consists of several Grafana containers (the service calls these “replicas”), each running an identical Grafana server. Ideally, each container would be spread around several different machines or nodes. Note: we opted to use an external MYSQL database for user sessions/Grafana.
It seemed simple to configure each user to have a Grafana container running on a unique port, spread across a Docker Swarm, and we added some changes to our web app to route Grafana requests to the Docker Swarm rather than the old-school "shared Grafana on the server".
A simplistic view of the path taken by a request now looked something like this:
frontend -> webapp -> Docker-swarm -> Docker-container -> Grafana-server -> MySQL DB (not in Docker)
As you might expect, we created a test environment we felt would represent a percentage of what our production swarm might look like.
Quickly, we began to see the swarm fail. One issue was found when we were testing how quickly we could get some services up and running after some kind of "total failure". We saw a lot of RAM usage and it bogged down the system - https://github.com/moby/moby/issues/36264. Eventually, this was deemed to either be a load issue or one that disappeared if we ran the swarm with more nodes (maybe they were the same thing). Either way, we mitigated the problem with more nodes. Perhaps we were expecting a bit more from the swarm than it could handle.
It took time and some tweaking, but we managed to scale the test environment and run Grafana for all our users (more on how we orchestrated that later). The PUGs were running!
The test environment had everything we needed to get our SRE team involved (to be fair, they already had helped with most steps along the way). Before taking the next steps toward production, we consulted with the SRE team for clear direction on determining how to monitor the swarm.
As we ran more tests over a couple of weeks, we developed ideas of what metrics to look at to determine swarm health. We built tools to gather these metrics and began building dashboards to monitor them. This is a typical process at Hosted Graphite, as dashboards are considered part of any new project and are developed over time with the project.
As we became more confident in the swarm, we moved it into production. We spun up a service per user with 3 replicas (so 3 Grafana servers for each user), initially running silently alongside everything else, but not serving any requests. Once the SRE team cleared everything, we began a migration of users from the single-shared Grafana to each having their own unique instance of Grafana.
This was the first time we ran Docker in production at Hosted Graphite and we did quite a lot of testing and research. Now it’s just another tool in our belt for when we're designing new services.
From a user's point of view, this changeover was only cosmetic. They got a slightly newer Grafana version, which came as part of getting our (still hacked) version of Grafana built into a Docker image. The large changes lay in the point of view of the development team. We had discovered a whole new way of working with Grafana that provided a better experience for our users. In some ways, the work was only beginning.
How we orchestrate the Docker swarm and its PUGs
We had a pre-existing service at Hosted Graphite which used Python RQ to schedule jobs based on user data pulled from our web app. The shape of this service fit the bill for what we needed on our Docker swarm.
As a result, we extended the existing project to include new jobs for adding, removing, and upgrading Docker services. We're running a Grafana service (services contain multiple containers) per user, and we store any information about how we want that service to run in our user database. However, the actual amount of information needed to spin up the Grafana service is minimal:
- Port number to run the Docker service on
- A Grafana Docker image tag (this is how we handle new versions)
- Username and password (to talk to the DB)
- A few other trivial pieces of information that aren’t relevant here
We already had a secure, internal API for pulling some user info, so it was easy to add a new endpoint to pull back a list of per-user Grafana configs, pop each one into a job with RQ and use something like: https://github.com/docker/docker-py to tell the swarm what to do based on information passed in. If there were endpoints not available through the client, we make direct requests to the Docker API.
We added a second job to RQ to remove a service once a user was deemed "not active" anymore.
A slight issue we had was that (at the time), RQ automatically logged out all the parameters we were passing to each job. This wasn't ideal as it included some sensitive information, so I made a little addition to the RQ project to solve that: https://github.com/rq/rq/pull/991. We can now log the job status, and not worry about logging sensitive user information.
This system has worked well for us for about a year now. As time went on, we’ve improved our monitoring, our SRE team have run books for how to recover from various failures (thankfully these haven’t been needed very often), and we even added some Grafana-specific deploy commands to our ChatOps Slack deploy-bot.
As a bonus, this section produced a depiction of a shepherd minding a flock of Docker containers (something we’re all very proud of):
Pug photo by Matthew Henry, background by Shoot N' Design via Unsplash
Monitoring the swarm
For a better explanation of the following terms, have a look here: https://docs.docker.com/engine/swarm/key-concepts/.
Each node in our swarm reports:
- Ingress peer count
- Leader status
- Manager status
- Count of containers
All of this information is available through the Docker API. We’ve used https://github.com/docker/docker-py at times or made direct requests to the Docker API, usually having to do some snooping to figure out the exact endpoints.
Like our own services, we send this information through our own monitoring stack and then use Grafana to visualise it, giving us a nice overview of what sort of state our swarm is in. Sticking this info beside some standard machine metrics (which we collect using Diamond) puts us in a great position to monitor the swarm.
Our Grafana service gets monitored differently. We treat it as part of our “webapp” stack, focusing on the user experience, and checking status codes and response times to gauge user impact.
What about the rest?
Once we had Grafana pushed into our Docker swarm, this opened the rest of our goals and made them much more approachable. This “phase” of work happened in the first half of 2019, and, as I had taken on the newer role of Tech Lead, this was a phase for which I wrote the least code. Instead, I defined each of the individual tasks and added them to our team’s sprints over several weeks, while liaising with our team manager. I was busy writing code elsewhere and was heavily involved in code reviews.
The project was a success, everything got deployed, and the teams were great at reporting back or highlighting newer tasks that emerged as the project progressed. We had our “hack-free” version of Grafana deployed and running. Another great outcome of this phase was spreading knowledge around the team with regards to how our Grafana was integrated with our web app.
Duct-tape free since 2k19
Simply deploying a clone of the official Grafana image to our swarm, taking advantage of the many configuration options provided, and then making a handful of small changes to our web app, we have a version of Grafana “without all the hacks” up and running in production with just a few weeks of development work.
Installing plugins
We’ve baked different plugins into the Docker image we provide for users. In most cases, it takes only a day or two to test and deploy a new one and involves almost zero code changes.
Newer features
We allow users to configure external data sources to their hosted Grafana. You can graph many others alongside your Hosted Graphite metrics.
Upgrading to newer Grafana versions
Grafana upgrades are more or less stress-free for our developers. Within a month of getting everyone to the “hack free” version of Grafana (initially was 5.0.3), we upgraded to Grafana 6.0 and then straight to 6.1.
A nice coincidence occurred as I was writing this post. I got a message from one of the team, where the conversation started with a link to a pull request for a Grafana upgrade to version 6.2:
I’d call this a roaring success, though maybe we can improve that build time.
Where are we now?
Moving forward, we can focus on new features rather than the hours of work required to maintain what we have. During this process, we wrote a lot of internal documentation making the project(s) clear to any newer hires. It is an easy process to upgrade our Grafana version.
I learnt an awful lot over the years with this project, from the first deployment to the current (much more frequent) ones. I ended up involved in most of these in some way. Whether that was part of planning and task specifications, running proof of concepts or digging deep into the code, the journey Grafana at Hosted Graphite has taken us on has taught us a lot that we can apply to future projects.
So you’ve read this far and only seen one pug? Have another:
Conclusion
Running multiple instances of Grafana is a big challenge - but entirely worthwhile. If you're not interested in running multiple instances of Grafana on your own, let us do it for you! Hosted Graphite is here to host Grafana for you, so you can focus on what matters. Get your free trial of Hosted Graphite here and start monitoring now! You can also reach out to us directly and book a demo with our team.
Happy monitoring!