In part one of this series, I talked about my early weeks as an SRE at Hosted Graphite. After jumping into on-call, getting to grips with our Architecture and getting acquainted with 5 years worth of tasks, I was almost ready to call myself a fully fledged member of SRE. Little did I know, my onboarding wasn’t quite finished yet…
Building A Cluster
You gotta learn how to make the plane before you get to jump out of it.
Our overall infrastructure is divided into different collections of our main services, called clusters. Generally, we build dedicated clusters for users that need an enterprise-y level of isolation, because they’re heavy-hitters in terms of usage. This separates them away from tanking our production services and, vice versa, guarantees if our production service is hit by high usage, it won’t affect their cluster’s availability.
The biggest portion of our onboarding plan is that every new SRE hire has to build their own cluster. Essentially, I was given a week to duplicate our production environment on a smaller scale. Oh boy.
Thankfully, we had fairly detailed documentation about the entire process. The bad news, however, is that we innovate pretty frequently so it was already outdated after only two months. In kind of an unplanned way, this allowed for a whole level of investigation, exploration, testing and documenting that wouldn’t be there if we had a silver platter build-out process.
The entire build-out gave me very detailed views of how our services interact, what services depend on and how portions of our config management can be used to swap primary and secondary services.
Once it was built, I was heavily encouraged to break it in as many ways as I could think. I built a multi-threaded program to spam my pipeline with metrics and datapoints until portions of it collapsed or overflowed or caught fire. I played with yanking portions of the architecture out mid-flow and witnessed how they recovered and failed over. I purposefully corrupted data mid-process, just to see what happened.
After a couple days of kicking my cluster, a colleague noticed I’d missed an (undocumented) step necessary for database failover to work properly. If everything had been setup correctly, failover of our database was simply a matter of opening Slack and telling Glitter who we wanted to take over instead. Without it, the process was a little more… involved.
At 7PM after work, he pointed it out in the Slack channel and then promptly turned off my master DB server. This was a great opportunity to treat it like a real incident.
(Worth noting: at no point was I forced to fix this outside of business hours or my oncall shift, this was totally a voluntary thing!)
I tried rebooting my master. That same colleague made sure it stayed off. I had to work through our failure response and come up with a solution – restoring backups and doing manual DB failover to the slave. I had missed a vital step to setup, and these were the consequences of that. You can bet your sweet ass I documented what I missed the morning after.
From Zero To Impact
Flying planes and doing jumps just for fun.
After spending a couple days on a new feature for our AWS polling service, it was a Wednesday when it was deployed to all dedicated clusters. The clusters were reporting healthy with the change and it would take a lot of pressure off our authentication and rate-limit layer, so I was looking forward to pushing this change to production.
I checked up on the service after an hour or two on the dedicated clusters. Everything looked good, no disruptions. I used our ChatOps to push the change to production.
The service spikes initially as it’s restarted and back-fills data it might have missed during the restart time, so a queue-climb was expected. It climbed to a minute of data. And then five minutes. And then ten. Oh sh*t.
Long story short, I caused this incident. The problem was that our machines for that service are under constant heavy load and there’s some headroom but not enough for what I added. In fact, what I added had pushed the machines way beyond their limits. I’ve never seen a machine kill so many child processes in the space of a minute before.
I rolled back the deploy and laid out a basic testing environment. Something about those Out Of Memory notifications worried me. Sure, we’d added overhead but it’s just a link to a tool we use all over our infrastructure. There was no way it had, say, a memory leak? Right?
I took a copy of the data during the incident, put my changes on a staging server and setup some tests. I sampled 1 million datapoints from the data and used that as my representative sample. I ran my test scenario with my vanilla changes and noticed some worrying trends: it could only handle ~200 datapoints per second and, worse still, it consumed 15GB of RAM to do so. We process thousands of datapoints a second through this service, this was no bueno.
I could talk for days about the process I went through but that’s a blog post for another day. I did some timing tests and some memory profiling until, eventually, I had changed only 5 lines of code but had improved throughput to ~9,000 datapoints per second and it only consumed 2GB of RAM at most.
Although my original AWS polling change is still awaiting deploy (for other feature reasons), I pushed my improvements to that tool we use everywhere.
My onboarding (bar many months of oncall rotation) is over and it culminated in me making a change to a vital component of our pipeline that increased throughput and decreased footprint tremendously.
It took 105 minutes for the tool to process my 1 million datapoint sample. When I was done, it did it in less than 6 minutes.
Having spent the last two months training to be the kind of paratrooper I see in the other SREs, I can wholeheartedly endorse Hosted Graphite’s onboarding process. It’s done everything to give me confidence in my changes and the humility to lean on our team when things go wrong.
I’ve learned so much and felt like I’ve contributed so much more in such a short space of time. It feels ethereal that I could have become an active member of the team so quickly having come from a radically different background initially. My work is a testament to the team I work with, the colleagues who support me and how well the process works.
The onboarding process has given me the confidence to go skydiving. It might be time to start looking into base-jumping.