Continuous deployment at Etsy

A QCON Talk

Posted by hossg on March 08, 2014 · 9 mins read

So I went to QCon London this week and attended some really fascinating talks and I figured I’d try and write up some short notes and thoughts on what I learned there. This is going to come in bite-sized chunks rather than as a single post, so expect a drip feed of articles over the coming days. Urgh – I really mixed up some metaphors there. The first session I went to – aside from the brilliant, wide-ranging opening keynote from Damian Conway (http://qconlondon.com/london-2014/speaker/Damian+Conway) – was by Daniel Schauenberg of Etsy describing their continuous deployment approach.

Given that Etsy perform more than 50 production deployments per day, I think there are plenty of lessons to be learned from them. I work in a corporate environment, and currently in a much smaller team than the Etsy group, so it’s not obvious at first glance that we would need to release 50 times per day.

There’s some wrong-thinking in that perspective however – which I think is summed up by the first question Daniel asked:

“How comfortable are you deploying a change right now?

I think that’s a key challenge to anyone trying to operate in an Agile world. To me – even in my corporate IT environment – being able to release right now is an invaluable capability. It allows real flexibility of prioritisation. If we want to change direction today, develop something new, reap the benefits of the 80% of a project we’ve completed, we can make that decision. Right now.

Coming from the corporate IT world, anything that makes prioritisation and re-prioritisation easy is a great thing. Prioritisation is essentially waste (in the Lean sense). It’s not producing anything, it’s not earning money – it occupies huge amounts of time, argument, attention, compromise, etc. If you can simplify the process that has to be a good thing. Being able to take out the “fixed cost” of deployment is a big part of reducing the overall cost of development, and making it easy to “de-prioritise” items until the next sprint or release – because the delay to de-prioritsed features need not be too long. And prioritisation is all about de-prioritisation! But how does Etsy get comfortable releasing code “now”. By the use of Config Flags to allow the system to know whether to execute code in production or not – which allows the release (now!) of incomplete or features. It also allows easy A/B testing, and enables roll-out of new features in a managed fashion (e.g. migration of users). There’s a nice article about this approach here: http://www.agileconnection.com/article/configuration-flags-love-story?page=0%2C0

Daniel then moved on to talk about some of the cultural ways in which they embed the “release now” culture. The kicker for me was this: if it’s your first day at Etsy, you do a release – Day 1! It’s a simple change – uploading your own photo to the site – but it takes the new dev through the release process, and since that lies at the heart of the process you learn that first. I think that’s a really nice model for cultural development: “we believe this [continuous deploy] is so important, it’s going to be the first thing you do”. Next up Daniel explained that every developer gets a dev environment that matches the full production stack. It’s a virtual environment served up by Chef. This makes it easy for the dev to be confident that their change – which could be released imminently, is going to be a fit for the prod config. Although we make full use of VMWare virtual machines in our office, we have a model of needing to hand-craft configs each time we need them (as well as using shared dev environments, running on server builds). With Daniel’s example fresh in my mind, I wonder whether using something like Chef or Puppet could move us to a consistent dev environment model which is surely a pre-requisite for confidence in the “release now” mentality.

I also liked the idea of the Try command that each developer has – it automates a diff between your working copy and trunk, and submits the patch to the Continuous Integration service (in Etsy’s case, this is Jenkins) to allow the CI service to run your code, in the build environment to make sure it doesn’t break the build. What a fantastic idea! Any time at all in a team-development environment will tell you that “breaking the build” is the worst crime a dev can commit. Many moons ago we used to send any such offender to Starbucks to fetch Frappucino’s for the whole team. Any build-breaking bug was called a “Frap Error” and the culprit was widely mocked.

Next up is the concept of a staging environment (called “Princess” at Etsy) which makes use of Production data, and allows both smoke tests and manual tests as a final validation before release. There’s nothing radical in this – it’s a widely used pattern, but it does of course make you think carefully about how to make data- or data-structure changes. Clearly these “system-stateful” changes are high risk, and need to be well planned. I think this staging model actually makes it clearer that data/state-changing changes are a “special case” and reinforces a culture that treats them as such.

Deploys at Etsy are coordinated via an IRC channel, manned by bots, that allow devs to join the deploy queue/train. Deploys are actually done from a custom app with 2 buttons so there is no ambiguity about the process. Interestingly there are two “deploy trains” – one for config and one for code. I think I like that approach, though it does require some forward planning and coordination to ensure that inter-dependencies between code and config are released in the right order and it seems that that requires human judgement and practise that could – in principle – be easy to screw up.

Another aspect – combined with the feature toggle/configs – that gives Etsy confidence to release is the post-release monitoring. The culture is that Devs write their own feature-monitors (using statsd) and do their own monitoring, that everybody can access all the graphs and that everything gets graphed onto dashboards, and that all the logs get streamed and monitored using supergrep. These tools are Etsy-written, but general-purpose and open-source.

EtsyGraphs

Daniel summed this up with a great website:

DoesItMove http://shouldigraphit.com/

Basically – if it moves, graph it!

This is important because once you’ve deployed your feature you need to watch the responses from the system to see if the behaviour of the charts changes – load, error pages, response times, etc. If it does it indicates a problem, and if the feature is in A/B testing you should be able to confirm that clearly from the figures. Etsy use a #warroom IRC channel soley to coordinate outage/issue-related responses along with open-invite post-mortems. Daniel moved on to talk about dev support rotas, “on-call”, etc. There’s nothing especially novel about this, though of course the culture of having to “eat your own dog-food” as a dev is generally good thing – if you know you’re going to get woken up or have to stay late, you’re more likely to take extra care to prevent the bug in the first place I guess.

There’s a world of difference between Daniel’s world and my own, but nevertheless I think there are some real lessons to be drawn from the Etsy experience – in both the culture and the tools/recipes they use.

You can find a link to Daniel’s slides here: http://qconlondon.com/dl/qcon-london-2014/slides/DanielSchauenberg_DevelopmentDeploymentCollaborationAtEtsy.pdf