DevOps – some lessons from QCon

DevOps is a software development method that stresses communication, collaboration and integration between software developers and IT Operations (e.g. Support, Helpdesk, Infrastructure).

I view it as a breaking-down of the barriers between these groups and an end to the “silo” mentality, with its associated hand-offs that are essentially waste (in the Lean sense). That’s not to say that the assurance, compliance, etc, that are associated with IT Operations are not important, but really to emphasise that if these can be achieved without the waste of rigid silo-isation, then that can give us a much more efficient and affective organisation.

That was a thrust of a talk by Damon Edwards (@damonedwards) at QCon last week where he spoke about how to introduce DevOps to an organisation, with a focus on doing this from the development team. For a synopsis and copies of the slides, please see: this link.

I also went to a talk by Ola Ellnestam (@ellnestam) about DevOps at a small Scandinavian bank.  See this link for a synopsis and slides.  Ola’s talk was very interesting, though it felt very unreal to me as the entire staff of this bank is 12 – with only 3 people in IT – naturally giving a DevOps requirement. Of course necessity is the mother of invention, so given a need to operate in a DevOps style, it was very interesting to hear Ola’s experiences and to think how these could be translated to bigger organisations.

Ola said that working in such a small environment forced people to take on responsibility that might not be defined by their nominal role, and that it was essential to focus on Activities not Roles.  Funnily enough despite having worked in some organisations several orders of magnitude larger than Ola’s, the same characteristic seems important to me.  In big hierarchical organisations people have no way of navigating and understanding the complex structures, decision-making processes, etc, and instead look to the Role that people have as a proxy.  Then they look inward and focus on their own Role, and the definition of that role in turn becomes a proxy for achievement, recognition, reward and self-worth.  Promotion, rank, grade, job title all assume far greater apparent value than they deserve and occupy an unnatural amount of attention – and thus result in Waste.

Operating at such a small scale also forces Ola to focus on minimalism and pragmatism.  There’s a note of caution here – if you only focus on these characteristics there’s a danger of only producing “duct-tape” solutions to problems that neither stand the test of time or change in an organisation.  Ola also espouses a need to look at Holism – the understanding of the whole, and when this is also brought into the equation it’s possible to look for really powerful but Lean solutions.  Ola summed it up with a great slide:

MPH

Damon also brought in concepts of Holism when he spoke about “Seeing the System” and “Organisational Alignment”, but more on these later.

From a more practical standpoint, Ola talked about the need for reducing the time taken to roll-back, and he described this brilliantly as having a Ctrl-Z strategy.  We always undo typos and mistakes when we’re working at a computer, so why should doing releases/deploys be any different?  Being able to do this would certainly give more confidence in an ability to “release now” which was a central theme of the talk about deployment at Etsy. Interestingly at Etsy they cannot roll-back – their release cycles are so frequent, they adopt a roll-forward approach.

One of the rationales for this Ctrl-Z strategy is in order to have better control over the “error surface”, which is the impact of an error multiplied by its duration.  I don’t think this is something that is easily quantified in most circumstances, but the point is this:  it’s very difficult to know whether a given error (e.g. a typo) will result in a large impact or a small impact.  One thing you can try and control however is the duration of that impact – and a key technique for this is the ability to easily and swiftly roll-back, i.e. Ctrl-Z.

Ola achieves this Ctrl-Z approach by investing in a pure build – i.e. building and releasing from scratch rather than in an incremental fashion. Clearly an approach like this commoditizes the “running instances” and can be useful in other scenarios – e.g. failover to BCP locations, etc.  If you always release by doing a “virgin” build, then it’s easy to have confidence in doing the same in an extenuating, disaster-like situation.

Other techniques Ola mentioned were to do with database changes.  It’s imperative to make these easy to roll-back as well, so they need to be non-destructive where possible, e.g. by use of shadow-copies of DB tables, and by adding columns and renaming old ones, allowing the reverse operations in case of roll-back.

Interestingly he moved his team from Subversion to Git and explained that the focus of development should be on conversations rather than complex-locking. In some ways there’s no surprise here, but the most interesting aspect to me is that he found value in the SCM switch (and the cultural, behavioural switch!) even for a team of three.  I’m a big fan of Git, but it’s reassuring to hear about its benefits even in very small teams.

The final notion that I took from Ola was a fantastic metaphor:

How do you eat an Elephant?

 

Elephant

The normal approach is to cut it up into (small!) slices!  Ola’s answer is instructive: better to shrink it into a small elephant!  This is a great way of thinking about Agile – do the “whole of something” many times, but on a small scale, rather than trying to do “part of something” big and hoping you can get through it all!

How to make the transition to a DevOps model?

Back to Damon’s talk now.  Damon presented a “recipe” for introducing DevOps to an organisation, from a developer’s perspective. I’ll summarise that recipe here, though it seems to me to require quite an intensive effort, and funnily enough to be be fairly “non-agile” in its approach (i.e. the implementation, not the end-state).  I say that because I think it requires some considerable blocked out time for a team to analyse their current processes, etc, and agree a program of change.  Damon himself suggests a workshop lasting a few days to kick things off, though he does say with practise this can be significantly reduced. In my own organisation, we’re only likely to do this once (we are pretty small) and hence this represents a pretty high fixed cost.  Furthermore, I think we have other lower-hanging-fruit to pick first, for example introducing some Agile development practises, and being more selective about what we build versus what we buy.

Anyway, all that being said here’s Damon’s DevOps Recipe for a Developer:

1. Socialize the concepts and vocabulary

2. Visualize the system:

  • value stream mapping
  • time analysis
  • waste

3. pick metrics that matter

4. identify projects and experiment and measure against baseline and repeat 2-4

Damon’s recipe does seem to me to involve some pretty heavy lifting in terms of the value stream mapping and the time analysis.  I guess these things could be done quickly – but I suspect you need considerable experience to be able to do this, and some strong guidance and mentoring.  He suggested doing this analysis by looking at two things:

  • the flow of artefacts
  • the flow of information

Waste is somewhat an easier concept for me – at least in as much as I am familiar with Lean, and so I find these more natural to look for and identify.

The seven wastes of software development:

When it comes to metrics – clearly and important topic if you’re looking to iteratively improve something, Damon suggested:

  1. cycle time
  2. mean time to detect 
  3. mean time to repair
  4. quality at the source (scrap) – how often does the problem escape the place it was created?

He also emphasised the need to tie the metrics to the individual, i.e. to be help to answer the question:  “what can i do to improve the metrics?”

When it comes to implementation, Damon suggested viewing the transition from the point of view of the challenges that an Ops group face, for example a desire for Audit, Compliance, predictable workload etc, and then suggesting solutions that can specifically address those, but in an automated way.  E.g. to address the “queuing” problem of hand-offs, for example using Chef/Runbook deploys, self-service tooling and so on, but demonstrate how these can be Audited, Secure, etc, etc.

Damon also suggested launching with a “burst of energy” which he proposed would be in the form of a multi-day workshop, along with a “brand name” to help rally the cause and establishe a common goal, e.g. “Ticketless IT”. While I can see how that would work, I think the bigger gains to be had in my own organisation are probably to be found elsewhere (other than DevOps) so I think my own approach will be to look for some specific, obvious, examples of waste and to try and address these initially and then look to DevOps as a second or third stage transition.

 

3 thoughts on “DevOps – some lessons from QCon

  1. damonedwards

    Thanks for writing up such detailed notes.

    I want to respond to your comment about the process being heavyweight and your preference for starting with “low hanging fruit”.

    My first question would be: how do you know if your low hanging fruit is going to actually fix an organization wide problem? Or could it seem like a great idea from your silo but in reality making things worse somewhere else? Until you get everyone in the organization seeing the whole and aligned on what they problems are, you really don’t know. That vision and alignment is a skill that needs to be developed just the same as any technical skill.

    DevOps problems develop at organizations where the various parts of the organization fall out of alignment. Silos build up and the problems mount because everyone thinks they are doing the right thing but are doing so from their own perspective and not the perspective of the whole system (i.e the end-to-end delivery process that spans your organization and reaches your customers).

    I probably should have been clearer in my talk that, while the ideas are the same, the way I presented the process was intended for medium and large enterprises (multiple business lines, dozens of teams, multiple locations, etc.). It’s in these organizations where once the various parts fall out of alignment, it will undermine any improvement efforts you want to make. It’s in these scenarios where you need to attack that alignment problem head on and bring the various parties together to develop that shared vision of how things are and how to move forward. You don’t need the entire company to do this at once. You just go value stream by value stream. As the patterns emerge it really does speed up. At a certain point it just becomes natural way for the entire org to see their work and improve. At that point and the dedicated workshop approach is no longer needed to focus people.

    I’ve seen companies treat this as a training exercise. I’ve also seen companies treat this as something akin to an agile spike for culture and process. In any case, 4 days (really 3.5) to get people together to start to sort out the culture and process issues that are forcing an organization out of alignment and spawning DevOps problems (namely slowing time to market and decreasing quality) shouldn’t seem like a lot considering the importance. Especially when you factor in what the true compounding costs of these problems are to an organization of any significant scale.

    One of the things that didn’t come across in Daniel’s talk about Etsy is just how much time they spend on getting their culture right (hint: it’s a lot more than a one time 4 day event and some 1 -2 day follow-ups). Look up some of the writings by John Allspaw (SVP Tech Ops) or Chad Dickerson (CEO). It’s fascinating just how much thought and effort they put into getting the culture (i.e. alignment) right. Because they have done that, all of the tips and tools that Daniel spoke about just come as obvious answers to them (with lots of trial and error to get there, obviously).

    1. zeresht

      Hi Damon – thanks for such a comprehensive comment. I do take your point, and I completely agree about the value of “getting out of the silo”. I do actually prefer the value-stream by value-stream approach (if only because my demonstrating success in one area, you can generate momentum to repeat it elsewhere). I guess the “low hanging fruit” I have in mind (here) is to introduce automated testing, iterative development and continuous build. These are pre-requisites to an effective DevOps culture in my mind, and making improvements in these areas will have the biggest impact in my organisation right now. I do think stepping towards DevOps after that makes perfect sense, and I think your model and approach has a lot to offer. I do think it may require specialist advice and assistance though… the value-stream mapping and time-analysis for example look pretty daunting to someone who has not approached them in a formal way before.

      1. damonedwards

        By packing it all into a 45 minute presentation I probably gave the impression that this is an heavy all or nothing prescriptive approach. That’s not the case in reality. Everyone goes about it a bit differently depending on their company’s culture and politics… it’s just that in the end the successful ones all sort of look like that general pattern.

        Like any skill, there are some orgs that can figure it out on their own and others who find it more productive to bring in a third party to help sort it all out. With the human complexity of culture and process issues, a third party often helps play neutral change agent (something you usually don’t have to worry about with bring in a new technology).

        Last thing regarding the allure of “low hanging fruit”… :)

        If the org is aligned, then low hanging fruit really is low hanging fruit. If it’s not, I’ve seen plenty of cases where people think something is obvious but the solution causes more problems. For example, I’ve seen plenty of automation solutions (everybody needs deployment automation, right?) in which the implementation caused it to become yet another silo. So then the team that implemented it was happy but the rest of the org was bottlenecking even worse. People would start to route around it and finger-pointing would ensure. Right goal but wrong implementation because nobody saw and really understood the whole.

Leave a Reply