Acceptance Testing for Continuous Delivery
I went to #qconlondon last week and one of the stand-out presentations was Acceptance Testing for Continuous Delivery by @davefarley77. This was one of those talks which read pretty much like a recipe book for the topic, but when the author is Dave Farley, pretty much the authority on Continuous Delivery, it’s a foolish man who doesn’t listen to what he has to say.
What is acceptance testing?
Acceptance testing is the ability to assert that code does what user wants, that the system works in a prod-like environment (not just “on my computer”) and that the deploy and config of the system work correctly. Acceptance testing acts like an automated Definition of Done, and provides timely feedback loop on user stories. Taking the ATDD view of things, automated acceptance testing should be an executable spec of what we want from the system.
Who owns the tests?
A key question that Dave posed up front is “Who owns the tests?”. He acknowledged that anyone can write tests – with modern tools and DSL’s – but it’s the developers who will break the tests when they make code changes, and so it the developers who have a responsibility to keep the tests running (remember: keeping the continuity of delivery is key). In his view, separate test teams owning acceptance testing is a toxic anti-pattern.
I don’t disagree with this, though there are a couple of traditional biases towards that model – at least in the enterprises where I have experience. Firstly, until you’ve made a genuine transition towards TDD of some description, developers have a tendency to look down upon testing. There are probably all sorts of reasons for this, but I suspect it’s a slightly tautologous technical snobbery within the dev team, (“I’m a techie so I build things; the testers can’t code so they just test stuff, therefore testing doesn’t need techie skills, and therefore it’s beneath me.”). Of course, once you have automated tests and you acquire the technical skills to implement and maintain these, this bias disappears quite quickly.
Secondly, in regulated environments there’s a belief that quality and assurance come from the independence of testing, which in turn is implemented in the simplest possible way (from a management perspective, though not from an efficacy perspective!) by simply having a separate team. That’s independent. Simples! As always with “management from a distance” there’s no understanding about the quality of testing, and no empirical measure of the number of tests or test coverage (in relative terms) that can be achieved by automation. In my experience these characteristics vastly outweigh the benefits of “independence”. Nevertheless it’s all too easy to underestimate the ability of people to screw things up, so there’s a lot to be gained by having someone check your work, but we call that a “peer review” and there’s no need to reinforce the “independence” at a management level; instead simply have some mechanism or tool for recording the peer review, and use that and the metrics resulting from automated testing to demonstrate in quantitative terms that level of assurance and review required by auditors or regulators. @SmartBear were at QCon selling their #smartbearcollaborator tool which looks like it could fit the bill by providing a structure and measurement model around the peer review process, and of course there’s always #fisheye.
Of course this begs the question of what happens to the existing dedicated QA/Assurance folk. I think it’s fair to say that they must be a threatened species, so I’d think it would make sense for them to reskill themselves – whether by embracing the change, and using the opportunity of DSL’s to learn some coding or by taking advantage of their acquired business knowledge to move into the analysis or product-owner space. A lot of the testers I’ve come across in my career have been very impressive individuals with a forensic ability to detect and identify bugs, with a deep knowledge of the systems under test and the businesses supported by those systems. I don’t view themas threatened – just the nature of their current roles; they need to adapt to survive and recognise that automation is going to happen. Some have indeed recognised this but have not sufficiently accepted the scale of change required, thinking that “automated UI testing with UI recording tools” is a good enough adoption of automation. It really isn’t – these tools are extremely difficult to use and to maintain, they have a high learning curve are not often automatable, scalable or performant-enough in a CD environment. More on that later…
Other nasty anti-patterns Dave mentioned included the record-and-playback of production data as well as dumps of production data to test systems.
Properties of good acceptance tests
So what does make a good acceptance test anyway? This really formed the bulk of Dave’s talk, and so I’ll summarise here what I heard and how I interpret that.
What not how
Firstly, it’s critical to focus on the “What” it is we are looking to be expressed in a test, not the “How” it is expressed.
Typically for a system under test, we write test cases to simulate users and/or use-cases of the system. The problems arise when a change is made to the system, because then you have to go and change every test case affected by that change. The traditional way in IT of solving a problem like this is to add a level of abstraction to the system – to add a proxy between the test cases and the system under test, for example as with the Pagedriver pattern, with only the proxy needing to be changed in parallel with the application change. This means that the tests don’t need to know anything about the underlying system itself (the “how”), and instead can focus just on the “what” the test is intended to achieve, with the proxy translating that into the actual interactions required. This in turn makes it easier to express the tests themselves in an abstract language, e.g. a DSL – more later on that subject!
Over time you start to build up a “test infrastructure”, which in turn accelerates the development of further tests and you build a virtuous circle of TDD. (Note: infrastructure in this context is not supposed to mean a physical or virtual infrastructure of servers and networks, but instead the underlying paraphernalia of services, software and tools necessary for writing, running and supporting the tests).
This focus on “what not how” also means that we don’t want every test to control its initial conditions by starting and stopping the app, and of course doing so we would incur a huge cost every time we run the test. Instead we should start the system once and then run a bunch of tests against it and amortize the cost of startup. Separation of these concerns (the “what” and the “how”) gives us the opportunity to parallelise the tests.
Good tests use the language of problem domain
Although this doesn’t necessarily mean you should use a domain-specific-language with which to specify the tests (though there are plenty of such frameworks and toolsets, and they’re definitely worth looking at) it does suggest that over time – even in a traditional language – you should build up the artifacts of a testing framework with a nomenclature that represents the problem domain – the entities/nouns as well as the actions/verbs. This helps with test-case creation, readability and maintenance and of course helps to ensure that the tests do live up to the goal of describing the “what” not the “how”.
Isolated from one another
Any form of testing is about evaluating something in controlled circumstances, and so good acceptance tests should be isolated in a number of different ways and ensuring this is a vital part of test strategy. Plus of course the isolation means that they can be run in parallel with one another for efficiency and speed.
Firstly the tests should isolate the system under test from other systems. A typical scenario (especially in big enterprises) is to say we want to see “end to end tests” of multiple systems connected together. Of course this is the antithesis of isolation, and typically we have little to no control over the input systems that feed data to our system. We won’t be able to test the corner cases, the peculiar situations, etc, plus of course it’s unfair to assume knowledge and understanding of all these other systems. Instead a good approach is to fake the external systems and produce validatable output.
Of course there’s a valid concern about the stability of the interfaces between systems when we change something, which is usually the driver of people wanting these end-to end tests, but the tests required to test an interface itself are usually far fewer and simpler than the complexity added by linking the systems together, and so this is usually a much better approach.
Next, tests should be isolated from each other and to achieve this (and thus achieve parallelisation and independence) we need to avoid ANY dependencies between tests. An important technique is functional isolation – identify the different functions that the system has and test them in complete isolation. As an example, Dave suggested that a test of Amazon (say) would involve creating a new account and a new book or product, and then have the account purchase the book/product. Doing otherwise would imply a functional dependence between a (pre-created test user, and a pre-created test book).
Tests should be repeatable, which is actually just another form of isolation in as much as it means they should be isolated from themselves: they should have “temporal isolation”. If I run a test twice it should work both times. But what if I create data (or delete, update) in a test? When I re-run the test, it will already exist – so I won’t have the isolation or true repeatability I am looking for. So a good technique for this is to have the test infrastructure create (consistent) unique id’s or aliases to achieve this uniqueness (within a test-run) for each test, and thus ensure the test isolation I need.
Another technique to assist with repeatability, often used in conjunction with the interface-based testing discussed above (instead of end-to-end testing) is to create a back-channel to confirm that the expected outputs correspond to the inputs of a test.
Tests should test any change
Tests should be able to test ANY change to the system, and should be a method of applying the scientific method to the system development – they are about performing experiments.
Test cases should be deterministic, and yet time is a problem for determinism since dates and times are frequently important concepts in our data and our systems logic, and of course we always run the same test at different times. There are two approaches to dealing with this. Firstly, the simplest: simply filter out time-based values in the test infra so that they are ignored. Although fairly trivial to implement, this approach can obviously miss errors and prevents complex time-based scenarios from being tested. An alternative is to control time – to treat it as an external dependency, like any external system, and to fake it! This is obviously more complex in terms of the test infrastructure required, but provides a great deal of flexibility. @davefarley77 recommended creating a time.travel() method to manage the shifting of “system” time (or any time field passed to it), and again the use of a back channel to manage the expected results.
Efficiency and test environment types
We want thousands of tests and we want prod-like environments, but simply replicating production is likely to be too expensive or complex to manage. Instead we need to focus on the deployment topology and ensure that what we are testing replicates the challenges we will face in production. The test environments need to be representative not identical. The key to success is to fail fast, in order to provide feedback so we can correct issues and move on. There’s simply no point in spending time, money and effort in replicating an entire environment if it doesn’t actually yield a greater number of failures any more quickly. In @davefarley77’s words: “If the feedback loop from running these tests goes much above an hour it severely compromises the behaviour of the team.”
A key tip relating to efficiency is to make the tests appear externally synchronous, but have them complete internally in an asynchronous fashion, for example by listening for a concluding event to represent completion. Never use wait() and expect reliability or efficiency; at worse use poll() and timeout if absolutely necessary.
Some tests need special treatment. A good approach is to tag test with their characteristic properties (e.g. @timetravel, @fpga, @detructive) and to allocate them dynamically to the overall test infrastructure in a coordinated way, for example ensuring that these different tags are not run in parallel where they may break an overall test isolation that is assumed by the test system.
Don’t use UI record-and-playback systems. They’re fragile and force you to couple the “what” and the “how”
Don’t use record-and-playback production data. Although this has a place, it’s just not about acceptance testing – we want to be able to test in controlled circumstances, and recorded prod data is simply not representative of all of the control and edge cases we need to understand and test for.
Don’t dump production data into your test systems, even if you just use it to create specific test cases. Prod data is almost always too big and heavyweight. Instead define the absolute minimum data that you need to test with so that the tests can be run anywhere.
Don’t assume that an out-of-the-box automated testing product is going to define your testing strategy. Start with your OWN strategy which is specific to your system and problem domain. There are some good tools (e.g. Cucumber), but the pattern is more important than the technology.
Don’t have a separate Test/QA team… it doesn’t work! Quality is down to everyone, and developers should own the acceptance tests.
Don’t let every test start and init the app. Optimize for cycle-time and be efficient in your use of test environments.
Don’t include systems outside of your control within your acceptance test scope
Don’t have wait() statements in your tests and hope it will solve intermittency issues… it won’t!
Tricks for success
Do ensure that developers own the tests
Do focus your tests on “what” not “how”
Do think of your tests as “executable specifications”
Do make acceptance testing part of your Definition of Done – integrating the acceptance test into the human approach to building systems
Do keep tests isolated from one another
Do keep tests repeatable
Do use the language of the problem domain – whatever your tech
Do stub external systems
Do test in “production like” environments
Do make tests appear synchronous at the level of the test case
Do test for ANY change
Do make your tests efficient