Test Failure Analysis Best Practices

sauce labs employee working at desk

In the world of software testing, tests sometimes fail. That’s a fact of life.

And when those failures happen, your first thought as a QA engineer might be to Slack the developers, say “test failed—try again thx!” and call it a day. After all, the main role of the QA team is to make sure that the software is ready to ship. Fixing problems that are discovered during testing is usually developers’ responsibility.

However, the fact is that the QA team needs to do more than merely record failures and alert developers to them. Whenever a test fails, QA engineers should take the additional step of performing test failure analysis. Doing so not only helps the QA team provide insights to developers that might allow them to resolve an issue faster, but can also make testing operations smoother at the same time.

In this post, we take a look at what test failure analysis means and discuss strategies for getting the most out of it.

What is test failure analysis?

Test failure analysis is what it sounds like: it’s the process of analyzing a failed test to figure out what went wrong.

The exact nature of your test failure analysis process should be tailored to your needs, but typically, the analysis should allow you to answer the following questions:

Application vs. test failures

Did the test fail because of a problem with the software that you were testing, or because of a problem with your test? This is the first and most basic question that you need to answer; after all, before you go telling the developers that their code has a bug, you should make certain that the problem was not caused by a test that you wrote.

Root cause

Regardless of whether the problem lies with the software or with your test, you need to know the root cause of the issue. The ways in which a test failure manifests itself on the surface may or may not reflect the root cause of the problem. For example, perhaps your test reveals that a menu in your application fails to load on a certain browser. There are multiple potential root causes of this issue -- a corrupted CSS file, a permissions problem with the file, or a bug in the browser (which your developers will need to work around), to name just a few possibilities. You need to figure out what the core cause of the issue is to help developers address it quickly.

Failure scope

If the failure was caused by a problem with your application’s code, how many builds or configurations are affected by the failure? If you have run tests on every build and configuration, then this will be an easy question to answer. But if -- like most QA teams -- you don’t have the resources to test every single possible environment, your test failure analysis should include an assessment of how many environments are likely to experience the same problem.

Failure significance

A final question that your test failure analysis should be able to answer is: how significant is the failure? If it was caused by a problem with the application, is it significant enough that you need to delay deployment until it is fixed? Or is it a relatively minor issue that doesn’t warrant canceling a whole deployment?

If, alternatively, the problem lies with your test, you should evaluate how seriously the issue affects your testing pipeline, and whether you need to drop everything and address it ASAP or if you can live with it for a little while.

Getting the most out of test failure analysis

You might be thinking: “Test failure analysis sounds great, but who has the time to run a detailed analysis of every failed test? Not me!”

Fair enough. You shouldn’t be doing a test failure analysis every single time a test fails. Instead, consider the following strategies, which can help you identify which failures require a full analysis and which you can ignore.

Parallel testing

Parallel testing offers many benefits, including the ability to assess the probable significance of a test failure quickly. This is because, when you run tests in parallel, you’ll know as soon as they are complete how many tests for a given build are failing.

If you run several dozen automated tests in parallel and only one or two fail, there’s a decent chance that the problem is relatively minor. That doesn’t mean you should necessarily ignore it, but it does give you a sense of how to approach the issue before you even begin any kind of analysis.

If, on the other hand, a significant portion (say, more than 20 percent) of your tests fail at the same time during a parallel run, you know that you either have a problematic bug on your hands that will probably require you to cancel deployment, or a widespread problem in your tests that you need to address immediately before you do any more testing.

Auto-restarts

Despite the old cliché that the definition of insanity is repeating the same thing and expecting different results, the fact is that software is a fickle thing. Sometimes, running the same test twice will yield different results. Maybe a network connection failed temporarily on the first run, or a server crashed on your test grid. Weird and unpredictable things happen from time to time, even in software testing environments where we strive for consistency.

For that reason, configuring your tests to restart automatically when a failure occurs is one way to reduce the number of potential failures that you need to analyze. By auto-restarting tests, you can be sure that a test is truly failing and requires you to give it a deeper look.

Failure playbook

As virtually anyone who works in QA knows, consistency is the mother (or one of them, at least) of quality. That rule certainly holds true when it comes to test failure analysis. You want a consistent, predictable process for assessing and reacting to failures.

That’s why you should develop a failure analysis “playbook” that lays out the process you will follow for determining whether a failure warrants analysis, as well as how you’ll perform the analysis if it does. Your playbook should also specify what you’ll do after analysis is complete by identifying the process for things like contacting developers and determining whether to retest.

AI

Artificial Intelligence might not be a production-ready test failure analysis solution in most cases today, but as AI becomes an increasingly important part of software testing, you should keep it in mind as your team plans its test failure analysis strategy. AI is already transforming the ways in which IT teams troubleshoot post-deployment software and infrastructure problems, and it has the same potential to help analyze complex test failures quickly.

Conclusion

Your tests will fail, at least sometimes. The way you react to the failures plays a pivotal role in shaping the effectiveness of your overall testing strategy. Instead of simply sending failed code back to developers and expecting them to handle it, you should have a consistent plan in place for analyzing test failures and reacting to them. 

Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure and networking. He is Senior Editor of content and a DevOps Analyst at Fixate IO. His latest book, For Fun and Profit: A History of the Free and Open Source Software Revolution, was published in 2017.

Written by

Chris Tozzi

Categories