Using Sauce Breakpoints to Find and Fix Flakey Tests

Spoiler Alert:
If you read this article, you'll be one of the first to hear about a previously-unpublicized feature from Sauce. It's like an easter egg!




Bugs and Flakes, yum!It probably comes as no surprise that at Sauce we write a lot of Selenium tests. Our website needs good test coverage, just like our customers' apps. We have a build that runs all of these tests (and many more unit tests besides) after every chunk of commits. If tests fail in our build, it stays "red" until someone commits a fix. During that time, we can't deploy the new code, and it's our custom to not even push more commits on top while the build is red, so the problem can be diagnosed and fixed without complicating matters. In other words, it's a big deal when the build breaks because it is potentially interfering with other developers' workflows. That's one of the reasons we pull out our hair and yell obscenities when we encounter flakes in our build. A flake occurs when a test that normally passes, or passes under normal conditions, fails non-deterministically (i.e., under seemingly random conditions). If we run the build again, that same test might pass, leaving us without a lot of information about what went wrong. Is something wrong in the code? Is something wrong in our build infrastructure? It leaves us uncertain whether we might actually have a problem with that functionality in production, too---if it's failing 1 out of every 1,000 times in the build, is it affecting 0.1% of our customers? On a recent Flakey Friday (a Friday dedicated to tracking down and eleminating flakiness from tests), we caught a test acting strangely, and failing one out of every ten or so runs. The test looked like this:

def test_can_publish_and_back(self):
    self.login(self.user)
    self.open_job(self.test_job['_id'])
    self.find_element_by_link_text("make public").click()
    self._check_public()
    self.find_element_by_link_text("make private").click()
    self._check_private()

This is one test for our job* detail page. The setUp function for this test class handles creating a new random user and a new random job. The test logs the user in and goes to the page for this job. It then clicks a link designed to make the job "public" (i.e., viewable by anyone on the web), checks both the database and the website to make sure the AJAX-powered toggle did its trick, and finally makes sure we can toggle the job back to "private" in the same way. This is a straightforward test and is built using Selenium test best practices (creating a fresh random user object, a fresh random job, and using spin asserts to avoid race conditions), but every so often it would fail because Selenium would check that the link text changed after click---something that, in these cases, didn't happen. Likewise, the job was not marked with the appropriate status in the database. How do you diagnose and fix a problem that only occurs on average 10% of the time in the build? Well, the first thing we tried was reproducing the behavior manually. Unfortunately, no matter how many times we performed the test actions ourselves in a browser, we could never observe the failure. Since we couldn't reproduce the bug, all we had were various hypotheses about website load in our test environment, or javascript issues that prevented the AJAX call from taking place. But we were basically looking at a long, hard road of guesswork. At that point we decided to make use of Sauce Breakpoints to try and catch the bug in the wild. I've written previously about how you can use Breakpoints to debug javascript errors in tests you are writing. This particular technique wouldn't have helped us here, because we couldn't reliably reproduce the failure. What we needed was a way to run so many instances of this test that we were likely to observe a failure, and then to enter Breakpoint mode on just the tests that failed. The first step was taken care of in a rather brute-force way: I simply created 14 new versions of the same test, like so:

def test_can_publish_and_back2(self):
    self.test_can_publish_and_back()

...

def test_can_publish_and_back15(self):
    self.test_can_publish_and_back()

This way, I could run our custom version of the Nose test runner and have it pick up all and only the tests I was interested in using a wildcard match: nose test_can_publish_and_back* Then, I made use of a feature we have not yet publicized: programmatic Sauce Breakpoints. This is achieved by sending a special Selenium command that the Sauce Cloud understands to mean that you want the job breakpointed. For both Selenium RC and WebDriver, the special command is sauce: break. For Selenium RC, this command is sent as the context parameter for setContext. For Selenium WebDriver, it is passed as the script value of the execute command. Luckily, the Python WebDriver API implements these commands, so all I had to do was hack sauce: break into our main test class's tearDown function:

def tearDown(self):
    if not self.passed:
        self.collect_web_traceback()
        if self.break_on_fail:
            self.driver.execute_script("sauce: break")
    self.report_pass_fail()
    if self.stop_on_teardown:
        self.driver.quit()

Essentially, our tearDown logic here says, "If the test didn't pass, get a traceback and breakpoint the test if I've set self.break_on_fail. Then, report the status to Sauce, and close the WebDriver session." With all of these modifications in hand, I was able to run the offending test multiple times in parallel like so: nose --processes=15 test_can_publish_and_back* Then, all I had to do was go to my Sauce Labs tests page and watch to see which tests turned up as breakpointed. I could navigate to the detail page for a breakpointed test and use the dev tools in Chrome to examine what was happening. In the case of this flake, I discovered the problem was that the AJAX request was not successful---it was receiving a 401 response from our test server. This meant that the CSRF protection for the AJAX POST was messing up somehow. After a lot of website backend debugging, we were able to determine that, under load, new CSRF tokens sometimes took longer to save to our persistent data store than it did for the website to respond with them to the request, making the browser's next (valid) request appear invalid to the server, thus causing it to reply with a 401. Luckily, upgrading our backend code and making our session save synchronous took care of the problem. The details I have shared about our particular flake are not important to the big story here. What is important is that we had a kind of flake that was nigh-impossible to pin down without a tool like Sauce Breakpoints. It allowed me (within the space of one parallel Sauce test run) to observe the bug in its natural habitat and get into the dev tools of this problem session, where we were able to find the first clue on the trail which eventually led to squashing the issue. We hope this strategy can also be useful to others who aren't tolerant of mysterious flakes in their build. Let us know if you can think of any other testing practices which can be augmented by Breakpoints! Addendum: Selenium RC
The example code of programmatic Sauce Breakpoints above is for Selenium 2, a.k.a WebDriver. Breakpoints also work for Selenium 1 (a.k.a. Selenium RC) tests, but the code is different. Here is our tearDown function for Selenium 1 tests, which illustrates the use of the set_context function:









def tearDown(self):
    if not self.passed:
        self.collect_web_traceback()
        if self.break_on_fail:
            self.selenium.set_context("sauce: break")
    self.report_pass_fail()
    if self.stop_on_teardown:
        self.selenium.stop()

* at Sauce, we call an individual test run in our infrastructure by a customer a "job"




Written by

Jonathan Lipps

Topics

Selenium

Categories