Last week we quietly rolled out Sauce Connect version 3. Many of our users know Sauce Connect as the magic piece of software from Sauce Labs that enables users to securely connect their behind-firewall application under test with the Sauce OnDemand and Scout cloud services. It began life as a simple reverse SSH tunnel, but it’s come a long way since then.
Over the course of the 7 million customer Selenium tests we’ve run, we’ve accumulated a great deal of test data and user feedback on Sauce Connect. And we’ve acted upon it, building three increasingly sophisticated generations. With version 3, we’ve made a major enhancement to the protocol which makes using Selenium over the Internet far more reliable than it has ever been.
Reliability is a big deal to us. Our internal tracking of errors under our control is approaching 1 in 10,000 tests. But over time, some Sauce Labs users have reported challenges with mysterious test failures, and forensics on neither their side nor ours exposed the culprit. Enter the Internet and complex interactions between applications (Selenium) and transport (the Internet).
First, some quick Selenium background: Selenium scripts are HTTP clients of the Selenium server which controls the web browser. HTTP requests flow one direction from the Selenium script to the browser, and then back from the browser to the application under test. All of this is designed and tested for a situation in which the script, the browser, and the application under test are all on the same machine, and network problems never happen. Unfortunately, running everything on a single computer doesn’t scale to handle large, cross-platform test suites. Cloud-based testing takes care of platform support and speeds things up via parallel test execution, but in the process it replaces the pristine reliability rates of your computer’s internal virtual network with the messiness of those on the Internet.
Typical Selenium test suites consist of hundreds of tests, each made up of tens or hundreds of Selenium commands. Many of those commands cause page loads, which each involve typically dozens of requests for assets. If you do the math, that means a typical Selenium suite can easily involve 100,000 HTTP requests. For an ordinary Internet user, it’s not very noticeable if 1 in 100,000 HTTP requests fails. That’s in the neighborhood of one failure every 10,000 pages, and not all asset load failures are noticeable to users. But for a Selenium suite, if 1 in 100,000 HTTP requests fails, that can cause every single build to fail.
Why do individual requests fail when most succeed? In our experience this usually comes down to two factors: routing problems on the Internet backbone, and TCP’s handling of congestion and retries. With knowledge of the actual behavior and delay tolerances of Selenium and browsers, it’s possible in principle to recover from temporary routing problems, and tolerate longer delays and more congestion than TCP in general can.
Sauce Connect 3 puts that principle into practice, working around these issues by doing part of TCP’s job over again, better: it acknowledges receipt of data, so that when connections fail it can reconnect and resend unacknowledged data — just like TCP already does, but with longer timeouts and less sensitivity to routes. You can even switch network connections, and Sauce Connect 3 will recover and keep on going as if nothing happened. It also uses a delay-detection mechanism inspired by LEDBAT to rapidly give up on TCP connections experiencing uncommonly large delays (which is often a sign of a routing or congestion problem that can kill an individual TCP connection).
Early beta usage shows promising results. If you were one of the users who experienced mysterious test anomalies, you should give Sauce Connect 3 a spin. But if Sauce Connect 2 is working for you, don’t fix it!