Automating Test Retries
Once we had a list of the flaky tests, we tried to go through each one and determine why they were failing. We found that some UI elements such as menus and popovers were particularly prone to flakiness — they would sometimes be dismissed by the system for no discernable reason!
[…]
Since we already had the JUnit parsing code, we decided to build on top of that and rerun only the failed tests. By using the
xcodebuild
command’s-only-testing
flag, we ran only the failed tests again. Another optimization we made was to build the project only once, even when testing multiple times. We accomplished that by using thexcodebuild build-for-testing
andxcodebuild test-without-building
commands.[…]
Flaky tests still exist, but they no longer slow down the workflow of our developers. CI automatically retries any failing tests, and almost all flaky tests pass when run again. If a test actually fails three times in a row, only then it is considered an actual failure and the build is marked as failed.
Xcode 13 has a built-in option to do this. But why are these tests flaky?
3 Comments RSS · Twitter
The general notion of expecting a test to randomly fail, and only considering an actual failure if it fails more than a few times, seems fundamentally flawed. Obviously the correct treatment is to identify the cause of the failing test, and fix that. Otherwise it’s just wallpaper, bailing wire, crossed fingers, and a shrug. What is the value in such a test?
@Ben I agree in general, but in this case (and I’ve heard as much from others, too) it seems there’s an inherent problem in either UIKit or the testing framework that causes the problem.
Over a decade or more of test-driven development, Tinderbox has had its share of fragile and flaky tests. We’ve eventually tracked them all down. The two common causes:
1. Race conditions. The test environment — especially when running tests in parallel — often brings intermittent failure to the front. Requiring three consecutive failure before you consider the failure to be real guarantees that you’ll never catch these.
2. Date arithmetic. Tests on calculations based on the current date avoid blunders that synthetic dates won’t catch, but bring their own set of troubles. We have tests that fail only on the day before Daylight Savings Time starts here, or when Summer Time starts in the UK. We have other tests that fail in the days leading up to Jan. 1.
3. Unstubbed file/network access. For some tests, we don’t mock the net or the file system. This lets us test more code with simpler tests. This flakiness is innocuous because it's obvious, and it tends to be temporarily repeatable.
4. Tests that don’t initialize some precondition, usually a shared text fixture that is *usually* left in the expected state but that some other test leaves askew.
5. Tests with timeouts that are “long enough”. If the timeout is needed, sooner or later it will be exceeded.
I do think there's some real value to be gained from fragile and flaky tests. You can’t trust them, and if they're numerous they may TDD unpleasant. Fragile tests of types 1 and 4 really ought to be pursued, even in failures aren't rare. The others can be tolerated until someone has spare time.