3 Subtle Ways to Write a Flakey Test

Jon Bleiberg
8 min readAug 8, 2023

--

Interested in seeing more posts like this? Subscribe here!

Picture this. After hours of poring over the codebase, you’ve finally squashed that tricky bug. You do a bit of local testing, and everything looks good — all the tests you wrote are passing!

You commit your changes and open a PR, then step out to grab a coffee while your full test suite runs up in CI land.

Ten minutes later, you head back to your desk, expecting to see that elusive green checkmark which means your work is nearly done.

But no!

Instead of a green checkmark to crown your hard work, you see a Big. Red. X.

Clicking in to see what happened, you notice that some test in a completely unrelated part of the codebase (which was passing on all of your prior commits!) has failed for some inexplicable reason.

Defeated, you head back to VS Code, emit a hearty sigh, and reconsider all of the life choices that led you to become a SWE.

As I’m sure anyone who’s worked with a large, complex codebase can agree, there are few things more frustrating than dealing with a flakey test (particularly one written by some random engineer you’ve never even met since they left the company so long ago). Below, we’ll cover three subtle ways that you can accidentally become that engineer, and how you can avoid doing so in the first place.

1. Assuming Random Values will be Unique

Here’s a subtle one I came across the other day.

As you probably know, a common pattern in testing is to leverage a fake data generation library (e.g. faker) to generate a bunch of test data. That way, you can test a range of valid values that change each time, rather than testing the same value over and over. You could think of this as a “lite” version of property based testing.

However, you can get into trouble very quickly by assuming your fake data generator will always act as it did on your first couple test runs.

Consider the following test case:

const { faker } = require('@faker-js/faker')

...

test('it groups the customers by source', () => {
const source1 = faker.lorem.word()
const source2 = faker.lorem.word()


const mockCustomers = [source1, source2].map(source => ({
name: faker.person.fullName(),
source
}))


// Some example function that groups the customers within each source
// in the same way as Lodash's groupBy function
// E.g. {
// source1: [{... customer1}, {...customer2}],
// source2: [{...customer3}]
// }
const result = groupCustomersBySource(mockCustomers)


// Check that each source has only the customer we created
expect(result[source1]).toHaveLength(1)
expect(result[source2]).toHaveLength(1)


})

...

In most cases this test will pass just fine. Every now and then however, faker.lorem.word() will generate the same word twice!¹ In that case, if our groupCustomersBySource function is working correctly, result[source1] would have length 2 instead of 1, giving you a sneaky test failure!

So how can we fix this?

Well, there are a few options, each with their own pros and cons.

One simple fix is to remove the randomness entirely — e.g.

const source1 = 'source1'
const source2 = 'source2'

Now, every time we run the test, we’ll get a unique value for source (and presumably a test that doesn’t give false negatives).

Of course, this defeats the original purpose of using a data generator like faker. Say for example, there was some bug in groupCustomersBySource that only arose when a customer has source equal to 'foo'. In that case, even though our test would pass, the underlying function we’re testing would be incorrect!²

Another option is to use the built in unique method in faker, like so:

// Now source1 never equals source2!

const source1 = faker.unique(faker.lorem.word)
const source2 = faker.unique(faker.lorem.word)

Under the hood, unique will store all previously generated values of faker.lorem.word and filter these out from subsequent calls. This preserves the randomness of the original test, while avoiding the sneaky bug we had above.³

2. Assuming Random Numbers will always be Valid

It’s not just strings that pose a problem - careless use of fake number generators can just as easily get you into trouble. Consider the following toy example:

from faker import Faker

fake = Faker()

...

def test_calculate_tax_rate():
annual_income = fake.pyint()
tax_paid = fake.pyint()

# Tax Rate = tax_paid / income
calculated_tax_rate = calculate_tax_rate(annual_income, tax_paid)
expected_tax_rate = tax_paid / annual_income

assert calculated_tax_rate == expected_tax_rate

...

Seems like a pretty straightforward (if somewhat silly) unit test, no?

Well, lurking beneath the covers is a random test failure, just waiting to rear its ugly head!

Again, most of the time, this test will pass with no issues. What if however, fake.pyint() were to generate a value of 0 for annual_income?

Well, you’d get a cheeky ZeroDivisionError of course, causing your test to fail before it even had a chance to get off the ground!⁴

Fortunately, this one is an easy fix. By setting ranges for the random numbers you generate, you can dodge the wrath of your future coworkers:

from faker import Faker

fake = Faker()

...

def test_calculate_tax_rate():
# Now we're safe from that pesky ZeroDivisionError
annual_income = fake.pyint(min=10_000, max=1_000_000)
tax_paid = fake.pyint(min=1000, max=5000)

...

While the above example may seem obvious, this kind of trap can be surprisingly easy to fall into if you’re trying to test a more complex function, perhaps one that makes calls to several helper methods under the hood.

Of course, ZeroDivisionErrors aren’t the only thing you need to watch out for. In general, the code you’re testing may implicitly assume some invariant is always true — e.g. foo is always greater than bar, or baz is always positive. If you’re not careful to respect these invariants, you can inadvertently add a flakey test to your codebase.

3. Assuming Dates Will Stay Fixed

As any programmer worth their salt can attest, dates, times, and time zones are one of the hardest things to handle correctly in code. If you don’t believe me, take a quick peek at this classic.

Testing with dates and times is no exception. Consider this toy example:

const { faker } = require('@faker-js/faker')
const { addDays } = require('date-fns/addDays')

...

test('it correctly calculates the daily membership fee', () => {
const startDate = new Date()
const endDate = addDays(startDate, 1)


// Suppose we charge a daily fee of $10
const calculatedMembershipFee = calculateMembershipFee(startDate, endDate)

expect(calculatedMembershipFee).toBe(10)

})
...

Seems like a pretty straightforward test — what could possibly go wrong?

Well, suppose our calculateMembershipFee function looks something like this:

const { 
differenceInDays,
differenceInCalendarMonths
} = require('date-fns')


const calculateMembershipFee = (startDate: Date, endDate: Date): number => {
// $10 daily membership fee
const daysEnrolled = differenceInDays(endDate, startDate)
const dailyFees = daysEnrolled * 10

// $50 service fee billed on the first of each month
const calendarMonthsEnrolled = differenceInCalendarMonths(endDate, startDate)
const monthlyFees = calendarMonthsEnrolled * 50

return dailyFees + monthlyFees

}

As the actual implementation makes clear, our test is completely missing the monthly service fee!

And yet, despite this notable (and costly) omission, most of the time our test will pass, lulling us into a false sense of security. However, when some unlucky engineer runs this test on the last day of the month, they’ll be hit with a (correct) test failure! Worse yet, if that engineer is strapped for time and decides to come back to it the next day, the test will look completely fine again, and in all likelihood will be forgotten.

So what’s the fix here?

Well there are a couple of options. If you only wanted to test the daily component of the membership fee, you could of course hardcode your startDate like so:

...

const startDate = new Date('2023-07-18')
const endDate = addDays(startDate, 1)

...

Alternatively, if you’re working with a bunch of dates, you could use something like Sinon’s fake-timers to control the current date/time returned by new Date() and similar. For example:


const sinon = require('sinon')

...

let clock, today

beforeEach(() => {
today = new Date('2023-07-18')
clock = sinon.useFakeTimers(today.getTime())
})

afterEach(() => {
clock.restore()
})


test('it correctly calculates the daily membership fee', () => {
const startDate = new Date() // Now returns '2023-07-18T00:00:00.000Z'!
const endDate = addDays(startDate, 1)

...
}

Finally, if you actually wanted to test the monthly service fee component of the calculation, you could of course fix your test to incorporate this new logic.

Again, this example might seem obvious when pulled into isolation. In the context of a large, complex codebase however, it’s all to easy to fall into this sort of trap.

All code relies on certain assumptions and invariants. Testing allows us to push the boundaries of these assumptions, ensuring that our code responds robustly when these fundamental assumptions are respected (and hopefully throws a helpful error message if not).

However, as we’ve seen above, tests that don’t properly respect the assumptions of underlying code can lead to a whole array of tricky-to-debug issues. For this reason, when writing tests, it’s critical to 1) understand the assumptions your code makes and 2) properly encode these assumptions in your tests.

This is certainly not to say that you should only test your code on a limited set of inputs that you know won’t break the underlying code. Rather, you should take care to write tests which pass (or fail) deterministically if the underlying code is correct (or incorrect).

Hopefully after reading this article, you’ll be able to avoid some of the more common goofs that lead to hours of painful debugging on the part of your future colleagues. And if you’re lucky, next time you spot a flakey test in the wild, you’ll be able to diagnose the issue more effectively with these principles in mind.

Find any flakey tests in your codebase that aren’t covered here? Feel free to drop a comment or shoot me an email at mail@jonbleiberg.com

[1] There are currently 1,000 distinct words that can be returned by faker.lorem.word() if the locale is set to 'en'. By the logic of a generalized form of the Birthday Problem, you’d only need to generate ~38 words before you have a >50% chance of a collision!

[2] In statistical terms, you could think of this as a false positive caused by a a reduction in the “specificity” of the original test. I’m planning to write a follow-up article on the correspondences between statistical testing and unit testing — stay tuned!

[3] It should be noted that as of Jan 2023, the unique method is scheduled for deprecation in Faker V8+. It looks like some members of the community have built a replacement here, though I haven’t personally vetted the package. It’s also worth noting that if you called unique on the same faker method enough times, you could exhaust all of the possible unique values and get an error. If you need to generate upwards of 1,000 unique random words in a single test, you might want to look into alternatives.

[4] The astute reader will notice that I switched to Python for this example. If we stuck with JS, we’d get an expected_tax_rate of Infinity, which, (depending on the underlying implementation of calculate_tax_rate) might actually cause the test to pass, though probably not for the reason we wanted it to (thanks Javascript…)

--

--

Jon Bleiberg

Software Engineer - Data Scientist - Math and Language Enthusiast.