Property-based Testing For The Rest Of Us - Property-driven Development

Arialdo Martini — 10/08/2023 — tdd functional programming property-based testing

Index

  1. Utterly opinionated introduction to Property Testing
  2. Shut up and code!
  3. It’s properties all the way down
  4. Property-driven Development

Property-driven Development

If you apply the practices of TDD relying on Property-based Tests rather than on Example-based ones, what you get is Property-driven Development (PDD).

I claim that PDD leads to a more correct, safer and faster development experience than the example-based TDD. I will argument my bold statement through the implementation of a classical, simple coding kata.

The Prime Factors Kata

I’m using the ThePrimeFactorsKata, a little exercise that Bob Martin has been employing since 2005 to demo TDD. We will first follow the classical approach, mirroring what Bob Martin does; then we will repeat the same using PBT, commenting on the differences.

The requirement is:

Write a function that takes an integer and returns the collection of its prime factors;

Factors of a number are integers that,
when multiplied together, result in the original number.

A number is prime if it has no divisors other than 1 and itself

Bob Martin solves it with few tests (from 7 to 10, depending on the session). You can watch him in action in the video The Three Laws of TDD (Featuring Kotlin).

The classic TDD approach

Bob Martin’s approach boils down to this iterative process:

  1. He starts from the most extreme minimization of the requirement, recuced to the simplest problem
  2. He resolves the requirement with a minimal implementation, making sure the test is green
  3. He asks himself if the implementation is complete
  4. If is not, he extends the requirement adding the next logical use-case, and he starts over from 2.

Here’s a summary of the steps he performs in the video.

Step 1

The simplest possible requirement is: the list of the prime factors of 1 is empty.

Now, technically speaking, this is a mathematical nonsense: prime factorization is defined for numbers n >= 2 (see Prime Factorization - Wolfram Mathworld), so Bob should have been started from 2. But let us ignore that.

The requirement to implement is:

@Test fun factors() {
  assertThat(factorsOf(1), `is` (emptyList()))
}

which can be resolved with:

private fun factorsOf(n: Int): List<int> {
  return emptyList()
}

Easy peasy. Here Bob Martin applies “Fake it until you make it”, an idiomatic technique described by Kent Beck in his seminal “Test-Driven Development By Example” as:

return a constant and gradually replace constants with variables until you have the real code

We will see later how PBT improves on this.

Step 2

After 1, the logical step is to assert the prime factors of 2:

@Test fun factors() {
  assertThat(factorsOf(1), `is` (emptyList()))
  assertThat(factorsOf(2), `is` (factors(2) = listOf(2)))
}

which is implemented with:

val factors = mutableListOf<int>()

if(n > 1)
  factors.add(2)
  
return factors 

Basically, Bob Martin keeps adding the next assert, each time incrementing the number by 1. That’s the application of “Triangulation”, an other technique described in Kent Beck’s book, which is meant to be complementary to “Fake it”.

Step 3

Asserting that factorsOf(3) is listOf(3) leads to replacing 2 with n:

val factors = mutableListOf<int>()

if(n > 1)
  factors.add(n)
  
return factors 

Step 4.

factorsOf(4) is listOf(2, 2)

leads to

var remainder = n
val factors = mutableListOf<int>

if(remainder > 1) {
  if(remainder % 2 ==  0)  {
    factors.add(2)
    remainder /= 2
  }
}
if(remainder > 1)
  factors.add(remainder)

return factors 

Step 5

It turns out that this implementation is already fine for 5.

Step 6

Also with factorsOf(6), the algorithm works.

Wait a minute: does it mean we are done?
We don’t know. The test suite is green and technically this should mean there are no evidences of bugs.
But we also know we have proceeded by arbitrary — and probably simplistic — examples.

The said truth is: we cannot completely trust the test suite, at this point. It might not be expressing the business functionality in its entirety. In a way, it might be lying.
Not being led by tests, we cannot but use our brains and commonsense, and keep adding a further test.

Step 7

Again, all the tests are green.

Damn. Shall we finally stop? Maybe yes, the algorithm is complete, and there are no bugs.
But there are! The algorithm will not work with 8.

Step 8

factorsOf(8) should be listOf(2, 2, 2).

Finally, a red test! We found how to proceed!
We should consider ourselves very lucky that we only get 3 green tests in a row: what if the tests kept being green until 30 or 150? I’m sure many would have just stopped trying.

Here’s the improved (I would say, fixed) implementation:

var remainder = n
val factors = mutableListOf<int>

if(remainder > 1) {
  while(remainder%2 ==  0) {
    factors.add(2)
    remainder /= 2
  }
}
if(remainder > 1)
  factors.add(remainder)

return factors 

Step 9

factors(9) shall be listOf(3, 3).

Luckily, the test fails, which brings us to:

var remainder = n
val factors = mutableListOf<int>


if(remainder > 1)
{
  var divisor = 2
  while(remainder > 1)
  {
      while(remainder% divisor ==  0)
      {
        factors.add(divisor)
        remainder /= divisor
      }
      divisor ++
  }
}

return factors 

Step 10

Green test.

Here Bob Martin senses that the algorithm must be complete, and challenges it with

factorsOf(2*2*3*3*5*7*11*13)

getting again green test. We trust him, and he is in fact right. Would you trust yourself in a real-world, more complex, productive case?

Can we do better?

Kent Beck is my hero, and I’m an avid admirer of Bob Martin. I profoundly love TDD since I got infected, years ago, and I’m convinced I still have everything to learn on the topic.
So, it’s with a heart full of respect that I try to challenge the approach above, looking for chances of progress.

Capturing the requirement

Let me start from the test suite we got with the classic approach:

@Test fun factors() {
  assertThat(factorsOf(1), `is` (emptyList()))
  assertThat(factorsOf(2), `is` listOf(2)))
  assertThat(factorsOf(3), `is` listOf(3)))
  assertThat(factorsOf(4), `is` listOf(2, 2)))
  assertThat(factorsOf(5), `is` listOf(5)))
  assertThat(factorsOf(6), `is` listOf(2, 3)))
  assertThat(factorsOf(7), `is` listOf(7)))
  assertThat(factorsOf(8), `is` listOf(2, 2, 2)))
  assertThat(factorsOf(9), `is` listOf(3, 3)))
  assertThat(factorsOf(2*2*3*3*5*7*11*13), `is` (listOf(2, 2, 3, 3, 5, 7, 11, 13)))
}

It’s fine, but it says nothing about “prime factorization”. Nowhere does the idea of “prime number” emerge. Besides the name factorsOf(), little intuition is expressed about what it is going to happen.

Foundamentally, the test is a collection of input-output pairs:

Input Output
1 []
2 [2]
3 [3]
4 [2, 2]
5 [5]
6 [2, 3]
7 [7]
8 [2, 2, 2]
9 [3, 3]
2*2*3*3*5*7*11*13 [2, 2, 3, 3, 5, 7, 11, 13]

I guess that displayed like that, it would not be apparent to everyone that this is about extracting the prime factors.

How can this be improved?

The PBT approach

Let’s do a step back, starting from the original requirement:

Write a function that takes an integer and returns the collection of its prime factors;

Factors of a number are integers that,
when multiplied together, result in the original number.

A number is prime if it has no divisors other than 1 and itself

Using a C# FsCheck property-test, this could be translated to:

[Property]
bool boolean_factorization_in_prime_numbers(PositiveInt positiveNumber)
{
    var n = positiveNumber.Item;
        
    var factors = factorsOf(n);
        
    return factors.AreAllPrime() && factors.Multiplied() == n;
}

where AreAllPrime() and Multiplied() are implemented as a literal translation of their definitions:

internal static bool IsPrime(this int n) => 
    Enumerable.Range(1, n)
        .Where(i => !(i == 1 || i == n))
        .All(i => n.CannotBeDividedBy(i));

private static bool CannotBeDividedBy(this int n, int i) => 
    n % i != 0;

internal static bool AreAllPrime(this IEnumerable<int> xs) => 
    xs.All(IsPrime);

internal static int Multiplied(this IEnumerable<int> xs) => 
    xs.Aggregate(1, (product, i) => product * i);

Notice that you don’t need to put brains to write this test and its helping functions: there is no judment nor opinion, no commonsense evaluation which values to consider or which shortcuts to take. Indeed, this is the result of the verbatim transcription into C# of the specification. This is a mechanical translation. Could be made by a machine.

Yet, it captures the requirement, it reads like a requirement. It conveys a domain meaning and it’s not arbitrary.

If tests are intended to serve as documentation, I argue that this particular property test imparts significantly more information than the original 10 examples.

So, let’s run this single Property-Testing to see how this will lead the development.

Step 1

The test fails stating that factorsOf(1) is not matched:

Falsifiable, after 1 test (1 shrink) (StdGen (1957335681,297231759)):
Original:
PositiveInt 2
Shrunk:
PositiveInt 1

which leads you to write:

IEnumerable<int> factorsOf(int n) => 
        new[] { 1 };

That’s exactly the same that happened with TDD.

Interestingly, though, running the test again with no modification, gives you a new red, stating that now it’s the cae for 2 to be broken.

Falsifiable, after 1 test (0 shrinks) (StdGen (334860507,297231761)):
Original:
PositiveInt 2

This immediately introduces you to the next use case.

Step 2

This leads you to write:

IEnumerable<int> factorsOf(int n)
{
    var factors = new List<int>();
    if(n > 1)
        factors.Add(2);
        
    return factors;
}

Run the test again and again get a new red case, this time for the 3 case. Do you start seeing the pattern? This is mirroring the TDD experience, without the need of updating the test, and without misleading green results.

Step 3

As before, your replace 2 with n:

IEnumerable<int> factorsOf(int n)
{
    var factors = new List<int>();
    if(n > 1)
        factors.Add(n);
        
    return factors;
}

No surprises: the test now fails for 4:

Falsifiable, after 6 tests (0 shrinks) (StdGen (605250109,297231772)):
Original:
PositiveInt 4

Step 4.

As before, you implement:

IEnumerable<int> factorsOf(int n)
{
    var remainder = n;
    var factors = new List<int>();
    if(n > 1)
    {
        if(remainder % 2 ==  0)
        {
            factors.Add(2);
            remainder /= 2;
        }
    }

    if (remainder > 1)
        factors.Add(remainder);
        
    return factors;
}

Steps 5, 6 and 7

Remember when with TDD you had to write the case for 5, 6 and 7, only to discover that they were already covered?
Interestingly, this is not needed with PBT: re-executing the test, it jumps right to the case for 8:

Falsifiable, after 7 tests (0 shrinks) (StdGen (1965459728,297231772)):
Original:
PositiveInt 8

Step 8

Again, the implementation is identical to the one with TDD. After all, you are doing TDD:

IEnumerable<int> factorsOf(int n)
{
    var remainder = n;
    var factors = new List<int>();
    if(remainder > 1)
    {
        while(remainder % 2 ==  0)
        {
            factors.Add(2);
            remainder /= 2;
        }
    }

    if (remainder > 1)
        factors.Add(remainder);
    
    return factors;
}

Run again, get again a failure, for 9:

Falsifiable, after 8 tests (0 shrinks) (StdGen (2011486583,297231773)):
Original:
PositiveInt 9

Step 9

As before:

IEnumerable<int> factorsOf(int n)
{
    var remainder = n;
    var divisor = 2;
    var factors = new List<int>();
    while(remainder > 1)
    {
        while(remainder % divisor ==  0)
        {
            factors.Add(divisor);
            remainder /= divisor;
        }

        divisor++;
    }

    return factors;
}

Run the test again. Tada! Green!

As hard the library tried to prove you false, it surrended admitting that you finally implemented the right algorithm.

Step 10

Pause a little and ask yourself: do you need to test 2*2*3*3*5*7*11*13 as before?

Ruminating on PBT

Covering the examples

Can this single test replace the original 10, without loss of information?
As a matter of fact, yes: indeed, it explores a much larger domain space, and almost for free.

By default, it will be exercised with 100 cases, but it would be a matter of configuration to increase the number to 1.000 or 200.000.
By no means could we do the same with examples, neither because of the volume, nor because of the risk of inputing wrong values.

But can it be used to lead development?

Tests in TDD are a design tool. I like to think that they end up forming a non-regression test harness almost as a side-effect: their main goal, the reason why they are written in the first place, is to guide the programmer during the development, and let a design emerge.

You might think that a single property test, not being incrementally built, cannot be effective in supporting an incremental style of development.
Here is where the shrinker comes to play.

When you run a test such as

var n = positiveNumber.Item;
        
var factors = factorsOf(n);
        
return factors.AreAllPrime() && factors.Multiplied() == n;

against an empty implementation

IEnumerable<int> factorize(int n) => 
    null;

the test will fail with random inputs, and the shrinker will eventually find the simplest case for you to troubleshoot.

Falsifiable, after 1 test (1 shrink) (StdGen (737062029,297223429)):
Original:
PositiveInt 2
Shrunk:
PositiveInt 1

That’s interesting!
1 is exactly the number Bob Martin started with. No reasons not to proceed with the same implementation he wrote, then:

IEnumerable<int> factorize(int n) => 
    new []{1};

Run the same test. Even more interestingly, the test fails, this time bringing your attention on the number 2:

Falsifiable, after 2 tests (0 shrinks) (StdGen (1305811796,297223430)):
Original:
PositiveInt 2

There are 2 remarkable results here:

  • First, the shrinker is smart enough to do the job Bob Martin did, selecting for you the next logical case to troubleshoot, the number 2
  • Second, and more importantly: the test is still red! Bob got a green. We get a red.

In fact, with the traditional TDD approach, at this stage you would see a whole green test suite.
Every time you got a green test before, you had to infer from your experience and your gut feeling what that meant: do you need to put some other use cases into exercise, or can you cross the fingers and declare the algorithm complete?
You remeber you could never be sure.

A green test suite during an incremental TDD session based on “Fake it until you make it” does not give any direct evidence when “you made it”: it is all left to your educated analysis and experience.

PBT behaves differently here.
When it comes to using “Fake it”, PBT doesn’t forgive: it split hairs, and chases you with the next failing case, relentlessy. No matter how many hard-coded cases you add, it will find the next counterexample, until the algorithm is really completed.

Far from being perfect (it still isn’t a theorem prover), when a PBT library gives up and declare to be unable to find counterexamples, you can be more confident that your algorithm is finally complete.

Is it faster?

So, when the case for 2 fails, you are lead to to enhance the code the same way Bob did.
Immediately after you do this, 3 also fails, and so on, in the very same order Bob manually chose.

Maybe it does not come as a surprise that FsCheck will not bother you with 5, 6 and 7. Remember the series of 3 green tests and the questions that we raised? That’s not the case with PBT: the library will directly jump to 8.

Neither you will have doubts after developing the implementation for 9, for which Bob assumed the implementation was complete. In your case, the library will immediately greet you with a splendid green light, communicating that you are really done.
Of course, you might still reassure yourself trying with the 2*2*3*3*5*7*11*13, as before; the chances are you will not feel that need at all.
More probably, instead, you will be curious to know how the random values generated by FsCheck cover the domain space. You will be happy to know there are specific tools for observing the test case statistical distribution. That’s an example taken from Time Travelling and Fixing Bugs with Property-Based Testing, by Oskar Wickström:

λ> check prop_invalid_age_fails
  ✓ <interactive> passed 100 tests.
    not yet born 18% ███▌················ ✓ 5%
    too young    54% ██████████▊········· ✓ 5%
    too old      46% █████████▏·········· ✓ 5%

Sounds like a more systematic approach, doesn’t it?

Is it safer?

Here some observations:

  • With Example-Based tests, we started with a simplification of the requirement (the function shall return emptyList). In each step, we then extended the requirement, until it was finally complete.
    This means that, along the process, for a good deal of time the specification was incomplete and partially lying. Some green tests were deceiving. In fact, we did not trust and we kept adding tests.

  • With Example-Based tests, the algorithm has been buggy for a good deal of time.
    For example: when we implemented the test for 4, the function returned the wrong result of a moltitude of cases, such as [2, 9] instead of [2, 3, 3] as the factors of 9.
    Yet the test suite was green. Green tests did not mean completeness at all.

  • Surprisingly, at the same step 4, the code would also return * the right result* for 5 and 7, although this was completely incidental and not intentional. This lack of intentionality and control gives to think.

  • With PDD we practiced an incremental development letting the shrinker identify the increasingly complex use cases. Doing this, we never needed to modify the requirement: on the contrary, we started since the beginning with a complete specification, and we maintained it immutated till the end.
    Green test really meant completeness.

Generally, we pursue a green test suite for the confidence it instills. There might be exceptions where green tests do not necessarily imply we are done, but as a matter of facts, in the prime factors kata, the approach of PDD tended to have lesser exceptions than TDD. I’m inclined to think that this holds broader validity.

What about Collateral Properties?

We started with a direct translation of the business requirement into an Essential Property.

It might be interesting to enrich the test suite with other valid observations on the algorithm. Ideally, a proper set of collateral properties can even replace the original test.

This is what Johannes Link does in Property-based Testing in Java: Property-driven Development, using the very same Kata.

He lists the following intuitions:

factorize(prime) -> [prime]
factorize(prime^2) -> [prime, prime]
factorize(prime^n) -> [prime, ..., prime]
factorize(prime1 * prime2) -> [prime1, prime2]
factorize(prime1 * .... * primeN) -> [prime1, ..., primeN]
factorize(n < 2) -> IllegalArgumentException
factorize(2 <= number <= Integer.MAX_VALUE) -> no exception  
product of all returned numbers must be equal to input number
all numbers in produced list must be primes

Interestingly, the last 2 in the list constitute the Essential Property we have used.

Why would this be a good idea?
On the one hand, the Essential Property we chose has the benefit of capturing the meaning of the original requirement; on the other hand, the Collateral Properties Johannes relies on for his implementation possibly give an extra solidity to the test suite. They could really be complementary.
There is an excellent study on the effectiveness of bug hunting of different styles of properties in Bug Hunting: How to Specify it! In Java!.

What’s next?

I hope you got the idea. And that you agree that, in a way, all tests are about properties and that more or less you already knew how to write one.

If you are that kind of developer who likes to think by abstractions, you could have invented Property-based Testing already, and the chances are probably you had.

The next natural step is to crack open the manual of your preferred programming language PBT library and start playing. A non exaustive list is:

Library Comment Languages
Hedgehog An excellent choice, with integrated shrinking C#, F#, Scala, Haskell
FsCheck From the QuickCheck’s family C#, F#
jqwik It comes with a lot of documentation and integrated shrinking Java, Kotlin
junit-quickcheck   Java
QuickTheories   Java
ScalaCheck   Scala
test.check   Clojure
Kotest   Kotlin
Hypothesis   Python, Java, Ruby
CrossHair More than a PBT library Python
fast-check   JavaScript, TypeScript
js-verify QuickCheck family JavaScript, TypeScript
stream_data   Elixir



Yuppie! We made it to the end!
I wish you happy testing, and a lot of chocolate!

References

See References

Comments

GitHub Discussions