jenny: a pairwise testing tool

jenny is tool for generating regression tests. (It "jenny"rates tests, thus its name.) Any time exhaustive testing looks painful due to the combinatorial explosion of features interactions to be tested, consider using jenny. It will cover most of the interactions with far fewer testcases. It can guarantee pairwise testing of all features that can be used together, and it can avoid those feature combinations that cannot.

Table of Contents:

C code

jenny.c (February 5 2005 version)

Compile it like so: gcc -O3 jenny.c -o jenny

Prebuilt Windows executable: jenny.exe (supplied by Derrick Pallas, I think the September 14 2003 version)

The reasoning behind jenny

Features tend to come in dimensions, where a testcase can choose one feature from each dimension. For example, the datatype of a variable is a dimension: the variable can have only one of several datatype. The variable can be compared to another variable. The type of the other variable is another dimension. The type of comparison (<, >, =) is another dimension. Each testcase chooses one feature from each dimension, and often chooses them independently of each other.

jenny has exhaustive testing as its limit. Exhaustive testing is to run one testcase for every combination in the cross-product of all dimensions. For example, exhaustive testing of 12 dimensions with 2 features each (for example 12 flags that can each be on or off) requires 212 = 4096 testcases, and that's what jenny produces:

PROMPT>jenny -n12 2 2 2 2 2 2 2 2 2 2 2 2 | wc
   4096   49152  167936
(wc lists lines, words, characters. jenny produces one line per testcase. So, 4096 testcases cover all 12-feature combinations.)

Exhaustive testing is overkill. The goal of testing is to find bugs. I have never seen a bug that only happens when twelve features are used together. Most bugs only require one, two, or three features to be used together. (One-feature bugs should all be caught before you check in your code. Two-feature bugs should all be caught by regression testing before going production. Three-feature bugs are typically found in production.) In our example, testing all twelve-feature combinations is overkill. Testing all three-feature combinations would have been good enough.

jenny allows the user to ask to cover combinations of features smaller than the number of dimensions. At the other limit from exhaustive testing is combinations of one feature, that is, to cover every feature at least once. 12 dimension of 2 features each have 24 features to cover. One testcase per feature would be 24 testcases. But behold:

PROMPT>jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2 | wc
      2      24      82
jenny produced just two testcases. How can that be? Let's look at the actual testcases this time:
PROMPT>jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2
 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 
 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b 
Each line is a "testcase", saying which features a single testcase should cover. Each testcase contains 12 features, and 12*2 = 24. Two testcases cover all features.

Suppose we want to cover all triples of features. There are (12 choose 3) = 220 ways to choose dimensions, and 23 = 8 ways to choose features within those dimensions, for a total of 1760 possible triples of features. That's a quite a bit fewer than the 4096 testcase that exhaustive testing would give. But jenny does almost two orders of magnitude better than both:

PROMPT>jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | wc
     20     240     820
PROMPT>jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | sort
 1a 2a 3a 4a 5a 6b 7b 8a 9a 10b 11a 12a 
 1a 2a 3a 4a 5a 6b 7b 8b 9b 10a 11b 12b 
 1a 2a 3a 4a 5b 6b 7a 8b 9b 10b 11b 12b 
 1a 2a 3b 4b 5b 6a 7a 8a 9a 10b 11b 12b 
 1a 2a 3b 4b 5b 6b 7b 8b 9b 10a 11a 12a 
 1a 2b 3a 4b 5a 6a 7b 8a 9b 10a 11a 12b 
 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 
 1a 2b 3b 4a 5a 6a 7a 8b 9a 10a 11a 12a 
 1a 2b 3b 4a 5a 6b 7b 8b 9a 10a 11b 12b 
 1a 2b 3b 4a 5b 6a 7b 8a 9b 10b 11a 12a 
 1a 2b 3b 4b 5a 6a 7b 8b 9b 10b 11b 12a 
 1b 2a 3a 4a 5b 6a 7b 8b 9a 10a 11b 12b 
 1b 2a 3a 4b 5b 6a 7a 8a 9b 10b 11a 12a 
 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b 
 1b 2a 3b 4a 5a 6b 7a 8a 9b 10a 11b 12a 
 1b 2a 3b 4b 5a 6b 7a 8b 9a 10a 11a 12b 
 1b 2b 3a 4a 5a 6a 7a 8a 9a 10b 11b 12b 
 1b 2b 3a 4b 5b 6b 7b 8b 9a 10b 11a 12a 
 1b 2b 3b 4a 5b 6b 7a 8b 9b 10a 11a 12b 
 1b 2b 3b 4b 5a 6b 7b 8a 9b 10b 11b 12b 
jenny covers all triples of features in 20 testcases. Run down any three columns, if you'd like, and confirm that all 8 possible feature combinations in them are covered. Every testcase covers exactly one of those choices, so 8 testcases is a lower bound for covering all triples of features. The minimal number of testcases may be 8, or may be higher. (In this case it seems to be 12 testcases that form a Golay code. I don't know how to generalize that in any useful way, though.) jenny usually achieves the minimum number of testcases when -n is 1 or the number of dimensions. For -n in between it usually does slightly worse than optimal.

Usage

jenny must be given at least n dimensions of features when you want to cover all n-tuples. A dimension is represented by just a number (the number of features in that dimension). The number of features in a dimension must be an integer in 2..52. Features in dimensions are implicitly named, in order, a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z. Dimensions are implicitly numbered 1..65535, in the order that they appear.

-n gives the n in n-tuples. It must be a number between 1 and 32. For example, -n2 means pairs and -n3 means triples. If there are several -n, the last one wins. The default for n is 2.

-s seeds the random number generator. It must be an integer in 0..2^^32-1. The default is 0. Different seeds produce different tests.

-w gives a restriction, a "without". -w1a2bc5d means that the combination of the first dimension's first feature (1a), the second dimension's second or third feature (2bc), and the fifth dimension's fourth feature (5d) is disallowed. The position of withouts is irrelevant.

-o is for old testcases. For example -ofoo.txt will read old testcases from the file foo.txt. It will include those testcases first in the output, then append new testcases as required. The set of inputs can be terminated by end-of-file or by a line starting with '.' . The old testcases should be as they appear in jenny output, with features given for every vector in order, separated by spaces, with a leading and trailing space. -o is useful for reusing the testcases that you've already written when you find you need to add a -w or change the size of some dimension in your jenny command.

-h prints out instructions for using jenny.

Strategy

Here are some heuristics for using jenny.

The work required to add a new feature or new dimension to an existing product is similar to the work required to update the pairwise tests to cover that new feature or dimension. It is not enough to just test the new feature, it has to work with all other features. This cross-feature compatibility doesn't necessarily require a product with n features to require n*n work, but doubling or the work per feature sounds about right. It is common for software teams to be doing no new features at all, but spending all their time running to keep up with a changing underlying base. This is the work of maintaining cross-feature compatibility, where the new features are in the underlying base rather than the actual product. Management often tries to destaff an old product while continuing to modify the underlying base, and the old product will mysteriously fail to continue working despite nothing changing in it ... this is what causes that. "Bitrot" is this lack of maintaining cross-feature compatibility.

Using jenny to test itself

The obvious way to show that jenny works in practice is to use it to test itself. I'm not sure whether -n2 or -n3 is appropriate. We'll be pessimistic and use -n3, that is, we'll cover all triples of features, and to include as many dimensions as possible. So, here goes!

First I came up with all the dimensions of features I could think of. I only found 12 (jenny is a pretty simple program, and I'm not a tester by trade). I spelled them out in a sed script, jenny_gen.sed, that can translate jenny output into something short but interpretable. I spent some effort rearranging the dimensions so that I could type in a testcase as I read through the output without too much backtracking. Each dimension has three or four settings (each setting is a "feature").

Next came the command itself:

./jenny -n3 4 4 3 3 3 3 3 3 4 3 3 4 -w1abc2d -w1d2abc -w6ab7bc \
  -w6b8c -w6a8bc -w6a9abc -w6a10ab -w11a12abc -w11bc12d -w4c5ab \
  -w1a3a -w1a9a -w3a9c | sed -f jenny_gen.sed > jenny_gen.txt
This is the final form. The original was simpler, but when trying to write the first two testcases I found many restrictions that had to be added. (For example, -w1a9a says you can't have n=1 and also have withouts that restrict fewer than n dimensions, because there is no such thing as a without with zero dimensions.)

Running that command recommended 120 testcases that would cover all triples of features. The first 26 cover all pairs of features. The first 5 cover every feature at least once.

For each of jenny's recommendations, I wrote a testcase by hand that contained all those features (jenny_test.sh). For example, here's the first jenny output, the handwritten testcase, and its results:

# test1: (use -n1) (put -n last) (between n and 10 dimensions) (a dimension of size 2)
  (all dimensions equal size) (several restrictions) (some restrictions fully duplicated)
  (a restriction comes after the dimensions it constrains) (one -s) (put -s first) 

./jenny -s9 2 -n8 2 2 2 2 2 2 2 2 -w4b3a -w3a4b -w3a5a4a -w5b4a3a -w7a4b \
 -w8a3b4b -w9a5a6a7a4a -n1

Could not cover tuple  3a 
 1a 2a 3b 4a 5b 6b 7a 8a 9b 
 1b 2b 3b 4b 5a 6a 7b 8b 9a 
The testcase I wrote obeys jenny's recommendations.

Once I got rolling it took me about 3 minutes on average to write a testcase. Once finished, I had to proofread and correct the testcases. The output is about 700K (jenny_test.log). On Solaris, I could produce the log like so:

sh -v jenny_test.sh >& jenny_test.log
On Windows that didn't work, so I did "sh -v jenny_test.sh" in a shell in Emacs, trimmed the buffer, and saved the results to jenny_test.log.

So, did it find any bugs? I wrote this test after doing some code coverage and playing with jenny for a few weeks, so it was already stable for all the inputs I could think of. But jenny still found 7 new bugs in all, in testcases #2, #10, #11, #11, #13, #13, and #35. It was clear the bugs were petering out once I'd covered all pairs of features (testcases #1 through #26), but I slogged through all triples just to demonstrate it could be done and to find out whether more bugs would be found. Perhaps more complicated software would find covering all triples worthwhile, but for testing jenny, -n2 was the right choice.

I could have written another program specific to the thing being tested to automatically translate the output into runnable testcases. Would that have taken longer than writing the tests by hand? Probably. Definitely, had I stopped after covering all pairs of features. I also added more variability in the testcases by hand than the dimensions specifically asked for. That would have been lost had I written a program to automatically produce runnable testcases. Writing the testcases by hand also allowed me to use complicated features. For example, one features was "do the combined restrictions imply another implicit restriction?" That would have been difficult for an automatic test generator to deal with. Writing the testcases by hand appears to be worthwhile.

Algorithm

jenny first covers single features, then pairs of features, then triples, and so forth up to the n-tuples requested by the user. Each pass checks whether the existing tests cover all tuples, and if not, make a list of uncovered tuples and add more tests until all tuples are covered. It tries to find testcases that obey the restrictions and cover a lot of new tuples. Any tuple that it can't cover no matter how hard it tries without disobeying some restriction, it says it can't cover it and adds it to its list of restrictions. Then it goes on to one-bigger tuples. The running time is empirically about O(d2.3f4.6) for d dimensions all of size f (for -n2).

Several improvements are possible.

Tools for covering all n-tuples

Exhaustive testing grows exponentially with the number of dimensions to be tested together. But if you limit yourself to just covering all pairs (triples, quadruples, n-tuples) of features, and allow every testcase to cover many n-tuples, the number of testcases required grows only logarithmically with the of the number of dimensions. jenny is just one tool that exploits this. The site pairwise.org has a list of many such tools, and a comparison of the number of testcases each produces for various inputs. (So far it doesn't list which ones support restrictions. I think these tools are unusable unless they support restrictions. Several do, several don't.)

History


A minimal perfect hash generator
Klemperer Rosettes and other odd orbits
Ye Olde Catalogue of Boy Scout Skits
Chocolate Chicken and other recipes
Table of Contents