jenny: a pairwise testing tool

`jenny: a pairwise testing tool`

jenny is tool for generating regression tests. (It "jenny"rates tests, thus its name.) Any time exhaustive testing looks painful due to the combinatorial explosion of features interactions to be tested, consider using jenny. It will cover most of the interactions with far fewer testcases. It can guarantee pairwise testing of all features that can be used together, and it can avoid those feature combinations that cannot.

Table of Contents:

C code for jenny
The reasoning behind jenny
Usage
Strategy
Using jenny to test itself
Algorithm used
Pairwise combinatorial testing: aetg, jenny, allpairs

C code

jenny.c (February 5 2005 version)

Compile it like so: gcc -O3 jenny.c -o jenny

Prebuilt Windows executable: jenny.exe (supplied by Derrick Pallas, I think the September 14 2003 version)

The reasoning behind `jenny`

Features tend to come in dimensions, where a testcase can choose one feature from each dimension. For example, the datatype of a variable is a dimension: the variable can have only one of several datatype. The variable can be compared to another variable. The type of the other variable is another dimension. The type of comparison (<, >, =) is another dimension. Each testcase chooses one feature from each dimension, and often chooses them independently of each other.

jenny has exhaustive testing as its limit. Exhaustive testing is to run one testcase for every combination in the cross-product of all dimensions. For example, exhaustive testing of 12 dimensions with 2 features each (for example 12 flags that can each be on or off) requires 2¹² = 4096 testcases, and that's what jenny produces:

PROMPT>jenny -n12 2 2 2 2 2 2 2 2 2 2 2 2 | wc
   4096   49152  167936

(wc lists lines, words, characters. jenny produces one line per testcase. So, 4096 testcases cover all 12-feature combinations.)

Exhaustive testing is overkill. The goal of testing is to find bugs. I have never seen a bug that only happens when twelve features are used together. Most bugs only require one, two, or three features to be used together. (One-feature bugs should all be caught before you check in your code. Two-feature bugs should all be caught by regression testing before going production. Three-feature bugs are typically found in production.) In our example, testing all twelve-feature combinations is overkill. Testing all three-feature combinations would have been good enough.

jenny allows the user to ask to cover combinations of features smaller than the number of dimensions. At the other limit from exhaustive testing is combinations of one feature, that is, to cover every feature at least once. 12 dimension of 2 features each have 24 features to cover. One testcase per feature would be 24 testcases. But behold:

PROMPT>jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2 | wc
      2      24      82

jenny produced just two testcases. How can that be? Let's look at the actual testcases this time:

PROMPT>jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2
 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 
 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b

Each line is a "testcase", saying which features a single testcase should cover. Each testcase contains 12 features, and 12*2 = 24. Two testcases cover all features.

Suppose we want to cover all triples of features. There are (12 choose 3) = 220 ways to choose dimensions, and 2³ = 8 ways to choose features within those dimensions, for a total of 1760 possible triples of features. That's a quite a bit fewer than the 4096 testcase that exhaustive testing would give. But jenny does almost two orders of magnitude better than both:

PROMPT>jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | wc
     20     240     820
PROMPT>jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | sort
 1a 2a 3a 4a 5a 6b 7b 8a 9a 10b 11a 12a 
 1a 2a 3a 4a 5a 6b 7b 8b 9b 10a 11b 12b 
 1a 2a 3a 4a 5b 6b 7a 8b 9b 10b 11b 12b 
 1a 2a 3b 4b 5b 6a 7a 8a 9a 10b 11b 12b 
 1a 2a 3b 4b 5b 6b 7b 8b 9b 10a 11a 12a 
 1a 2b 3a 4b 5a 6a 7b 8a 9b 10a 11a 12b 
 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 
 1a 2b 3b 4a 5a 6a 7a 8b 9a 10a 11a 12a 
 1a 2b 3b 4a 5a 6b 7b 8b 9a 10a 11b 12b 
 1a 2b 3b 4a 5b 6a 7b 8a 9b 10b 11a 12a 
 1a 2b 3b 4b 5a 6a 7b 8b 9b 10b 11b 12a 
 1b 2a 3a 4a 5b 6a 7b 8b 9a 10a 11b 12b 
 1b 2a 3a 4b 5b 6a 7a 8a 9b 10b 11a 12a 
 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b 
 1b 2a 3b 4a 5a 6b 7a 8a 9b 10a 11b 12a 
 1b 2a 3b 4b 5a 6b 7a 8b 9a 10a 11a 12b 
 1b 2b 3a 4a 5a 6a 7a 8a 9a 10b 11b 12b 
 1b 2b 3a 4b 5b 6b 7b 8b 9a 10b 11a 12a 
 1b 2b 3b 4a 5b 6b 7a 8b 9b 10a 11a 12b 
 1b 2b 3b 4b 5a 6b 7b 8a 9b 10b 11b 12b

jenny covers all triples of features in 20 testcases. Run down any three columns, if you'd like, and confirm that all 8 possible feature combinations in them are covered. Every testcase covers exactly one of those choices, so 8 testcases is a lower bound for covering all triples of features. The minimal number of testcases may be 8, or may be higher. (In this case it seems to be 12 testcases that form a Golay code. I don't know how to generalize that in any useful way, though.) jenny usually achieves the minimum number of testcases when -n is 1 or the number of dimensions. For -n in between it usually does slightly worse than optimal.

Usage

jenny must be given at least n dimensions of features when you want to cover all n-tuples. A dimension is represented by just a number (the number of features in that dimension). The number of features in a dimension must be an integer in 2..52. Features in dimensions are implicitly named, in order, a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z. Dimensions are implicitly numbered 1..65535, in the order that they appear.

-n gives the n in n-tuples. It must be a number between 1 and 32. For example, -n2 means pairs and -n3 means triples. If there are several -n, the last one wins. The default for n is 2.

-s seeds the random number generator. It must be an integer in 0..2^^32-1. The default is 0. Different seeds produce different tests.

-w gives a restriction, a "without". -w1a2bc5d means that the combination of the first dimension's first feature (1a), the second dimension's second or third feature (2bc), and the fifth dimension's fourth feature (5d) is disallowed. The position of withouts is irrelevant.

-o is for old testcases. For example -ofoo.txt will read old testcases from the file foo.txt. It will include those testcases first in the output, then append new testcases as required. The set of inputs can be terminated by end-of-file or by a line starting with '.' . The old testcases should be as they appear in jenny output, with features given for every vector in order, separated by spaces, with a leading and trailing space. -o is useful for reusing the testcases that you've already written when you find you need to add a -w or change the size of some dimension in your jenny command.

-h prints out instructions for using jenny.

Strategy

Here are some heuristics for using jenny.

You have to implement every testcase jenny recommends. You can't skip any. For example, suppose you do
PROMPT>jenny 2 2 2 1a 2a 3b 1a 2b 3a 1b 2a 3a 1b 2b 3b
then afterwards you decide you want to not cover 1b2a. If you don't implement 1b2a3a, it's true you won't cover 1b2a. But you'll also lose your coverage of pairs 1b3a and 2a3a. (Yes I faked that output.)
If you want to not cover certain combinations, use -w. You can have as many -w as you want. Use -w to avoid combinations that raise errors or otherwise prevent the other features in the testcase from being covered. jenny assumes that each testcase demonstrates all tuples in that testcase actually working. If that assumption is not satisfied, jenny's guarantee of coverage does not hold.
Use a sed script like this one to translate jenny output into something readable. Arrange your dimensions so it is as easy as possible to read through the output and type in the testcase.
A complete test suite needs more than just feature interaction tests. It also needs (this list may not be complete) error cases, unit tests that do code coverage, testcases for bugs found, and stress tests. The bugs found in the other tests can give you good ideas about what features to feed to jenny.
You're probably forgetting over half of your dimensions. As you write your testcases, make an effort to do them all differently, even more differently than jenny's specification requires. jenny still proves useful by giving you situations that suggest interesting things to test.
Very critical, very intertwined, or very large projects may benefit from covering all triples of features (-n3). For everything else, covering all pairs (-n2) is sufficient.
Add as many dimensions as you can. You get a lot more coverage at little cost. The number of testcases (the work you'll do) only grows with the log of the number of dimensions. The number of n-tuples covered (the coverage you'll get) grows with the nth power of the number of dimensions.
Features don't have to be things in the syntax. They can be things the code does. For example flushing an object to disk due to low memory can be a feature, even if there's no syntax for it. Writing testcases for such features can be tricky, though.
Try to avoid large dimensions. The number of testcases is at least the product of the sizes of the n largest dimensions. Large dimensions can often be viewed as the product of smaller dimensions: for example, the 6-feature dimension (=, !=, <, <=, >, >=) can be viewed as a 3-feature dimension (=, < >) times a 2-feature dimension (self, not-self). Beware, factoring a dimension means turning each feature in it into two features, so it reduces coverage. Factoring is only appropriate if each feature really is two features, or if which feature of the dimension is covered isn't very important.
When adding a new feature to an existing dimension in a product with an existing test suite generated by jenny, if there are now n features in that dimension, change 1/n of the testcases to use the new feature instead of some old features. Then run jenny with -o to take the modified old testcases and tell you want new testcases are needed. You might want to experiment with which testcases to modify before going through the work of actually modifying the written testcases and adding new ones; you might get away with adding no new testcases that way.
When adding a new dimension to a product with an existing test suite generated by jenny, add the new dimension to all existing testcases, choosing from the features randomly with the constraint that the old testcases have to keep running. Then run jenny with -o to accept the old testcases, and it'll tell you want additional new testcases need to be written (if any).

The work required to add a new feature or new dimension to an existing product is similar to the work required to update the pairwise tests to cover that new feature or dimension. It is not enough to just test the new feature, it has to work with all other features. This cross-feature compatibility doesn't necessarily require a product with n features to require n*n work, but doubling or the work per feature sounds about right. It is common for software teams to be doing no new features at all, but spending all their time running to keep up with a changing underlying base. This is the work of maintaining cross-feature compatibility, where the new features are in the underlying base rather than the actual product. Management often tries to destaff an old product while continuing to modify the underlying base, and the old product will mysteriously fail to continue working despite nothing changing in it ... this is what causes that. "Bitrot" is this lack of maintaining cross-feature compatibility.

Using `jenny` to test itself

The obvious way to show that jenny works in practice is to use it to test itself. I'm not sure whether -n2 or -n3 is appropriate. We'll be pessimistic and use -n3, that is, we'll cover all triples of features, and to include as many dimensions as possible. So, here goes!

First I came up with all the dimensions of features I could think of. I only found 12 (jenny is a pretty simple program, and I'm not a tester by trade). I spelled them out in a sed script, jenny_gen.sed, that can translate jenny output into something short but interpretable. I spent some effort rearranging the dimensions so that I could type in a testcase as I read through the output without too much backtracking. Each dimension has three or four settings (each setting is a "feature").

Next came the command itself:

./jenny -n3 4 4 3 3 3 3 3 3 4 3 3 4 -w1abc2d -w1d2abc -w6ab7bc \
  -w6b8c -w6a8bc -w6a9abc -w6a10ab -w11a12abc -w11bc12d -w4c5ab \
  -w1a3a -w1a9a -w3a9c | sed -f jenny_gen.sed > jenny_gen.txt

This is the final form. The original was simpler, but when trying to write the first two testcases I found many restrictions that had to be added. (For example, -w1a9a says you can't have n=1 and also have withouts that restrict fewer than n dimensions, because there is no such thing as a without with zero dimensions.)

Running that command recommended 120 testcases that would cover all triples of features. The first 26 cover all pairs of features. The first 5 cover every feature at least once.

For each of jenny's recommendations, I wrote a testcase by hand that contained all those features (jenny_test.sh). For example, here's the first jenny output, the handwritten testcase, and its results:

# test1: (use -n1) (put -n last) (between n and 10 dimensions) (a dimension of size 2)
  (all dimensions equal size) (several restrictions) (some restrictions fully duplicated)
  (a restriction comes after the dimensions it constrains) (one -s) (put -s first) 

./jenny -s9 2 -n8 2 2 2 2 2 2 2 2 -w4b3a -w3a4b -w3a5a4a -w5b4a3a -w7a4b \
 -w8a3b4b -w9a5a6a7a4a -n1

Could not cover tuple  3a 
 1a 2a 3b 4a 5b 6b 7a 8a 9b 
 1b 2b 3b 4b 5a 6a 7b 8b 9a

The testcase I wrote obeys jenny's recommendations.

Once I got rolling it took me about 3 minutes on average to write a testcase. Once finished, I had to proofread and correct the testcases. The output is about 700K (jenny_test.log). On Solaris, I could produce the log like so:

sh -v jenny_test.sh >& jenny_test.log

On Windows that didn't work, so I did "sh -v jenny_test.sh" in a shell in Emacs, trimmed the buffer, and saved the results to jenny_test.log.

So, did it find any bugs? I wrote this test after doing some code coverage and playing with jenny for a few weeks, so it was already stable for all the inputs I could think of. But jenny still found 7 new bugs in all, in testcases #2, #10, #11, #11, #13, #13, and #35. It was clear the bugs were petering out once I'd covered all pairs of features (testcases #1 through #26), but I slogged through all triples just to demonstrate it could be done and to find out whether more bugs would be found. Perhaps more complicated software would find covering all triples worthwhile, but for testing jenny, -n2 was the right choice.

I could have written another program specific to the thing being tested to automatically translate the output into runnable testcases. Would that have taken longer than writing the tests by hand? Probably. Definitely, had I stopped after covering all pairs of features. I also added more variability in the testcases by hand than the dimensions specifically asked for. That would have been lost had I written a program to automatically produce runnable testcases. Writing the testcases by hand also allowed me to use complicated features. For example, one features was "do the combined restrictions imply another implicit restriction?" That would have been difficult for an automatic test generator to deal with. Writing the testcases by hand appears to be worthwhile.

Algorithm

jenny first covers single features, then pairs of features, then triples, and so forth up to the n-tuples requested by the user. Each pass checks whether the existing tests cover all tuples, and if not, make a list of uncovered tuples and add more tests until all tuples are covered. It tries to find testcases that obey the restrictions and cover a lot of new tuples. Any tuple that it can't cover no matter how hard it tries without disobeying some restriction, it says it can't cover it and adds it to its list of restrictions. Then it goes on to one-bigger tuples. The running time is empirically about O(d^2.3f^4.6) for d dimensions all of size f (for -n2).

Several improvements are possible.

jenny could read in old testcases with fewer dimensions and fill in the missing dimensions.
jenny could walk through all tuples for each dimension once, caching as many as it can and moving on only once cached ones are covered, rather than generating all n-tuples at once and forcing memory to be big enough to hold them. That would allow jenny to do some amount of optimization even when the total number of tuples is too big to fit in memory.
If there is enough room in memory, jenny could use augmenting paths to move tuples covered only in the last test into earlier tests. The end of the augmenting path would be a change that doesn't eliminate any existing tuples. The links would be changes that displace one existing tuple, which needs to be placed elsewhere. I've used this technique successfully before, for perfect hashing.
jenny ought to support nested dimensions. For example, the type of comparison is a dimension, the type of lefthand side is a dimension, and the type of righthand side is a dimension. But whether to have a predicate at all is another dimension. If you choose not to have a dimension, those nested dimensions (lhs, comparison, rhs) are ignored. I can't simply add the whether-or-not dimension because jenny would count the chosen but ignored features toward its total. So I could add lots of restrictions, or better yet, change the interface to support nested dimensions. The interface should be BNF, with something like xpath to specify restrictions. This one sounds like a lot of work, so I'll put it off until I learn that someone is actually using jenny and running into this problem. A workaround in the meantime is to run jenny once at n with all nested dimensions, then once per set of nested dimensions at n-1 with one set removed, et cetera, and use the union of all the testcases.
jenny ought to have a file-based interface that names all the features and dimensions directly, instead of using a command line with numbers and a sed script. The file-based interface should allow up to 64K features per dimensions and up to 64K dimensions, bypassing the limits on how long a command line can be. Nested dimensions are complicated enough to almost demand a file-based interface. jenny can keep the current command line interface, but it ought to have a file-based interface as well.
jenny could guarantee different coverage for different dimensions. For example, it could guarantee all pairs for some dimensions, but all triples for others. I'm not going to do this because it can already be done: define a dimension as the product of two smaller dimensions, eg a 24-case dimension becomes a 6-case dimension times a 4-case dimension. If you're normally covering triples, you'll cover just pairs of the split dimension.

Tools for covering all n-tuples

Exhaustive testing grows exponentially with the number of dimensions to be tested together. But if you limit yourself to just covering all pairs (triples, quadruples, n-tuples) of features, and allow every testcase to cover many n-tuples, the number of testcases required grows only logarithmically with the of the number of dimensions. jenny is just one tool that exploits this. The site pairwise.org has a list of many such tools, and a comparison of the number of testcases each produces for various inputs. (So far it doesn't list which ones support restrictions. I think these tools are unusable unless they support restrictions. Several do, several don't.)

History

Feb 5 2005. Renamed rand to flearand and #included stdlib.h. No change to semantics or perforance.
March 2 2004. Fix a bug in -o. Add a sanity check at the end that every tuple is covered either by a without or by a testcase.
September 14 2003. Added -o.
August 4 2003. Self-tested.

A minimal perfect hash generator
Klemperer Rosettes and other odd orbits
Ye Olde Catalogue of Boy Scout Skits
Chocolate Chicken and other recipes
Table of Contents