jenny is tool for generating regression tests. (It "jenny"rates tests, thus its name.) Any time exhaustive testing looks painful due to the combinatorial explosion of features interactions to be tested, consider using jenny. It will cover most of the interactions with far fewer testcases. It can guarantee pairwise testing of all features that can be used together, and it can avoid those feature combinations that cannot.
Table of Contents:
Compile it like so: gcc -O3 jenny.c -o jenny
Prebuilt Windows executable: jenny.exe
(supplied by Derrick Pallas, I think the September 14 2003 version)
The reasoning behind jenny
Features tend to come in dimensions, where a testcase can choose one feature from each dimension. For example, the datatype of a variable is a dimension: the variable can have only one of several datatype. The variable can be compared to another variable. The type of the other variable is another dimension. The type of comparison (<, >, =) is another dimension. Each testcase chooses one feature from each dimension, and often chooses them independently of each other.
jenny has exhaustive testing as its limit. Exhaustive testing is to run one testcase for every combination in the cross-product of all dimensions. For example, exhaustive testing of 12 dimensions with 2 features each (for example 12 flags that can each be on or off) requires 212 = 4096 testcases, and that's what jenny produces:
PROMPT>jenny -n12 2 2 2 2 2 2 2 2 2 2 2 2 | wc 4096 49152 167936 |
Exhaustive testing is overkill. The goal of testing is to find bugs. I have never seen a bug that only happens when twelve features are used together. Most bugs only require one, two, or three features to be used together. (One-feature bugs should all be caught before you check in your code. Two-feature bugs should all be caught by regression testing before going production. Three-feature bugs are typically found in production.) In our example, testing all twelve-feature combinations is overkill. Testing all three-feature combinations would have been good enough.
jenny allows the user to ask to cover combinations of features smaller than the number of dimensions. At the other limit from exhaustive testing is combinations of one feature, that is, to cover every feature at least once. 12 dimension of 2 features each have 24 features to cover. One testcase per feature would be 24 testcases. But behold:
PROMPT>jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2 | wc 2 24 82 |
PROMPT>jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b |
Suppose we want to cover all triples of features. There are (12 choose 3) = 220 ways to choose dimensions, and 23 = 8 ways to choose features within those dimensions, for a total of 1760 possible triples of features. That's a quite a bit fewer than the 4096 testcase that exhaustive testing would give. But jenny does almost two orders of magnitude better than both:
PROMPT>jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | wc 20 240 820 PROMPT>jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | sort 1a 2a 3a 4a 5a 6b 7b 8a 9a 10b 11a 12a 1a 2a 3a 4a 5a 6b 7b 8b 9b 10a 11b 12b 1a 2a 3a 4a 5b 6b 7a 8b 9b 10b 11b 12b 1a 2a 3b 4b 5b 6a 7a 8a 9a 10b 11b 12b 1a 2a 3b 4b 5b 6b 7b 8b 9b 10a 11a 12a 1a 2b 3a 4b 5a 6a 7b 8a 9b 10a 11a 12b 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 1a 2b 3b 4a 5a 6a 7a 8b 9a 10a 11a 12a 1a 2b 3b 4a 5a 6b 7b 8b 9a 10a 11b 12b 1a 2b 3b 4a 5b 6a 7b 8a 9b 10b 11a 12a 1a 2b 3b 4b 5a 6a 7b 8b 9b 10b 11b 12a 1b 2a 3a 4a 5b 6a 7b 8b 9a 10a 11b 12b 1b 2a 3a 4b 5b 6a 7a 8a 9b 10b 11a 12a 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b 1b 2a 3b 4a 5a 6b 7a 8a 9b 10a 11b 12a 1b 2a 3b 4b 5a 6b 7a 8b 9a 10a 11a 12b 1b 2b 3a 4a 5a 6a 7a 8a 9a 10b 11b 12b 1b 2b 3a 4b 5b 6b 7b 8b 9a 10b 11a 12a 1b 2b 3b 4a 5b 6b 7a 8b 9b 10a 11a 12b 1b 2b 3b 4b 5a 6b 7b 8a 9b 10b 11b 12b |
jenny must be given at least n dimensions of features when you want to cover all n-tuples. A dimension is represented by just a number (the number of features in that dimension). The number of features in a dimension must be an integer in 2..52. Features in dimensions are implicitly named, in order, a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z. Dimensions are implicitly numbered 1..65535, in the order that they appear.
-n gives the n in n-tuples. It must be a number between 1 and 32. For example, -n2 means pairs and -n3 means triples. If there are several -n, the last one wins. The default for n is 2.
-s seeds the random number generator. It must be an integer in 0..2^^32-1. The default is 0. Different seeds produce different tests.
-w gives a restriction, a "without". -w1a2bc5d means that the combination of the first dimension's first feature (1a), the second dimension's second or third feature (2bc), and the fifth dimension's fourth feature (5d) is disallowed. The position of withouts is irrelevant.
-o is for old testcases. For example -ofoo.txt will read old testcases from the file foo.txt. It will include those testcases first in the output, then append new testcases as required. The set of inputs can be terminated by end-of-file or by a line starting with '.' . The old testcases should be as they appear in jenny output, with features given for every vector in order, separated by spaces, with a leading and trailing space. -o is useful for reusing the testcases that you've already written when you find you need to add a -w or change the size of some dimension in your jenny command.
-h prints out instructions for using jenny.
Strategy
Here are some heuristics for using jenny.
PROMPT>jenny 2 2 2 1a 2a 3b 1a 2b 3a 1b 2a 3a 1b 2b 3b |
The work required to add a new feature or new dimension to an
existing product is similar to the work required to update the
pairwise tests to cover that new feature or dimension. It is not
enough to just test the new feature, it has to work with all other
features. This cross-feature compatibility doesn't necessarily
require a product with n features to require n*n work, but doubling or
the work per feature sounds about right. It is common for software
teams to be doing no new features at all, but spending all their
time running to keep up with a changing underlying base. This is
the work of maintaining cross-feature compatibility, where the new
features are in the underlying base rather than the actual product.
Management often tries to destaff an old product while continuing to
modify the underlying base, and the old product will mysteriously
fail to continue working despite nothing changing in it ... this is
what causes that. "Bitrot" is this lack of maintaining
cross-feature compatibility.
Using jenny to test itself
The obvious way to show that jenny works in practice is to use it to test itself. I'm not sure whether -n2 or -n3 is appropriate. We'll be pessimistic and use -n3, that is, we'll cover all triples of features, and to include as many dimensions as possible. So, here goes!
First I came up with all the dimensions of features I could think of. I only found 12 (jenny is a pretty simple program, and I'm not a tester by trade). I spelled them out in a sed script, jenny_gen.sed, that can translate jenny output into something short but interpretable. I spent some effort rearranging the dimensions so that I could type in a testcase as I read through the output without too much backtracking. Each dimension has three or four settings (each setting is a "feature").
Next came the command itself:
./jenny -n3 4 4 3 3 3 3 3 3 4 3 3 4 -w1abc2d -w1d2abc -w6ab7bc \ -w6b8c -w6a8bc -w6a9abc -w6a10ab -w11a12abc -w11bc12d -w4c5ab \ -w1a3a -w1a9a -w3a9c | sed -f jenny_gen.sed > jenny_gen.txt |
Running that command recommended 120 testcases that would cover all triples of features. The first 26 cover all pairs of features. The first 5 cover every feature at least once.
For each of jenny's recommendations, I wrote a testcase by hand that contained all those features (jenny_test.sh). For example, here's the first jenny output, the handwritten testcase, and its results:
# test1: (use -n1) (put -n last) (between n and 10 dimensions) (a dimension of size 2) (all dimensions equal size) (several restrictions) (some restrictions fully duplicated) (a restriction comes after the dimensions it constrains) (one -s) (put -s first) ./jenny -s9 2 -n8 2 2 2 2 2 2 2 2 -w4b3a -w3a4b -w3a5a4a -w5b4a3a -w7a4b \ -w8a3b4b -w9a5a6a7a4a -n1 Could not cover tuple 3a 1a 2a 3b 4a 5b 6b 7a 8a 9b 1b 2b 3b 4b 5a 6a 7b 8b 9a |
Once I got rolling it took me about 3 minutes on average to write a testcase. Once finished, I had to proofread and correct the testcases. The output is about 700K (jenny_test.log). On Solaris, I could produce the log like so:
sh -v jenny_test.sh >& jenny_test.log |
So, did it find any bugs? I wrote this test after doing some code coverage and playing with jenny for a few weeks, so it was already stable for all the inputs I could think of. But jenny still found 7 new bugs in all, in testcases #2, #10, #11, #11, #13, #13, and #35. It was clear the bugs were petering out once I'd covered all pairs of features (testcases #1 through #26), but I slogged through all triples just to demonstrate it could be done and to find out whether more bugs would be found. Perhaps more complicated software would find covering all triples worthwhile, but for testing jenny, -n2 was the right choice.
I could have written another program specific to the thing being
tested to automatically translate the output into runnable testcases.
Would that have taken longer than writing the tests by hand?
Probably. Definitely, had I stopped after covering all pairs of
features. I also added more variability in the testcases by hand than the
dimensions specifically asked for. That would have been lost
had I written a program to automatically produce runnable testcases.
Writing the testcases by hand also allowed me to use complicated
features. For example, one features was "do the combined restrictions
imply another implicit restriction?" That would have been difficult
for an automatic test generator to deal with. Writing the
testcases by hand appears to be worthwhile.
Algorithm
jenny first covers single features, then pairs of features, then triples, and so forth up to the n-tuples requested by the user. Each pass checks whether the existing tests cover all tuples, and if not, make a list of uncovered tuples and add more tests until all tuples are covered. It tries to find testcases that obey the restrictions and cover a lot of new tuples. Any tuple that it can't cover no matter how hard it tries without disobeying some restriction, it says it can't cover it and adds it to its list of restrictions. Then it goes on to one-bigger tuples. The running time is empirically about O(d2.3f4.6) for d dimensions all of size f (for -n2).
Several improvements are possible.
Exhaustive testing grows exponentially with the number of dimensions
to be tested together. But if you limit yourself to just covering all
pairs (triples, quadruples, n-tuples) of features, and allow every
testcase to cover many n-tuples, the number of testcases required
grows only logarithmically with the of the number of dimensions.
jenny is just one tool that exploits this. The site pairwise.org has a list of many such
tools, and a comparison of the number of testcases each produces for
various inputs. (So far it doesn't list which ones support
restrictions. I think these tools are unusable unless they support
restrictions. Several do, several don't.)
History
A minimal perfect hash generator
Klemperer Rosettes and other odd orbits
Ye Olde Catalogue of Boy Scout Skits
Chocolate Chicken and other recipes
Table of Contents