A unit test of concurrency always has some problems to solve:
The simplest solution to that is to do a trial with n threads that each make m calls, each to a random routine with random arguments. It is guaranteed to finish if individual calls are guaranteed to finish. You can test invariants that are true when that routine and that combination of arguments is used, or when some other state of the system is true. However, in itself it doesn't guarantee that any particular invariant will be tested. It might be that the conditions where that invariant would be tested never happen. There are more complicated trials (call some routines more frequently, do calls in a certain pattern, stop the trial when some state is reached rather than when n calls have been done), but having each of the n threads call m random routines is the simplest design.
Doing a set of concurrent calls, then resetting everything to its initial state, is a trial. You could increase your chances of hitting tests by doing some large fixed number of trials, but that alone would still not guarantee that any particular test is hit.
A solution is to have a flag per test. Set the flag when the test is hit. Continue doing trials until all flags have been set. For example, suppose there are x tests numbered 0..x-1, where x is at most 32. Make a global 32-bit integer F and initialize it to 0. When the i'th test is hit, do F |= (1<<i). When all tests have been hit at least once, F will have value (1<<x)-1. So, do trials until all flags are set (that is, F == (1<<x)-1), or too many trials have passed. This guarantees that all tests get hit, and that they get hit in a bounded amount of time, otherwise the test fails.
That solution might be too simple. The assignment F |= (1<<i) is concurrent. Some assignments might get lost. For example you could set F |= 1 and F |= 2 concurrently, which maps to F' = F (value 0), F'' = F (value 0), F = F' | 1 (value 1), then F = F'' | 2 (value 2), where you really wanted (1|2)==3, see the 1 got lost. If you are OK with maybe waiting until the test is hit two or three times, this is probably OK. You can reduce the chances of this race by testing before setting: if ((F & (1<<i)) == 0) F |= (1<<i);. That's almost certainly good enough, the race would only hit tests that happen very frequently. Infrequent tests would happen long after the start of the test and probably wouldn't have any other tests racing with them to write to F. You could further eliminate the race entirely by using interlocked assignments.
An easy optimization is to only have a flag per code block containing tests, rather than a flag per test, since if one test in the block is hit then all of them will be hit.
How long do you have to wait? If your least frequent test is hit within q trials half the time, then on average you have to wait 2q trials, and the chance of having to wait over qz trials is 2-z. One in a million is 2-20, so if the test does at most 20q trials then it will pass all but one in a million times. 30q is is one in a billion, 40q is one in a trillion. If you have only 2-r chance of not having all flags set after a single trial, then 40/r trials will pass all but one in a trillion times.
How do you tell how frequently flags are set? In the process of writing the tests, fix the number of trials to 1000, and for each flag keep a count of how many trials set that flag. This lets to see how frequently flags are set, and makes it easy to tune the trials to adjust flag frequencies. After you have flag frequencies that you like, you can change the test to have the right max number of trials and also stop early if all flags are set.
This strategy cannot test all cases. It just lets you cover the ones that are reached frequently. A way to increase the number of reachable cases is to reduce the problem size. Like use 3-bit integers instead of 64-bit integers, or use lists of length 4 instead of length 100. The goal is to keep all the edge cases, but make them a larger percentage of total cases, making more edge cases reachable.
Another useful technique is to have a history array. If a trial has n threads each doing m calls, make the history array of size nm, zero the array offset before each trial, use interlocked increments to choose an array slot after each call, and fill the array slot with a description of the thread and the call and its arguments. That way, when a test fails, you can look at the history of that particular trial to see what commands led up to it. The array gets filled after calls are done, and not quite in the same order as the calls, but still it's better than having no idea what calls led up to a test failing. The array entries from a single thread are always ordered correctly, even though entries between threads are not.