### Stupid RCU Tricks: Failure Probability and CPU Count

So

How to make it reproduce faster? Or at all, as the case may be?

One approach is to tweak the Kconfig options and maybe even the code to make the failure more probable. Another is to find a “near miss” that is related to and more probable than the actual failure.

But given that we are trying to make a race condition happen more frequently, it is only natural to try tweaking the number of CPUs. After all, one would hope that increasing the number of CPUs would increase the probability of hitting the race condition. So the straightforward answer is to use all available CPUs.

But how to use them? Run a single

As is so often the case, the answer is: “It depends!”

If the race condition happens randomly between any pair of CPUs, then bigger is better. To see this, consider the following old-school ASCII-art comparison:

If there are n CPUs that can participate in the race condition, then at any given time there are n(n-1)/2 possible races. The upper row has N*M CPUs, and thus N*M*(N*M-1)/2 possible races. The lower row has M sets of N CPUs, and thus M*N*(N-1)/2, which is almost a factor of M smaller. For this type of race condition, you should therefore run a small number of scenarios with each using as many CPUs as possible, and preferably only one scenario that uses all of the CPUs. For example, to make the

But there is no guarantee that the race condition will be such that all CPUs participate with equal probability. For example, suppose that the bug was due to a race between RCU's grace-period kthread (named either

In this case, no matter how many CPUs were available to a given

What happens in real life?

For a race condition that

In other words, real life is completely capable of lying somewhere between the two theoretical extremes outlined above.

`rcutorture`found a bug, whether in RCU or elsewhere, and it is now time to reproduce that bug, whether to make good use of`git bisect`or to verify an alleged fix. One problem is that,`rcutorture`being what it is, that bug is likely a race condition and it likely takes longer than you would like to reproduce. Assuming that it reproduces at all.How to make it reproduce faster? Or at all, as the case may be?

One approach is to tweak the Kconfig options and maybe even the code to make the failure more probable. Another is to find a “near miss” that is related to and more probable than the actual failure.

But given that we are trying to make a race condition happen more frequently, it is only natural to try tweaking the number of CPUs. After all, one would hope that increasing the number of CPUs would increase the probability of hitting the race condition. So the straightforward answer is to use all available CPUs.

But how to use them? Run a single

`rcutorture`scenario covering all the CPUs, give or take the limitations imposed by qemu and KVM? Or run many instances of that same scenario, with each instance using a small fraction of the available CPUs?As is so often the case, the answer is: “It depends!”

If the race condition happens randomly between any pair of CPUs, then bigger is better. To see this, consider the following old-school ASCII-art comparison:

+---------------------+ | N * M | +---+---+---+-----+---+ | N | N | N | ... | N | +---+---+---+-----+---+

If there are n CPUs that can participate in the race condition, then at any given time there are n(n-1)/2 possible races. The upper row has N*M CPUs, and thus N*M*(N*M-1)/2 possible races. The lower row has M sets of N CPUs, and thus M*N*(N-1)/2, which is almost a factor of M smaller. For this type of race condition, you should therefore run a small number of scenarios with each using as many CPUs as possible, and preferably only one scenario that uses all of the CPUs. For example, to make the

`TREE03`scenario run on 64 CPUs, edit the`tools/testing/selftests/rcutorture/confi`gs/rcu/TREE03 file so as to set`CONFIG_NR_CPUS=64`.But there is no guarantee that the race condition will be such that all CPUs participate with equal probability. For example, suppose that the bug was due to a race between RCU's grace-period kthread (named either

`rcu_preempt`or`rcu_sched`, depending on your Kconfig options) and its expedited grace period, which at any given time will be running on at most one workqueue kthread.In this case, no matter how many CPUs were available to a given

`rcutorture`scenario, at most two of them could be participating in this race. In this case, it is instead best to run as many two-CPU`rcutorture`scenarios as possible, give or take the memory footprint of that many guest OSes (one per`rcutorture`scenario). For example, to make 32`TREE03`scenarios run on 64 CPUs, edit the`tools/testing/selftests/rcutorture/confi`gs/rcu/TREE03 file so as to set`CONFIG_NR_CPUS=2`and remember to pass either the`--allcpus`or the`--cpus 64`argument to`kvm.sh`.What happens in real life?

For a race condition that

`rcutorture`uncovered during the v5.8 merge window, running one large`rcutorture`instance instead of 14 smaller ones (very) roughly doubled the probability of locating the race condition.In other words, real life is completely capable of lying somewhere between the two theoretical extremes outlined above.