How should you celebrate a successful heisenbug hunt? Any five people would likely have ten different opinions, but whatever your choice, you should also fix your broken tests. Just what makes me say that your tests are broken? Consider the following definitions:
From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.
I fixed the rcutorture test process as follows:
This improved testing regimen resulted in grace-period hangs, but only with
A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.
So how should I celebrate success in this latest hunt?
- The only bug-free programs are trivial programs.
- A reliable program has no known bugs.
From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.
I fixed the rcutorture test process as follows:
- Choosing
CONFIG_RCU_FANOUT=2on an eight-CPU machine, which exercised code paths that would otherwise require 1024 CPUs. - Decreasing the wait time successive randomly chosen CPU-hotplug operations from three seconds to a hundred milliseconds (using “sleep 0.1” from a
bashscript!) - Retaining the artificially low one-scheduler-clock-tick delay between invocations of
force_quiescent_state().
This improved testing regimen resulted in grace-period hangs, but only with
CONFIG_PREEMPT_TREE_RCU. The really nice thing about this improved test was that the failures occurred within a few minutes, which permits use of printk() debugging, a rare luxury when working with RCU infrastructure. I nevertheless suppressed my urge to wildly code up random printk() statements in favor of enabling the RCU tracing infrastructure. The tracing data clearly showed that all leaf rcu_node structures had recorded quiescent-state passage for all their CPUs, but that this information had not propagated up the rcu_node hierarchy. Given that these grace-period hangs occurred only in CONFIG_TREE_PREEMPT_RCU, this data fingered two possible code paths:-
rcu_read_unlock_special()invoked from a task that had blocked within the prior RCU read-side critical section, and -
__rcu_offline_cpu()when offlining the last CPU from a given leafrcu_nodestructure when that structure has queued a task that has blocked within its current RCU read-side critical section.
A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.
So how should I celebrate success in this latest hunt?

Comments