You are viewing paulmck

Previous Entry | Next Entry

Hunting More Heisenbugs

How should you celebrate a successful heisenbug hunt? Any five people would likely have ten different opinions, but whatever your choice, you should also fix your broken tests. Just what makes me say that your tests are broken? Consider the following definitions:

  1. The only bug-free programs are trivial programs.
  2. A reliable program has no known bugs.

From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.

I fixed the rcutorture test process as follows:

  1. Choosing CONFIG_RCU_FANOUT=2 on an eight-CPU machine, which exercised code paths that would otherwise require 1024 CPUs.
  2. Decreasing the wait time successive randomly chosen CPU-hotplug operations from three seconds to a hundred milliseconds (using “sleep 0.1” from a bash script!)
  3. Retaining the artificially low one-scheduler-clock-tick delay between invocations of force_quiescent_state().

This improved testing regimen resulted in grace-period hangs, but only with CONFIG_PREEMPT_TREE_RCU. The really nice thing about this improved test was that the failures occurred within a few minutes, which permits use of printk() debugging, a rare luxury when working with RCU infrastructure. I nevertheless suppressed my urge to wildly code up random printk() statements in favor of enabling the RCU tracing infrastructure. The tracing data clearly showed that all leaf rcu_node structures had recorded quiescent-state passage for all their CPUs, but that this information had not propagated up the rcu_node hierarchy. Given that these grace-period hangs occurred only in CONFIG_TREE_PREEMPT_RCU, this data fingered two possible code paths:

  1. rcu_read_unlock_special() invoked from a task that had blocked within the prior RCU read-side critical section, and
  2. __rcu_offline_cpu() when offlining the last CPU from a given leaf rcu_node structure when that structure has queued a task that has blocked within its current RCU read-side critical section.

A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.

So how should I celebrate success in this latest hunt?


Nov. 22nd, 2009 09:53 pm (UTC)
I might consider that ...
What sort of organization and presentation of this material would seem most effective to you?