You are viewing paulmck

Previous Entry | Next Entry

Hunting Heisenbugs

buggytux
My children's opinions notwithstanding, I recently found myself pursuing some nasty concurrency bugs in Linux's TREE_RCU implementation. This was not particularly surprising, given that I recently added preemption support, and the code really hadn't put up that much of a fight. In fact, I was getting the feeling that the bugs had gotten together and decided to hide out, the better to ambush me. This feeling wasn't far from wrong.

My first hint of trouble appeared when I began running longer sequences of rcutorture runs, seeing an occasional failure on one-hour runs. My first reaction was to increase the duration to ten hours and attempt to bisect the problem. Of course, even with bisection, finding the bug takes quite some time given ten hours for each probe, so rather than use “git bisect”, I manually ran parallel runs and (for example) quadrisected. I also ran multiple configurations. The results initially hinted that CONFIG_NO_HZ might have something to do with it, but later runs showed no shortage of failures in !CONFIG_NO_HZ runs as well.

The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.

Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies const to a few data structures involved in RCU tracing, which I don't and you shouldn't. So I ran a few more runs on 2.6.32-rc1, and eventually did trigger a failure. In contrast, 2.6.31 was rock solid.

Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.

Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?

Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes rcutorture testing more severe, inserting delays of up to 50 milliseconds in RCU read-side critical sections. I therefore cherry-picked this commit back onto 2.6.31 and 2.6.30, and, sure enough, got failures in both cases. As it turned out, I was dealing with a day-one bug in TREE_RCU.

This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be: force_quiescent_state() or one of the functions it calls. This function runs more often in face of long-running RCU read-side critical sections. In addition, this explained the earlier CONFIG_NO_HZ results, because one of the force_quiescent_state() function's responsibilities is detecting dyntick-idle CPUs. In addition, it raised the possibility that the bug was unrelated to memory ordering, which motivated me to try a few runs on x86 — which, to my surprise, resulted in much higher failure rates than did the earlier tests on the Power machines.

I stubbed out force_quiescent_state() to check my assumption that it was to blame (but please, please do not do this on production systems!!!). Stubbing out force_quiescent_state() resulted in a statistically significant 3x decrease in failures on the x86 machine, confirming my assumption, at least for some subset of the bugs. Now that there was a much smaller section of code to inspect, I was able to locate one race involving mishandling of the ->completed values. This reduced the error rate on the x86 machine by roughly the same amount as did stubbing out force_quiescent_state(). One bug down, but more bugs still hiding.

I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of RCU_JIFFIES_TILL_FORCE_QS from three to one, which increased the failure rate by well over an order of magnitude on the x86 machine.

Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Given that the race I found involved unsynchronized access to the ->completed values, it made sense to look at other unsynchronized accesses. I found three other such issues, and testing of the resulting patches has thus far turned up zero rcutorture failures.

And it only took 882.5 hours of machine time to track down these bugs. :–)

This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the TREE_RCU effort. Bug-fixing is often done under considerably more time pressure than is production of new code, and the mistake in this case was failing to follow up with more careful analysis.

Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.

Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.

Answer to Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Comments

paulmck
Nov. 6th, 2009 10:41 pm (UTC)
Couldn't agree more!!!
In fact, doing just that found a missing increment in synchronize_sched_expedited()...

And possibly a much better way to represent the ->gpnum and ->completed counters, though I need to think about this some more.