?

Log in

No account? Create an account

Previous Entry | Next Entry

My previous posting described an RCU bug that I might plausibly blame on falsehoods from firmware. The RCU bug in this post, alas, I can blame only on myself.

In retrospect, things were going altogether too smoothly while I was getting my RCU commits ready for the 3.19 merge window. That changed suddenly when my automated testing kicked out a “BUG: FAILURE, 1 instances”. This message indicates a grace-period failure, in other words that RCU failed to be RCU, which is of course really really really bad. Fortunately, it was still some weeks until the merge window, so there was some time for debugging and fixing.

Of course, we all have specific patches that we are suspicious of. So my next step was to revert suspect patches and to otherwise attempt to outguess the bug. Unfortunately, I quickly learned that the bug is difficult to reproduce, requiring something like 100 hours of focused rcutorture testing. Bisection based on 100-hour tests would have consumed the remainder of 2014 and a significant fraction of 2015, so something better was required. In fact, something way better was required because there was only a very small number of failures, which meant that the expected test time to reproduce the bug might well have been 200 hours or even 300 hours instead of my best guess of 100 hours.

My first attempt at “something better” was to inspect the suspect patches. This effort did locate some needed fixes, but nothing that would explain the grace-period failures. My next attempt was to take a closer look at the dmesg logs of the two runs with failures, which produced much better results.

You see, rcutorture detects failures using the RCU API, for example, invoking synchronize_rcu() and waiting for it to return. This overestimates the grace-period duration, because RCU's grace-period primitives wait not only for a grace period to elapse, but also for the RCU core to notice their requests and also for RCU to inform them of the grace period's end, as shown in the following figure.

RCUnearMiss.svg

This means that a given RCU read-side critical section might overlap a grace period by a fair amount without rcutorture being any the wiser. However, some RCU implementations provide rcutorture access to the underlying grace-period counters, which in theory provide rcutorture with a much more precise view of each grace period's duration. These counters have long recorded in the rcutorture output as “Reader Batch” counts as shown on the second line of the following (the first line is the full-API data):


rcu-torture: !!! Reader Pipe:  13341924415 88927 1 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



This shows rcutorture output from a run containing a failure. On each line, the first two numbers correspond to legal RCU activity: An RCU read-side critical section might have been entirely contained within a single RCU grace period (first number) or it might have overlapped the boundary between an adjacent pair of grace periods (second number). However, a single RCU read-side critical section is most definitely not allowed to overlap three different grace periods, because that would mean that the middle grace period was by definition too short. And exactly that error occured above, indicated by the exclamation marks and the “1” in the third place in the “Reader Pipe” line above.

If RCU was working perfectly, in theory, the output would instead look something like this, without the exclamation marks and without non-zero value in the third and subsequent positions:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 0 0 0 0 0 0 0 0 0



In practice, the “Reader Batch” counters were intended only for statistical use, and access to them is therefore unsynchronized, as indicated by the jagged grace-period-start and grace-period-end lines in the above figure. Furthermore, any attempt to synchronize them in rcutorture's RCU readers would incur so much overhead that rcutorture could not possibly do a good job of torturing RCU. This means that these counters can result in false positives, and false positives are not something that I want in my automated test suite. In other words, it is at least theoretically possible that we might legitimately see something like this from time to time:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



We clearly need to know how bad this false-positive problem is. One way of estimating the level of false-positive badness is to scan my three months worth of rcutorture test results. This scan showed that there were no suspicious Reader Batch counts in more than 1,300 hours of rcutorture testing, aside from the roughly one per hour from the TREE03 test, which happened to also be the only test to produce real failures. This suggested that I could use the Reader Batch counters as indications of a “near miss” to guide my debugging effort. The once-per-hour failure rate suggested a ten-hour test duration, which I was able compress into two hours by running five concurrent tests.

Why ten hours?

Since TREE03 had been generating Reader Batch near misses all along, this was not a recent problem. Therefore, git bisect was more likely to be confused by unrelated ancient errors than to be of much help. I therefore instead bisected by configuration and by rcutorture parameters. This alternative bisection showed that the problem occurred only with normal grace periods (as opposed to expedited grace periods), only with CONFIG_RCU_BOOST=y, and only with concurrent CPU-hotplug operations. Of course, one possibility was that the bug was in rcutorture rather than RCU, but rcutorture was exonerated via priority-boost testing on a kernel built with CONFIG_RCU_BOOST=n, which showed neither failures nor Reader Batch near misses. This combination of test results points the finger of suspicion directly at the rcu_preempt_offline_tasks() function, which is the only part of RCU's CPU-hotplug code path that has a non-trivial dependency on CONFIG_RCU_BOOST.

One very nice thing about this combination is that it is very unlikely that people are encountering this problem in production. After all, for this problem to appear, you must be doing the following:


  1. Running a kernel with both CONFIG_PREEMPT=y and CONFIG_RCU_BOOST=y.
  2. Running on hardware containing more than 16 CPUs (assuming the default value of 16 for CONFIG_RCU_FANOUT_LEAF).
  3. Carrying out frequent CPU-hotplug operations, and so much so that a given block of 16 CPUs (for example, CPUs 0-15 or 16-31) might reasonably all be offline at the same time.


That said, if this does describe your system and workload, you should consider applying this patch set. You should also consider modifying your workload.

Returning to the rcu_preempt_offline_tasks() function that the finger of suspicion now points to:


 1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 2                                      struct rcu_node *rnp,
 3                                      struct rcu_data *rdp)
 4 {
 5   struct list_head *lp;
 6   struct list_head *lp_root;
 7   int retval = 0;
 8   struct rcu_node *rnp_root = rcu_get_root(rsp);
 9   struct task_struct *t;
10 
11   if (rnp == rnp_root) {
12     WARN_ONCE(1, "Last CPU thought to be offlined?");
13     return 0;
14   }
15   WARN_ON_ONCE(rnp != rdp->mynode);
16   if (rcu_preempt_blocked_readers_cgp(rnp) && rnp->qsmask == 0)
17     retval |= RCU_OFL_TASKS_NORM_GP;
18   if (rcu_preempted_readers_exp(rnp))
19     retval |= RCU_OFL_TASKS_EXP_GP;
20   lp = &rnp->blkd_tasks;
21   lp_root = &rnp_root->blkd_tasks;
22   while (!list_empty(lp)) {
23     t = list_entry(lp->next, typeof(*t), rcu_node_entry);
24     raw_spin_lock(&rnp_root->lock);
25     smp_mb__after_unlock_lock();
26     list_del(&t->rcu_node_entry);
27     t->rcu_blocked_node = rnp_root;
28     list_add(&t->rcu_node_entry, lp_root);
29     if (&t->rcu_node_entry == rnp->gp_tasks)
30       rnp_root->gp_tasks = rnp->gp_tasks;
31     if (&t->rcu_node_entry == rnp->exp_tasks)
32       rnp_root->exp_tasks = rnp->exp_tasks;
33 #ifdef CONFIG_RCU_BOOST
34     if (&t->rcu_node_entry == rnp->boost_tasks)
35       rnp_root->boost_tasks = rnp->boost_tasks;
36 #endif
37     raw_spin_unlock(&rnp_root->lock);
38   }
39   rnp->gp_tasks = NULL;
40   rnp->exp_tasks = NULL;
41 #ifdef CONFIG_RCU_BOOST
42   rnp->boost_tasks = NULL;
43   raw_spin_lock(&rnp_root->lock);
44   smp_mb__after_unlock_lock();
45   if (rnp_root->boost_tasks != NULL &&
46       rnp_root->boost_tasks != rnp_root->gp_tasks &&
47       rnp_root->boost_tasks != rnp_root->exp_tasks)
48     rnp_root->boost_tasks = rnp_root->gp_tasks;
49   raw_spin_unlock(&rnp_root->lock);
50 #endif
51   return retval;
52 }



First, we should of course check the code under #ifdef CONFIG_RCU_BOOST. The code starting at line 43 is quite suspicious, as it is not clear why it is safe to make these modifications after having dropped the rnp_root structure's ->lock on line 37. And removing lines 43-49 does in fact reduce the number of Reader Batch near misses by an order of magnitude, but sadly not to zero.

I was preparing to dig into this function to find the bug, but then I noticed that the loop spanning lines 22-38 is executed with interrupts disabled. Given that the number of iterations through this loop is limited only by the number of tasks in the system, this loop is horrendously awful for real-time response.

Or at least it is now—at the time I wrote that code, the received wisdom was “Never do CPU-hotplug operations on a system that runs realtime applications!” I therefore had invested zero effort into maintaining good realtime latencies during these operations. That expectation has changed because some recent real-time applications offline then online CPUs in order to clear off irrelevant processing that might otherwise degrade realtime latencies. Furthermore, the large CPU counts on current systems are an invitation to run multiple real-time applications on a single system, so that the CPU hotplug operations run as part of one application's startup might interfere with some other application. Therefore, this loop clearly needs to go. So I abandoned my debugging efforts and focused instead on getting rid of rcu_preempt_offline_tasks() entirely, along with all of its remaining bugs.

The point of the loop spanning lines 22-38 is handle any tasks on the rnp structure's ->blkd_tasks list. This list accumulates tasks that block while in an RCU read-side critical section while running on a CPU associated with the leaf rcu_node structure pointed to by rnp. This blocking will normally be due to preemption, but could also be caused by -rt's blocking spinlocks. This function is called only when the last CPU associated with the leaf rcu_node structure is going offline, after which there will no longer be any online CPUs associated with this leaf rcu_node structure. This loop then moves those tasks to the root rcu_node structure's ->blkd_tasks list. Because there is always at least one CPU online somewhere in the system, there will always be at least one online CPU associated with the root rcu_node structure, which means that RCU will be guaranteed to take proper care of these tasks.

The obvious question at this point is “why not just leave the tasks on the leaf rcu_node structure?” After all, this clearly avoids the unbounded time spent moving tasks while interrupts are disabled. However, it also means that RCU's grace-period computation must change in order to account for blocked tasks associated with a CPU-free leaf rcu_node structure.

In the old approach, the leaf rcu_node structure in question was excluded from RCU's grace-period computations immediately. In the new approach, RCU will need to continue including that leaf rcu_node structure until the last task on its list exits its outermost RCU read-side critical section. At that point, the last task will remove itself from the list, leaving the list empty, thus allowing RCU to ignore the leaf rcu_node structure from that point forward.

This patch set makes this change by abstracting an rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu(), which can then be called from rcu_read_unlock_special() when the last task removes itself from the ->blkd_tasks list. This change allows the rcu_preempt_offline_tasks() function to be dispensed with entirely. With this patch set applied, TEST03 ran for 1,000 hours with neither failures nor near misses, which gives a high degree of confidence that this patch set made the bug less likely to appear. This, along with the decreased complexity and the removal of a source of realtime latency, is a definite plus!

I also modified the rcutorture scripts to pay attention to the “Reader Batch” near-miss output, giving a warning if one such near miss occurs in a run, and giving an error if two or more near misses occur, but at a rate of at least one near miss per three hours of test time. This should inform me should I make a similar mistake in the future.

Given that RCU priority boosting has been in the Linux kernel for many years and given that I regularly run rcutorture with CPU-hotplug testing enabled on CONFIG_RCU_BOOST=y kernels, it is only fair to ask why rcutorture did not locate this bug years ago. The likely reason is that I only recently added RCU callback flood testing, in which rcutorture registers 20,000 RCU callbacks, waits a jiffy, registers 20,000 more callbacks, waits another jiffy, then finally registers a third set of 20,000 callbacks. This causes call_rcu() to take evasive action, which has the effect of starting the RCU grace period more quickly, which in turn makes rcutorture more sensitive to too-short grace periods. This view is supported by the fact that I saw actual failures only on recent kernels whose rcutorture testing included callback-flood testing.

Another question is “exactly what was the bug?” I have a fairly good idea of what stupid mistake led to that bug, but given that I have completely eliminated the offending function, I am not all that motivated to chase it down. In fact, it seems much more productive to leave it as an open challenge for formal verification. So, if you have a favorite formal-verification tool, why not see what it can make of rcu_preempt_offline_tasks()?

Somewhat embarrassingly, slides 39-46 of this Collaboration Summit presentation features my careful development of rcu_preempt_offline_tasks(). I suppose that I could hide behind the “More changes due to RCU priority boosting” on slide 46, but the fact remains that I simply should not have written this function in the first place, whether carefully before priority boosting or carelessly afterwards.

In short, it is not enough to correctly code a function. You must also code the correct function!

Comments

( 4 comments — Leave a comment )
(Anonymous)
Nov. 21st, 2014 03:37 pm (UTC)
Dear Lord.
" Carrying out frequent CPU-hotplug operations, and so much so that a given block of 16 CPUs (for example, CPUs 0-15 or 16-31) might reasonably all be offline at the same time."
^- Wow. Just, wow.

Thanks for the excellent write up!
paulmck
Nov. 21st, 2014 05:05 pm (UTC)
Re: Dear Lord.
So I am correct in presuming that your workloads don't do that? :-)
(Anonymous)
Nov. 21st, 2014 05:48 pm (UTC)
Re: Dear Lord.
That was my thoughts, at first, thinking of it as a physical act. (What hardware would even allow a CPU hotplug? some sort of HA enterprise cluster in a box ...) But then I realized, the massively multi-cpu systems certainly allow for software _offlining_ of CPUs, presumably as a power-saving measure, like the new truck engines that shut off cylinders, running on 6 or 4 instead of all 8. Then it makes sense.

Also, great detective work! This is the sort of stuff that makes me confident in the software being developed. Like the PostgreSQL's team's insistence of doing and surviving literal pull-the-plug tests.
paulmck
Nov. 21st, 2014 08:14 pm (UTC)
Re: Dear Lord.
Glad you liked it!
( 4 comments — Leave a comment )