You are viewing paulmck

Tired of ads? Upgrade to paid account and never see ads again!
Some time ago, I figured out that there are more than a billion instances of the Linux kernel in use, and this in turn led to the realization that a million-year RCU bug is happening about three times a day across the installed base. This realization has caused me to focus more heavily on RCU validation, which has uncovered a number of interesting bugs. I have also dabbled a bit in formal verification, which has not yet found a bug. However, formal verification might be getting there, and might some day be a useful addition to RCU's regression testing. I was therefore quite happy to be invited to this Dagstuhl Seminar. In what follows, I summarize a few of the presentation. See here for the rest of the presentations.

Viktor Vafeiadis presented his analysis of the C11 memory model, including some “interesting” consequences of data races, where a data race is defined as a situation involving multiple concurrent accesses to a non-atomic variable, at least one of which is a write. One such consequence involves a theoretically desirable “strengthening” property. For example, this property would mean that multiplexing two threads onto a single underlying thread would not introduce new behaviors. However, with C11, the undefined-behavior consequences of data races can actually cause new behaviors to appear with fewer threads, for example, see Slide 7. This suggests the option of doing away with the undefined behavior, which is exactly the option that LLVM has taken. However, this approach requires some care, as can be seen on Slide 19. Nevertheless, this approach seems promising. One important takeaway from this talk is that if you are worried about weak ordering, you need to pay careful attention to reining in the compiler's optimizations. If you are unconvinced, take a look at this! Jean Pichon-Pharabod, Kyndylan Nienhuis, and Mike Dodds presented on other aspects of the C11 memory model.

Martin T. Vechev apparently felt that the C11 memory model was too tame, and therefore focused on event-driven applications, specifically javascript running on Android. This presentation included some entertaining concurrency bugs and their effects on the browser's display. Martin also discussed formalizing javascript's memory model.

Hongjin Liang showed that ticket locks can provide starvation freedom given a minimally fair scheduler. This provides a proof point for Björn B. Brandenburg's dissertation, which analyzed the larger question of real-time response from lock-based code. It should also provide a helpful corrective to people who still believe that non-blocking synchronization is required.

Joseph Tassarotti presented a formal proof of the quiescent-state based reclamation (QSBR) variant of userspace RCU. In contrast to previous proofs, this proof did not rely on sequential consistency, but instead leveraged a release-acquire memory model. It is of course good to see researchers focusing their tools on RCU! That said, when a researcher asked me privately whether I felt that the proof incorporated realistic assumptions, I of course could not resist saying that since they didn't find any bugs, the assumptions clearly must have been unrealistic.

My first presentation covered what would be needed for me to be able to use formal verification as part of Linux-kernel RCU's regression testing. As shown on slide 34, these are:

  1. Either automatic translation or no translation required. After all, if I attempt to manually translate Linux-kernel RCU to some special-purpose language every release, human error will make its presence known.
  2. Correctly handle environment, including the memory model, which in turn includes compiler optimizations.
  3. Reasonable CPU and memory overhead. If these overheads are excessive, RCU is better served by simple stress testing.
  4. Map to source code lines containing the bug. After all, I already know that there are bugs—I need to know where they are.
  5. Modest input outside of source code under test. The sad fact is that a full specification of RCU would be at least as large as the implementation, and also at least as buggy.
  6. Find relevant bugs. To see why this is important, imagine that some tool finds 100 different million-year bugs and I fix them all. Because roughly one of six fixes introduces a bug, and because that bug is likely to reproduce in far less than a million years, this process has likely greatly reduced the robustness of the Linux kernel.

I was not surprised to get some “frank and honest” feedback, but I was quite surprised (but not at all displeased) to learn that some of the feedback was of the form “we want to see more C code.” After some discussion, I provided just that.

Lies that firmware tells RCU

One of the complaints that real-time people have against some firmware is that it lies about its age, attempting to cover up cycle-stealing via SMIs by reprogramming the TSC. Some firmware goes farther and lies about the number of CPUs on the system, apparently on the grounds that more is better, regardless of how many of those alleged CPUs actually exist.

RCU used to naively believe the firmware, and would therefore create one set of rcuo kthreads per advertised CPU. On some systems, this resulted in hundreds of such kthreads on systems with only a few tens of CPUs. But RCU can choose to create the rcuo kthreads only for CPUs that actually come online. Problem solved!

Mostly solved, that is.

Yanko Kaneti, Jay Vosburgh, Meelis Roos, and Eric B Munson discovered the “mostly” part when they encountered hangs in _rcu_barrier(). So what is rcu_barrier()?

The rcu_barrier() primitive waits for all pre-existing callbacks to be invoked. This is useful when you want to unload a module that uses call_rcu(), as described in this LWN article. It is important to note that rcu_barrier() does not necessarily wait for a full RCU grace period. In fact, if there are currently no RCU callbacks queued, rcu_barrier() is within its rights to simply return immediately. Otherwise, rcu_barrier() enqueues a callback on each CPU that already has callbacks, and waits for all these callbacks to be invoked. Because RCU is careful to invoke all callbacks posted to a given CPU in order, this guarantees that by the time rcu_barrier() returns, all pre-existing RCU callbacks will have already been invoked, as required.

However, it is possible to offload invocation of a given CPU's RCU callbacks to rcuo kthreads, as described in this LWN article. This kthread might well be executing on some other CPU, which means that the callbacks are moved from one list to another as they pass through their lifecycles. This makes it difficult for rcu_barrier() to reliably determine whether or not there are RCU callbacks pending for an offloaded CPU. So rcu_barrier() simply unconditionally enqueues an RCU callback for each offloaded CPU, regardless of that CPU's state.

In fact, rcu_barrier() even enqueues a callback for offloaded CPUs that are offline. The reason for this odd-seeming design decision is that a given CPU might enqueue a huge number of callbacks, then go offline. It might take the corresponding rcuo kthread significant time to work its way through this backlog of callbacks, which means that rcu_barrier() cannot safely assume that an offloaded CPU is callback-free just because it happens to be offline. So, to come full circle, rcu_barrier() enqueues an RCU callback for all offloaded CPUs, regardless of their state.

This approach works quite well in practice.

At least, it works well on systems where the firmware provides the Linux kernel with an accurate count of the number of CPUs. However, it breaks horribly when the firmware over-reports the number of CPUs, because then the system will then have CPUs that never ever come online. If these CPUs have been designated as offloaded CPUs, this means that their rcuo kthreads will never ever be spawned, which in turn means that any callbacks enqueued for these mythical CPUs will never ever be invoked. And because rcu_barrier() waits for all the callbacks that it posts to be invoked, rcu_barrier() ends up waiting forever, which can of course result in hangs.

The solution is to make rcu_barrier() refrain from posting callbacks for offloaded CPUs that have never been online, in other words, for CPUs that do not yet have an rcuo kthread.

With some luck, this patch will clear things up. And I did take the precaution of reviewing all of RCU's uses of for_each_possible_cpu(), so here is hoping! ;-)
I had the privilege of being asked to present on ordering, RCU, and validation at a joint meeting of the REORDER (Third International Workshop on Memory Consistency Models) and EC2 (7th International Workshop on Exploiting Concurrency Efficiently and Correctly) workshops.

Before my talk, Michael Tautschnig gave a presentation (based on this paper) on an interesting prototype tool (called “mole,” with the name chosen because the gestation period of a mole is about 42 days) that helps identify patterns of usage in large code bases. It is early days for this tool, but one could imagine it growing into something quite useful, perhaps answering questions such as “what are the different ways in which the Linux kernel uses reference counting?” He also took care to call out the many disadvantages of testing, which include not being able to test all paths, all possible races, all possible inputs, or all possible much of anything, at least not in finite time.

I started my talk with an overview of split counters, where each CPU (or task or whatever) updates its own counter, and the aggregate counter is read out by summing all the per-CPU counters. There was some concern expressed by members of the audience about the possibility of inconsistent results. For example, if one CPU adds five, another CPU adds seven, and a third CPU adds 11 to initially zero counter, then two CPUs reading out the counter might see 12 and 18, respectively, which are inconsistent (they differ by six, and no CPU added six). To their credit, the attendees picked right up on a reasonable correctness criterion. The idea is that the aggregate counter's value varies with time, and that any given reader will be guaranteed to return a value between that of the counter when the reader started and that of the counter when the reader ended: Consistency is neither needed nor provided in a great number of situations.

I then gave my usual introduction to RCU, and of course it is always quite a bit of fun introducing RCU to people who have never encountered anything like it. There was quite a bit of skepticism initially, as well as a lot of questions and comments.

I then turned to validation, noting the promise of some formal-validation tooling. I ended by saying that although I agreed with the limitations of testing called out by the previous speaker, the fact remains that a number of people have devised tests that had found RCU bugs (thank you, Stephen, Dave, and Fengguang!), but no one has yet devised a hard-core formal-validation tool that has found any bugs in RCU. I also pointed out that this is definitely not because there are no bugs in RCU! (Yes, I have gotten rid of all day-one bugs in RCU, but only by having also gotten rid of all day-one code in RCU.) When asked if I meant bugs in RCU usage or in RCU itself, I replied “Either would be good.” Several people wrote down where to find RCU in the Linux kernel, so it will be interesting to see what they come up with. (Perhaps all too interesting!)

There were several talks on analyzing weakly ordered systems, but keep in mind that for these guys, even x86 is weakly ordered. After all, it allows prior stores to be reordered with later loads.

Another interesting talk was given by Kapil Vaswani on the topic of wait freedom. Recall that in a wait-free algorithm, every process is guaranteed to make some progress in a finite time, even in the presence of arbitrarily long delays for any given process. In contrast, in a lock-free algorithm, only one process is guaranteed to make some progress in a finite time, again, even in the presence of arbitrarily long delays for any given process. It is worth noting that neither of these guarantees is sufficient for real-time programs, which require a specified amount of progress (not merely some progress) in a bounded amount of time (not merely a finite amount of time). Wait-freedom and lock-freedom are nevertheless important forward-progress guarantees, and there are numerous other similar guarantees including obstruction freedom, deadlock freedom, starvation freedom, many more besides.

It turns out that most software in production, even in real-time systems, is not wait-free, which has been a source of consternation for many researchers for quite some time. Kapil went on to describe how Alistarh et al. showed that, roughly speaking, given a non-hostile scheduler and crash-free execution, lock-free algorithms have wait-free behavior.

The interesting thing about this is that you can take it quite a bit farther, and those of you who know me well won't be surprised to learn that I did just that in a question to the speaker. If you have a non-hostile scheduler, crash-free execution, FIFO locks, bounded lock-hold times, no lock nesting, a finite number of processes, and so on, you can obtain the benefits of the more aggressive forward-progress guarantees. The general idea is that if you have at most N processes, and if the maximum lock-hold time is T, then you can wait at most (N-1)T time to acquire a given lock. (Those wishing greater rigor should read Bjoern B. Brandenburg's dissertation — Full disclosure: I was on Bjoern's committee.) In happy contrast to the authors of the paper mentioned in the previous paragraph, the speaker and audience seemed quite intrigued by this line of thought.

In all, it was an extremely interesting and thought-provoking time. With some luck, perhaps we will see some powerful software tools introduced by this group of researchers.

And it used to be so simple...

It is something I have done many times under many different operating systems. You write something into the startup scripts, take the system down, then regain control at startup. So I planned to take this approach for my KVM-based rcutorture setup: Mount the image file, deposit a test script in the startup scripts, unmount the image file, start qemu, then analyze the results (either deposited in the filesystem or dumped to the console). What could be easier?

So I manually mounted the file I use as the qemu disk image, my plan being to manually create simple tests as a first step towards automation.

Hmmm... Everything looks very strange and unfamiliar. Ah, yes, this is upstart. No problem, the Internet is but a mouse-click away, right? But wait, the documentation says that upstart and the more-familiar (to old guys like myself, anyway) sysvinit can co-exist on the same system. And what about this new systemd thing that I keep hearing about?

So my tests are going to have to inspect the filesystem and figure out which of these three things is installed. And maybe someone installed one of them, then switched to the other, leaving traces of their earlier choice lying around. How is my script supposed to figure all that out? Of course, if it was just me running the tests, I could simply make sure that a pre-selected boot-up mechanism was installed. But that is silly — anyone anywhere should be able to test RCU. So I need to be able to deal with any of the three boot-up mechanisms. But wait, it is far worse than that: What is my script supposed to do when someone else comes up with a fourth, fifth, or sixth boot-up mechanism? Ouch!!!

It is clearly time to just say “no”! After all, my main test (rcutorture) is coded in the kernel, so it is clearly going to be far easier to translate my driver scripts to Linux kernel code than to code up a script that can outwit all present and future perpetrators of bootup mechanisms. In-kernel coding will allow me to fully control the test using kernel boot parameters, and thus happily ignore which of sysvinit, upstart, systemd, or whatever is installed. Some times you just have to get a horse!


Stupid RCU Tricks: Bug Hunt Number 2

Lai Jiangshan spotted another bug in the solution posted for bug hunt number 1. Here is the fixed (but still buggy) version:

  1 struct foo {
  2   struct list_head list;
  3   int key;
  4   int data;
  5 };
  7 LIST_HEAD(mylist);
  8 DEFINE_SPINLOCK(mylock);
  9 struct foo *cache;
 11 int search(int key, int *data)
 12 {
 13   struct foo *p;
 15   rcu_read_lock();
 16   p = rcu_dereference(cache);
 17   if (p != NULL && p->key == key)
 18     goto found;
 19   list_for_each_entry_rcu(p, &mylist, list)
 20     if (p->key == key) {
 21       rcu_assign_pointer(cache, p);
 22       goto found;
 23     }
 24   rcu_read_unlock();
 25   return -ENOENT;
 26 found:
 27   *data = p->data;
 28   rcu_read_unlock();
 29   return 0;
 30 }
 32 int insert(int key, int data)
 33 {
 34   struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL);
 36   if (p == NULL)
 37     return -ENOMEM;
 38   p->key = key;
 39   p->data = data;
 40   spin_lock(&mylock);
 41   list_add_rcu(&p->list, &mylist);
 42   spin_unlock(&mylock);
 43 }
 45 int delete(int key)
 46 {
 47   struct foo *p;
 49   spin_lock(&mylock);
 50   list_for_each_entry(p, &mylist, list)
 51     if (p->key == key) {
 52       list_del_rcu(&p->list);
 53       RCU_INIT_POINTER(cache, NULL);
 54       spin_unlock(&mylock);
 55       synchronize_rcu();
 56       free(p);
 57       return 0;
 58     }
 59   spin_unlock(&mylock);
 60   return -ENOENT;
 61 }

Can you spot the additional bug?

Stupid RCU Tricks: RCU-Protected Deletion

Consider the following code fragment:



  1 rcu_read_lock();
  2 p = rcu_dereference(gp);
  3 if (must_delete(p)) {
  4   spin_lock(&my_lock);
  5   if (must_delete(p)) {
  6     kfree_rcu(p);
  7     rcu_assign_pointer(gp, q);
  8     already_deleted(p);
  9   }
 10   spin_unlock(&my_lock);
 11 } else {
 12   do_something_with(p);
 13 }
 14 rcu_read_unlock();

Note that the whole fragment is enclosed in an RCU read-side critical section. Line 2 picks up a local pointer to the global RCU-protected pointer gp, which we assume is never NULL. Line 3 checks to see if it is necessary to delete the structure referenced by gp, and, if so, lines 4-10 do the deleting. Otherwise, line 12 does something with the just-accessed structure.

If the structure must be deleted, we first acquire a spinlock on line 4, and then line 5 re-checks whether it is still necessary to delete the structure, but this time under the lock. If the structure still needs to be deleted, then line 6 frees it (but only after an RCU grace period has elapsed) and line 7 makes gp point to some replacement structure q. Clearly this replacement structure must have been allocated somehow, but the details of exactly how that happens are not important to this example. Next, line 8 marks the old structure as having already been deleted, which resolves races when multiple CPUs concurrently notice that the structure needs deleting, and is also the reason that line 5 rechecks under the lock. Finally, line 10 releases the spinlock.

This code fragment is a bit unconventional: Normally lines 6 and 7 would be interchanged. But the RCU read-side critical section will prevent the structure from being freed until after the global pointer is updated, right? Why or why not?

Stupid RCU Tricks: Bug Hunt Number 1

The following code fragment contains a bug semi-similar to one recently found in the Linux kernel.



struct foo {
	struct list_head list;
	int key;
	int data;

struct foo *cache;

int search(int key, int *data)
	struct foo *p;

	p = rcu_dereference(cache);
	if (p != NULL && p->key == key)
		goto found;
	list_for_each_entry_rcu(p, &mylist, list)
		if (p->key == key) {
			rcu_assign_pointer(cache, p);
			goto found;
	return -ENOENT;
	*data = p->data;
	return 0;

int insert(int key, int data)
	struct foo *p = kmalloc(sizeof(*p), GFP_KERNEL);

	if (p == NULL)
		return -ENOMEM;
	p->key = key;
	p->data = data;
	list_add_rcu(&p->list, &mylist);

int delete(int key)
	struct foo *p;

	list_for_each_entry(p, &mylist, list)
		if (p->key == key) {
			return 0;
	return -ENOENT;

Can you find it and fix it?

Hunting More Heisenbugs

How should you celebrate a successful heisenbug hunt? Any five people would likely have ten different opinions, but whatever your choice, you should also fix your broken tests. Just what makes me say that your tests are broken? Consider the following definitions:

  1. The only bug-free programs are trivial programs.
  2. A reliable program has no known bugs.

From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.

I fixed the rcutorture test process as follows:

  1. Choosing CONFIG_RCU_FANOUT=2 on an eight-CPU machine, which exercised code paths that would otherwise require 1024 CPUs.
  2. Decreasing the wait time successive randomly chosen CPU-hotplug operations from three seconds to a hundred milliseconds (using “sleep 0.1” from a bash script!)
  3. Retaining the artificially low one-scheduler-clock-tick delay between invocations of force_quiescent_state().

This improved testing regimen resulted in grace-period hangs, but only with CONFIG_PREEMPT_TREE_RCU. The really nice thing about this improved test was that the failures occurred within a few minutes, which permits use of printk() debugging, a rare luxury when working with RCU infrastructure. I nevertheless suppressed my urge to wildly code up random printk() statements in favor of enabling the RCU tracing infrastructure. The tracing data clearly showed that all leaf rcu_node structures had recorded quiescent-state passage for all their CPUs, but that this information had not propagated up the rcu_node hierarchy. Given that these grace-period hangs occurred only in CONFIG_TREE_PREEMPT_RCU, this data fingered two possible code paths:

  1. rcu_read_unlock_special() invoked from a task that had blocked within the prior RCU read-side critical section, and
  2. __rcu_offline_cpu() when offlining the last CPU from a given leaf rcu_node structure when that structure has queued a task that has blocked within its current RCU read-side critical section.

A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.

So how should I celebrate success in this latest hunt?

Hunting Heisenbugs

My children's opinions notwithstanding, I recently found myself pursuing some nasty concurrency bugs in Linux's TREE_RCU implementation. This was not particularly surprising, given that I recently added preemption support, and the code really hadn't put up that much of a fight. In fact, I was getting the feeling that the bugs had gotten together and decided to hide out, the better to ambush me. This feeling wasn't far from wrong.

My first hint of trouble appeared when I began running longer sequences of rcutorture runs, seeing an occasional failure on one-hour runs. My first reaction was to increase the duration to ten hours and attempt to bisect the problem. Of course, even with bisection, finding the bug takes quite some time given ten hours for each probe, so rather than use “git bisect”, I manually ran parallel runs and (for example) quadrisected. I also ran multiple configurations. The results initially hinted that CONFIG_NO_HZ might have something to do with it, but later runs showed no shortage of failures in !CONFIG_NO_HZ runs as well.

The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.

Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies const to a few data structures involved in RCU tracing, which I don't and you shouldn't. So I ran a few more runs on 2.6.32-rc1, and eventually did trigger a failure. In contrast, 2.6.31 was rock solid.

Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.

Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?

Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes rcutorture testing more severe, inserting delays of up to 50 milliseconds in RCU read-side critical sections. I therefore cherry-picked this commit back onto 2.6.31 and 2.6.30, and, sure enough, got failures in both cases. As it turned out, I was dealing with a day-one bug in TREE_RCU.

This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be: force_quiescent_state() or one of the functions it calls. This function runs more often in face of long-running RCU read-side critical sections. In addition, this explained the earlier CONFIG_NO_HZ results, because one of the force_quiescent_state() function's responsibilities is detecting dyntick-idle CPUs. In addition, it raised the possibility that the bug was unrelated to memory ordering, which motivated me to try a few runs on x86 — which, to my surprise, resulted in much higher failure rates than did the earlier tests on the Power machines.

I stubbed out force_quiescent_state() to check my assumption that it was to blame (but please, please do not do this on production systems!!!). Stubbing out force_quiescent_state() resulted in a statistically significant 3x decrease in failures on the x86 machine, confirming my assumption, at least for some subset of the bugs. Now that there was a much smaller section of code to inspect, I was able to locate one race involving mishandling of the ->completed values. This reduced the error rate on the x86 machine by roughly the same amount as did stubbing out force_quiescent_state(). One bug down, but more bugs still hiding.

I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of RCU_JIFFIES_TILL_FORCE_QS from three to one, which increased the failure rate by well over an order of magnitude on the x86 machine.

Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

Given that the race I found involved unsynchronized access to the ->completed values, it made sense to look at other unsynchronized accesses. I found three other such issues, and testing of the resulting patches has thus far turned up zero rcutorture failures.

And it only took 882.5 hours of machine time to track down these bugs. :–)

This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the TREE_RCU effort. Bug-fixing is often done under considerably more time pressure than is production of new code, and the mistake in this case was failing to follow up with more careful analysis.

Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.

Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.

Answer to Quick Quiz 2: How could increasing the frequency of force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?

RCU and unloadable modules

The rcu_barrier() function was described some time back in an article on Linux Weekly News. This rcu_barrier() function solves the problem where a given module invokes call_rcu() using a function in that module, but the module is removed before the corresponding grace period elapses, or at least before the callback can be invoked. This results in an attempt to call a function whose code has been removed from the Linux kernel. Oops!!!

Since the above article was written, rcu_barrier_bh() and rcu_barrier_sched() have been accepted into the Linux kernel, for use with call_rcu_bh() and call_rcu_sched(), respectively. These functions have seen relatively little use, which is no surprise, given that they are quite specialized. However, Jesper Dangaard recently discovered that they need to be used a bit more heavily. This lead to the question of exactly when they needed to be used, to which I responded as follows:

Unless there is some other mechanism to ensure that all the RCU callbacks have been invoked before the module exit, there needs to be code in the module-exit function that does the following:

  1. Prevents any new RCU callbacks from being posted. In other words, make sure that no future call_rcu() invocations happen from this module unless those call_rcu() invocations touch only functions and data that outlive this module.
  2. Invokes rcu_barrier().
  3. Of course, if the module uses call_rcu_sched() instead of call_rcu(), then it should invoke rcu_barrier_sched() instead of rcu_barrier(). Similarly, if it uses call_rcu_bh() instead of call_rcu(), then it should invoke rcu_barrier_bh() instead of rcu_barrier(). If the module uses more than one of call_rcu(), call_rcu_sched(), and call_rcu_bh(), then it must invoke more than one of rcu_barrier(), rcu_barrier_sched(), and rcu_barrier_bh().
What other mechanism could be used? I cannot think of one that it safe. For example, a module that tried to count the number of RCU callbacks in flight would be vulnerable to races as follows:

  1. CPU 0: RCU callback decrements the counter.
  2. CPU 1: module-exit function notices that the counter is zero, so removes the module.
  3. CPU 0: attempts to execute the code returning from the RCU callback, and dies horribly due to that code no longer being in memory.

If there was an easy solution (or even a hard solution) to this problem, then I do not believe that Nikita Danilov would have asked Dipankar Sarma for rcu_barrier(). Therefore, I do not expect anyone to be able to come up with an alternative to rcu_barrier() and friends. Always happy to learn something by being proven wrong, of course!!!

So unless someone can show me some other safe mechanism, every unloadable module that uses call_rcu(), call_rcu_sched(), or call_rcu_bh() must use rcu_barrier(), rcu_barrier_sched(), and/or rcu_barrier_bh() in its module-exit function.

So if you have a module that uses one of the call_rcu() functions, please use the corresponding rcu_barrier() function in the module-exit code!

Update: Peter Zijlstra rightly points out that the issue is not whether your module invokes call_rcu(), but rather whether the corresponding RCU callback invokes a function that is in a module. So, if there is a call_rcu(), call_rcu_sched(), or call_rcu_bh() anywhere in the kernel whose RCU callback either directly or indirectly invokes a function in your module, then your module's exit function needs to invoke rcu_barrier(), rcu_barrier_sched(), and/or rcu_barrier_bh(). Thanks to Peter for pointing this out!