Stupid RCU Tricks: A tour through rcutorture

Although Linux-kernel RCU gets most of the attention, without rcutorture, RCU would not be what it is today. To see this, note that the old saying “If it ain't tested, it don't work!” is if anything more valid today than it was back then. After all, software has not gotten any simpler, workloads have not become less demanding, and systems have not grown smaller, except in terms of physical size. That said, the decrease in size has been truly impressive. Back when Jack and I invented RCU, the hardware contained in my laptop would have filled no fewer than fifteen standard racks, and that ignores the hardware that simply was not available back then, and also ignores the reliability issues that would have resulted from such an imposing agglomeration of hardware.

It is rcutorture's job to make sure that Linux-kernel RCU actually works, and so it is worthwhile getting to know rcutorture a bit better. The following blog posts cover design of, use of, and experience with this test suite:

  1. Stupid RCU Tricks: So you want to torture RCU? (use)
  2. Stupid RCU Tricks: So rcutorture is Not Aggressive Enough For You? (use)
  3. Stupid RCU Tricks: Failure Probability and CPU Count (use)
  4. Stupid RCU Tricks: Enlisting the Aid of a Debugger (use)
  5. Stupid RCU Tricks: Torturing RCU Fundamentally, Part I (design)
  6. Stupid RCU Tricks: Torturing RCU Fundamentally, Part II (design)
  7. Stupid RCU Tricks: Torturing RCU Fundamentally, Part III (design)
  8. Stupid RCU Tricks: Torturing RCU Fundamentally, Parts IV and V (design)
  9. Stupid RCU Tricks: So rcutorture is Still Not Aggressive Enough For You? (use)
  10. Stupid RCU Tricks: rcutorture fails to find an RCU bug (experience)
  11. Stupid RCU Tricks: The design of rcutorture (design)

And here are a few older posts covering rcutorture:

  1. Hunting Heisenbugs (experience, 2009)
  2. Hunting More Heisenbugs (experience, 2009)
  3. Stupid RCU Tricks: RCU Priority Inversion (design, 2010)
  4. And it used to be so simple... (design, 2011)
  5. Stupid RCU Tricks: Bug Found by Refactored Tests (design, experience, and use, 2014)
  6. Stupid RCU Tricks: rcutorture Catches an RCU Bug (experience, 2014)
  7. Stupid RCU Tricks: rcutorture Accidentally Catches an RCU Bug (experience, 2017)
I hope that this series is helpful, and I further hope that it will inspire more aggressive torturing of other software!

Stupid RCU Tricks: The design of rcutorture

This installment of the rcutorture series takes a high-level look at its design. At the highest level, rcutorture is a stress test with a few unit-test components thrown in for good measure. It also includes scripts to handle both single-system and distributed testing. All of this code is of course paying homage to the many moods of Mr. Murphy.

The Many Moods of Mr. Murphy

As I have progressed through my career, I seem to have progressively miffed Mr. Murphy.

I completed my first professional (but pro bono) project in the mid-1970s. It had one user. Any million-year bugs it might have contained took the full million years to appear. This meant that Murphy was actually a pretty nice guy. Sure, whatever could happen would. Eventually. Maybe in geologic time.

In the 1980s, I completed a number of contract-programming projects that might have had installed bases of at many as 100 units. A million-year bug could be expected to appear about once per 10,000 years. In the 1990s, I worked on Sequent's DYNIX/ptx proprietary-UNIX operating system, which had an installed base of perhaps 6,000 systems. A million-year bug could be expected to appear not quite once per two centuries.

Shortly after the year 2000, I started working on the Linux kernel. There are at best rough estimates of the Linux kernel's installed based, and as of 2017, there were an estimated 20 billion systems of one sort of another running the Linux kernel, including smartphones, automobiles, household appliances, and much more. A million-year bug could be expected to appear more than once per hour across this huge installed base. In other words, over a period of about 40 years, Murphy has transitioned from being a pretty nice guy to being a total jerk!

Worse yet, should the Linux kernel capture even a modest fraction of the Internet-of-things market, a million-year bug could be expected to appear every few minutes across the installed base. Which might well result in Murphy becoming nothing less than a homicidal maniac.

Fortunately, there are some validation strategies that might help keep Murphy on the straight and narrow.

If You Cannot Beat Him, Join Him!

Given that everything that can happen eventually will, the task at hand is to try to make it happen in the comparative comfort and safety of the lab. This means aiding and abetting Mr. Murphy, at least within the lab environment. And this is the whole point of rcutorture, whose tricks include the following:

  1. Temporal fuzzing.
  2. Exercising race conditions.
  3. Anticipating abuse.
Of course, none of these tricks are new, but it does not hurt to review them.

Temporal Fuzzing

But why not go for the full effect and apply straight-up fuzzing? The answer to this question may be found in RCU's core API:
void rcu_read_lock(void);
void rcu_read_unlock(void);
void synchronize_rcu(void);
void call_rcu(struct rcu_head *head, rcu_callback_t func);
For the first three functions, there is nothing to fuzz, unless you are trying to test your compiler. For the last function, fuzzing of pointers—and most especially pointers to functions—is reserved for the truly brave and for those wishing to test their kernel's exception handling.

But it does make sense to fuzz the timing of calls to these functions, and that is exactly what rcutorture does. RCU readers and updaters are invoked at random times, with readers and updaters cooperating to detect any too-short grace periods, memory misordering, and so on. Much of the fuzzing is randomly generated at run time, but there are also module parameters that insert delays in specific locations. This strategy is straightforward, but can also be powerful, for example, careful choice of delays and other configuration settings decreased the mean time between failure (MTBF) of a memorable heisenbug from hundreds of hours to less than five hours. This had the beneficial effect of de-heisening this bug.

Exercising Race Conditions

Many of the most troublesome bugs involve rare operations, and one way to join forces with Murphy is to make rare operations less rare during validation. And rcutorture takes this approach often, including for the following operations:

  1. CPU hotplug.
  2. Transitions to and from idle, including transitions to and from the whole system being idle.
  3. Long RCU readers.
  4. Readers from interrupt handlers.
  5. Complex readers, for example, those overlapping with irq-disable regions.
  6. Delayed grace periods, for example, allowing a CPU to go offline and come back online during grace-period initialization.
  7. Racing call_rcu() invocations against rcu_barrier().
  8. Periodic forced migrations to other CPUs.
  9. Substantial testing of less-popular grace-period mechanisms.
  10. Processes running on the hypervisor to preempt code running in rcutorture guest OSes.
  11. Process exit.
  12. ”Near misses“ where the RCU grace-period guarantee is almost violated.
  13. Moving CPUs to and from rcu_nocbs callback-offloaded mode.
This exercising of race conditions might be reminiscent of the Netflix Chaos Monkey.

Anticipating Abuse

There are things that RCU users are not supposed to do. Just as users of the fork() system call are not supposed to code up forkbombs, RCU users are not supposed to code up endless blasts of call_rcu() invocations (see Documentation/RCU/checklist.rst item 8). Nevertheless, rcutorture does engage in (carefully limited forms of) call_rcu() abuse in order to find stress-related RCU bugs. This abuse is enabled by default and may be controlled by the rcutorture.fwd_progress module parameter and friends.

In addition, rcutorture inserts the occasional long-term delay in preemptible RCU readers and exercises code paths that must avoid deadlocks involving the scheduler and RCU.

Meta-Murphy, AKA Test the Test

Of course, one danger of joining Murphy is that things can go wrong in test code just as easily as they can go wrong in the code under test.

For this reason, rcutorture provides the rcutorture.object_debug module parameter that verifies that the code checking for double call_rcu() invocations is working properly. In addition, the rcutorture.stall_cpu module parameter and friends may be used to force RCU CPU stall warning messages of various types.

The rcutorture tests of more fundamental RCU properties may be enabled by using the rcutorture.torture_type module parameter. For example, rcutorture.torture_type=busted selects a broken RCU implementation, which may also be selected using the BUSTED scenario. Either way, rcutorture had jolly well better complain about too-short grace periods. In addition, rcutorture.torture_type=busted_srcud forces rcutorture to run compound readers against SRCU, which does not support this notion. In this case also, rcutorture had better complain about too-short grace periods for these compound readers. The rcutorture.leakpointer module parameter tests the CONFIG_RCU_STRICT_GRACE_PERIOD Kconfig option's ability to detect pointers leaked from RCU read-side critical sections. Finally, the rcutorture tests of RCU priority boosting can themselves be tested by using the BUSTED-BOOST scenario, which must then complain about priority-boosting failures.

Additional unscheduled tests of rcutorture testing are of course provided by bugs in RCU itself. Perhaps these are rare examples of Murphy working against himself, but they normally do not feel that way at the time!

Enlisting Darwin

Those who are willing to consider the possibility that natural selection applies to non-living objects might do well to consider validation such as that provided by rcutorture to be a selection function. Now, some developers might object to the thought that their carefully created changes are random mutations, but the sad fact is that long experience has often supported that view.

With this in mind, a good validation suite will select against bugs, resulting in robust software, right?


You see, bugs are a form of software. An undesirable form, perhaps, but a form nevertheless. Bugs will therefore adapt to any fixed validation suite and accumulate in your software, degrading its robustness. This means that any bugs located by end users must also be considered bugs against the validation suite, which after all failed to find those bugs. Modifying the validation suite to successfully find those bugs is therefore important, as is independent efforts to make the validation suite more capable. The hope is that modifying the test suite will make it more difficult for bugs to adapt to it.

In short, the price of robust software is eternal test development.

Stupid RCU Tricks: rcutorture fails to find an RCU bug

I recently took a close look at rcutorture's console output and noticed the following string: rtbf: 0 rtb: 0. The good news is that there were no rcutorture priority-boosting failures (rtbf: 0). The bad news is that this was only because there was no priority-boosting testing (rtb: 0). And as we all know, if it isn't tested, it doesn't work, so this implied bugs in RCU priority boosting itself.

What is RCU Priority Boosting?

If you are running a kernel built with CONFIG_PREEMPT=y, RCU read-side critical sections can be preempted by higher-priority tasks, regardless of whether these tasks are executing kernel or userspace code. If there are enough higher-priority tasks, and especially if someone has foolishly disabled realtime throttling, these RCU read-side critical sections might remain preempted for a good long time. And as long as they remain preempted, RCU grace periods cannot complete. And if RCU grace periods cannot complete, your system has an OOM in its future.

This is where RCU priority boosting comes in, at least in kernels built with CONFIG_RCU_BOOST=y. If a given grace period is blocked only by preempted RCU read-side critical sections, and that grace period is at least 500 milliseconds old (this timeout can be adjusted using the RCU_BOOST_DELAY Kconfig option), then RCU starts boosting the priority of these RCU readers to the level specified by the rcutree.kthread_prio kernel boot parameter, which defaults to FIFO priority 2. RCU does this using one rcub kthread per rcu_node structure. Given a default Kconfig, this works out to one rcub kthread per 16 CPUs.

Why did rcutorture Fail to Test RCU Priority Boosting?

As with many things in life, this happened one step at a time:

  1. A bug I was chasing a few years back reproduced much more quickly if I enabled CPU hotplug on the TREE03 rcutorture scenario.
  2. And in addition, x86 no longer supports configurations where CPUs cannot be hotplugged (mumble mumble security mumble mumble), which means that the rcutorture scripting is always going to test CPU hotplug.
  3. TREE03 was the one scenario that tested RCU priority boosting.
  4. But RCU priority-boost testing assumes that CPU hotplug was disabled. So much so that it would disable itself if CPU-hotplug testing was enabled. Which it now always was.
  5. So RCU priority boosting has gone completely untested for quite a few years.
  6. Quite a few more years back, I learned that firmware sometimes lies about the number of CPUs. I learned this from bug reports noting that RCU was sometimes creating way more kthreads than made any sense on small systems.
  7. So the spawning of kthreads that are per-CPU or per-group-of-CPUs is done at CPU-online time. Which ensures that systems get the right number of RCU kthreads even in the presence of lying firmware. In the case of the RCU boost kthreads, the code verifies that the rcu_node structure in question has at least one online CPU before spawning the corresponding kthread.
  8. Except that it is now quite possible for the incoming CPU to not be fully online at the time that rcutree_online_cpu() executes, in part due to RCU being much more careful about CPU hotplug. This means that the RCU boost kthread will be spawned when the second CPU corresponding to a given rcu_node structure comes online.
  9. Which means that rcu_node structures that have only one CPU never have an RCU boost kthread, and in turn that RCU readers preempted on such CPUs will never be boosted. This problematic situation is unusual, requiring 17, 33, 49, 65, ... CPUs on the system, assuming default RCU kconfig options. But it can be made to happen, especially when using the rcutorture scripting. (--kconfig "CONFIG_NR_CPUS=17" ...)

The fix is to refactor the creation of rcub kthreads so that a CPU coming online is assumed to eventually make it online, which means that one online CPU suffices to spawn an rcub kthread.

Additional Testing Challenges

The rcu_torture_boost() function required additional rework because CPUs can fail to pass through a quiescent state for some seconds from time to time, and there is nothing that RCU priority boosting can do about this. There are now checks for this condition, and rcutorture refrains from reporting an error in such cases.

Worse yet, this testing proceeds by disabling the aforementioned realtime throttling, then running a FIFO realtime priority 1 kthread on each CPU. This sort of abuse is a great way to break your kernel, yet nothing less abusive will reliably and efficiently test RCU priority boosting. It just so happens that many of RCU's kthreads will do just fine because in this configuration they run at FIFO realtime priority 2. Unfortunately, timers often run in a ksoftirqd kthread, which runs at a non-realtime priority. This means that although RCU's grace-period kthread runs just fine, if it tries to sleep for (say) three milliseconds, it won't awaken until RCU priority boosting testing has completed, which is a great way to force this testing to fail.

Therefore, rcutorture now takes a the rude and crude approach of checking to see if it is built into the kernel (as opposed to running as a kernel module), and if so, it forces all of the ksoftirqd kthreads to run at FIFO realtime priority 2. (Needless to say, don't try this at home.)

The usual way to asynchronously determine when a grace period has ended is to post an RCU callback using call_rcu(). Except that in realtime configurations, RCU callbacks are often offloaded to rcuo kthreads. It is the system administrator's responsibility to decide where to run these, and, failing that, the Linux-kernel scheduler's responsibility. Neither of which should be expected to do the right thing in the presence of a full set of CPU-bound unthrottled real-time-priority boost-test kthreads.

Fortunately, RCU now has polling APIs for managing grace periods. The start_poll_synchronize_rcu() function starts a new grace period if needed and returns a “cookie” that can be passed to poll_state_synchronize_rcu(), which will return true if the needed grace period has completed. These functions do not rely on RCU callbacks, and thus will function correctly even if the rcuo kthreads are inauspiciously scheduled, or even if these kthreads are not scheduled at all. Thus, rcutorture's test of RCU priority boosting now uses these two functions.

With all of this in place, RCU priority boosting lives again!

But untested software does not work, and that includes the tests themselves. Thus, a new BUSTED-BOOST scenario tests RCU priority boosting on a kernel built with CONFIG_RCU_BOOST=y, which does not do RCU priority boosting. This scenario fails within a few tens of seconds, so the test being tested might actually be working!

Stupid RCU Tricks: So rcutorture is Still Not Aggressive Enough For You?

An earlier post discussed ways of making rcutorture more aggressive, but even with these techniques, rcutorture's level of aggression is limited by build time on the one hand and the confines of a single system on the other. This post describes some recent ways around those limitations.

Play It Again, Sam!

A full rcutorture run will do about 20 kernel builds, which can take some tens of minutes or, on slower systems, well over an hour. This can be extremely annoying when you simply want to re-run the last test in order to obtain better failure statistics or to get more test time on a recent bug fix.

The traditional rcutorture way of avoiding rebuilds is to optionally edit the qemu-cmd files for each scenario to be re-run, then manually invoke sh on each resulting file. The editing step allows you to avoid overwriting the previous run's console output, but may be omitted if you don't care about that console output or if you have already saved it off somewhere. This works, but is painstaking and error-prone.

This is where the new kvm-again.sh script comes in. Its first argument is the path to the directory for the old run, for one example on my laptop, tools/testing/selftests/rcutorture/res/2021.03.31-10.52.56. This can be a relative pathname as in this example, but use of absolute pathnames can make your life easier when reviewing output from prior kvm-again.sh runs. By default, the new run will have the same duration as the old run, but the --duration argument may be used to specify the new run's duration. Also by default, kvm-again.sh will generate the new run's directory based on the current date and time (suffixed with -again), but the --rundir argument may be used to specify some other location. Finally, and again by default, hard links are used to “copy” the needed data from the old run directory (such as the Linux kernel), but the --link argument can be used to specify soft links or explicit copy operations. The full set of scenarios generates some 20 kernels, each of which is somewhat larger than they would have been in the past. You may therefore need to exercise some caution when using --link copy, especially if you are doing repeated kvm-again.sh runs.

The re-run file in the new run directory gives the pathname of the old run directory. Although you can give a run directory produced by a prior kvm-again.sh invocation to a later kvm-again.sh invocation, best practice is to continue specifying the original run directory. If nothing else, following this best practice avoids ever-growing qemu-cmd files.

Of course, the shorter the runs, the greater an advantage kvm-again.sh provides. In the extreme case, it can be amazingly helpful when testing for rare boot-time failures.

Strength in Numbers

It seems likely that there are quite a few more people with access to eight 16-CPU systems than there are people with access to a single 128-CPU system. You can of course run kvm.sh on each of eight 16-CPU systems, but working out which scenarios to run on each of those systems can be time-consuming and error-prone. And this is why the new kvm-remote.sh script exists.

Build or Buy?

This script can be invoked in two different modes. In both cases, the first argument is a quoted list of system names, as in names that the ssh command understands. Specifying localhost or any of its synonyms might work, but is an option for the brave at this point. Should this prove useful, it will be take care of in a later version of this script.

The first form builds all needed kernels on the system on which the kvm-remote.sh script is run. In this case, the second and subsequent arguments can be anything accepted by the kvm.sh script.

In the second form, the second and subsequent arguments must be suitable for the kvm-again.sh script, that is, the second argument must specify the path to an old run directory and the third and subsequent arguments can be --duration, --rundir, and </tt>--link</tt>.

In both forms, once the kernels are available, a tarball of all scenarios is downloaded to all of the systems. Each such download is run sequentially, which means that downloading can take significant time, especially if low-bandwidth network links are involved. Once all systems have had the tarball downloaded and expanded, batches of scenarios are parceled out among the systems specified by the first argument. If there are more batches than there are systems, once a system completes its current batch, it will be given another batch.

Once all batches have completed, the results from each system are uploaded back to the system running the kvm-remote.sh script, where the usual end-of-run error-checking and analysis is carried out.

This script assumes that all systems have the same number of CPUs. Addressing this limitations is future work. In the meantime, one workaround is to do multiple --buildonly runs of kvm.sh, one for each type of system. Then multiple runs of the second form of the kvm-remote.sh script can safely be run concurrently on the same build system. Because all the pre-built kernels for each type of system are safely collected up in the corresponding old-run directory, the multiple invocations of kvm-remote.sh will not interfere with each other.

Why ssh?

The kvm-remote.sh script uses ssh to do all downloading, control, and uploading operations. This might seem to be a poor choice in this age of Kubernetes and friends, but the fact remains that ssh is widely available, easy to configure, and reasonably robust. In contrast, there is a wide variety of Kubernetes-like systems, and they can be configured in a wide variety of ways. It would be impossible to choose just one of these systems, and it would be quite difficult to accommodate all of the configurations, versions, and variants of even one of them.

However, please note that kvm-remote.sh assumes that all of the systems have been set up properly. This means that low-level virtualization support must be in place, and it also means that running an ssh command to any of the specified systems must complete without the need for any human interaction. For example, if ssh foo date does not open a connection to system foo, run the date command, and print the result without any need to type any sort password or passphrase, then system foo is not yet set up properly.

Similarly, kvm-remote.sh does not take any actions that might be necessary to reserve system foo for your exclusive use, nor does it do anything to release this system upon completion of the test. Thus, these system-configuration, reservation, and release operations are jobs for which you may wish to enlist the help of Kubernetes or of similar frameworks. For example, I use (admittedly crude) scripts that interact with Facebook's internal environment to reserve and configure the desired number and type of systems, invoke kvm-remote.sh once everything is set up, and then release those systems.

What Might The Future Hold?

Although the kvm-remote.sh approach of using ssh works reasonably well on a few tens of systems, if someone wanted to run rcutorture on thousands of systems, something else would likely be required. On the other hand, there are not that many sites where one would reasonably devote anywhere near that many systems to rcutorture. There might be downloading improvements at some point, most likely in the form of allowing a script to be provided to allow kvm-remote.sh to use some site-specific optimized multi-system download utility. Both kvm-again.sh and kvm-remote.sh might someday need a way to specify that only a subset of a prior run's scenarios be re-run, for example, to chase down a bug that occurred in only a few of those scenarios.

And as mentioned earlier, perhaps a future version of kvm-remote.sh will gracefully handle remote systems with varying numbers of CPUs or running actual tests on the system running the kvm-remote.sh script.

But if things go as they usually do, a fair fraction of the future changes will come as complete surprises.

Parallel Programming: Second Edition

The second edition of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” is now available. I have no plans to create a dead-tree version, but I have no objection to others doing so, whether individually or in groups.

Big-animal changes over the First Edition include:

  1. A full rewrite of the memory-barriers section, which is now its own chapter. This new chapter includes discussion of the Linux-kernel memory model, courtesy of Akira Yokosawa, who kindly pulled in the LWN article.
  2. A number of new tools have been added to the formal-verification chapter.
  3. A new section on SMP real-time programming.
  4. The “Tools of the Trade” chapter has been dragged kicking and screaming into the 2020s, courtesy of Akira Yokosawa, Junchang Wang, and Slavomir Kaslev.
  5. Hyperlinking between quizzes and answers, courtesy of Paolo Bonzini and Akira Yokosawa.
  6. Improved formatting and build system, courtesy of Akira Yokosawa.
  7. Bibliographic facelift, courtesy of Stamatis Karnouskos and Akira Yokosawa.
  8. Grammatical fixes from a great many people, but especially from translators SeongJae Park and Motohiro Kanda.
  9. Several new cartoons.
  10. Performance results from a system with hundreds of CPUs, courtesy of my employer, Facebook.
  11. Substantial updates pretty much everywhere else. (Yes, this might be the first time in a long time that I read through the entire book. Why do you ask?)

Contributors include Akira Yokosawa; SeongJae Park; Junchang Wang; Borislav Petkov; Stamatis Karnouskos; Palik, Imre; Paolo Bonzini; Praveen Kumar; Tobias Klauser; Andreea-Cristina Bernat; Balbir Singh; Bill Pemberton; Boqun Feng; Emilio G. Cota; Namhyung Kim; Andrew Donnellan; Dominik Dingel; Igor Dzreyev; Pierre Kuo; Yubin Ruan; Chris Rorvick; Dave; Mike Rapoport; Nicholas Krause; Patrick Marlier; Patrick Yingxi Pan; Slavomir Kaslev; Zhang, Kai; and Zygmunt Bazyli Krynicki. On behalf of all who read this book, I thank you all for all you did to help make this second edition a reality!

Parallel Programming: December 2020 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features numerous improvments:


  1. LaTeX and build-system upgrades (including helpful error checking and reporting), formatting improvements (including much nicer display of hyperlinks and of Quick Quizzes, polishing of numerous figures and tables, plus easier builds for A4 paper), refreshing of numerous broken URLs, an improved “make help” command (see below), improved FAQ-BUILD material, and a prototype index, all courtesy of Akira Yokosawa.
  2. A lengthy Quick Quiz on the relationship of half-barriers, compilers, CPUs, and locking primitives, courtesy of Patrick Yingxi Pan.
  3. Updated performance results throughout the book, courtesy of a large x86 system kindly provided by Facebook.
  4. Compiler tricks, RCU semantics, and other material from the Linux-kernel memory model added to the memory-ordering and tools-of-the-trade chapters.
  5. Improved discussion of non-blocking-synchronization algorithms.
  6. Many new citations, cross-references, fixes, and touchups throughout the book.
A number of issues were spotted by Motohiro Kanda in the course of his translation of this book to Japanese, and Borislav Petkov, Igor Dzreyev, and Junchang Wang also provided much-appreciated fixes.

The output of the aforementioned make help is as follows:
Official targets (Latin Modern Typewriter for monospace font):
  Full,              Abbr.
  perfbook.pdf,      2c:   (default) 2-column layout
  perfbook-1c.pdf,   1c:   1-column layout

Set env variable PERFBOOK_PAPER to change paper size:
   PERFBOOK_PAPER=A4: a4paper
   PERFBOOK_PAPER=HB: hard cover book
   other (default):   letterpaper

make help-full" will show the full list of available targets.

The following excerpt of the make help-full command's output might be of interest to those who find Quick Quizzes distracting:
Experimental targets:
  Full,              Abbr.
  perfbook-qq.pdf,   qq:   framed Quick Quizzes
  perfbook-nq.pdf,   nq:   no inline Quick Quizzes (chapterwise Answers)

Thus, the make nq command creates a perfbook-nq.pdf with Quick Quizzes and their answers grouped at the end of each chapter, in the usual textbook style, while still providing PDF navigation from each Quick Quiz to the relevant portion of that chapter.

Finally, this release also happens to be the first release candidate for the long-awaited Second Edition, which should be available shortly.

Stupid RCU Tricks: Torturing RCU Fundamentally, Parts IV and V

Continuing further into the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file uncovers RCU's final two fundamental guarantees:


  1. The common-case RCU primitives are unconditional, and
  2. RCU users can perform a guaranteed read-to-write upgrade.

The first guarantee is trivially verified by inspection of the RCU API. The type of rcu_read_lock(), rcu_read_unlock(), synchronize_rcu(), call_rcu(), and rcu_assign_pointer() are all void. These API members therefore have no way to indicate failure. Even primitives like rcu_dereference(), which do have non-void return types, will succeed any time a load of their pointer argument would succeed. That is, if you do rcu_dereference(*foop), where foop is a NULL pointer, then yes, you will get a segmentation fault. But this segmentation fault will be unconditional, as advertised!

The second guarantee is a consequence of the first four guarantees, and must be tested not within RCU itself, but rather within the code using RCU to carry out the read-to-write upgrade.

Thus for these last two fundamental guarantees there is no code in rcutorture. But maybe even rcutorture deserves a break from time to time! ;–)

Stupid RCU Tricks: Torturing RCU Fundamentally, Part III

Even more reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's memory-barrier guarantees. These guarantees are a bit ornate, but roughly speaking guarantee that RCU read-side critical sections lapping over one end of a given grace period are fully ordered with anything past the other end of that same grace period. RCU's overall approach towards this guarantee is shown in the Linux-kernel Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst file, so one approach would be to argue that these guarantees are proven by a combination of this documentation along with periodic code inspection. Although this approach works well for some properties, the periodic code inspections require great attention to detail spanning a large quantity of intricate code. As such, these inspections are all too vulnerable to human error.

Another approach is formal verification, and in fact RCU's guarantees have been formally verified. Unfortunately, these formal-verification efforts, groundbreaking though they are, must be considered to be one-off tours de force. In contrast, RCU needs regular regression testing.

This leaves rcutorture, which has the advantage of being tireless and reasonably thorough, especially when compared to human beings. Except that rcutorture does not currently test RCU's memory-barrier guarantees.

Or at least it did not until today.

A new commit (which has since been accepted into Linux kernel v5.11) enlists the existing RCU readers. Each reader frequently increments a free-running counter, which can then be used to check memory ordering: If the counter appears to have counted backwards, something is broken. Each reader samples and records a randomly selected reader's counter, and assigns some other randomly selected reader to check for backwardsness. A flag is set at the end of each grace period, and once this flag is set, that other reader takes another sample of that same counter and compares them.

The test strategy for this particular fundamental property of RCU is more complex and likely less effective than the memory-ordering property described earlier, but life is like that sometimes.

Stupid RCU Tricks: Torturing RCU Fundamentally, Part II

Further reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's publish/subscribe guarantee. This guarantee ensures that RCU readers that traverse a newly inserted element of an RCU-protected data structure never see pre-initialization garbage in that element. In CONFIG_PREEMPT_NONE=y kernels, this guarantee combined with the grace-period guarantee permits RCU readers to traverse RCU-protected data structures using exactly the same sequence of instructions that would be used if these data structures were immutable. As always, free is a very good price!

However, some care is required to make use of this publish-subscribe guarantee. When inserting a new element, updaters must take care to first initialize everything that RCU readers might access and only then use an RCU primitive to carry out the insertion. Such primitives include rcu_assign_pointer() and list_add_rcu(), but please see The RCU API, 2019 edition or the Linux-kernel source code for the full list.

For their part, readers must use an RCU primitive to carry out their traversals, for example, rcu_dereference() or list_for_each_entry_rcu(). Again, please see The RCU API, 2019 edition or the Linux-kernel source code for the full list of such primitives.

Of course, rcutorture needs to test this publish/subscribe guarantee. It does this using yet another field in the rcu_torture structure:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  int rtort_pipe_count;
  struct list_head rtort_free;
  int rtort_mbtest;

This additional field is ->rtort_mbtest, which is set to zero when a given rcu_torture structure is freed for reuse (see the rcu_torture_pipe_update_one() function), and then set to 1 just before that structure is made available to readers (see the rcu_torture_writer() function). For its part, the rcu_torture_one_read() function checks to see if this field is zero, and if so flags the error by atomically incrementing the global n_rcu_torture_mberror counter. As you would expect, any run ending with a non-zero value in this counter is considered to be a failure.

Thus we have an important fundamental property of RCU that nevertheless happens to have a simple but effective test strategy. To the best of my knowledge, this was also the first aspect of Linux-kernel RCU that was subjected to an automated proof of correctness.

Sometimes you get lucky! ;–)

Stupid RCU Tricks: Torturing RCU Fundamentally, Part I

A quick look at the beginning of the Documentation/RCU/Design/Requirements/Requirements.rst file in a recent Linux-kernel source tree might suggest that testing RCU's fundamental requirements is Job One. And that suggestion would be quite correct. This post describes how rcutorture tests RCU's grace-period guarantee, which is usually used to make sure that data is not freed out from under an RCU reader. Later posts will describe how the other fundamental guarantees are tested.

What Exactly is RCU's Fundamental Grace-Period Guarantee?

Any RCU reader that started before the start of a given grace period is guaranteed to complete before that grace period completes. This is shown in the following diagram:

Diagram of RCU grace-period guarantee 1

Similarly, any RCU reader that completes after the end of a given grace period is guaranteed to have started after that grace period started. And this is shown in this diagram:

Diagram of RCU grace-period guarantee 2

More information is available in the aforementioned Documentation/RCU/Design/Requirements/Requirements.rst file.

Whose Fault is This rcutorture Failure, Anyway?

Suppose an rcutorture test fails, perhaps by triggering a WARN_ON() that normally indicates a problem in some other area of the kernel. But how do we know this failure is not instead RCU's fault?

One straightforward way to test RCU's grace-period guarantee would be to maintain a single RCU-protected pointer (let's call it rcu_torture_current) to a single structure, perhaps defined as follows:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  atomic_t rtort_nreaders;
  int rtort_pipe_count;
} *rcu_torture_current;

Readers could then do something like this in a loop:

p = rcu_dereference(rcu_torture_current);

An updater could do something like this, also in a loop:

p = kzalloc(sizeof(*p), GFP_KERNEL);
q = xchg(&rcu_torture_current, p);
call_rcu(&q->rtort_rcu, rcu_torture_cb);

And the rcu_torture_cb() function might be defined as follows:

static void rcu_torture_cb(struct rcu_head *p)
  struct rcu_torture *rp = container_of(p, struct rcu_torture, rtort_rcu);

  WRITE_ONCE(rp->rtort_pipe_count, 1);

This approach is of course problematic, never mind that one of rcutorture's predecessors actually did something like this. For one thing, a reader might be interrupted or (in CONFIG_PREEMPT=y kernels) preempted between its rcu_dereference() and its atomic_inc(). Then a too-short RCU grace period could result in the above reader doing its atomic_inc() on some structure that had already been freed and allocated as some other data structure used by some other part of the kernel. This could in turn result in a confusing failure in that other part of the kernel that was really RCU's fault.

In addition, the read-side atomic_inc() will result in expensive cache misses that will end up synchronizing multiple tasks concurrently executing the RCU reader code shown above. This synchronization will reduce read-side concurrency, which will in turn likely reduce the probability of these readers detecting a too-short grace period.

Finally, using the passage of time for synchronization is almost always a bad idea, so burn_a_bit_more_cpu_time() really needs to go. One might suspect that burn_a_random_amount_of_cpu_time() is also a bad idea, but we will see the wisdom in it.

Making rcutorture Preferentially Break RCU

The rcutorture module reduces the probability of false-positive non-RCU failures using these straightforward techniques:

  1. Allocate the memory to be referenced by rcu_torture_current in an array, whose elements are only ever used by rcutorture.
  2. Once an element is removed from rcu_torture_current, keep it in a special rcu_torture_removed list for some time before allowing it to be reused.
  3. Keep the random time delays in the rcutorture readers.
  4. Run rcutorture on an otherwise idle system, or, more commonly these days, within an otherwise idle guest OS.
  5. Make rcutorture place a relatively heavy load on RCU.

Use of the array keeps rcutorture from use-after-free clobbering of other kernel subsystems' data structures, keeping to-be-freed elements on the rcu_torture_removed list increases the probability that rcutorture will detect a too-short grace period, the delays in the readers increases the probability that a too-short grace period will be detected, and ensuring that most of the RCU activity is done at rcutorture's behest decreases the probability that any too-short grace periods will clobber other kernel subsystems.

The rcu_torture_alloc() and rcu_torture_free() functions manage a freelist of array elements. The freelist is a simple list creatively named rcu_torture_freelist and guarded by a global rcu_torture_lock. Because allocation and freeing happen at most once per grace period, this global lock is just fine: It is nowhere near being any sort of performance or scalability bottleneck.

The rcu_torture_removed list is handled by the rcu_torture_pipe_update_one() function that is invoked by rcutorture callbacks and the rcu_torture_pipe_update() function that is invoked by rcu_torture_writer() after completing a synchronous RCU grace period. The rcu_torture_pipe_update_one() function updates only the specified array element, and the rcu_torture_pipe_update() function updates all of the array elements residing on the rcu_torture_removed list. These updates each increment the ->rtort_pipe_count field. When the value of this field reaches RCU_TORTURE_PIPE_LEN (by default 10), the array element is freed for reuse.

The rcu_torture_reader() function handles the random time delays and leverages the awesome power of multiple kthreads to maintain a high read-side load on RCU. The rcu_torture_writer() function runs in a single kthread in order to simplify synchronization, but it enlists the help of several other kthreads repeatedly invoking the rcu_torture_fakewriter() in order to keep the update-side load on RCU at a respectable level.


This blog post described RCU's fundamental grace-period guarantee and how rcutorture stress-tests it. It also described a few simple ways that rcutorture increases the probability that any failures to provide this guarantee are attributed to RCU and not to some hapless innocent bystander.