You are viewing paulmck

RCU
My previous posting described an RCU bug that I might plausibly blame on falsehoods from firmware. The RCU bug in this post, alas, I can blame only on myself.

In retrospect, things were going altogether too smoothly while I was getting my RCU commits ready for the 3.19 merge window. That changed suddenly when my automated testing kicked out a “BUG: FAILURE, 1 instances”. This message indicates a grace-period failure, in other words that RCU failed to be RCU, which is of course really really really bad. Fortunately, it was still some weeks until the merge window, so there was some time for debugging and fixing.

Of course, we all have specific patches that we are suspicious of. So my next step was to revert suspect patches and to otherwise attempt to outguess the bug. Unfortunately, I quickly learned that the bug is difficult to reproduce, requiring something like 100 hours of focused rcutorture testing. Bisection based on 100-hour tests would have consumed the remainder of 2014 and a significant fraction of 2015, so something better was required. In fact, something way better was required because there was only a very small number of failures, which meant that the expected test time to reproduce the bug might well have been 200 hours or even 300 hours instead of my best guess of 100 hours.

My first attempt at “something better” was to inspect the suspect patches. This effort did locate some needed fixes, but nothing that would explain the grace-period failures. My next attempt was to take a closer look at the dmesg logs of the two runs with failures, which produced much better results.

You see, rcutorture detects failures using the RCU API, for example, invoking synchronize_rcu() and waiting for it to return. This overestimates the grace-period duration, because RCU's grace-period primitives wait not only for a grace period to elapse, but also for the RCU core to notice their requests and also for RCU to inform them of the grace period's end, as shown in the following figure.

RCUnearMiss.svg

This means that a given RCU read-side critical section might overlap a grace period by a fair amount without rcutorture being any the wiser. However, some RCU implementations provide rcutorture access to the underlying grace-period counters, which in theory provide rcutorture with a much more precise view of each grace period's duration. These counters have long recorded in the rcutorture output as “Reader Batch” counts as shown on the second line of the following (the first line is the full-API data):


rcu-torture: !!! Reader Pipe:  13341924415 88927 1 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



This shows rcutorture output from a run containing a failure. On each line, the first two numbers correspond to legal RCU activity: An RCU read-side critical section might have been entirely contained within a single RCU grace period (first number) or it might have overlapped the boundary between an adjacent pair of grace periods (second number). However, a single RCU read-side critical section is most definitely not allowed to overlap three different grace periods, because that would mean that the middle grace period was by definition too short. And exactly that error occured above, indicated by the exclamation marks and the “1” in the third place in the “Reader Pipe” line above.

If RCU was working perfectly, in theory, the output would instead look something like this, without the exclamation marks and without non-zero value in the third and subsequent positions:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 0 0 0 0 0 0 0 0 0



In practice, the “Reader Batch” counters were intended only for statistical use, and access to them is therefore unsynchronized, as indicated by the jagged grace-period-start and grace-period-end lines in the above figure. Furthermore, any attempt to synchronize them in rcutorture's RCU readers would incur so much overhead that rcutorture could not possibly do a good job of torturing RCU. This means that these counters can result in false positives, and false positives are not something that I want in my automated test suite. In other words, it is at least theoretically possible that we might legitimately see something like this from time to time:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



We clearly need to know how bad this false-positive problem is. One way of estimating the level of false-positive badness is to scan my three months worth of rcutorture test results. This scan showed that there were no suspicious Reader Batch counts in more than 1,300 hours of rcutorture testing, aside from the roughly one per hour from the TREE03 test, which happened to also be the only test to produce real failures. This suggested that I could use the Reader Batch counters as indications of a “near miss” to guide my debugging effort. The once-per-hour failure rate suggested a ten-hour test duration, which I was able compress into two hours by running five concurrent tests.

Why ten hours?

Since TREE03 had been generating Reader Batch near misses all along, this was not a recent problem. Therefore, git bisect was more likely to be confused by unrelated ancient errors than to be of much help. I therefore instead bisected by configuration and by rcutorture parameters. This alternative bisection showed that the problem occurred only with normal grace periods (as opposed to expedited grace periods), only with CONFIG_RCU_BOOST=y, and only with concurrent CPU-hotplug operations. Of course, one possibility was that the bug was in rcutorture rather than RCU, but rcutorture was exonerated via priority-boost testing on a kernel built with CONFIG_RCU_BOOST=n, which showed neither failures nor Reader Batch near misses. This combination of test results points the finger of suspicion directly at the rcu_preempt_offline_tasks() function, which is the only part of RCU's CPU-hotplug code path that has a non-trivial dependency on CONFIG_RCU_BOOST.

One very nice thing about this combination is that it is very unlikely that people are encountering this problem in production. After all, for this problem to appear, you must be doing the following:


  1. Running a kernel with both CONFIG_PREEMPT=y and CONFIG_RCU_BOOST=y.
  2. Running on hardware containing more than 16 CPUs (assuming the default value of 16 for CONFIG_RCU_FANOUT_LEAF).
  3. Carrying out frequent CPU-hotplug operations, and so much so that a given block of 16 CPUs (for example, CPUs 0-15 or 16-31) might reasonably all be offline at the same time.


That said, if this does describe your system and workload, you should consider applying this patch set. You should also consider modifying your workload.

Returning to the rcu_preempt_offline_tasks() function that the finger of suspicion now points to:


 1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 2                                      struct rcu_node *rnp,
 3                                      struct rcu_data *rdp)
 4 {
 5   struct list_head *lp;
 6   struct list_head *lp_root;
 7   int retval = 0;
 8   struct rcu_node *rnp_root = rcu_get_root(rsp);
 9   struct task_struct *t;
10 
11   if (rnp == rnp_root) {
12     WARN_ONCE(1, "Last CPU thought to be offlined?");
13     return 0;
14   }
15   WARN_ON_ONCE(rnp != rdp->mynode);
16   if (rcu_preempt_blocked_readers_cgp(rnp) && rnp->qsmask == 0)
17     retval |= RCU_OFL_TASKS_NORM_GP;
18   if (rcu_preempted_readers_exp(rnp))
19     retval |= RCU_OFL_TASKS_EXP_GP;
20   lp = &rnp->blkd_tasks;
21   lp_root = &rnp_root->blkd_tasks;
22   while (!list_empty(lp)) {
23     t = list_entry(lp->next, typeof(*t), rcu_node_entry);
24     raw_spin_lock(&rnp_root->lock);
25     smp_mb__after_unlock_lock();
26     list_del(&t->rcu_node_entry);
27     t->rcu_blocked_node = rnp_root;
28     list_add(&t->rcu_node_entry, lp_root);
29     if (&t->rcu_node_entry == rnp->gp_tasks)
30       rnp_root->gp_tasks = rnp->gp_tasks;
31     if (&t->rcu_node_entry == rnp->exp_tasks)
32       rnp_root->exp_tasks = rnp->exp_tasks;
33 #ifdef CONFIG_RCU_BOOST
34     if (&t->rcu_node_entry == rnp->boost_tasks)
35       rnp_root->boost_tasks = rnp->boost_tasks;
36 #endif
37     raw_spin_unlock(&rnp_root->lock);
38   }
39   rnp->gp_tasks = NULL;
40   rnp->exp_tasks = NULL;
41 #ifdef CONFIG_RCU_BOOST
42   rnp->boost_tasks = NULL;
43   raw_spin_lock(&rnp_root->lock);
44   smp_mb__after_unlock_lock();
45   if (rnp_root->boost_tasks != NULL &&
46       rnp_root->boost_tasks != rnp_root->gp_tasks &&
47       rnp_root->boost_tasks != rnp_root->exp_tasks)
48     rnp_root->boost_tasks = rnp_root->gp_tasks;
49   raw_spin_unlock(&rnp_root->lock);
50 #endif
51   return retval;
52 }



First, we should of course check the code under #ifdef CONFIG_RCU_BOOST. The code starting at line 43 is quite suspicious, as it is not clear why it is safe to make these modifications after having dropped the rnp_root structure's ->lock on line 37. And removing lines 43-49 does in fact reduce the number of Reader Batch near misses by an order of magnitude, but sadly not to zero.

I was preparing to dig into this function to find the bug, but then I noticed that the loop spanning lines 22-38 is executed with interrupts disabled. Given that the number of iterations through this loop is limited only by the number of tasks in the system, this loop is horrendously awful for real-time response.

Or at least it is now—at the time I wrote that code, the received wisdom was “Never do CPU-hotplug operations on a system that runs realtime applications!” I therefore had invested zero effort into maintaining good realtime latencies during these operations. That expectation has changed because some recent real-time applications offline then online CPUs in order to clear off irrelevant processing that might otherwise degrade realtime latencies. Furthermore, the large CPU counts on current systems are an invitation to run multiple real-time applications on a single system, so that the CPU hotplug operations run as part of one application's startup might interfere with some other application. Therefore, this loop clearly needs to go. So I abandoned my debugging efforts and focused instead on getting rid of rcu_preempt_offline_tasks() entirely, along with all of its remaining bugs.

The point of the loop spanning lines 22-38 is handle any tasks on the rnp structure's ->blkd_tasks list. This list accumulates tasks that block while in an RCU read-side critical section while running on a CPU associated with the leaf rcu_node structure pointed to by rnp. This blocking will normally be due to preemption, but could also be caused by -rt's blocking spinlocks. This function is called only when the last CPU associated with the leaf rcu_node structure is going offline, after which there will no longer be any online CPUs associated with this leaf rcu_node structure. This loop then moves those tasks to the root rcu_node structure's ->blkd_tasks list. Because there is always at least one CPU online somewhere in the system, there will always be at least one online CPU associated with the root rcu_node structure, which means that RCU will be guaranteed to take proper care of these tasks.

The obvious question at this point is “why not just leave the tasks on the leaf rcu_node structure?” After all, this clearly avoids the unbounded time spent moving tasks while interrupts are disabled. However, it also means that RCU's grace-period computation must change in order to account for blocked tasks associated with a CPU-free leaf rcu_node structure.

In the old approach, the leaf rcu_node structure in question was excluded from RCU's grace-period computations immediately. In the new approach, RCU will need to continue including that leaf rcu_node structure until the last task on its list exits its outermost RCU read-side critical section. At that point, the last task will remove itself from the list, leaving the list empty, thus allowing RCU to ignore the leaf rcu_node structure from that point forward.

This patch set makes this change by abstracting an rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu(), which can then be called from rcu_read_unlock_special() when the last task removes itself from the ->blkd_tasks list. This change allows the rcu_preempt_offline_tasks() function to be dispensed with entirely. With this patch set applied, TEST03 ran for 1,000 hours with neither failures nor near misses, which gives a high degree of confidence that this patch set made the bug less likely to appear. This, along with the decreased complexity and the removal of a source of realtime latency, is a definite plus!

I also modified the rcutorture scripts to pay attention to the “Reader Batch” near-miss output, giving a warning if one such near miss occurs in a run, and giving an error if two or more near misses occur, but at a rate of at least one near miss per three hours of test time. This should inform me should I make a similar mistake in the future.

Given that RCU priority boosting has been in the Linux kernel for many years and given that I regularly run rcutorture with CPU-hotplug testing enabled on CONFIG_RCU_BOOST=y kernels, it is only fair to ask why rcutorture did not locate this bug years ago. The likely reason is that I only recently added RCU callback flood testing, in which rcutorture registers 20,000 RCU callbacks, waits a jiffy, registers 20,000 more callbacks, waits another jiffy, then finally registers a third set of 20,000 callbacks. This causes call_rcu() to take evasive action, which has the effect of starting the RCU grace period more quickly, which in turn makes rcutorture more sensitive to too-short grace periods. This view is supported by the fact that I saw actual failures only on recent kernels whose rcutorture testing included callback-flood testing.

Another question is “exactly what was the bug?” I have a fairly good idea of what stupid mistake led to that bug, but given that I have completely eliminated the offending function, I am not all that motivated to chase it down. In fact, it seems much more productive to leave it as an open challenge for formal verification. So, if you have a favorite formal-verification tool, why not see what it can make of rcu_preempt_offline_tasks()?

Somewhat embarrassingly, slides 39-46 of this Collaboration Summit presentation features my careful development of rcu_preempt_offline_tasks(). I suppose that I could hide behind the “More changes due to RCU priority boosting” on slide 46, but the fact remains that I simply should not have written this function in the first place, whether carefully before priority boosting or carelessly afterwards.

In short, it is not enough to correctly code a function. You must also code the correct function!

Lies that firmware tells RCU

RCU
One of the complaints that real-time people have against some firmware is that it lies about its age, attempting to cover up cycle-stealing via SMIs by reprogramming the TSC. Some firmware goes farther and lies about the number of CPUs on the system, apparently on the grounds that more is better, regardless of how many of those alleged CPUs actually exist.

RCU used to naively believe the firmware, and would therefore create one set of rcuo kthreads per advertised CPU. On some systems, this resulted in hundreds of such kthreads on systems with only a few tens of CPUs. But RCU can choose to create the rcuo kthreads only for CPUs that actually come online. Problem solved!

Mostly solved, that is.

Yanko Kaneti, Jay Vosburgh, Meelis Roos, and Eric B Munson discovered the “mostly” part when they encountered hangs in _rcu_barrier(). So what is rcu_barrier()?

The rcu_barrier() primitive waits for all pre-existing callbacks to be invoked. This is useful when you want to unload a module that uses call_rcu(), as described in this LWN article. It is important to note that rcu_barrier() does not necessarily wait for a full RCU grace period. In fact, if there are currently no RCU callbacks queued, rcu_barrier() is within its rights to simply return immediately. Otherwise, rcu_barrier() enqueues a callback on each CPU that already has callbacks, and waits for all these callbacks to be invoked. Because RCU is careful to invoke all callbacks posted to a given CPU in order, this guarantees that by the time rcu_barrier() returns, all pre-existing RCU callbacks will have already been invoked, as required.

However, it is possible to offload invocation of a given CPU's RCU callbacks to rcuo kthreads, as described in this LWN article. This kthread might well be executing on some other CPU, which means that the callbacks are moved from one list to another as they pass through their lifecycles. This makes it difficult for rcu_barrier() to reliably determine whether or not there are RCU callbacks pending for an offloaded CPU. So rcu_barrier() simply unconditionally enqueues an RCU callback for each offloaded CPU, regardless of that CPU's state.

In fact, rcu_barrier() even enqueues a callback for offloaded CPUs that are offline. The reason for this odd-seeming design decision is that a given CPU might enqueue a huge number of callbacks, then go offline. It might take the corresponding rcuo kthread significant time to work its way through this backlog of callbacks, which means that rcu_barrier() cannot safely assume that an offloaded CPU is callback-free just because it happens to be offline. So, to come full circle, rcu_barrier() enqueues an RCU callback for all offloaded CPUs, regardless of their state.

This approach works quite well in practice.

At least, it works well on systems where the firmware provides the Linux kernel with an accurate count of the number of CPUs. However, it breaks horribly when the firmware over-reports the number of CPUs, because then the system will then have CPUs that never ever come online. If these CPUs have been designated as offloaded CPUs, this means that their rcuo kthreads will never ever be spawned, which in turn means that any callbacks enqueued for these mythical CPUs will never ever be invoked. And because rcu_barrier() waits for all the callbacks that it posts to be invoked, rcu_barrier() ends up waiting forever, which can of course result in hangs.

The solution is to make rcu_barrier() refrain from posting callbacks for offloaded CPUs that have never been online, in other words, for CPUs that do not yet have an rcuo kthread.

With some luck, this patch will clear things up. And I did take the precaution of reviewing all of RCU's uses of for_each_possible_cpu(), so here is hoping! ;-)
BookAndGlasses
I had the privilege of being asked to present on ordering, RCU, and validation at a joint meeting of the REORDER (Third International Workshop on Memory Consistency Models) and EC2 (7th International Workshop on Exploiting Concurrency Efficiently and Correctly) workshops.

Before my talk, Michael Tautschnig gave a presentation (based on this paper) on an interesting prototype tool (called “mole,” with the name chosen because the gestation period of a mole is about 42 days) that helps identify patterns of usage in large code bases. It is early days for this tool, but one could imagine it growing into something quite useful, perhaps answering questions such as “what are the different ways in which the Linux kernel uses reference counting?” He also took care to call out the many disadvantages of testing, which include not being able to test all paths, all possible races, all possible inputs, or all possible much of anything, at least not in finite time.

I started my talk with an overview of split counters, where each CPU (or task or whatever) updates its own counter, and the aggregate counter is read out by summing all the per-CPU counters. There was some concern expressed by members of the audience about the possibility of inconsistent results. For example, if one CPU adds five, another CPU adds seven, and a third CPU adds 11 to initially zero counter, then two CPUs reading out the counter might see 12 and 18, respectively, which are inconsistent (they differ by six, and no CPU added six). To their credit, the attendees picked right up on a reasonable correctness criterion. The idea is that the aggregate counter's value varies with time, and that any given reader will be guaranteed to return a value between that of the counter when the reader started and that of the counter when the reader ended: Consistency is neither needed nor provided in a great number of situations.

I then gave my usual introduction to RCU, and of course it is always quite a bit of fun introducing RCU to people who have never encountered anything like it. There was quite a bit of skepticism initially, as well as a lot of questions and comments.

I then turned to validation, noting the promise of some formal-validation tooling. I ended by saying that although I agreed with the limitations of testing called out by the previous speaker, the fact remains that a number of people have devised tests that had found RCU bugs (thank you, Stephen, Dave, and Fengguang!), but no one has yet devised a hard-core formal-validation tool that has found any bugs in RCU. I also pointed out that this is definitely not because there are no bugs in RCU! (Yes, I have gotten rid of all day-one bugs in RCU, but only by having also gotten rid of all day-one code in RCU.) When asked if I meant bugs in RCU usage or in RCU itself, I replied “Either would be good.” Several people wrote down where to find RCU in the Linux kernel, so it will be interesting to see what they come up with. (Perhaps all too interesting!)

There were several talks on analyzing weakly ordered systems, but keep in mind that for these guys, even x86 is weakly ordered. After all, it allows prior stores to be reordered with later loads.

Another interesting talk was given by Kapil Vaswani on the topic of wait freedom. Recall that in a wait-free algorithm, every process is guaranteed to make some progress in a finite time, even in the presence of arbitrarily long delays for any given process. In contrast, in a lock-free algorithm, only one process is guaranteed to make some progress in a finite time, again, even in the presence of arbitrarily long delays for any given process. It is worth noting that neither of these guarantees is sufficient for real-time programs, which require a specified amount of progress (not merely some progress) in a bounded amount of time (not merely a finite amount of time). Wait-freedom and lock-freedom are nevertheless important forward-progress guarantees, and there are numerous other similar guarantees including obstruction freedom, deadlock freedom, starvation freedom, many more besides.

It turns out that most software in production, even in real-time systems, is not wait-free, which has been a source of consternation for many researchers for quite some time. Kapil went on to describe how Alistarh et al. showed that, roughly speaking, given a non-hostile scheduler and crash-free execution, lock-free algorithms have wait-free behavior.

The interesting thing about this is that you can take it quite a bit farther, and those of you who know me well won't be surprised to learn that I did just that in a question to the speaker. If you have a non-hostile scheduler, crash-free execution, FIFO locks, bounded lock-hold times, no lock nesting, a finite number of processes, and so on, you can obtain the benefits of the more aggressive forward-progress guarantees. The general idea is that if you have at most N processes, and if the maximum lock-hold time is T, then you can wait at most (N-1)T time to acquire a given lock. (Those wishing greater rigor should read Bjoern B. Brandenburg's dissertation — Full disclosure: I was on Bjoern's committee.) In happy contrast to the authors of the paper mentioned in the previous paragraph, the speaker and audience seemed quite intrigued by this line of thought.

In all, it was an extremely interesting and thought-provoking time. With some luck, perhaps we will see some powerful software tools introduced by this group of researchers.
elephant, penguin
We have all suffered from changing requirements. We are almost done with implemention, maybe even most of the way done testing, and then a change in requirements invalidates all of our hard work. This can of course be extremely irritating.

It turns out that changing requirements are not specific to software, nor are they anything new. Here is an example from my father's area of expertise, which is custom-built large-scale food processing equipment.

My father's company was designing a new-to-them piece of equipment, namely a spinach chopper able to chop 15 tons of spinach per hour. Given that this was a new area, they built a small-scale prototype with which they ran a number of experiments on “small” batches of spinach (that is, less than a ton of spinach). The customer provided feedback on the results of these experiments, which fed into the design of the full-scale production model.

After the resulting spinach chopper had been in production for a short while, there was a routine follow-up inspection. The purpose of the inspection was to check for unexpected wear and other unanticipated problems with the design and fabrication. While the inspector was there, the chopper kicked a large rock into the air. It turned out that spinach from the field can contain rocks up to six inches (15cm) in diameter, and this requirement was not been communicated during development. This situation of course inspired a quick engineering change, installing a cage that captured the rocks.

Of course, it is only fair to point out that this requirement didn't actually change. After all, spinach from the field has always contained rocks up to six inches in diameter. There had been a failure to communicate an existing requirement, not a change in the actual requirement.

However, from the viewpoint of the engineers, the effect is the same. Regardless of whether there was an actual change in the requirements or the exposure of a previously hidden requirement, redesign and rework will very likely required.

One other point of interest to those who know me... The spinach chopper was re-inspected some five years after it started operation. Its blades had never been sharpened, and despite encountering the occasional rock, still did not need sharpening. So to those of you who occasionally accuse me of over-engineering, all I can say is that I come by it honestly! ;–)
SequentialCaveman
The first edition of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” is now available in electronic form. The dead-tree version will be available a bit later, or, as they say, Real Soon Now.

Big-animal changes over the early 2013 version include:


  1. Some material buried in the introduction and a few key Quick Quizzes has been pulled up into a new “How To Use This Book” chapter.
  2. Threre is significantly more material in the “Beyond Partitioning” section.
  3. A new “Hazard Pointers” section has been added to the “Deferred Processing” chapter.
  4. The “Data Structures” chapter has been filled out.
  5. Formal verification has been moved from an appendix to a new chapter following the pre-existing “Validation” chapter.
  6. A new “Putting It All Together” chapter contains some case studies.
  7. New cartoons have been added and some of the old cartoons have been updated.


Interestingly enough, even before its first edition, this book has seen some use in various classrooms. The memory-barrier information and the “Counting” chapter seem to be the most popular in that environment.

So what next?


Why, the second edition, of course!!! What else? :-)

Parallel Programming: January 2014 Update

SequentialCaveman
This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features filled-out data-structures, debugging, and advanced-synchronization chapters; the addition of hazard pointers to the deferred-processing chapter; two new cartoons; and random updates to a number of other chapters. It also includes contributions from Angela Demke Brown, Darren Hart, Ralf Moeller, Charles-François Natali, Borislav Petkov, Anatol Pomozov, Alexey Roytman, and Frederic Weisbecker.

This release is a bit unusual in that it is a release candidate for a first edition. Although I still stand by my 2011 statement that this book will never be finished, some readers have been urging me to finish it, and releasing a first edition seems as good a way as any of resolving these two conflicting impulses. I expect to release the first edition in a month or two, and will of course address any editorial feedback I receive during that time. Another request from readers has been to make hardcopies easier to obtain, which I find quite surprising, but which I will nevertheless do for the first edition. In fact, a few of the people who do the best job of providing editorial feedback on the release candidates will be eligible to receive rare signed first editions. The decision of the judge (which would be me) as to who is eligible to received signed first editions is final.

What is “editorial feedback”? This is hard to define precisely, but would include grammar fixes, fixes for bugs in example code, and small expansions or clarifications to text. It does not include things like adding a chapter on update-friendly mechanisms or on GPUs and FPGAs. In other words, “editorial feedback” consists of fixes that can reasonably be accommodated at the current release-candidate stage.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.
RCU
I recently refactored my rcutorture tests, getting that item off of my todo list after “only” three or four years. To my surprise, one (but only one) of the new rcutorture test scenarios failed, namely the one creatively named TREE08:





CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_FAST_NO_HZ=n
CONFIG_RCU_TRACE=n
CONFIG_HOTPLUG_CPU=n
CONFIG_SUSPEND=n
CONFIG_HIBERNATION=n
CONFIG_RCU_FANOUT=3
CONFIG_RCU_FANOUT_EXACT=y
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_PROVE_RCU_DELAY=n
CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_CPU_STALL_VERBOSE=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_PRINTK_TIME=y


One reason for my surprise was that I had been running similar rcutorture scenarios for some years. In sharp (and happy) contrast to previous efforts featuring but a handful of failures in a given ten-hour run, the effect was not subtle. In fact, TREE08 often failed more than 20 times per minute.

My first thought was that I had blown a recent commit, but testing earlier kernel versions gave similar failure rates.

Additional testing and analysis showed that the bug was quite sensitive to the exact values of the Kconfig parameters. For example, setting CONFIG_RCU_FANOUT=2 or CONFIG_RCU_FANOUT=4 resulted in rock-solid behavior, despite CONFIG_RCU_FANOUT=3 failing miserably. Similarly, setting CONFIG_RCU_FANOUT_EXACT=n (which is thankfully the default) also resulted in rock-solid behavior. This fortunately provides not only a workaround for the bug, but, even better, a workaround that is enabled by default. This enabled-by-default workaround did much to explain the total lack of bug reports.

This sensitivity to a few Kconfig parameters implicated the boot-time code that sets up the rcu_node structures. I therefore ran this boot-time setup code in usermode scaffolding, resulting in the following output:

NR_CPUS = 16, CONFIG_RCU_FANOUT = 3, CONFIG_RCU_FANOUT_LEAF = 2, MAX_RCU_LVLS = 4
rcu_fanout_leaf = 16, nr_cpu_ids = 3
NR_CPUS: 16, RCU_NUM_LVLS: 3, rcu_num_lvls: 3, rcu_num_nodes: 12
num_rcu_lvl[0]: 1
num_rcu_lvl[1]: 3
num_rcu_lvl[2]: 8
num_rcu_lvl[3]: 16
num_rcu_lvl[4]: 0
rsp->levelspread[0]: 2
rsp->levelspread[1]: 3
rsp->levelspread[2]: 3
CPU 0: rdp=0x804c420 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 1: rdp=0x804c4cc rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 2: rdp=0x804c578 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 3: rdp=0x804c624 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 4: rdp=0x804c6d0 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 5: rdp=0x804c77c rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 6: rdp=0x804c828 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 7: rdp=0x804c8d4 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 8: rdp=0x804c980 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 9: rdp=0x804ca2c rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 10: rdp=0x804cad8 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 11: rdp=0x804cb84 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 12: rdp=0x804cc30 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 13: rdp=0x804ccdc rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 14: rdp=0x804cd88 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 15: rdp=0x804ce34 rnp=0x804b2e8 (15-15) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)


The rdp= numbers show each CPU's rcu_data structure addresses, and the sequence of rnp= numbers show the rcu_node structures when traversing upwards from the corresponding CPU to the root of the rcu_node tree. The numbers in parentheses show which CPUs are handled by each rcu_node structure. The initial lines simply show the values of a few key boot-time macros, variables, and arrays.

Had I been paying attention, I would have noticed that this output shows two very odd things, with one of the called out twice. However, some of my assumptions blinded me to these warning signs.

I therefore enabled RCU event tracing, with the usual false start where I failed to build the kernel with CONFIG_RCU_TRACE=y. This resulted in a 20MB trace, roughly 15MB of which was required to show the entire grace-period dance starting from the callback being registered and ending with rcutorture detecting RCU bug. The following entries caught my eye:

rcu_pree-7       0d..2 5852674733: rcu_grace_period_init: rcu_preempt 18446744073709551553 0 0 15 3
rcu_pree-7 0d..2 5852674946: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 0 8 7
rcu_pree-7 0d..2 5852675138: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 9 15 7
rcu_pree-7 0d..2 5852675313: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 18 15 0
rcu_pree-7 0d..2 5852675631: rcu_grace_period: rcu_preempt 18446744073709551553 cpustart
rcu_pree-7 0d..2 5852675802: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 0 2 7
rcu_pree-7 0d..2 5852675974: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 3 5 7
rcu_pree-7 0d..2 5852676148: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 6 8 7
rcu_pree-7 0d..2 5852676321: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 9 11 7
rcu_pree-7 0d..2 5852676519: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 12 14 7
rcu_pree-7 0d..2 5852676691: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 15 15 1
rcu_pree-7 0d..2 5852676862: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 18 15 0
rcu_pree-7 0d..2 5852677034: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 21 15 0


The smoking gun is the fourth line, the one containing the “18 15”. This indicates that the rcu_node structure corresponding to this trace event covers all CPUs whose numbers (as listed by smp_processor_id()) are greater or equal to 18 and also less than or equal to 15. There are of course no such CPUs. The last two lines show similar brokenness.

These trace events narrowed the bug down to the following code fragment in kernel/rcutree.c:

 1 #ifdef CONFIG_RCU_FANOUT_EXACT
2 static void __init rcu_init_levelspread(struct rcu_state *rsp)
3 {
4 int i;
5
6 for (i = rcu_num_lvls - 1; i > 0; i--)
7 rsp->levelspread[i] = CONFIG_RCU_FANOUT;
8 rsp->levelspread[0] = rcu_fanout_leaf;
9 }
10 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */
11 static void __init rcu_init_levelspread(struct rcu_state *rsp)
12 {
13 int ccur;
14 int cprv;
15 int i;
16
17 cprv = nr_cpu_ids;
18 for (i = rcu_num_lvls - 1; i >= 0; i--) {
19 ccur = rsp->levelcnt[i];
20 rsp->levelspread[i] = (cprv + ccur - 1) / ccur;
21 cprv = ccur;
22 }
23 }
24 #endif /* #else #ifdef CONFIG_RCU_FANOUT_EXACT */


Can you spot the problem and the fix?

RCU
Al Viro came up with an example similar to this gem, where A and B are both initially zero:

CPU 0CPU 1
rcu_read_lock();
B = 1;
r1 = A;
rcu_read_unlock();   
A = 1;
synchronize_rcu();
r2 = B;


Al's question is whether the outcome r1 == 0 && r2 == 0 is possible after both code fragments complete and all the dust settles.

To answer this question, we will use the fundamental property of RCU, which guarantees that an RCU read-side critical section cannot span a grace period. In the above example, the RCU read-side critical section is the code between rcu_read_lock() and rcu_read_unlock(), and the RCU grace period is provide by synchronize_rcu(). It turns out that this rather weak guarantee is actually quite powerful, as we will see in this example.

Now we know that if r1 == 0, then CPU 0's load from A must have preceded CPU 1's store to A, and thus also have preceded the grace period provided by synchronize_rcu(). Then RCU guarantees that CPU 0's store to B also preceded that same grace period. Because CPU 1's load from B follows the grace period, it must see the results of CPU 0's store, which means that if r1 == 0 we know that r2 == 1. Therefore, RCU prohibits Al's outcome of r1 == 0 && r2 == 0, as Linus Torvalds surmised.

You can also start by assuming that r2 == 0, but this is left as an exercise for the reader.

Let us instead proceed to a more elaborate example:

CPU 0CPU 1CPU 2CPU 3
rcu_read_lock();
r1 = A;
r2 = B;
rcu_read_unlock();    
r3 = B;
synchronize_rcu();    
r4 = A;
A = 1;    B = 1;


Is the result r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0 possible? Why or why not?
elephant, penguin
The computing field has had its share of holy wars over the decades, with the subject of one famous 1970s/1980s holy war being endianness. Of course, endianness within a system refers to the mapping between numerical values and character arrays, so that for little-endian the numerical value 1 results in the first character of the corresponding array being non-zero, while for big-endian it results in the last character of the corresponding array being non-zero. For more on endianness, see the Wikipedia article.

Although many of the “cool kids” of the 1980s (such as Apple and Sun) were big-endian, the vast majority of today's Linux-capable systems are little-endian. This of course includes x86, but it also includes ARM, whose partners shipped enough ARM CPUs to give one to each man, woman, and child on this planet—and that only counts the ARM CPUs shipped in 2012.

Although most modern software is endian-neutral, device drivers and related software do sometimes make assumptions about endianness. In addition, someone considering creating special-purpose hardware would likely do a little-endian implementation first and foremost, and a big-endian implementation later, if at all. Furthermore, because single-threaded CPU throughput is not rising anywhere near as quickly as it was back in the 1980s and 1990s, special-purpose hardware is becoming increasingly important.

So what is a big-endian architecture like Power supposed to do?

The answer turns out to be “Both!!!”

Power hardware has long supported both big-endian and little-endian byte ordering, and the toolchain has had prototype little-endian support for some years. Furthermore, if you have been paying close attention, you will have noticed that the toolchain's little-endian support has received significant care and feeding over the past few months. Expect to see patches for the Linux kernel, QEMU, and other packages soon, from IBM and from others. IBM's partners in this effort include Google, Mellanox, NVIDIA, and Tyan.

Should be exciting! ;-)
TMEverywhere
Last year, I noted that hardware transactional memory (HTM) announcements lacked forward-progress guarantees. As noted in that posting:

Until HTM implementations provide some sort of forward-progress guarantee, HTM will be limited by its fallback code sequences. For example, if the fallback code uses locking, then HTM will be just as vulnerable to deadlock as is the fallback code.


And during the past year, the IBM Mainframe announced an HTM implementation that includes constrained transactions in addition to the usual best-effort HTM implementation. A constrained transaction starts with the tbeginc instruction instead of the tbegin instruction that is used for best-effort transactions. Constrained transactions are guaranteed to always complete (eventually), so if a transaction aborts, rather than branching to a fallback path (as is done for best-effort transactions), the hardware instead restarts the transaction at the tbeginc instruction.

The Mainframe architects needed to take extreme measures to deliver on this forward-progress guarantee. If a given constrained transaction repeatedly fails, the CPU might disable branch prediction, force in-order execution, and even completely disable pipelining. If the repeated failures are due to high contention, the CPU might disable speculative fetches, introduce random delays, and even serialize execution of the conflicting CPUs. “Interesting” forward-progress scenarios involve as few as two CPUs or as many as one hundred CPUs. Perhaps these extreme measures provide some insight as to why other CPUs have thus far refrained from offering constrained transactions.

As the name implies, constrained transactions are in fact severely constrained:


  1. The maximum data footprint is four blocks of memory, where each block can be no larger than 32 bytes.
  2. The maximum code footprint is 256 bytes.
  3. If a given 4K page contains a constrained transaction’s code, then that page may not contain that transaction’s data.
  4. The maximum number of assembly instructions that may be executed is 32.
  5. Backwards branches are forbidden.

Nevertheless, these constraints support a number of important data structures, including linked lists, stacks, queues, and arrays. Perhaps HTM with constrained transactions will eventually become an important tool in the parallel programmer's toolbox.