You are viewing paulmck

Tired of ads? Upgrade to paid account and never see ads again!
Some time ago, I figured out that there are more than a billion instances of the Linux kernel in use, and this in turn led to the realization that a million-year RCU bug is happening about three times a day across the installed base. This realization has caused me to focus more heavily on RCU validation, which has uncovered a number of interesting bugs. I have also dabbled a bit in formal verification, which has not yet found a bug. However, formal verification might be getting there, and might some day be a useful addition to RCU's regression testing. I was therefore quite happy to be invited to this Dagstuhl Seminar. In what follows, I summarize a few of the presentation. See here for the rest of the presentations.

Viktor Vafeiadis presented his analysis of the C11 memory model, including some “interesting” consequences of data races, where a data race is defined as a situation involving multiple concurrent accesses to a non-atomic variable, at least one of which is a write. One such consequence involves a theoretically desirable “strengthening” property. For example, this property would mean that multiplexing two threads onto a single underlying thread would not introduce new behaviors. However, with C11, the undefined-behavior consequences of data races can actually cause new behaviors to appear with fewer threads, for example, see Slide 7. This suggests the option of doing away with the undefined behavior, which is exactly the option that LLVM has taken. However, this approach requires some care, as can be seen on Slide 19. Nevertheless, this approach seems promising. One important takeaway from this talk is that if you are worried about weak ordering, you need to pay careful attention to reining in the compiler's optimizations. If you are unconvinced, take a look at this! Jean Pichon-Pharabod, Kyndylan Nienhuis, and Mike Dodds presented on other aspects of the C11 memory model.

Martin T. Vechev apparently felt that the C11 memory model was too tame, and therefore focused on event-driven applications, specifically javascript running on Android. This presentation included some entertaining concurrency bugs and their effects on the browser's display. Martin also discussed formalizing javascript's memory model.

Hongjin Liang showed that ticket locks can provide starvation freedom given a minimally fair scheduler. This provides a proof point for Björn B. Brandenburg's dissertation, which analyzed the larger question of real-time response from lock-based code. It should also provide a helpful corrective to people who still believe that non-blocking synchronization is required.

Joseph Tassarotti presented a formal proof of the quiescent-state based reclamation (QSBR) variant of userspace RCU. In contrast to previous proofs, this proof did not rely on sequential consistency, but instead leveraged a release-acquire memory model. It is of course good to see researchers focusing their tools on RCU! That said, when a researcher asked me privately whether I felt that the proof incorporated realistic assumptions, I of course could not resist saying that since they didn't find any bugs, the assumptions clearly must have been unrealistic.

My first presentation covered what would be needed for me to be able to use formal verification as part of Linux-kernel RCU's regression testing. As shown on slide 34, these are:


  1. Either automatic translation or no translation required. After all, if I attempt to manually translate Linux-kernel RCU to some special-purpose language every release, human error will make its presence known.
  2. Correctly handle environment, including the memory model, which in turn includes compiler optimizations.
  3. Reasonable CPU and memory overhead. If these overheads are excessive, RCU is better served by simple stress testing.
  4. Map to source code lines containing the bug. After all, I already know that there are bugs—I need to know where they are.
  5. Modest input outside of source code under test. The sad fact is that a full specification of RCU would be at least as large as the implementation, and also at least as buggy.
  6. Find relevant bugs. To see why this is important, imagine that some tool finds 100 different million-year bugs and I fix them all. Because roughly one of six fixes introduces a bug, and because that bug is likely to reproduce in far less than a million years, this process has likely greatly reduced the robustness of the Linux kernel.


I was not surprised to get some “frank and honest” feedback, but I was quite surprised (but not at all displeased) to learn that some of the feedback was of the form “we want to see more C code.” After some discussion, I provided just that.

Verification Challenge 5: Uses of RCU

This is another self-directed verification challenge, this time to validate uses of RCU instead of validating the RCU implementations as in earlier posts. As you can see from Verification Challenge 4, the logic expression corresponding even to the simplest Linux-kernel RCU implementation is quite large, weighing in at tens of thousands of variables and hundreds of thousands of clauses. It is therefore worthwhile to look into the possibility of a trivial model of RCU that could be used for verification.

Because logic expressions do not care about cache locality, memory contention, energy efficiency, CPU hotplug, and a host of other complications that a Linux-kernel implementation must deal with, we can start with extreme simplicity. For example:

 1 static int rcu_read_nesting_global;
 2 
 3 static void rcu_read_lock(void)
 4 {
 5   (void)__sync_fetch_and_add(&rcu_read_nesting_global, 2);
 6 }
 7 
 8 static void rcu_read_unlock(void)
 9 {
10   (void)__sync_fetch_and_add(&rcu_read_nesting_global, -2);
11 }
12 
13 static inline void assert_no_rcu_read_lock(void)
14 {
15   BUG_ON(rcu_read_nesting_global >= 2);
16 }
17 
18 static void synchronize_rcu(void)
19 {
20   if (__sync_fetch_and_xor(&rcu_read_nesting_global, 1) < 2)
21     return;
22   SET_NOASSERT();
23   return;
24 }


The idea is to reject any execution in which synchronize_rcu() does not wait for all readers to be done. As before, SET_ASSERT() sets a variable that suppresses all future assertions.

Please note that this model of RCU has some shortcomings:


  1. There is no diagnosis of rcu_read_lock()/rcu_read_unlock() misnesting. (A later version of the model provides limited diagnosis, but under #ifdef CBMC_PROVE_RCU.)
  2. The heavyweight operations in rcu_read_lock() and rcu_read_unlock() result in artificial ordering constraints. Even in TSO systems such as x86 or s390, a store in a prior RCU read-side critical section might be reordered with loads in later critical sections, but this model will act as if such reordering was prohibited.
  3. Although synchronize_rcu() is permitted to complete once all pre-existing readers are done, in this model it will instead wait until a point in time at which there are absolutely no readers, whether pre-existing or new. Therefore, this model's idea of an RCU grace period is even heavier weight than in real life.


Nevertheless, this approach will allow us to find at least some RCU-usage bugs, and it fits in well with cbmc's default fully-ordered settings. For example, we can use it to verify a variant of the simple litmus test used previously:

 1 int r_x;
 2 int r_y;
 3 
 4 int x;
 5 int y;
 6 
 7 void *thread_reader(void *arg)
 8 {
 9   rcu_read_lock();
10   r_x = x;
11 #ifdef FORCE_FAILURE_READER
12   rcu_read_unlock();
13   rcu_read_lock();
14 #endif
15   r_y = y;
16   rcu_read_unlock();
17   return NULL;
18 }
19 
20 void *thread_update(void *arg)
21 {
22   x = 1;
23 #ifndef FORCE_FAILURE_GP
24   synchronize_rcu();
25 #endif
26   y = 1;
27   return NULL;
28 }
29 
30 int main(int argc, char *argv[])
31 {
32   pthread_t tr;
33 
34   if (pthread_create(&tr, NULL, thread_reader, NULL))
35     abort();
36   (void)thread_update(NULL);
37   if (pthread_join(tr, NULL))
38     abort();
39 
40   BUG_ON(r_y != 0 && r_x != 1);
41   return 0;
42 }


This model has only 3,032 variables and 8,844 clauses, more than an order of magnitude smaller than for the Tiny RCU verification. Verification takes about half a second, which is almost two orders of magnitude faster than the 30-second verification time for Tiny RCU. In addition, the model successfully flags several injected errors. We have therefore succeeded in producing a simpler and faster model approximating RCU, and that can handle multi-threaded litmus tests.

A natural next step would be to move to litmus tests involving linked lists. Unfortunately, there appear to be problems with cbmc's handling of pointers in multithreaded situations. On the other hand, cbmc's multithreaded support is quite new, so hopefully there will be fixes for these problems in the near future. After fixes appear, I will give the linked-list litmus tests another try.

In the meantime, the full source code for these models may be found here.
So the Linux kernel now has a Documentation/CodeOfConflict file. As one of the people who provided an Acked-by for this file, I thought I should set down what went through my mind while reading it. Taking it one piece at a time:

The Linux kernel development effort is a very personal process compared to “traditional” ways of developing software. Your code and ideas behind it will be carefully reviewed, often resulting in critique and criticism. The review will almost always require improvements to the code before it can be included in the kernel. Know that this happens because everyone involved wants to see the best possible solution for the overall success of Linux. This development process has been proven to create the most robust operating system kernel ever, and we do not want to do anything to cause the quality of submission and eventual result to ever decrease.

In a perfect world, this would go without saying, give or take the “most robust” chest-beating. But I am probably not the only person to have noticed that the world is not always perfect. Sadly, it is probably necessary to remind some people that “job one” for the Linux kernel community is the health and well-being of the Linux kernel itself, and not their own pet project, whatever that might be.

On the other hand, I was also heartened by what does not appear in the above paragraph. There is no assertion that the Linux kernel community's processes are perfect, which is all to the good, because delusions of perfection all too often prevent progress in mature projects. In fact, in this imperfect world, there is nothing so good that it cannot be made better. On the other hand, there also is nothing so bad that it cannot be made worse, so random wholesale changes should be tested somewhere before being applied globally to a project as important as the Linux kernel. I was therefore quite happy to read the last part of this paragraph: “we do not want to do anything to cause the quality of submission and eventual result to ever decrease.”

If however, anyone feels personally abused, threatened, or otherwise uncomfortable due to this process, that is not acceptable.

That sentence is of course critically important, but must be interpreted carefully. For example, it is all too possible that someone might feel abused, threatened, and uncomfortable by the mere fact of a patch being rejected, even if that rejection was both civil and absolutely necessary for the continued robust operation of the Linux kernel. Or someone might claim to feel that way, if they felt that doing so would get their patch accepted. (If this sounds impossible to you, be thankful, but also please understand that the range of human behavior is extremely wide.) In addition, I certainly feel uncomfortable when someone points out a stupid mistake in one of my patches, but that discomfort is my problem, and furthermore encourages me to improve, which is a good thing. For but one example, this discomfort is exactly what motivated me to write the rcutorture test suite. Therefore, although I hope that we all know what is intended by the words “abused”, “threatened”, and “uncomfortable” in that sentence, the fact is that it will never be possible to fully codify the difference between constructive and destructive behavior.

Therefore, the resolution process is quite important:

If so, please contact the Linux Foundation's Technical Advisory Board at <tab@lists.linux-foundation.org>, or the individual members, and they will work to resolve the issue to the best of their ability. For more information on who is on the Technical Advisory Board and what their role is, please see:

http://www.linuxfoundation.org/programs/advisory-councils/tab

There can be no perfect resolution process, but this one seems to be squarely in the “good enough” category. The timeframes are long enough that people will not be rewarded by complaining to the LF TAB instead of fixing their patches. The composition of the LF TAB, although not perfect, is diverse, consisting of both men and women from multiple countries. The LF TAB appears to be able to manage the inevitable differences of opinion, based on the fact that not all members provided their Acked-by for this Code of Conflict. And finally, the LF TAB is an elected body that has oversight via the LF, so there are feedback mechanisms. Again, this is not perfect, but it is good enough that I am willing to overlook my concerns about the first sentence in the paragraph.

On to the final paragraph:

As a reviewer of code, please strive to keep things civil and focused on the technical issues involved. We are all humans, and frustrations can be high on both sides of the process. Try to keep in mind the immortal words of Bill and Ted, “Be excellent to each other.”

And once again, in a perfect world it would not be necessary to say this. Sadly, we are human beings rather than angels, and so it does appear to be necessary. Then again, if we were all angels, this would be a very boring world.

Or at least that is what I keep telling myself!

Verification Challenge 4: Tiny RCU

The first and second verification challenges were directed to people working on verification tools, and the third challenge was directed at developers. Perhaps you are thinking that it is high time that I stop picking on others and instead direct a challenge at myself. If so, this is the challenge you were looking for!

The challenge is to take the v3.19 Linux kernel code implementing Tiny RCU, unmodified, and use some formal-verification tool to prove that its grace periods are correctly implemented.

This requires a tool that can handle multiple threads. Yes, Tiny RCU runs only on a single CPU, but the proof will require at least two threads. The basic idea is to have one thread update a variable, wait for a grace period, then update a second variable, while another thread accesses both variables within an RCU read-side critical section, and a third parent thread verifies that this critical section did not span a grace period, like this:

 1 int x;
 2 int y;
 3 int r1;
 4 int r2;
 5
 6 void rcu_reader(void)
 7 {
 8   rcu_read_lock();
 9   r1 = x; 
10   r2 = y; 
11   rcu_read_unlock();
12 }
13
14 void *thread_update(void *arg)
15 {
16   x = 1; 
17   synchronize_rcu();
18   y = 1; 
19 }
20
21 . . .
22
23 assert(r2 == 0 || r1 == 1);


Of course, rcu_reader()'s RCU read-side critical section is not allowed to span thread_update()'s grace period, which is provided by synchronize_rcu(). Therefore, rcu_reader() must execute entirely before the end of the grace period (in which case r2 must be zero, keeping in mind C's default initialization to zero), or it must execute entirely after the beginning of the grace period (in which case r1 must be one).

There are a few technical problems to solve:


  1. The Tiny RCU code #includes numerous “interesting” files. I supplied empty files as needed and used “-I .” to focus the C preprocessor's attention on the current directory.
  2. Tiny RCU uses a number of equally interesting Linux-kernel primitives. I stubbed most of these out in fake.h, but copied a number of definitions from the Linux kernel, including IS_ENABLED, barrier(), and bool.
  3. Tiny RCU runs on a single CPU, so the two threads shown above must act as if this was the case. I used pthread_mutex_lock() to provide the needed mutual exclusion, keeping in mind that Tiny RCU is available only with CONFIG_PREEMPT=n. The thread that holds the lock is running on the sole CPU.
  4. The synchronize_rcu() function can block. I modeled this by having it drop the lock and then re-acquire it.
  5. The dyntick-idle subsystem assumes that the boot CPU is born non-idle, but in this case the system starts out idle. After a surprisingly long period of confusion, I handled this by having main() invoke rcu_idle_enter() before spawning the two threads. The confusion eventually proved beneficial, but more on that later.


The first step is to get the code to build and run normally. You can omit this step if you want, but given that compilers usually generate better diagnostics than do the formal-verification tools, it is best to make full use of the compilers.

I first tried goto-cc, goto-instrument, and satabs [Slide 44 of PDF] and impara [Slide 52 of PDF], but both tools objected strenuously to my code. My copies of these two tools are a bit dated, so it is possible that these problems have since been fixed. However, I decided to download version 5 of cbmc, which is said to have gained multithreading support.

After converting my code to a logic expression with no fewer than 109,811 variables and 457,344 clauses, cbmc -I . -DRUN fake.c took a bit more than ten seconds to announce VERIFICATION SUCCESSFUL. But should I trust it? After all, I might have a bug in my scaffolding or there might be a bug in cbmc.

The usual way to check for this is to inject a bug and see if cbmc catches it. I chose to break up the RCU read-side critical section as follows:

 1 void rcu_reader(void)
 2 {
 3   rcu_read_lock();
 4   r1 = x; 
 5   rcu_read_unlock();
 6   cond_resched();
 7   rcu_read_lock();
 8   r2 = y; 
 9   rcu_read_unlock();
10 }


Why not remove thread_update()'s call to synchronize_rcu()? Take a look at Tiny RCU's implementation of synchronize_rcu() to see why not!

With this change enabled via #ifdef statements, “cbmc -I . -DRUN -DFORCE_FAILURE fake.c” took almost 20 seconds to find a counter-example in a logic expression with 185,627 variables and 815,691 clauses. Needless to say, I am glad that I didn't have to manipulate this logic expression by hand!

Because cbmc catches an injected bug and verifies the original code, we have some reason to hope that the VERIFICATION SUCCESSFUL was in fact legitimate. As far as I know, this is the first mechanical proof of the grace-period property of a Linux-kernel RCU implementation, though admittedly of a rather trivial implementation. On the other hand, a mechanical proof of some properties of the dyntick-idle counters came along for the ride, courtesy of the WARN_ON_ONCE() statements in the Linux-kernel source code. (Previously, researchers at Oxford mechanically validated the relationship between rcu_dereference() and rcu_assign_pointer(), taking the whole of Tree RCU as input, and researchers at MPI-SWS formally validated userspace RCU's grace-period guarantee—manually.)

As noted earlier, I had confused myself into thinking that cbmc did not handle pthread_mutex_lock(). I verified that cbmc handles the gcc atomic builtins, but it turns out to be impractical to build a lock for cbmc's use from atomics. The problem stems from the “b” for “bounded” in “cbmc”, which means cbmc cannot analyze the unbounded spin loops used in locking primitives.

However, cbmc does do the equivalent of a full state-space search, which means it will automatically model all possible combinations of lock-acquisition delays even in the absence of a spin loop. This suggests something like the following:

 1 if (__sync_fetch_and_add(&cpu_lock, 1))
 2   exit();


The idea is to exclude from consideration any executions where the lock cannot be immediately acquired, again relying on the fact that cbmc automatically models all possible combinations of delays that the spin loop might have otherwise produced, but without the need for an actual spin loop. This actually works, but my mis-modeling of dynticks fooled me into thinking that it did not. I therefore made lock-acquisition failure set a global variable and added this global variable to all assertions. When this failed, I had sufficient motivation to think, which caused me to find my dynticks mistake. Fixing this mistake fixed all three versions (locking, exit(), and flag).

The exit() and flag approaches result in exactly the same number of variables and clauses, which turns out to be quite a bit fewer than the locking approach:

exit()/flaglocking
Verification69,050 variables, 287,548 clauses (output)109,811 variables, 457,344 clauses (output)
Verification Forced Failure113,947 variables, 501,366 clauses (output)   185,627 variables, 815,691 clauses (output)


So locking increases the size of the logic expressions by quite a bit, but interestingly enough does not have much effect on verification time. Nevertheless, these three approaches show a few of the tricks that can be used to accomplish real work using formal verification.

The GPL-licensed source for the Tiny RCU validation may be found here. C-preprocessor macros select the various options, with -DRUN being necessary for both real runs and cbmc verification (as opposed to goto-cc or impara verification), -DCBMC forcing the atomic-and-flag substitute for locking, and -DFORCE_FAILURE forcing the failure case. For example, to run the failure case using the atomic-and-flag approach, use:

cbmc -I . -DRUN -DCBMC -DFORCE_FAILURE fake.c


Possible next steps include verifying dynticks and interrupts, dynticks and NMIs, and of course use of call_rcu() in place of synchronize_rcu(). If you try these out, please let me know how it goes!

Verification Challenge 3: cbmc

The first and second verification challenges were directed to people working on verification tools, but this one is instead directed at developers.

It turns out that there are a number of verification tools that have seen heavy use. For example, I have written several times about Promela and spin (here, here, and here), which I have used from time to time over the past 20 years. However, this tool requires that you translate your code to Promela, which is not conducive to use of Promela for regression tests.

For those of use working in the Linux kernel, it would be nice to have a verification tool that operated directly on C source code. And there are tools that do just that, for example, the C Bounded Model Checker (cbmc). This tool, which is included in a number of Linux distributions, converts a C-language input file into a (possibly quite large) logic expression. This expression is constructed so that if any combination of variables causes the logic expression to evaluate to true, then (and only then) one of the assertions can be triggered. This logic expression is then passed to a SAT solver, and if this SAT solver finds a solution, then there is a set of inputs that can trigger the assertion. The cbmc tool is also capable of checking for array-bounds errors and some classes of pointer misuse.

Current versions of cbmc can handle some useful tasks. For example, suppose it was necessary to reverse the sense of the if condition in the following code fragment from Linux-kernel RCU:

 1   if (rnp->exp_tasks != NULL ||
 2       (rnp->gp_tasks != NULL &&
 3        rnp->boost_tasks == NULL &&
 4        rnp->qsmask == 0 &&
 5        ULONG_CMP_GE(jiffies, rnp->boost_time))) {
 6     if (rnp->exp_tasks == NULL) 
 7       rnp->boost_tasks = rnp->gp_tasks;
 8     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
 9     t = rnp->boost_kthread_task;
10     if (t)   
11       rcu_wake_cond(t, rnp->boost_kthread_status);
12   } else {
13     rcu_initiate_boost_trace(rnp);
14     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
15   }


This is a simple application of De Morgan's law, but an error-prone one, particularly if carried out in a distracting environment. Of course, to test a validation tool, it is best to feed it buggy code to see if it detects those known bugs. And applying De Morgan's law in a distracting environment is an excellent way to create bugs, as you can see below:

 1   if (rnp->exp_tasks == NULL &&
 2       (rnp->gp_tasks == NULL ||
 3        rnp->boost_tasks != NULL ||
 4        rnp->qsmask != 0 &&
 5        ULONG_CMP_LT(jiffies, rnp->boost_time))) {
 6     rcu_initiate_boost_trace(rnp);
 7     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
 8   } else {
 9     if (rnp->exp_tasks == NULL) 
10       rnp->boost_tasks = rnp->gp_tasks;
11     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
12     t = rnp->boost_kthread_task;
13     if (t)   
14       rcu_wake_cond(t, rnp->boost_kthread_status);
15   }


Of course, a full exhaustive test is infeasible, but structured testing would result in a manageable number of test cases. However, we can use cbmc to do the equivalent of a full exhaustive test, despite the fact that the number of combinations is on the order of two raised to the power 1,000. The approach is to create task_struct and rcu_node structures that contain only those fields that are used by this code fragment, but that also contain flags that indicate which functions were called and what their arguments were. This allows us to wrapper both the old and the new versions of the code fragment in their respective functions, and call them in sequence on different instances of identically initialized task_struct and rcu_node structures. These two calls are followed by an assertion that checks that the return value and the corresponding fields of the structures are identical.

This approach results in checkiftrans-1.c (raw C code here). Lines 5-8 show the abbreviated task_struct structure and lines 13-22 show the abbreviated rcu_node struButcture. Lines 10, 11, 24, and 25 show the instances. Lines 27-31 record a call to rcu_wake_cond() and lines 33-36 record a call to rcu_initiate_boost_trace().

Lines 38-49 initialize a task_struct/rcu_node structure pair. The rather unconventional use of the argv[] array works because cbmc assumes that this array contains random numbers. The old if statement is wrappered by do_old_if() on lines 51-71, while the new if statement is wrappered by do_new_if() on lines 73-93. The assertion is in check() on lines 95-107, and finally the main program is on lines 109-118.

Running cbmc checkiftrans-1.c gives this output, which prominently features VERIFICATION FAILED at the end of the file. On lines 4, 5, 12 and 13 of the file are complaints that neither ULONG_CMP_GE() nor ULONG_CMP_LT() are defined. Lacking definitions for these these two functions, cbmc seems to treat them as random-number generators, which could of course cause the two versions of the if statement to yield different results. This is easily fixed by adding the required definitions:

 1 #define ULONG_MAX         (~0UL)
 2 #define ULONG_CMP_GE(a, b)  (ULONG_MAX / 2 >= (a) - (b))
 3 #define ULONG_CMP_LT(a, b)  (ULONG_MAX / 2 < (a) - (b))


This results in checkiftrans-2.c (raw C code here). However, running cbmc checkiftrans-2.c gives this output, which still prominently features VERIFICATION FAILED at the end of the file. At least there are no longer any complaints about undefined functions!

It turns out that cbmc provides a counterexample in the form of a traceback. This traceback clearly shows that the two instances executed different code paths, and a closer examination of the two representations of the if statement show that I forgot to convert one of the && operators to a ||—that is, the “rnp->qsmask != 0 &&” on line 84 should instead be “rnp->qsmask != 0 ||”. Making this change results incheckiftrans-3.c (raw C code here). The inverted if statement is now as follows:

 1   if (rnp->exp_tasks == NULL &&
 2       (rnp->gp_tasks == NULL ||
 3        rnp->boost_tasks != NULL ||
 4        rnp->qsmask != 0 ||
 5        ULONG_CMP_LT(jiffies, rnp->boost_time))) {
 6     rcu_initiate_boost_trace(rnp);
 7     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
 8   } else {
 9     if (rnp->exp_tasks == NULL) 
10       rnp->boost_tasks = rnp->gp_tasks;
11     /* raw_spin_unlock_irqrestore(&rnp->lock, flags); */
12     t = rnp->boost_kthread_task;
13     if (t)   
14       rcu_wake_cond(t, rnp->boost_kthread_status);
15   }


This time, running cbmc checkiftrans-3.c produces this output, which prominently features VERIFICATION SUCCESSFUL at the end of the file. Furthermore, this verification consumed only about 100 milliseconds on my aging laptop. And, even better, because it refused to verify the buggy version, we have at least some reason to believe it!

Of course, one can argue that doing such work carefully and in a quiet environment would eliminate the need for such verification, and 30 years ago I might have emphatically agreed with this argument. I have since learned that ideal work environments are not always as feasible as we might like to think, especially if there are small children (to say nothing of adult-sized children) in the vicinity. Besides which, human beings do make mistakes, even when working in ideal circumstances, and if we are to have reliable software, we need some way of catching these mistakes.

The canonical pattern for using cbmc in this way is as follows:

 1 retref = funcref(...);
 2 retnew = funcnew(...);
 3 assert(retref == retnew && ...);


The ... sequences represent any needed arguments to the calls and any needed comparisons of side effects within the assertion.

Of course, there are limitations:


  1. The “b” in cbmc stands for “bounded.” In particular, cbmc handles neither infinite loops nor infinite recursion. The --unwind and --depth arguments to cbmc allow you to control how much looping and recursion is analyzed. See the manual for more information.
  2. The SAT solvers used by cbmc have improved greatly over the past 25 years. In fact, where a 100-variable problem was at the edge of what could be handled in the 1990s, most ca-2015 solvers can handle more than a million variables. However, the NP-complete nature of SAT does occasionally make its presence known, for example, programs that reduce to a proof involving the pigeonhole principle are not handled well as of early 2015.
  3. Handling of concurrency is available in later versions of cbmc, but is not as mature as is the handling of single-threaded code.


All that aside, everything has its limitations, and cbmc's ease of use is quite impressive. I expect to continue to use it from time to time, and strongly recommend that you give it a try!
Although junk mail, puppies, and patches often are unwelcome, there are exceptions. For example, if someone has been wanting a particular breed of dog for some time, that person might be willing to accept a puppy, even if that means giving it shots, housebreaking it, teaching it the difference between furniture and food, doing bottlefeeding, watching over it day and night, and even putting up with some sleepless nights.

Similarly, if a patch fixes a difficult and elusive bug, the maintainer might be willing to apply the patch by hand, fix build errors and warnings, fix a few bugs in the patch itself, run a full set of tests, fix and style problems, and even accept the risk that the bug might have unexpected side effects, some of which might result in some sleepless nights. This in fact is one of the reasons for the common advice given to open-source newbies: start by fixing bugs.

Other good advice for new contributors can be found here:


  1. Greg Kroah-Hartman's HOWTO do Linux kernel development – take 2 (2005)
  2. Jonathan Corbet's How to Participate in the Linux Community (2008)
  3. Greg Kroah-Hartman's Write and Submit your first Linux kernel Patch (2010)
  4. My How to make a positive difference in a FOSS project (2012)
  5. Daniel Lezcano's What do we mean by working upstream: A long-term contributor’s view


This list is mostly about contributing to the Linux kernel, but most other projects have similar pages giving good new-contributor advice.
True confession: I was once a serial junk mailer. Not mere email spam, but physical bulk-postage-rate flyers, advertising a series of non-technical conferences. It was of course for a good cause, and one of the most difficult things about that task was convincing that cause's leaders that this flyer was in fact junk mail. They firmly believed that anyone with even a scrap of compassion would of course read the flyer from beginning to end, feeling the full emotional impact of each and every lovingly crafted word. They reluctantly came around to my view, which was that we had at most 300 milliseconds to catch the recipient's attention, that being the amount of time that the recipient might glance at the flyer on its way into the trash. Or at least I think that they came around to my view. All I really know is that they stopped disputing the point.

But junk mail for worthy causes is not the only thing that can be less welcome than its sender might like.

For example, Jim Wasko noticed a sign at a daycare center that read: “If you are late picking up your child and have not called us in advance, we will give him/her an espresso and a puppy. Have a great day.”

Which goes to show that although puppies are cute and lovable, and although their mother no doubt went to a lot of trouble to bring them into this world, they are, just like junk mail, not universally welcome. And this should not be too surprising, given the questions that come to mind when contemplating a free puppy. Has it had its shots? Is it housebroken? Has it learned that furniture is not food? Has it been spayed/neutered? Is it able to eat normal dogfood, or does it still require bottlefeeding? Is it willing to entertain itself for long periods? And, last, but most definitely not least, is it willing to let you sleep through the night?

Nevertheless, people are often surprised and bitterly disappointed when their offers of free puppies are rejected.

Other people are just as surprised and disappointed when their offers of free patches are rejected. After all, they put a lot of work into their patches, and they might even get into trouble if the patch isn't eventually accepted.

But it turns out that patches are a lot like junk mail and puppies. They are greatly valued by those who produce them, but often viewed with great suspicion by the maintainers receiving them. You see, the thought of accepting a free patch also raises questions. Does the patch apply cleanly? Does it build without errors and warnings? Does it run at all? Does it pass regression tests? Has it been tested with the commonly used combination of configuration parameters? Does the patch have good code style? Is the patch maintainable? Does the patch provide a straightforward and robust solution to whatever problem it is trying to solve? In short, will this patch allow the maintainer to sleep through the night?

I am extremely fortunate in that most of the RCU patches that I receive are “good puppies.” However, not everyone is so lucky, and I occasionally hear from patch submitters whose patches were not well received. They often have a long list of reasons why their patches should have been accepted, including:

  1. I put a lot of work into that patch, so it should have been accepted! Unfortunately, hard work on your part does not guarantee a perception of value on the maintainer's part.
  2. The maintainer's job is to accept patches. Maybe not, your maintainer might well be an unpaid volunteer.
  3. But my maintainer is paid to maintain! True, but he is probably not being paid to do your job.
  4. I am not asking him to do my job, but rather his/her job, which is to accept patches! The maintainer's job is not to accept any and all patches, but instead to accept good patches that further the project's mission.
  5. I really don't like your attitude! I put a lot of work into making this be a very good patch! It should have been accepted! Really? Did you make sure it applied cleanly? Did you follow the project's coding conventions? Did you make sure that it passed regression tests? Did you test it on the full set of platforms supported by the project? Does it avoid problems discussed on the project's mailing list? Did you promptly update your patch based on any feedback you might have received? Is your code maintainable? Is your code aligned with the project's development directions? Do you have a good reputation with the community? Do you have a track record of supporting your submissions? In other words, will your patch allow the maintainer to sleep through the night?
  6. But I don't have time to do all that! Then the maintainer doesn't have time to accept your patch. And most especially doesn't have time to deal with all the problems that your patch is likely to cause.

As a recovering proprietary programmer, I can assure you that things work a bit differently in the open-source world, so some adjustment is required. But participation in an open-source project can be very rewarding and worthwhile!

Parallel Programming: January 2015 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features a new chapter on SMP real-time programming, a updated formal-verification chapter, removal of a couple of the less-relevant appendices, several new cartoons (along with some refurbishing of old cartoons), and other updates, including contributions from Bill Pemberton, Borislav Petkov, Chris Rorvick, Namhyung Kim, Patrick Marlier, and Zygmunt Bazyli Krynicki.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.
I suppose that I might as well get the “But what about Verification Challenge 1?” question out of the way to start with. You will find Verification Challenge 1 here.

Now, Verification Challenge 1 was about locating a known bug. Verification Challenge 2 takes a different approach: The goal is instead to find a bug if there is one, or to prove that there is no bug. This challenge involves the Linux kernel's NO_HZ_FULL_SYSIDLE functionality, which is supposed to determine whether or not all non-housekeeping CPUs are idle. The normal Linux-kernel review process located an unexpected bug (which was allegedly fixed), so it seemed worthwhile to apply some formal verification. Unfortunately, all of the tools that I tried failed. Not simply failed to verify, but failed to run correctly at all—though I have heard a rumor that one of the tools was fixed, and thus advanced to the “failed to verify” state, where “failed to verify” apparently meant that the tool consumed all available CPU and memory without deigning to express an opinion as to the correctness of the code.

So I fell back to 20-year-old habits and converted my C code to a Promela model and used spin to do a full-state-space verification. After some back and forth, this model did claim verification, and correctly refused to verify bug-injected perturbations of the model. Mathieu Desnoyers created a separate Promela model that made more deft use of temporal logic, and this model also claimed verification and refused to verify bug-injected perturbations. So maybe I can trust them. Or maybe not.

Unfortunately, regardless of whether or not I can trust these models, they are not appropriate for regression testing. How large a change to the C code requires a corresponding change to the Promela models? And how can I be sure that this corresponding change was actually the correct change? And so on: These kinds of questions just keep coming.

It would therefore be nice to be able to validate straight from the C code. So if you have a favorite verification tool, why not see what it can make of NO_HZ_FULL_SYSIDLE? The relevant fragments of the C code, along with both Promela models, can be found here. See the README file for a description of the files, and you know where to find me for any questions that you might have.

If you do give it a try, please let me know how it goes!
My previous posting described an RCU bug that I might plausibly blame on falsehoods from firmware. The RCU bug in this post, alas, I can blame only on myself.

In retrospect, things were going altogether too smoothly while I was getting my RCU commits ready for the 3.19 merge window. That changed suddenly when my automated testing kicked out a “BUG: FAILURE, 1 instances”. This message indicates a grace-period failure, in other words that RCU failed to be RCU, which is of course really really really bad. Fortunately, it was still some weeks until the merge window, so there was some time for debugging and fixing.

Of course, we all have specific patches that we are suspicious of. So my next step was to revert suspect patches and to otherwise attempt to outguess the bug. Unfortunately, I quickly learned that the bug is difficult to reproduce, requiring something like 100 hours of focused rcutorture testing. Bisection based on 100-hour tests would have consumed the remainder of 2014 and a significant fraction of 2015, so something better was required. In fact, something way better was required because there was only a very small number of failures, which meant that the expected test time to reproduce the bug might well have been 200 hours or even 300 hours instead of my best guess of 100 hours.

My first attempt at “something better” was to inspect the suspect patches. This effort did locate some needed fixes, but nothing that would explain the grace-period failures. My next attempt was to take a closer look at the dmesg logs of the two runs with failures, which produced much better results.

You see, rcutorture detects failures using the RCU API, for example, invoking synchronize_rcu() and waiting for it to return. This overestimates the grace-period duration, because RCU's grace-period primitives wait not only for a grace period to elapse, but also for the RCU core to notice their requests and also for RCU to inform them of the grace period's end, as shown in the following figure.

RCUnearMiss.svg

This means that a given RCU read-side critical section might overlap a grace period by a fair amount without rcutorture being any the wiser. However, some RCU implementations provide rcutorture access to the underlying grace-period counters, which in theory provide rcutorture with a much more precise view of each grace period's duration. These counters have long recorded in the rcutorture output as “Reader Batch” counts as shown on the second line of the following (the first line is the full-API data):


rcu-torture: !!! Reader Pipe:  13341924415 88927 1 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



This shows rcutorture output from a run containing a failure. On each line, the first two numbers correspond to legal RCU activity: An RCU read-side critical section might have been entirely contained within a single RCU grace period (first number) or it might have overlapped the boundary between an adjacent pair of grace periods (second number). However, a single RCU read-side critical section is most definitely not allowed to overlap three different grace periods, because that would mean that the middle grace period was by definition too short. And exactly that error occured above, indicated by the exclamation marks and the “1” in the third place in the “Reader Pipe” line above.

If RCU was working perfectly, in theory, the output would instead look something like this, without the exclamation marks and without non-zero value in the third and subsequent positions:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 0 0 0 0 0 0 0 0 0



In practice, the “Reader Batch” counters were intended only for statistical use, and access to them is therefore unsynchronized, as indicated by the jagged grace-period-start and grace-period-end lines in the above figure. Furthermore, any attempt to synchronize them in rcutorture's RCU readers would incur so much overhead that rcutorture could not possibly do a good job of torturing RCU. This means that these counters can result in false positives, and false positives are not something that I want in my automated test suite. In other words, it is at least theoretically possible that we might legitimately see something like this from time to time:


rcu-torture: Reader Pipe:  13341924415 88927 0 0 0 0 0 0 0 0 0
rcu-torture: Reader Batch:  13341824063 189279 1 0 0 0 0 0 0 0 0



We clearly need to know how bad this false-positive problem is. One way of estimating the level of false-positive badness is to scan my three months worth of rcutorture test results. This scan showed that there were no suspicious Reader Batch counts in more than 1,300 hours of rcutorture testing, aside from the roughly one per hour from the TREE03 test, which happened to also be the only test to produce real failures. This suggested that I could use the Reader Batch counters as indications of a “near miss” to guide my debugging effort. The once-per-hour failure rate suggested a ten-hour test duration, which I was able compress into two hours by running five concurrent tests.

Why ten hours?

Since TREE03 had been generating Reader Batch near misses all along, this was not a recent problem. Therefore, git bisect was more likely to be confused by unrelated ancient errors than to be of much help. I therefore instead bisected by configuration and by rcutorture parameters. This alternative bisection showed that the problem occurred only with normal grace periods (as opposed to expedited grace periods), only with CONFIG_RCU_BOOST=y, and only with concurrent CPU-hotplug operations. Of course, one possibility was that the bug was in rcutorture rather than RCU, but rcutorture was exonerated via priority-boost testing on a kernel built with CONFIG_RCU_BOOST=n, which showed neither failures nor Reader Batch near misses. This combination of test results points the finger of suspicion directly at the rcu_preempt_offline_tasks() function, which is the only part of RCU's CPU-hotplug code path that has a non-trivial dependency on CONFIG_RCU_BOOST.

One very nice thing about this combination is that it is very unlikely that people are encountering this problem in production. After all, for this problem to appear, you must be doing the following:


  1. Running a kernel with both CONFIG_PREEMPT=y and CONFIG_RCU_BOOST=y.
  2. Running on hardware containing more than 16 CPUs (assuming the default value of 16 for CONFIG_RCU_FANOUT_LEAF).
  3. Carrying out frequent CPU-hotplug operations, and so much so that a given block of 16 CPUs (for example, CPUs 0-15 or 16-31) might reasonably all be offline at the same time.


That said, if this does describe your system and workload, you should consider applying this patch set. You should also consider modifying your workload.

Returning to the rcu_preempt_offline_tasks() function that the finger of suspicion now points to:


 1 static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
 2                                      struct rcu_node *rnp,
 3                                      struct rcu_data *rdp)
 4 {
 5   struct list_head *lp;
 6   struct list_head *lp_root;
 7   int retval = 0;
 8   struct rcu_node *rnp_root = rcu_get_root(rsp);
 9   struct task_struct *t;
10 
11   if (rnp == rnp_root) {
12     WARN_ONCE(1, "Last CPU thought to be offlined?");
13     return 0;
14   }
15   WARN_ON_ONCE(rnp != rdp->mynode);
16   if (rcu_preempt_blocked_readers_cgp(rnp) && rnp->qsmask == 0)
17     retval |= RCU_OFL_TASKS_NORM_GP;
18   if (rcu_preempted_readers_exp(rnp))
19     retval |= RCU_OFL_TASKS_EXP_GP;
20   lp = &rnp->blkd_tasks;
21   lp_root = &rnp_root->blkd_tasks;
22   while (!list_empty(lp)) {
23     t = list_entry(lp->next, typeof(*t), rcu_node_entry);
24     raw_spin_lock(&rnp_root->lock);
25     smp_mb__after_unlock_lock();
26     list_del(&t->rcu_node_entry);
27     t->rcu_blocked_node = rnp_root;
28     list_add(&t->rcu_node_entry, lp_root);
29     if (&t->rcu_node_entry == rnp->gp_tasks)
30       rnp_root->gp_tasks = rnp->gp_tasks;
31     if (&t->rcu_node_entry == rnp->exp_tasks)
32       rnp_root->exp_tasks = rnp->exp_tasks;
33 #ifdef CONFIG_RCU_BOOST
34     if (&t->rcu_node_entry == rnp->boost_tasks)
35       rnp_root->boost_tasks = rnp->boost_tasks;
36 #endif
37     raw_spin_unlock(&rnp_root->lock);
38   }
39   rnp->gp_tasks = NULL;
40   rnp->exp_tasks = NULL;
41 #ifdef CONFIG_RCU_BOOST
42   rnp->boost_tasks = NULL;
43   raw_spin_lock(&rnp_root->lock);
44   smp_mb__after_unlock_lock();
45   if (rnp_root->boost_tasks != NULL &&
46       rnp_root->boost_tasks != rnp_root->gp_tasks &&
47       rnp_root->boost_tasks != rnp_root->exp_tasks)
48     rnp_root->boost_tasks = rnp_root->gp_tasks;
49   raw_spin_unlock(&rnp_root->lock);
50 #endif
51   return retval;
52 }



First, we should of course check the code under #ifdef CONFIG_RCU_BOOST. The code starting at line 43 is quite suspicious, as it is not clear why it is safe to make these modifications after having dropped the rnp_root structure's ->lock on line 37. And removing lines 43-49 does in fact reduce the number of Reader Batch near misses by an order of magnitude, but sadly not to zero.

I was preparing to dig into this function to find the bug, but then I noticed that the loop spanning lines 22-38 is executed with interrupts disabled. Given that the number of iterations through this loop is limited only by the number of tasks in the system, this loop is horrendously awful for real-time response.

Or at least it is now—at the time I wrote that code, the received wisdom was “Never do CPU-hotplug operations on a system that runs realtime applications!” I therefore had invested zero effort into maintaining good realtime latencies during these operations. That expectation has changed because some recent real-time applications offline then online CPUs in order to clear off irrelevant processing that might otherwise degrade realtime latencies. Furthermore, the large CPU counts on current systems are an invitation to run multiple real-time applications on a single system, so that the CPU hotplug operations run as part of one application's startup might interfere with some other application. Therefore, this loop clearly needs to go. So I abandoned my debugging efforts and focused instead on getting rid of rcu_preempt_offline_tasks() entirely, along with all of its remaining bugs.

The point of the loop spanning lines 22-38 is handle any tasks on the rnp structure's ->blkd_tasks list. This list accumulates tasks that block while in an RCU read-side critical section while running on a CPU associated with the leaf rcu_node structure pointed to by rnp. This blocking will normally be due to preemption, but could also be caused by -rt's blocking spinlocks. This function is called only when the last CPU associated with the leaf rcu_node structure is going offline, after which there will no longer be any online CPUs associated with this leaf rcu_node structure. This loop then moves those tasks to the root rcu_node structure's ->blkd_tasks list. Because there is always at least one CPU online somewhere in the system, there will always be at least one online CPU associated with the root rcu_node structure, which means that RCU will be guaranteed to take proper care of these tasks.

The obvious question at this point is “why not just leave the tasks on the leaf rcu_node structure?” After all, this clearly avoids the unbounded time spent moving tasks while interrupts are disabled. However, it also means that RCU's grace-period computation must change in order to account for blocked tasks associated with a CPU-free leaf rcu_node structure.

In the old approach, the leaf rcu_node structure in question was excluded from RCU's grace-period computations immediately. In the new approach, RCU will need to continue including that leaf rcu_node structure until the last task on its list exits its outermost RCU read-side critical section. At that point, the last task will remove itself from the list, leaving the list empty, thus allowing RCU to ignore the leaf rcu_node structure from that point forward.

This patch set makes this change by abstracting an rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu(), which can then be called from rcu_read_unlock_special() when the last task removes itself from the ->blkd_tasks list. This change allows the rcu_preempt_offline_tasks() function to be dispensed with entirely. With this patch set applied, TEST03 ran for 1,000 hours with neither failures nor near misses, which gives a high degree of confidence that this patch set made the bug less likely to appear. This, along with the decreased complexity and the removal of a source of realtime latency, is a definite plus!

I also modified the rcutorture scripts to pay attention to the “Reader Batch” near-miss output, giving a warning if one such near miss occurs in a run, and giving an error if two or more near misses occur, but at a rate of at least one near miss per three hours of test time. This should inform me should I make a similar mistake in the future.

Given that RCU priority boosting has been in the Linux kernel for many years and given that I regularly run rcutorture with CPU-hotplug testing enabled on CONFIG_RCU_BOOST=y kernels, it is only fair to ask why rcutorture did not locate this bug years ago. The likely reason is that I only recently added RCU callback flood testing, in which rcutorture registers 20,000 RCU callbacks, waits a jiffy, registers 20,000 more callbacks, waits another jiffy, then finally registers a third set of 20,000 callbacks. This causes call_rcu() to take evasive action, which has the effect of starting the RCU grace period more quickly, which in turn makes rcutorture more sensitive to too-short grace periods. This view is supported by the fact that I saw actual failures only on recent kernels whose rcutorture testing included callback-flood testing.

Another question is “exactly what was the bug?” I have a fairly good idea of what stupid mistake led to that bug, but given that I have completely eliminated the offending function, I am not all that motivated to chase it down. In fact, it seems much more productive to leave it as an open challenge for formal verification. So, if you have a favorite formal-verification tool, why not see what it can make of rcu_preempt_offline_tasks()?

Somewhat embarrassingly, slides 39-46 of this Collaboration Summit presentation features my careful development of rcu_preempt_offline_tasks(). I suppose that I could hide behind the “More changes due to RCU priority boosting” on slide 46, but the fact remains that I simply should not have written this function in the first place, whether carefully before priority boosting or carelessly afterwards.

In short, it is not enough to correctly code a function. You must also code the correct function!