Verification Challenges

LPC Logo

TL;DR: Memory-Model Recommendations for Rusting the Linux Kernel

These recommendations assume that the initial Linux-kernel targets for Rust developers are device drivers that do not have unusual performance and scalability requirements, meaning that wrappering of small C-language functions is tolerable. (Please note that most device drivers fit into this category.) It also assumes that the main goal is to reduce memory-safety bugs, although other bugs might be addressed as well. Or, Murphy being Murphy, created as well. But that is a risk in all software development, not just Rust in the Linux kernel.

Those interested in getting Rust into Linux-kernel device drivers sooner rather than later should look at the short-term recommendations, while those interested in extending Rust's (and, for that matter, C's) concurrency capabilities might be more interested in the long-term recommendations.

Short-Term Recommendations

The goal here is to allow the Rust language considerable memory-model flexibility while also providing Rust considerable freedom in what it might be used for within the Linux kernel. The recommendations are as follows:

  1. Provide wrappers around the existing Linux kernel's locking, MMIO, I/O-barrier, and I/O-access primitives. If I was doing this work, I would add wrappers incrementally as they were actually needed, but I freely admit that there are benefits to providing a full set of wrappers from the get-go.
  2. Either wrapper READ_ONCE() or be careful to take Alpha's, Itanium's, and ARMv8's requirements into account as needed. The situation with WRITE_ONCE() appears more straightforward. The same considerations apply to the atomic_read() and atomic_set() family of primitives.
  3. Atomic read-modify-write primitives should be made available to Rust programs via wrappers around C code. But who knows? Maybe it is once again time to try out intrinsics. Again, if I was doing the work, I would add wrappers/implementations incrementally.
  4. Although there has been significant discussion surrounding how sequence locking and RCU might be handled by Rust code, more work appears to be needed. For one thing, it is not clear that people proposing solutions are aware of the wide range of Linux-kernel use cases for these primitives. In the meantime, I recommend keeping direct use of sequence locking and RCU in C code, and providing Rust wrappers for the resulting higher-level APIs. In this case, it might be wise to make good use of compiler directives in order to limit Rust's ability to apply code-motion optimizations. However, if you need sequence-locking or RCU Rust-language wrappers, there are a number that have been proposed. (Except that you should first take another look or three at wrappering higher-level APIs. Yes, APIs are hard, but there are huge benefits to proper APIs!)
  5. Control dependencies should be confined to C code. If needed, higher-level APIs whose C-language implementations require control dependencies can be wrappered for Rust use. But my best guess is that it will be some time before Rust code needs control dependencies.

Taking this approach should avoid memory-model-mismatch issues between Rust and C code in the Linux kernel. And discussions indicate that much (and maybe all) of the wrappering work called out in the first three items above has already been done.

Of course, situations that do not fit this set of recommendations can be addressed on a case-by-case basis. I would of course be happy to help. Specifically, in my role as lead Linux-kernel RCU maintainer:

  1. My first reaction to submission of Rust wrappers for the RCU API will be to ask hard questions about the possibility of higher-level APIs.
  2. If higher-level APIs are infeasible, I will look carefully at which RCU use cases are supported by the proposed wrappering. I am working to improve documentation of the range of RCU use cases in order to help submitters to do this as well. (This work will likely be helpful elsewhere as well.)
  3. Again, if higher-level APIs are infeasible, I will look carefully at how the Rust wrappers are helping to find bugs. This will clearly require me to continue learning Rust, as this requires me to have a detailed view of Rust's ownership mechanisms and how things like reference-counting use cases work around limitations in these mechanisms.

This procedure should help buy the time required for me to learn more about the Rust language and for the Rust community to learn more about RCU and its many use cases.

Long-Term Recomendations

This section takes a more utopian view. What would a perfect Rust sequence-locking implementation/wrapper look like? Here are some off-the-cuff desiderata, none of which are met by the current C-code Linux-kernel implementation:

  1. Use of quantities computed from sequence-lock-protected variables in a failed reader should result in a warning. But please note that reliably associating variables with sequence locks may not be easy. One approach suggested in response to this series is to supply a closure containing the read-side critical section, thus restricting such leakage to (unsafe?) side effects.
  2. Improper access to variables not protected by the sequence lock should result in a warning. But please note that it is quite difficult to define "improper" in this context, let alone detect it in real code.
  3. Data races in failed sequence-lock readers should not cause failures. But please note that this is extremely difficult in general if the data races involve non-volatile C-language accesses in the reader. For example, the compiler would be within its rights to refetch the value after the old value had been checked. (This is why the sequence-locking post suggests marked accesses to sequence-lock-protected variables.)
  4. Data races involving sequence-locking updaters are detected, for example, via KCSAN.

As noted elsewhere, use of a wrapper around the existing C-language implementation allows C and Rust to use the same sequence lock. This might or might not prove to be important.

Similarly, what would a perfect Rust RCU wrapper look like? Again, here are some off-the-cuff desiderata, similarly unmet by existing C-code Linux-kernel implementations:

  1. Make call_rcu() and friends cause the specified object to make an end-of-grace-period transition in other threads' ownership from readers to unowned. Again, not an immediate transition at the time of call_rcu() invocation, but rather a deferred transition that takes place at the end of some future grace period.
  2. Some Rust notion of type safety would be useful in slab caches flagged as SLAB_TYPESAFE_BY_RCU.
  3. Some Rust notion of existence guarantee would be useful for RCU readers.
  4. Detect pointers to RCU-protected objects being improperly leaked from RCU read-side critical sections, where proper leakage is possible via locks and reference counters. Perhaps an emphasis on closures is at least part of the answer.
  5. Detect mishandling of dependency-carrying pointers returned by rcu_dereference() and friends. Or perhaps introduce some marking such pointers to that the compiler will avoid breaking the ordering. See the address/data dependency post for a list of C++ working papers moving towards this goal.

There has recently been some work attempting to make the C compiler understand control dependencies. Perhaps Rust could make something useful happen in this area, perhaps by providing markings that allow the compiler to associate the reads with the corresponding control-dependent writes.

Perhaps Rust can better understand more of the ownership schemes that are used within the Linux kernel.


Please note that middle-term approaches are likely to be useful, for example, those that simply provide wrappers around the C-language RCU and sequence-locking implementations. However, such approaches also carry risks, especially during that time when the intersections of the set of people deeply understanding Rust with the set of people deeply understanding RCU and sequence locking is the empty set. For example, one risk that became apparent during the effort to add RCU to the C++ standard is that people new to RCU and sequence locking will latch onto the first use case that they encounter and focus solely on that use case. This has also been a recurring issue for concurrency experts who encounter sequence locking and RCU for the first time. This of course means that I need to do a better job of documenting RCU, its use cases, and the relationships between them.

This is not to say that per-use-case work is pointless. In fact, such work can be extremely valuable, if nothing else, in helping to build understanding of the overall problem. It is also quite possible that current RCU and sequence-locking use cases will resemble those of the future in the same way that the venerable "while" loop resembles the wide panoply of interator-like constructs found not only in modern languages, including those written in C in the Linux kernel, in other words, perhaps Rust will focus on specific fearless-concurrency-friendly RCU/sequence-locking use cases. Except that the Linux kernel still contains "while" loops, as does a great deal of other software, which suggests that the low-level RCU and sequence-locking APIs will always be required. There will also be questions as to whether a given new-age use case is there to help developers using RCU and sequence locking on the one hand or whether its main purpose is instead to work around a shortcoming in Rust on the other. "This should be good clean fun!" ;-)

Experience indicates that it will take significant time to sort out all of these issues. We should therefore make this time available by proceeding initially as described in the short-term recommendations.

Criteria for judging proposals include safety, performance, scalability, build time, development cost, maintenance effort, and so on, not necessarily in that order.


October 18, 2021: Add "Rationale" section.
October 20, 2021: Add a few expansions and clarifications.
October 21, 2021: RCU-specific recommendations.
LPC Logo

Rusting the Linux Kernel: Summary and Conclusions

We have taken a quick trip through history, through a number of the differences between the Linux kernel and the C/C++ memory models, sequence locks, RCU, ownership, zombie pointers, and KCSAN. I give a big "thank you" to everyone who has contributed to this discussion, both publicly and privately. It has been an excellent learning experience for me, and I hope that it has also been helpful to all of you.

To date, Can Rust Code Own Sequence Locks? has proven the most popular by far. Porting Linux-kernel code using sequence locking to Rust turns out to be trickier than one might expect, in part due to the inherently data-racy nature of this synchronization primitive.

So what are those advocating use of Rust within the Linux kernel to do?

The first thing is to keep in mind that the Linux kernel comprises tens of millions of lines of code. One disadvantage of this situation is that the Linux kernel is not going to be ported to much of anything quickly or easily. One corresponding advantage is that it allows Rust-for-Linux developers to carefully pick their battles, focusing first on those portions of the Linux kernel where conversion to Rust might do the most good. Given that order-of-magnitudes performance wins are unlikely, the likely focus is likely to be on proof of concept and on reduced bug rates. Given Rust's heavy focus on bugs stemming from undefined behavior, and given the Linux kernel's use of the -fno-strict-aliasing and -fno-strict-overflow compiler command-line options, bug-reduction choices will need to be made quite carefully. In addition, order-of-magnitude bug-rate differences across the source base provides a high noise floor that makes it more difficult to measure small bug-reduction rates. But perhaps enabling the movement of code into mainline and out of staging can be another useful goal. Or perhaps there is an especially buggy driver out there somewhere that could make good use of some Rust code.

Secondly, careful placement of interfaces between Rust and C code is necessary, especially if there is truth to the rumors that Rust does not inline C functions without the assistance of LTO. In addition, devilish details of Linux-kernel synchronization primitives and dependency handling may further constrain interface boundaries.

Finally, keep in mind that the Linux kernel is not written in standard C, but rather in a dialect that relies on gcc extensions, code-style standards, and external tools. Mastering these extensions, standards, and tools will of course require substantial time and effort, but on the other hand this situation provides precedent for Rust developers to also rely on extensions, style standards, and tools.

But what about the question that motivated this blog in the first place? What memory model should Linux-kernel Rust code use?

For Rust outside of the Linux kernel, the current state of compiler backends and the path of least resistance likely leads to something resembling the C/C++ memory model. Use of Rust in the Linux kernel is therefore not a substitute for continued participation in the C/C++ standards committees, much though some might wish otherwise.

Rust non-unsafe code does not depend much on the underlying memory model, at least assuming that unsafe Rust code and C code avoids undermining the data-race-free assumption of non-unsafe code. The need to preserve this assumption was in fact inspired the blog post discussing KCSAN.

In contrast, Rust unsafe code must pay close attention to the underlying memory model, and in the case of the Linux kernel, the only reasonable choice is of course the Linux-kernel memory model. That said, any useful memory-ordering tool will also need to pay attention to safe Rust code in order to correctly evaluate outcomes. However, there is substantial flexibility, depending on exactly where the interfaces are placed:

  1. As noted above, forbidding use of Rust unsafe code within the Linux kernel would render this memory-model question moot. Data-race freedom for the win!
  2. Allowing Rust code to access shared variables only via atomic operations with acquire (for loads), release (for stores) or stronger ordering would allow Rust unsafe code to use that subset of the Linux-kernel memory model that mirrors the C/C++ memory model.
  3. Restricting use of sequence locks, RCU, control dependencies, lockless atomics, and non-standard locking (e.g., stores to shared variables within read-side reader-writer-locking critical sections) to C code would allow Rust unsafe code to use that subset of the Linux-kernel memory model that closely (but sadly, not exactly) mirrors the C/C++ memory model.
  4. Case-by-case relaxation of the restrictions called out in the preceding pair of items pulls in the corresponding portions of the Linux-kernel memory model. For example, adding sequence locking, but only in cases where readers access only a limited co-located set of objects might permit some of the simpler Rust sequence-locking implementations to be used (see for example the comments to the sequence-locking post).
  5. Insisting that Rust be able to do everything that Linux-kernel C code currently does pulls in the entirety of the Linux-kernel memory model, including those parts that have motivated many of my years of C/C++ standards-committee fun and excitement.

With the first three options, a Rust compiler adhering to the C/C++ memory model will work just fine for Linux-kernel C code. In contrast, the last two options would require code-style restrictions (preferably automated), just as they are in current Linux-kernel C code. See the recommendations post for more detail.

In short, choose wisely and be very careful what you wish for! ;-)


October 7, 2021: Added atomic-operation-only option to memory-model spectrum.
October 8, 2021: Fix typo noted by Miguel Ojeda
October 12, 2021: Self-review.
October 13, 2021: Add reference to recommendations post.
LPC Logo

Can the Kernel Concurrency Sanitizer Own Rust Code?

Given the data-race-freedom guarantees of Rust's non-unsafe code, one might reasonably argue that there is no point in the Kernel Concurrency Sanitizer (KCSAN) analyzing such code. However, the Linux kernel is going to need unsafe Rust code, and although there has been significant progress in formal verification of such code, one of the leading researchers in this area says “we hope to eventually develop formal methods that can be put directly in the hands of programmers”. We would be wise to take careful note of the word “eventually”. Furthermore, even given unanticipated universal acclamation of Rust within the Linux kernel community combined with equally unanticipated advances in C-to-Rust translation capabilities, a significant fraction of the existing tens of millions of lines of Linux-kernel C code will persist for some time to come. Both the unsafe Rust code and the C code can interfere with Rust non-unsafe code, and furthermore there are special cases where safe code can violate unsafe code's assumptions. Therefore, run-time analysis of Rust code (safe code included) is likely to be able to find issues that compile-time analysis cannot.

As a result, Rust seems unlikely to render KCSAN obsolete any time soon. This suggests that Rust code should include KCSAN instrumentation, both via the compiler and via atomic and volatile primitives, as is currently the case for the C compilers and primitives currently in use. The public KCSAN interface is as follows:

  1. __kcsan_check_access(): This function makes a relevant memory access known to KCSAN. It takes a pointer to the object being accessed, the size of the object, and an access type (for example, zero for read, KCSAN_ACCESS_WRITE for write, and KCSAN_ACCESS_ATOMIC for atomic read-modify-write).
  2. The __tsan_{read,write}{1,2,4,8}() family of functions serve the same purpose as __kcsan_check_access(), but the size and access type are implicit in the function name instead of being passed as arguments, which can provide better performance. This family of functions is used by KCSAN-enabled compilers.
  3. ASSERT_EXCLUSIVE_ACCESS() is invoked from C code and causes KCSAN to complain if there is a concurrent access to the specified object. A companion ASSERT_EXCLUSIVE_ACCESS_SCOPED() expands the scope of the concurrent-access complaint to the full extent of the compound statement invoking this macro.
  4. ASSERT_EXCLUSIVE_WRITER() is invoked from C code and causes KCSAN to complain if there is a concurrent write to the specified object. A companion ASSERT_EXCLUSIVE_WRITER_SCOPED() expands the scope of the concurrent-writer complaint to the full extent of the compound statement invoking this macro.
  5. ASSERT_EXCLUSIVE_BITS() is similar to ASSERT_EXCLUSIVE_WRITER(), but focuses KCSAN's attention on changes to bits set in the mask passed as this macro's second argument.

These last three categories of interface members are designed to be directly invoked in order to check concurrency designs. See for example the direct call to ASSERT_EXCLUSIVE_WRITER() from the rcu_cpu_starting() function in Linux-kernel RCU. It would of course be quite useful for these interface members to be available from within Rust code.

KCSAN has helped me fix a number of embarrassing bugs in Linux-kernel RCU that might otherwise have inconvenience end users, so they just might be helpful to people introducing Rust code into the Linux kernel. It is therefore quite encouraging that Marco Elver (KCSAN lead developer and maintainer) points out off-list that the rustc compiler already has sanitizer support, which should suffice not only for KCSAN, but for the equally helpful Kernel Address Sanitizer (KASAN) as well. He also notes that some care is required to pass in the relevant -mllvm options to rustc and to ensure that it is not attempting to link against compiler-rt because doing so is not appropriate when building the Linux kernel. Marco also noted that KCSAN relies on the front-end to properly handle volatile accesses, and Miguel Ojeda kindly confirmed that it does. So there is hope!

For more information on KCSAN and its reason for being:

  1. "Concurrency bugs should fear the big bad data-race detector" (Part 1 and Part 2).
  2. Who's afraid of a big bad optimizing compiler?.
  3. Calibrating your fear of big bad optimizing compilers.
  4. Sections 4.3.4 ("Accessing Shared Variables"), 15.2 ("Tricks and Traps"), and 15.3 ("Compile-Time Consternation") of perfbook.


October 7, 2021: Add current state of KCSAN/Rust integration. Clarify Rust module per Gary Guo email.
October 12, 2021: Self-review.
LPC Logo

Will Your Rust Code Survive the Attack of the Zombie Pointers?

Some of the previous posts in this series have been said to be quite difficult, so I figured I owed you all an easy one. And the zombie-pointer problem really does have a trivial solution, at least in the context of the Linux kernel. In other environments, all bets are off.

But first, what on earth is a zombie pointer?

To answer that question, let's start with a last-in-first-out (LIFO) stack. We don't know when this algorithm was invented or who invented it, but a very similar algorithm was described in R. J. Treiber's classic 1986 technical report (IBM Almaden Research Center RJ 5118), and it was referred to as prior art in (expired) US Patent 3,886,525, which was filed in 1973 (hat trick to Maged Michael). Fun though it might be to analyze this code's original IBM assembly-language implementation, let's instead look at a C11 implementation:

 1 struct node_t* _Atomic top;
 3 void list_push(value_t v)
 4 {
 5   struct node_t *newnode = (struct node_t *) malloc(sizeof(*newnode));
 7   set_value(newnode, v);
 8   newnode->next = atomic_load(&top);
 9   do {
10     // newnode->next may have become invalid
11   } while (!atomic_compare_exchange_weak(&top, &newnode->next, newnode));
12 }
14 void list_pop_all()
15 {
16   struct node_t *p = atomic_exchange(&top, NULL);
18   while (p) {
19     struct node_t *next = p->next;
21     foo(p);
22     free(p);
23     p = next;
24   }
25 }

Those used to the Linux-kernel cmpxchg() atomic operation should keep in mind that when the C11 atomic_compare_exchange_weak() fails, it writes the current value referenced by its first argument to the pointer passed as its second argument. In other words, instead of passing atomic_compare_exchange_weak() the new value, you instead pass it a pointer to the new value. Each time atomic_compare_exchange_weak() fails, it updates the pointed-to new value. This sometimes allows this function to be used in a tight loop, as in line 11 above.

With these atomic_compare_exchange_weak() semantics in mind, let's step through execution of list_push(C) on a stack initially containing objects A and B:

  1. Line 5 allocates memory for the new object C.
  2. Line 7 initializes this newly allocated memory.
  3. Line 8 sets the new object's ->next pointer to the current top-of-stack pointer.
  4. Line 11 invokes atomic_compare_exchange_weak(), which atomically compares the value of newnode->next to the value of top, and if they are equal, stores the pointer newnode into top and returns true. (As noted before, if the comparison is instead not-equal, atomic_compare_exchange_weak() instead stores the value of top into newnode->next and returns false, repeating until such time as the pointers compare equal.)
  5. One way or another, the stack ends up containing objects C, A, and B, ordered from the top of stack down.

Oddly enough, this code would still work even if line 8 were omitted, however, that would result in a high probability that the first call to atomic_compare_exchange_weak() would fail, which would not be so good for common-case performance. It would also result in compile-time complaints about uninitialized variables, so line 8 is doubly unlikely to be omitted in real life.

While we are at it, let's step through list_pop_all():

  1. Line 16 atomically stores NULL into top and returns its previous value. After this lines executes, the stack is empty and its previous contents are referenced by local variable p.
  2. Lines 18-24 execute for each object on list p:
    1. Line 19 prefetches a pointer to the next object.
    2. Line 21 passes the current object to function foo().
    3. Line 22 frees the current object.
    4. Line 23 advances to the next object.

Thus far, we have seen no zombie pointers. Let's therefore consider the following sequence of events, with the stack initially containing objects A and B:

  1. Thread 1 starts pushing object C, but is interrupted, preempted, otherwise impeded just after executing line 8.
  2. Thread 2 pops the entire list and frees all of the objects. Object C's ->next pointer is now invalid because it points to now-freed memory formerly known as object A.
  3. Thread 2 pushes an object, and happens to allocate memory for it at the same address as the old object A, so let's call it A'.
  4. Object C's ->next pointer is still invalid as far as an omniscient compiler is concerned, but from an assembly language viewpoint happens to reference the type-compatible object A'. This pointer is now a "zombie pointer" that has come back from the dead, or that has at least come to have a more entertaining form of invalidity.
  5. Thread 1 resumes and executes the atomic_compare_exchange_weak() on line 11. Now this primitive, like its Linux-kernel counterpart cmpxchg(), operates not on the compiler's idea of what the pointer is, but rather on the actual bits making up that pointer. Therefore, that atomic_compare_exchange_weak() succeeds despite the fact that the object C's ->next pointer is invalid.
  6. Thread 1 has thus completed its push of object C, and the resulting stack looks great from an assembly-language viewpoint. But the pointer from object C to A' is a zombie pointer, and compilers are therefore within their rights to do arbitrarily strange things with it.

As noted earlier, this has a trivial solution within the Linux kernel. Please do not peek too early. ;-)

But what can be done outside of the Linux kernel?

The traditional solution has been to hide the allocator from the compiler. After all, if the compiler cannot see the calls to malloc() and free(), it cannot figure out that a given pointer is invalid (or, in C-standard parlance, indeterminate). Creating your own allocator with its own function names suffices, and back in the day you often needed to create your own allocator if performance and scalability was important to you.

However, this would also deprive Rust of information that it needs to check pointer operations. So maybe Rust needs to know about allocation and deallocation, but needs to be very careful not to pass this information on to the optimizer. Assuming that this even makes sense in the context of the implementation of Rust.

Perhaps code working with invalid and zombie pointers needs to be carefully written in C. And perhaps there is some way to write it safely in unsafe Rust.

The final approach is to wait until backend support for safe handling of invalid/zombie pointers arrives, and then add Rust syntax as appropriate. Here is some documentation of current efforts towards adding such support to C++:

  1. CPPCON 2020 presentation
  2. Pointer lifetime-end zap (informational/historical)
  3. Pointer lifetime-end zap proposed solutions

So there is good progress, but it might still be some time before this is supported by either the standard or by compilers.


October 6, 2021: Expand description per feedback from Jonathan Corbet
October 12, 2021: Self-review.
LPC Logo

How Much of the Kernel Can Rust Own?

Rust concurrency makes heavy use of ownership and borrowing. The purpose of this post is not to give an exposition of Rust's capabilities and limitations in this area, but rather to give a series of examples of ownership in the Linux kernel.

The first example involves Linux-kernel per-CPU variables. In some cases, such variables are protected by per-CPU locks, for example, a number of fields in the per-CPU rcu_data structure are used by the kernel threads that manage grace periods for offloaded callbacks, and these fields are protected by the ->nocb_gp_lock field in the same instance of that same structure. In other cases, access to a given per-CPU variable is permitted only by the corresponding CPU, and even then only if that CPU has disabled preemption. For example, the per-CPU rcu_data structure's ->ticks_this_gp field may be updated only from the corresponding CPU, and only when preemption is disabled. In the particular case, preemption is disabled as a side-effect of having disabled interrupts.

The second example builds on the first. In kernels built with CONFIG_RCU_NOCB_CPU=n, the per-CPU rcu_data structure's ->cblist field may be updated from the corresponding CPU, and only when preemption is disabled. However, it is also allowed from some other CPU when the corresponding CPU has been taken offline, but only when that other CPU that is orchestrating the offlining of the corresponding CPU.

(What about kernels built with CONFIG_RCU_NOCB_CPU=y? They must also acquire a ->nocb_lock that is also contained within the per-CPU rcu_data structure.)

The third example moves to the non-per-CPU rcu_node structures, which are arranged into a combining tree that processes events from an arbitrarily large number of CPUs while maintaining bounded lock contention. Each rcu_node> structure has a ->qsmask bitmask that tracks which of its children need to report a quiescent state for the current grace period, and a ->gp_tasks pointer that, when non-NULL, references a list of tasks that blocked while in an RCU read-side critical section that blocks the current grace period. The children of each leaf rcu_node structure are the rcu_data structures feeding into that rcu_node structure. Finally, there is a singleton rcu_state structure that contains a ->gp_seq field that numbers the current grace period and also indicates whether or not it is in progress.

We now have enough information to state the ownership rule for the rcu_state structure's ->gp_seq field. This field may be updated only if all of the following hold:

  1. The current CPU holds the root rcu_node structure's ->lock.
  2. All rcu_node structures' ->qsmask fields are zero.
  3. All rcu_node structures' ->gp_tasks pointers are NULL.

This might seem excessively ornate, but it is the natural consequence of RCU's semantics and design:

  1. The next grace period is not permitted to start until after the previous one has ended. This is a design choice that allows RCU to function well in the face of update-side overload from large numbers of CPUs.
  2. A grace period is not allowed to end until all CPUs have been observed in a quiescent state, that is, until all rcu_node structures' ->qsmask fields have become zero.
  3. A grace period is not allowed to end until all tasks that were preempted while executing within a critical section blocking that grace period have resumed and exited their critical sections, that is, until all rcu_node structures' ->gp_tasks pointers have become NULL.
  4. If multiple CPUs would like to start a grace period at the same time, there has to be something that works out which CPU is actually going to do the starting, and the last part of that something is the root rcu_node structure's ->lock.

Trust me, there are far more complex ownership models in the Linux kernel!

The fourth example involves per-CPU data that is protected by a reader-writer lock. A CPU is permitted to both read and update its own data only if it read-holds the lock. A CPU is permitted to read other CPUs' data only if it write-holds the lock. That is right, CPUs are allowed to write when read-holding the lock and and are allowed to read while write-holding the lock. Perhaps reader-writer locks should instead have been called exclusive-shared locks, but it is some decades too late now!

The fifth and final example involves multiple locks, for example, when some readers must traverse the data structure from an interrupt handler and others must sleep while traversing that same structure. One way to implement this is to have one interrupt-disabled spinlock (for readers in interrupt handlers) and a mutex (for readers that must sleep during the traversal). Updaters must then hold both locks.

So how on earth does the current C-language Linux kernel code keep all this straight???

One indispensable tool is the assertion, which in the Linux kernel includes WARN(), BUG(), and friends. For example, RCU uses these to verify the values of the aforementioned ->qsmask and ->gp_tasks fields during between-grace-periods traversals of the rcu_node combining tree.

Another indispensable tool is lockdep, which, in addition to checking for deadlocks, allows code to verify that conditions are right. A few of the many available lockdep assertions are shown below:

  1. lockdep_assert_held(): Complains if the specified lock is not held by the current CPU/task.
  2. lockdep_assert_held_write(): Complains if the specified reader-writer lock is not write-held by the current CPU/task.
  3. lockdep_assert_held_read():: Complains if the specified reader-writer lock is not read-held by the current CPU/task.
  4. lockdep_assert_none_held_once(): Complains if the current CPU/task holds any locks.
  5. lockdep_assert_irqs_disabled(): Complains if the current CPU has interrupts enabled.
  6. rcu_read_lock_held(): Complains if the current CPU/task is not within an RCU read-side critical section.

Of course, none of these assertions are going to do anything for you unless you make use of them. On the other hand, some of the more complex ownership criteria are going to be quite difficult for compilers or other tooling to intuit without significant help from the developer.

A more entertaining ownership scenario is described in Section 5.4.6 ("Applying Exact Limit Counters") of perfbook. Chapter 8 ("Data Ownership") describes several other ownership scenarios.


October 12, 2021: Self-review.
LPC Logo

Can Rust Code Own RCU?

Read-copy update (RCU) vaguely resembles a reader-writer lock [1], but one in which readers do not exclude writers. This change in semantic permits RCU readers to be exceedingly fast and scalable. In fact, in the most aggressive case, rcu_read_lock() and rcu_read_unlock() generate no code, and rcu_dereference() emits but a single load instruction. This most aggressive case is achieved in production via Linux-kernel CONFIG_PREEMPT_NONE=y builds. Additional information on RCU is presented at the end of this post.

Can Rust Ownership be Adapted to RCU?

The fact that RCU readers do not exclude writers opens the door to data races between RCU readers and updaters. This in turn suggests that least some RCU use cases are likely to pose challenges to Rust's ownership semantics, however, discussions with Wedson Almeida Filho at Linux Plumbers Conference suggested that these semantics might be flexed a little bit, perhaps in a manner similar to the way that Linux-kernel RCU flexes lockdep's semantics. In addition, the data races between the read-side rcu_dereference() primitive and the update-side rcu_assign_pointer() primitive might reasonably be resolved by having the Rust implementation of these two primitives use unsafe mode. Or, perhaps better yet, the Rust implementation of these two primitives might simply wrapper the Linux-kernel primitives in order to get the benefit of existing tools such as the aforementioned lockdep and the sparse static-analysis checker. One downside of this approach is potential performance degradation due to the extra function call implied by the wrappering. Longer term, LTO might eliminate this function call, but if the initial uses of Rust are confined to performance-insensitive device drivers, the extra overhead should not be a problem.

Linux-Kernel Tooling and RCU

Here are a few examples of these Linux-kernel tools:

  1. A pointer can be marked __rcu, which will cause the sparse static-analysis checker to complain if that pointer is accessed directly, instead of via rcu_dereference() and rcu_assign_pointer(). This checking can help developers avoid accidental destruction of the address and data dependencies on which many RCU use cases depend.
  2. In most RCU use cases, the rcu_dereference() primitive that accesses RCU-protected pointers must itself be protected by rcu_read_lock(). The Linux-kernel lockdep facility has been therefore adapted to complain about unprotected uses of rcu_dereference().
  3. Passing the same pointer twice in quick succession to a pair of call_rcu() invocations is just as bad as doing the same with a pair of kfree() invocations: Both situations are a double free. The Linux kernel's debug-objects facility checks for this sort of abuse.

This is by no means a complete list. Although I am by no means claiming that these sorts of tools provide the same degree of safety as does Rust safe mode, it is important to understand that Rust unsafe mode is not to be compared to standard C, but rather to the dialect of C used in the Linux kernel, augmented by the associated tools, processes, and coding guidelines.

RCU Use Cases

There are in fact Rust implementations of variants of RCU, including:

  1. crossbeam-rs
  2. ArcSwapAny
I currently have no opinion on whether or not these could be used within the Linux kernel, and any such opinion is irrelevant. You see, Rust's use of RCU within the Linux kernel would need to access RCU-protected data structures provided by existing C code. In this case, it will not help for Rust code to define its own RCU. It must instead interoperate with the existing Linux-kernel RCU implementations.

But even interoperating with Linux-kernel RCU is not trivial. To help with this task, this section presents a few RCU use cases within the Linux kernel.

The first use case is the textbook example where once an RCU-protected object is exposed to readers, its data fields remain constant. This approach allows Rust's normal read-only mode of ownership to operate as intended. The pointers will of course change as objects are inserted or deleted, but these are handled by rcu_dereference() and rcu_assign_pointer() and friends. These special primitives might be implemented using unsafe Rust code or C code, as the case might be.

In-place updates of RCU-protected objects are sometimes handled using the list_replace_rcu() primitive. In this use case, a new version of the object is allocated, the old version is copied to the new version, any required updates are carried out on the new version, and then list_replace_rcu() is used to make the new version available to RCU readers. Readers then see either the old version or the new version, but either way they see an object with constant data fields, again allowing Rust ownership to work as intended.

However, one complication arises for objects removed from an RCU-protected data structure. Such objects are not owned by the updater until after a grace period elapses, and they can in fact be accessed by readers in the meantime. It is not clear to me how Rust would handle this. For purposes of comparison, within the Linux kernel, the Kernel Concurrency Sanitizer (KCSAN) handles this situation using dynamic runtime checking. With this checking enabled, when an updater takes full ownership of a structure before readers are done with it, KCSAN reports a data race. These issues should be most immediately apparent to anyone attempting to create a Rust wrapper for the call_rcu() function.

Because RCU readers do not exclude RCU updaters, it is possible for an RCU reader to upgrade itself to an updater while traversing an RCU-protected data structure. This is usually handled by a lock embedded in the object itself. From what I understand, the easiest way to apply Rust ownership to these situations is to make the RCU-protected object contain a structure that in turn contains the data protected by the lock. People wishing to apply Rust to these sorts of situations are invited to review the Linux-kernel source code to check for possible complications, for example, cases where readers locklessly read fields that are updated under the protection of the lock, and cases where multiple locks provide one type of protection or another to overlapping subsets of fields in the RCU-protected object.

Again because RCU readers do not exclude RCU updaters, there can also be lockless in-place updates (either from readers or updaters) to RCU-protected objects, for example, to set flags, update statistics, or to provide type-safe memory for any number of lockless algorithms (for example, using SLAB_TYPESAFE_BY_RCU). Rust could presumably use appropriate atomic or volatile operations in this case.

One important special case of RCU readers locklessly reading updated-in-place data is the case where readers cannot be allowed to operate on an object that has been removed from the RCU-protected data structure. These situations normally involve a per-object flag and lock. Updaters acquire the object's lock, remove the object from the structure, set the object's flag, and release the lock. Readers check the flag, and if the flag is set, pretend that they did not find the corresponding object. This class of use cases illustrate how segregating read-side and update-side fields of an RCU-protected object can be non-trivial.

RCU is sometimes used to protect combinations of data structures, sometimes involving nested RCU readers. In some cases, a given RCU read-side critical section might span a great many functions across several translation units.

It is not unusual for RCU use cases to include a search function that is invoked both by readers and updaters. This function will therefore sometimes be within an RCU read-side critical section and other times be protected by the update-side lock (or by some other update-side synchronization mechanism).

Sequence locking is sometimes used in conjunction with RCU, so that RCU protects traversal of the data structure and sequence locking detects updates profound enough to cause problems for the RCU traversals. The poster boy for this use case is in the Linux kernel's directory-entry cache. In this case, certain sequences of rename operations could fool readers into traversing pathnames that never actually existed. Using the sequence lock named rename_lock to protect such traversals allows readers to reject such bogus pathnames. More information on the directory-entry cache is available in Neil Brown's excellent Linux Weekly News series (part 1, part 2, part 3). One key take-away of this use case is that sequence locking is sometimes used to protect arbitrarily large data structures.

Sequence locking can be used to easily provide readers with a consistent view of a data structure, for a very wide range of definitions of "consistent". However, all of these techniques allow updaters to block readers. There are a number of other approaches to read-side consistency both within and across data structures, including the Issaquah Challenge.

One final use case is phased state change, where RCU readers check a state variable and take different actions depending on the current state. Updaters can change the state variable, but must wait for the completion of all readers that might have seen the old value before taking further action. The rcu-sync functionality (rcu_sync_init() and friends) implements a variant of RCU-protected phased state change. This use case is notable in that there are no linked data structures and nothing is ever freed.

This is by no means an exhaustive list of RCU use cases. RCU has been in the Linux kernel for almost 20 years, and during that time kernel developers have come up with any number of new ways of applying RCU to concurrency problems.

Rust RCU Options

The most straightforward approach is to provide Rust wrappers for the relevant C-language RCU API members. As noted earlier, wrappering the low-overhead read-side API members will introduce function-call overhead in non-LTO kernels, but this should not be a problem for performance-insensitive device drivers. However, fitting RCU into Rust's ownership model might not be as pretty as one might hope.

A more utopian approach would be to extend Rust's ownership model, one way or another, to understand RCU. One approach would be for Rust to gain "type safe" and "guaranteed existence" ownership modes, which would allow Rust to better diagnose abuses of the RCU API. For example, this might help with pointers to RCU-protected objects being erroneously leaked from RCU read-side critical sections.

So how might a pointer to an RCU-protected object be non-erroneously leaked from an RCU read-side critical section, you ask? One way is to acquire a reference via a reference count within that object before exiting that critical section. Another way is to acquire a lock within that object before exiting that critical section.

So what does the Linux kernel currently do to find this type of pointer leak? One approach is to enable KCSAN and build the kernel with CONFIG_RCU_STRICT_GRACE_PERIOD=y, and to run this on a small system. Nevertheless, this might be one area where Rust could improve Linux-kernel diagnostics, albeit not without effort.

For More Information

Additional information on RCU may be found here:

  1. Linux-kernel documentation, with special emphasis on Linux-kernel RCU's requirements.
  2. The following sections of perfbook discuss other aspects of RCU:

    1. Section 9.5 ("Read-Copy Update").
    2. Section 10.3 ("Read-Mostly Data Structures").
    3. Section 13.5 ("RCU Rescues").
    4. Section 15.4.2 ("RCU [memory-ordering properties ]").
  3. The RCU API, 2019 Edition.
  4. Userspace RCU.
  5. Folly-library RCU.
  6. Section of perfbook lists several other RCU implementations.
  7. Proposed Wording for Concurrent Data Structures: Read-Copy-Update (RCU) (C++ Working Paper).

RCU's semantics ordered by increasing formality:

  1. The "Fundamental Requirements" section of the aforementioned Linux-kernel RCU requirements. Extremely informal, but you have to start somewhere!
  2. RCU Semantics: A First Attempt. Also quite informal, but with a bit more math involved.
  3. Verifying Highly Concurrent Algorithms with Grace (extended version). Formal semantics expressed in separation logic.
  4. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel. Executable formal semantics expressed in Cat language. This paper also includes a proof that the traditional "wait for all pre-existing readers" semantic is equivalent to these executable formal semantics within the context of the Linux-kernel memory model.
  5. Linux-kernel memory model (LKMM), see especially the Sync-rcu and Sync-srcu relations in file linux-kernel.cat.

Furthermore, significant portions of Linux-kernel RCU have been subjected to mechanical proofs of correctness:

  1. Lihao Liang applied the C Bounded Model Checker (CBMC) to Tree RCU (paper).
  2. Michalis Kokologiannakis applied Nidhugg to Tree RCU (paper).
  3. Lance Roy applied CBMC to Classic SRCU, represented by Linux-kernel commit 418b2977b343 ("rcutorture: Add CBMC-based formal verification for SRCU"). Sadly, the scripting has not aged well.

For more information, see Verification Challenge 6.

For those interested in the userspace RCU library, there have been both manual and mechanical proofs of correctness. A manual proof of correctness of userspace RCU in the context of a linked list has also been carried out.


[1]  At its core, RCU is nothing more nor less than a way of waiting for already-started things to finish. Within the Linux kernel, something to be waited on is delimited by rcu_read_lock() and rcu_read_unlock(). For their part, synchronize_rcu() and call_rcu() do the waiting, synchronously and asynchronously, respectively.

It turns out that you can do quite a lot with this simple capability, including approximating some reader-writer-lock use cases, emulating a number of reference-counting use cases, providing type-safe memory, and providing existence guarantees, this last sometimes being pressed into service as a poor-man's garbage collector. Section 9.3.3 ("RCU Usage") of perfbook provides more detail on these and other RCU use cases.


October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add RCU-protected phased state change use case.
October 18, 2021: Add read/update common code and Tasseroti citation.
LPC Logo

Can Rust Code Own Sequence Locks?

What Are Sequence Locks?

Sequence locks vaguely resemble reader-writer locks except that readers cannot block writers. If a writer runs concurrently with a reader, that reader is forced to retry its critical section. If a reader successfully completes its critical section (as in there were no concurrent writers), then its accesses will have been data-race free. Of course, an incessant stream of writers could starve all readers, and the Linux-kernel sequence-lock implementation provides extensions to deal with this in cases where incessant writing is a possibility. The usual approach is to instead arrange things so that writers are never incessant.

In the simple case where readers loop until they succeed, with no provisions for incessant writers, a reader looks like this:

do {
  seq = read_seqbegin(&test_seqlock);
  /* read-side access. */
} while (read_seqretry(&test_seqlock, seq));

A writer looks like this:

  /* Update */

The read-side access can run concurrently with the writer's update, which is not consistent with the data-race-free nature of non-unsafe Rust. How can this be handled?

Linux-Kernel Sequence-Lock Use Cases

First, please note that if Rust-language sequence-locking readers are to interoperate with C-language sequence-locking updaters (or vice versa), then the Rust-language and C-language implementations must be compatible. One easy way to achieve such compatibility is to provide Rust code wrappers around the C-language implementation, or perhaps better yet, around higher-level C-language functions making use of sequence locking. Of course, if Rust-language use of sequence locks only ever accesses data that is never accessed by C-language code, then Rust could provide its own implementation of sequence locking. As always, choose carefully!

Next, let's look at a few classes of sequence-locking use cases within the Linux kernel.

One common use case is the textbook case where the read-side loop picks up a few values, all within a single function. For example, sequence locking is sometimes used to fetch a 64-bit value on a 32-bit system (see the sock_read_timestamp() function in include/net/sock.h). The remainder of this section looks at some additional use cases that receive significant Linux-kernel use:

  1. Readers can gather data from multiple sources, including functions in other translation units that are invoked via function pointers. See for example the timer_cs_read() function in arch/sparc/kernel/time_32.c.
  2. Readers often perform significant computations. See for example the badblocks_check() function in block/badblocks.c.
  3. Readers can contain sequence-lock writers for that same sequence lock, as is done in the sdma_flush() function in drivers/infiniband/hw/hfi1/sdma.c. Although one could reasonably argue that this could be structured in other ways, this approach does avoid an added check, which could be valuable on fastpaths.
  4. The call to read_seqretry() is sometimes buried in a helper function, for example, in the sdma_progress() function in drivers/infiniband/hw/hfi1/sdma.h and the follow_dotdot_rcu() function in fs/namei.c.
  5. Readers sometimes use memcpy(), as might be expected by those suggesting that sequence-lock readers should clone/copy the data. See for example the neigh_hh_output() function in include/net/neighbour.h.
  6. Sequence-locking readers are often used in conjunction with RCU readers, as is described in the Can Rust Code Own RCU? posting.

Some of these use cases impose significant constraints on sequence-locking implementations. If the constraints rule out all reasonable Rust implementations, one approach would be to provide different Rust implementations for different use cases. However, it would be far better to avoid further API explosion!

Rust Sequence-Lock Options

One approach would be for all sequence-lock critical sections to be marked unsafe, but this removes at least some of the rationale for switching from C to Rust. In addition, some believe that this approach can pose severe Rust-syntax challenges in some use cases.

Another approach is to require all sequence-lock critical sections to use marked accesses. This avoids the data races and presumably also the need for unsafe, but it prevents the kernel concurrency sanitizer (KCSAN) from locating data races involving sequence-lock write-side critical sections. Or might, if it were not for the use of kcsan_flat_atomic_begin() and kcsan_flat_atomic_end() in sequence-locking read-side primitives and the use of kcsan_nestable_atomic_begin() and kcsan_nestable_atomic_end() in sequence-locking write-side primitives. So perhaps Rust could make use of similar hooks.

Yet another approach is to place the data referenced by readers into a separate structure and to switch back and forth between a pair of instances of such structures. This has been attempted in the Linux kernel with unsatisfactory results. In fact, in those cases where it is feasible to use multiple instances of a separate structure, Linux-kernel RCU is usually better choice.

One more approach would be creating something like a "tentatively readable" Rust ownership class that could directly handle sequence-locking read-side critical sections. For example, if a quantity computed from a read within a failed critical section were used subsequently, Rust could emit a warning. Hopefully without much in the way of false positives. (We can dream, can't we?)

One might instead search the Linux-kernel source tree for uses of sequence locking and to note that it is used only by a very few device drivers:

     16 drivers/dma-buf/
      1 drivers/firmware/efi/
      5 drivers/gpu/drm/
      2 drivers/gpu/drm/amd/amdgpu/
      2 drivers/gpu/drm/i915/gem/
     15 drivers/gpu/drm/i915/gt/
      5 drivers/hwmon/
     72 drivers/infiniband/hw/hfi1/
     11 drivers/md/
     15 drivers/net/ethernet/mellanox/mlx4/
     19 drivers/net/ethernet/mellanox/mlx5/core/lib/

These device drivers (or at least those portions of them that make use of sequence locking) could then be relegated to C code.

However, the use of sequence locking has been increasing, and it is likely that it will continue to increase, including within device drivers. Longer term, Rust's ownership model should therefore be extended to cover sequence locking in a natural and safe manner.

For More Information

For more on sequence locking, see the seqlock.rst documentation. Sequence locking is also covered in Sections 9.4 ("Sequence Locking") and 13.4 ("Sequence-Locking Specials") of perfbook.


October 6, 2021: Updated to add sectioning, Linux-kernel use cases, and updated KCSAN commentary.
October 12, 2021: Self-review. Note that some of the comments are specific to earlier versions of this blog post.
October 13, 2021: Add a note discussing possible interoperability between Rust-language and C-language sequence locking.

Rusting the Linux Kernel: Compiler Writers Hate Dependencies (OOTA)

Out-of-thin-air (OOTA) values are a troubling theoretical problem, but thus far do not actually occur in practice. In the words of the Bard: "It is a tale told by an idiot, full of sound and fury, signifying nothing."

But what is life without a little sound and fury? Besides, "mere theory" can loom large for those creating the tools required to analyze concurrent software. This post therefore gives a brief introduction to OOTA, and then provides some simple practical solutions, along with some difficult ones.

Here is the canonical OOTA example, involving two threads each having but one statement:

T1: x.store(y.load(mo_relaxed), mo_relaxed);
T2: y.store(x.load(mo_relaxed), mo_relaxed);

Believe it or not, according to the C/C++ memory model, even if both x and y are initially 0, after both threads complete it might be that x==y==42!

LKMM avoids OOTA by respecting dependencies, at least when headed by marked accesses. Plus the dependent access must either be marked or be free of data races. For example:


If x and y are both initially zero, then they are both guaranteed to stay that way!

There has been great quantities of ink spilled on the OOTA problem, but these three C++ working papers give a reasonable overview and cite a few of the more prominent papers:

  1. P0422R0: Out-of-Thin-Air Execution is Vacuous. This presents a fixed-point analysis that to the best of my knowledge is the first scheme capable of differentiating OOTA scenarios from straightforward reordering scenarios. Unfortunately, this analysis cannot reasonably be applied to automated tooling. This paper also includes a number of references to other work, which unfortunately either impose unnecessary overhead on weakly ordered systems on the one hand or are well outside of what compiler developers are willing to countenance on the other.
  2. P1916R0: There might not be an elegant OOTA fix. This diabolically clever paper shows how back-propagation of undefined behavior can further complicate the OOTA question.
  3. P2055R0: A Relaxed Guide to memory_order_relaxed. This paper presents memory_order_relaxed use cases that are believed to avoid the OOTA problem.

As with address, data, and control dependencies, the trivial solution is for Rust to simply promote all READ_ONCE() operations to smp_load_acquire(), but presumably while leaving the C-language READ_ONCE() untouched. This approach increases overhead, but keeps the people building analysis tools happy, or at least happier. Unfortunately, this trivial solution incurs overhead on weakly ordered systems, overhead that is justified only in theory, never in practice.

But if analysis tools are not an issue, then Rust could simply use volatile loads (perhaps even directly using an appropriately Rust-wrapped instance of the Linux kernel's READ_ONCE() primitive), just as the Linux kernel currently does (but keeping architecture-specific requirements firmly in mind). The problem is after all strictly theoretical.


October 12, 2021: Self-review. Note that the comments are specific to earlier versions of this blog post.
LPC Logo

Rusting the Linux Kernel: Compiler Writers Hate Dependencies (Address/Data)

An address dependency involves a load whose return value directly or indirectly determines the address of a later load or store, which results in the earlier load being ordered before the later load or store. A data dependency involves a load whose return value directly or indirectly determines the value stored by a later store, which results in the load being ordered before the store. These are used heavily by RCU. Although they are not quite as fragile as control dependencies, compilers still do not know about them. Therefore, care is still required, as can be seen in the rcu_dereference.rst Linux-kernel coding guidelines. As with control dependencies, address and data dependencies enjoy very low overheads, but unlike control dependencies, they are used heavily in the Linux kernel via rcu_dereference() and friends.

  1. As with control dependencies, the trivial solution is to promote READ_ONCE() to smp_load_acquire(). Unlike with control dependencies, there are only a few such READ_ONCE() instances, and almost all of them are conveniently located in definitions of the rcu_dereference() family of functions. Because the rest of the kernel might not be happy with the increase in overhead due to such a promotion, it would likely be necessary to provide Rust-specific implementations of rcu_dereference() and friends.
  2. Again as with control dependencies, an even more trivial solution is to classify code containing address and data dependencies as core Linux-kernel code that is outside of Rust's scope. However, given that there are some thousands of instances of rcu_dereference() scattered across the Linux kernel, this solution might be a bit more constraining than many Rust advocates might hope for.
  3. Provide Rust wrappers for the rcu_dereference() family of primitives. This does incur additional function-call overhead, but on the other hand, if initial use of Rust is confined to performance-insensitive device drivers, this added overhead is unlikely to be problem.
  4. But if wrapper overhead nevertheless proves problematic, provide higher-level C-language functions that encapsulate the required address and data dependencies, and that do enough work that the overhead of the wrappering for Rust-language use is insignificant.
  5. And yet again as with control dependencies, the best approach from the Linux-kernel-in-Rust developer's viewpoint is for Rust to enforce the address/data-dependency code-style restrictions documented in the aforementioned rcu_dereference.rst. There is some reason to hope that this enforcement would be significantly easier than for control dependencies.
  6. Wait for compiler backends to learn about address and data dependencies. This might take some time, but there is ongoing work along these lines that is described below.
One might hope that C/C++'s memory_order_consume would correctly handle address and data dependencies, and in fact it does. Unfortunately, in all known compilers, it does so by promoting memory_order_consume to memory_order_acquire, which adds overhead just as surely as does smp_load_acquire(). There has been considerable work done over a period of some years towards remedying this situation, including these working papers:

  1. P0371R1: Temporarily discourage memory_order_consume (for some definition of "temporarily").
  2. P0098R1: Towards Implementation and Use of memory_order_consume, which reviews a number of potential remedies.
  3. P0190R4: Proposal for New memory_order_consume Definition, which selects a solution involving marking pointers carrying dependencies. Note that many Linux-kernel developers would likely demand a compiler command-line argument that caused the compiler to act as if all pointers had been so marked. Akshat Garg prototyped marked dependency-carrying pointers in gcc as part of a Google Summer of Code project.
  4. P0750R1: Consume, which proposes carrying dependencies in an extra bit associated with the pointer. This approach is much more attractive to some compiler developers, but many committee members did not love the resulting doubling of pointer sizes.
  5. P0735R1: Interaction of memory_order_consume with release sequences, which addresses an obscure C/C++ memory-model corner case
There appears to be a recent uptick in interest in a solution to this problem, so there is some hope for progress. However, much more work is needed.

More information on address and data dependencies may be found in Section 15.3.2 ("Address- and Data-Dependency Difficulties") of perfbook.


October 12, 2021: Self-review.