Paul E. McKenney (paulmck) wrote,
Paul E. McKenney

How Much of the Kernel Can Rust Own?

Rust concurrency makes heavy use of ownership and borrowing. The purpose of this post is not to give an exposition of Rust's capabilities and limitations in this area, but rather to give a series of examples of ownership in the Linux kernel.

The first example involves Linux-kernel per-CPU variables. In some cases, such variables are protected by per-CPU locks, for example, a number of fields in the per-CPU rcu_data structure are used by the kernel threads that manage grace periods for offloaded callbacks, and these fields are protected by the ->nocb_gp_lock field in the same instance of that same structure. In other cases, access to a given per-CPU variable is permitted only by the corresponding CPU, and even then only if that CPU has disabled preemption. For example, the per-CPU rcu_data structure's ->ticks_this_gp field may be updated only from the corresponding CPU, and only when preemption is disabled. In the particular case, preemption is disabled as a side-effect of having disabled interrupts.

The second example builds on the first. In kernels built with CONFIG_RCU_NOCB_CPU=n, the per-CPU rcu_data structure's ->cblist field may be updated from the corresponding CPU, and only when preemption is disabled. However, it is also allowed from some other CPU when the corresponding CPU has been taken offline, but only when that other CPU that is orchestrating the offlining of the corresponding CPU.

(What about kernels built with CONFIG_RCU_NOCB_CPU=y? They must also acquire a ->nocb_lock that is also contained within the per-CPU rcu_data structure.)

The third example moves to the non-per-CPU rcu_node structures, which are arranged into a combining tree that processes events from an arbitrarily large number of CPUs while maintaining bounded lock contention. Each rcu_node> structure has a ->qsmask bitmask that tracks which of its children need to report a quiescent state for the current grace period, and a ->gp_tasks pointer that, when non-NULL, references a list of tasks that blocked while in an RCU read-side critical section that blocks the current grace period. The children of each leaf rcu_node structure are the rcu_data structures feeding into that rcu_node structure. Finally, there is a singleton rcu_state structure that contains a ->gp_seq field that numbers the current grace period and also indicates whether or not it is in progress.

We now have enough information to state the ownership rule for the rcu_state structure's ->gp_seq field. This field may be updated only if all of the following hold:

  1. The current CPU holds the root rcu_node structure's ->lock.
  2. All rcu_node structures' ->qsmask fields are zero.
  3. All rcu_node structures' ->gp_tasks pointers are NULL.

This might seem excessively ornate, but it is the natural consequence of RCU's semantics and design:

  1. The next grace period is not permitted to start until after the previous one has ended. This is a design choice that allows RCU to function well in the face of update-side overload from large numbers of CPUs.
  2. A grace period is not allowed to end until all CPUs have been observed in a quiescent state, that is, until all rcu_node structures' ->qsmask fields have become zero.
  3. A grace period is not allowed to end until all tasks that were preempted while executing within a critical section blocking that grace period have resumed and exited their critical sections, that is, until all rcu_node structures' ->gp_tasks pointers have become NULL.
  4. If multiple CPUs would like to start a grace period at the same time, there has to be something that works out which CPU is actually going to do the starting, and the last part of that something is the root rcu_node structure's ->lock.

Trust me, there are far more complex ownership models in the Linux kernel!

The fourth example involves per-CPU data that is protected by a reader-writer lock. A CPU is permitted to both read and update its own data only if it read-holds the lock. A CPU is permitted to read other CPUs' data only if it write-holds the lock. That is right, CPUs are allowed to write when read-holding the lock and and are allowed to read while write-holding the lock. Perhaps reader-writer locks should instead have been called exclusive-shared locks, but it is some decades too late now!

The fifth and final example involves multiple locks, for example, when some readers must traverse the data structure from an interrupt handler and others must sleep while traversing that same structure. One way to implement this is to have one interrupt-disabled spinlock (for readers in interrupt handlers) and a mutex (for readers that must sleep during the traversal). Updaters must then hold both locks.

So how on earth does the current C-language Linux kernel code keep all this straight???

One indispensable tool is the assertion, which in the Linux kernel includes WARN(), BUG(), and friends. For example, RCU uses these to verify the values of the aforementioned ->qsmask and ->gp_tasks fields during between-grace-periods traversals of the rcu_node combining tree.

Another indispensable tool is lockdep, which, in addition to checking for deadlocks, allows code to verify that conditions are right. A few of the many available lockdep assertions are shown below:

  1. lockdep_assert_held(): Complains if the specified lock is not held by the current CPU/task.
  2. lockdep_assert_held_write(): Complains if the specified reader-writer lock is not write-held by the current CPU/task.
  3. lockdep_assert_held_read():: Complains if the specified reader-writer lock is not read-held by the current CPU/task.
  4. lockdep_assert_none_held_once(): Complains if the current CPU/task holds any locks.
  5. lockdep_assert_irqs_disabled(): Complains if the current CPU has interrupts enabled.
  6. rcu_read_lock_held(): Complains if the current CPU/task is not within an RCU read-side critical section.

Of course, none of these assertions are going to do anything for you unless you make use of them. On the other hand, some of the more complex ownership criteria are going to be quite difficult for compilers or other tooling to intuit without significant help from the developer.

A more entertaining ownership scenario is described in Section 5.4.6 ("Applying Exact Limit Counters") of perfbook. Chapter 8 ("Data Ownership") describes several other ownership scenarios.


October 12, 2021: Self-review.
Tags: linux, lkmm, rust

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.