What is RCU Priority Boosting?If you are running a kernel built with CONFIG_PREEMPT=y, RCU read-side critical sections can be preempted by higher-priority tasks, regardless of whether these tasks are executing kernel or userspace code. If there are enough higher-priority tasks, and especially if someone has foolishly disabled realtime throttling, these RCU read-side critical sections might remain preempted for a good long time. And as long as they remain preempted, RCU grace periods cannot complete. And if RCU grace periods cannot complete, your system has an OOM in its future.
This is where RCU priority boosting comes in, at least in kernels built with CONFIG_RCU_BOOST=y. If a given grace period is blocked only by preempted RCU read-side critical sections, and that grace period is at least 500 milliseconds old (this timeout can be adjusted using the RCU_BOOST_DELAY Kconfig option), then RCU starts boosting the priority of these RCU readers to the level specified by the rcutree.kthread_prio kernel boot parameter, which defaults to FIFO priority 2. RCU does this using one rcub kthread per rcu_node structure. Given a default Kconfig, this works out to one rcub kthread per 16 CPUs.
Why did rcutorture Fail to Test RCU Priority Boosting?As with many things in life, this happened one step at a time:
- A bug I was chasing a few years back reproduced much more quickly if I enabled CPU hotplug on the TREE03 rcutorture scenario.
- And in addition, x86 no longer supports configurations where CPUs cannot be hotplugged (mumble mumble security mumble mumble), which means that the rcutorture scripting is always going to test CPU hotplug.
- TREE03 was the one scenario that tested RCU priority boosting.
- But RCU priority-boost testing assumes that CPU hotplug was disabled. So much so that it would disable itself if CPU-hotplug testing was enabled. Which it now always was.
- So RCU priority boosting has gone completely untested for quite a few years.
- Quite a few more years back, I learned that firmware sometimes lies about the number of CPUs. I learned this from bug reports noting that RCU was sometimes creating way more kthreads than made any sense on small systems.
- So the spawning of kthreads that are per-CPU or per-group-of-CPUs is done at CPU-online time. Which ensures that systems get the right number of RCU kthreads even in the presence of lying firmware. In the case of the RCU boost kthreads, the code verifies that the rcu_node structure in question has at least one online CPU before spawning the corresponding kthread.
- Except that it is now quite possible for the incoming CPU to not be fully online at the time that rcutree_online_cpu() executes, in part due to RCU being much more careful about CPU hotplug. This means that the RCU boost kthread will be spawned when the second CPU corresponding to a given rcu_node structure comes online.
- Which means that rcu_node structures that have only one CPU never have an RCU boost kthread, and in turn that RCU readers preempted on such CPUs will never be boosted. This problematic situation is unusual, requiring 17, 33, 49, 65, ... CPUs on the system, assuming default RCU kconfig options. But it can be made to happen, especially when using the rcutorture scripting. (--kconfig "CONFIG_NR_CPUS=17" ...)
The fix is to refactor the creation of rcub kthreads so that a CPU coming online is assumed to eventually make it online, which means that one online CPU suffices to spawn an rcub kthread.
Additional Testing ChallengesThe rcu_torture_boost() function required additional rework because CPUs can fail to pass through a quiescent state for some seconds from time to time, and there is nothing that RCU priority boosting can do about this. There are now checks for this condition, and rcutorture refrains from reporting an error in such cases.
Worse yet, this testing proceeds by disabling the aforementioned realtime throttling, then running a FIFO realtime priority 1 kthread on each CPU. This sort of abuse is a great way to break your kernel, yet nothing less abusive will reliably and efficiently test RCU priority boosting. It just so happens that many of RCU's kthreads will do just fine because in this configuration they run at FIFO realtime priority 2. Unfortunately, timers often run in a ksoftirqd kthread, which runs at a non-realtime priority. This means that although RCU's grace-period kthread runs just fine, if it tries to sleep for (say) three milliseconds, it won't awaken until RCU priority boosting testing has completed, which is a great way to force this testing to fail.
Therefore, rcutorture now takes a the rude and crude approach of checking to see if it is built into the kernel (as opposed to running as a kernel module), and if so, it forces all of the ksoftirqd kthreads to run at FIFO realtime priority 2. (Needless to say, don't try this at home.)
The usual way to asynchronously determine when a grace period has ended is to post an RCU callback using call_rcu(). Except that in realtime configurations, RCU callbacks are often offloaded to rcuo kthreads. It is the system administrator's responsibility to decide where to run these, and, failing that, the Linux-kernel scheduler's responsibility. Neither of which should be expected to do the right thing in the presence of a full set of CPU-bound unthrottled real-time-priority boost-test kthreads.
Fortunately, RCU now has polling APIs for managing grace periods. The start_poll_synchronize_rcu() function starts a new grace period if needed and returns a “cookie” that can be passed to poll_state_synchronize_rcu(), which will return true if the needed grace period has completed. These functions do not rely on RCU callbacks, and thus will function correctly even if the rcuo kthreads are inauspiciously scheduled, or even if these kthreads are not scheduled at all. Thus, rcutorture's test of RCU priority boosting now uses these two functions.
With all of this in place, RCU priority boosting lives again!
But untested software does not work, and that includes the tests themselves. Thus, a new BUSTED-BOOST scenario tests RCU priority boosting on a kernel built with CONFIG_RCU_BOOST=y, which does not do RCU priority boosting. This scenario fails within a few tens of seconds, so the test being tested might actually be working!