RCU used to naively believe the firmware, and would therefore create one set of
rcuokthreads per advertised CPU. On some systems, this resulted in hundreds of such kthreads on systems with only a few tens of CPUs. But RCU can choose to create the
rcuokthreads only for CPUs that actually come online. Problem solved!
Mostly solved, that is.
Yanko Kaneti, Jay Vosburgh, Meelis Roos, and Eric B Munson discovered the “mostly” part when they encountered hangs in
_rcu_barrier(). So what is
rcu_barrier()primitive waits for all pre-existing callbacks to be invoked. This is useful when you want to unload a module that uses
call_rcu(), as described in this LWN article. It is important to note that
rcu_barrier()does not necessarily wait for a full RCU grace period. In fact, if there are currently no RCU callbacks queued,
rcu_barrier()is within its rights to simply return immediately. Otherwise,
rcu_barrier()enqueues a callback on each CPU that already has callbacks, and waits for all these callbacks to be invoked. Because RCU is careful to invoke all callbacks posted to a given CPU in order, this guarantees that by the time
rcu_barrier()returns, all pre-existing RCU callbacks will have already been invoked, as required.
However, it is possible to offload invocation of a given CPU's RCU callbacks to
rcuokthreads, as described in this LWN article. This kthread might well be executing on some other CPU, which means that the callbacks are moved from one list to another as they pass through their lifecycles. This makes it difficult for
rcu_barrier()to reliably determine whether or not there are RCU callbacks pending for an offloaded CPU. So
rcu_barrier()simply unconditionally enqueues an RCU callback for each offloaded CPU, regardless of that CPU's state.
rcu_barrier()even enqueues a callback for offloaded CPUs that are offline. The reason for this odd-seeming design decision is that a given CPU might enqueue a huge number of callbacks, then go offline. It might take the corresponding
rcuokthread significant time to work its way through this backlog of callbacks, which means that
rcu_barrier()cannot safely assume that an offloaded CPU is callback-free just because it happens to be offline. So, to come full circle,
rcu_barrier()enqueues an RCU callback for all offloaded CPUs, regardless of their state.
This approach works quite well in practice.
At least, it works well on systems where the firmware provides the Linux kernel with an accurate count of the number of CPUs. However, it breaks horribly when the firmware over-reports the number of CPUs, because then the system will then have CPUs that never ever come online. If these CPUs have been designated as offloaded CPUs, this means that their
rcuokthreads will never ever be spawned, which in turn means that any callbacks enqueued for these mythical CPUs will never ever be invoked. And because
rcu_barrier()waits for all the callbacks that it posts to be invoked,
rcu_barrier()ends up waiting forever, which can of course result in hangs.
The solution is to make
rcu_barrier()refrain from posting callbacks for offloaded CPUs that have never been online, in other words, for CPUs that do not yet have an
With some luck, this patch will clear things up. And I did take the precaution of reviewing all of RCU's uses of
for_each_possible_cpu(), so here is hoping! ;-)