Log in

No account? Create an account

Previous Entry | Next Entry

Lies that firmware tells RCU

One of the complaints that real-time people have against some firmware is that it lies about its age, attempting to cover up cycle-stealing via SMIs by reprogramming the TSC. Some firmware goes farther and lies about the number of CPUs on the system, apparently on the grounds that more is better, regardless of how many of those alleged CPUs actually exist.

RCU used to naively believe the firmware, and would therefore create one set of rcuo kthreads per advertised CPU. On some systems, this resulted in hundreds of such kthreads on systems with only a few tens of CPUs. But RCU can choose to create the rcuo kthreads only for CPUs that actually come online. Problem solved!

Mostly solved, that is.

Yanko Kaneti, Jay Vosburgh, Meelis Roos, and Eric B Munson discovered the “mostly” part when they encountered hangs in _rcu_barrier(). So what is rcu_barrier()?

The rcu_barrier() primitive waits for all pre-existing callbacks to be invoked. This is useful when you want to unload a module that uses call_rcu(), as described in this LWN article. It is important to note that rcu_barrier() does not necessarily wait for a full RCU grace period. In fact, if there are currently no RCU callbacks queued, rcu_barrier() is within its rights to simply return immediately. Otherwise, rcu_barrier() enqueues a callback on each CPU that already has callbacks, and waits for all these callbacks to be invoked. Because RCU is careful to invoke all callbacks posted to a given CPU in order, this guarantees that by the time rcu_barrier() returns, all pre-existing RCU callbacks will have already been invoked, as required.

However, it is possible to offload invocation of a given CPU's RCU callbacks to rcuo kthreads, as described in this LWN article. This kthread might well be executing on some other CPU, which means that the callbacks are moved from one list to another as they pass through their lifecycles. This makes it difficult for rcu_barrier() to reliably determine whether or not there are RCU callbacks pending for an offloaded CPU. So rcu_barrier() simply unconditionally enqueues an RCU callback for each offloaded CPU, regardless of that CPU's state.

In fact, rcu_barrier() even enqueues a callback for offloaded CPUs that are offline. The reason for this odd-seeming design decision is that a given CPU might enqueue a huge number of callbacks, then go offline. It might take the corresponding rcuo kthread significant time to work its way through this backlog of callbacks, which means that rcu_barrier() cannot safely assume that an offloaded CPU is callback-free just because it happens to be offline. So, to come full circle, rcu_barrier() enqueues an RCU callback for all offloaded CPUs, regardless of their state.

This approach works quite well in practice.

At least, it works well on systems where the firmware provides the Linux kernel with an accurate count of the number of CPUs. However, it breaks horribly when the firmware over-reports the number of CPUs, because then the system will then have CPUs that never ever come online. If these CPUs have been designated as offloaded CPUs, this means that their rcuo kthreads will never ever be spawned, which in turn means that any callbacks enqueued for these mythical CPUs will never ever be invoked. And because rcu_barrier() waits for all the callbacks that it posts to be invoked, rcu_barrier() ends up waiting forever, which can of course result in hangs.

The solution is to make rcu_barrier() refrain from posting callbacks for offloaded CPUs that have never been online, in other words, for CPUs that do not yet have an rcuo kthread.

With some luck, this patch will clear things up. And I did take the precaution of reviewing all of RCU's uses of for_each_possible_cpu(), so here is hoping! ;-)


( 2 comments — Leave a comment )
Nov. 3rd, 2014 09:46 am (UTC)
Which HW does this?
Would you care to be more specific about the HW which lies about the number of CPUs? Thanks!
Nov. 4th, 2014 12:38 am (UTC)
Re: Which HW does this?
Sorry, but I will decline to give a direct answer.

However, please do thorough testing before putting a given piece of hardware into service running a real-time application.
( 2 comments — Leave a comment )