rcu_read_lock(), presumably because of concerns about contention or overhead. This blog entry is intended to help address any lingering concerns about rcu_read_lock() contention and overhead.To be fair, at first glance, rcu_read_lock()'s source code does look a bit scary and slow:
static inline void rcu_read_lock(void)
{
__rcu_read_lock();
__acquire(RCU);
rcu_lock_acquire(&rcu_lock_map);
}
However, a closer look reveals that both __acquire() and rcu_lock_acquire() compile to nothing in production kernels (CONFIG_PROVE_LOCKING=n). This leaves __rcu_read_lock(), which is compiled differently for different settings of CONFIG_PREEMPT.
Let's start with CONFIG_PREEMPT=n. We have:
static inline void __rcu_read_lock(void)
{
preempt_disable();
}
And, again for CONFIG_PREEMPT=n:
#define preempt_disable() do { } while (0)
The overall result for rcu_read_lock() when CONFIG_PREEMPT=n is therefore as follows:
static inline void rcu_read_lock(void)
{
}
Similar analysis of rcu_read_unlock()</code> reveals that it is also an empty static inline function for CONFIG_PREEMPT=n. It is to be hoped that this is sufficiently lightweight for most practical purposes. If you find a case where it is too heavyweight, I would be very interested in hearing about it!
That leaves CONFIG_PREEMPT=y, which actually has executable code in its definition of __rcu_read_lock() as follows:
void __rcu_read_lock(void)
{
current->rcu_read_lock_nesting++;
barrier();
}
The first statement is a simple non-atomic increment of a simple int that is located in the task's own task_struct. The barrier in the second statement expands as follows:
#define barrier() __asm__ __volatile__("": : :"memory")
This is an empty asm that generates no code, but that does disable code-motion optimizations that might otherwise move memory references across the barrier() statement. So, the executable code for rcu_read_lock() for CONFIG_PREEMPT=y is as follows:
void rcu_read_lock(void)
{
current->rcu_read_lock_nesting++;
}
A similar analysis of rcu_read_unlock() for CONFIG_PREEMPT=y yields the following:
void __rcu_read_unlock(void)
{
struct task_struct *t = current;
if (t->rcu_read_lock_nesting != 1)
--t->rcu_read_lock_nesting;
else {
t->rcu_read_lock_nesting = INT_MIN;
if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
rcu_read_unlock_special(t);
t->rcu_read_lock_nesting = 0;
}
}
The common case of a single level of rcu_read_lock() nesting executes the else clause of the first if statement, and only executes the then clause of the second if statement when the RCU read-side critical section was either preempted or ran for several milliseconds.
So in the common case, rcu_read_unlock() for CONFIG_PREEMPT=y executes two tests of task-local variables and two assignments to task-local variables.
This should be sufficiently lightweight for most purposes.
Of course, RCU is intended for read-mostly situations, and RCU updates can have significant overhead, incurring significant latency, CPU overhead, and/or cache misses. That said, there are some special cases where RCU updates can be extremely fast, as shown in Figures 12 and 13 of the supplementary materials to User-Level Implementations of Read-Copy Update. (No, the supplementary materials are not behind a paywall: Click on the “Supplemental Material(PDF)” link.)
Comment Form