Log in

No account? Create an account

Previous Entry | Next Entry

What is the overhead of rcu_read_lock()?

A recent post to LKML stated that the patch in question did not plan to hold any global locks, including rcu_read_lock(), presumably because of concerns about contention or overhead. This blog entry is intended to help address any lingering concerns about rcu_read_lock() contention and overhead.

To be fair, at first glance, rcu_read_lock()'s source code does look a bit scary and slow:

static inline void rcu_read_lock(void)

However, a closer look reveals that both __acquire() and rcu_lock_acquire() compile to nothing in production kernels (CONFIG_PROVE_LOCKING=n). This leaves __rcu_read_lock(), which is compiled differently for different settings of CONFIG_PREEMPT.

Let's start with CONFIG_PREEMPT=n. We have:

static inline void __rcu_read_lock(void)

And, again for CONFIG_PREEMPT=n:

#define preempt_disable()               do { } while (0)

The overall result for rcu_read_lock() when CONFIG_PREEMPT=n is therefore as follows:

static inline void rcu_read_lock(void)

Similar analysis of rcu_read_unlock()</code> reveals that it is also an empty static inline function for CONFIG_PREEMPT=n. It is to be hoped that this is sufficiently lightweight for most practical purposes. If you find a case where it is too heavyweight, I would be very interested in hearing about it!

That leaves CONFIG_PREEMPT=y, which actually has executable code in its definition of __rcu_read_lock() as follows:

void __rcu_read_lock(void)

The first statement is a simple non-atomic increment of a simple int that is located in the task's own task_struct. The barrier in the second statement expands as follows:

#define barrier() __asm__ __volatile__("": : :"memory")

This is an empty asm that generates no code, but that does disable code-motion optimizations that might otherwise move memory references across the barrier() statement. So, the executable code for rcu_read_lock() for CONFIG_PREEMPT=y is as follows:

void rcu_read_lock(void)

A similar analysis of rcu_read_unlock() for CONFIG_PREEMPT=y yields the following:

void __rcu_read_unlock(void)
        struct task_struct *t = current;

        if (t->rcu_read_lock_nesting != 1)
        else {
                t->rcu_read_lock_nesting = INT_MIN;
                if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
                t->rcu_read_lock_nesting = 0;

The common case of a single level of rcu_read_lock() nesting executes the else clause of the first if statement, and only executes the then clause of the second if statement when the RCU read-side critical section was either preempted or ran for several milliseconds.

So in the common case, rcu_read_unlock() for CONFIG_PREEMPT=y executes two tests of task-local variables and two assignments to task-local variables.

This should be sufficiently lightweight for most purposes.

Of course, RCU is intended for read-mostly situations, and RCU updates can have significant overhead, incurring significant latency, CPU overhead, and/or cache misses. That said, there are some special cases where RCU updates can be extremely fast, as shown in Figures 12 and 13 of the supplementary materials to User-Level Implementations of Read-Copy Update. (No, the supplementary materials are not behind a paywall: Click on the “Supplemental Material(PDF)” link.)


Feb. 1st, 2012 05:39 am (UTC)
Re: Typo..
You are quite right -- I have now fixed this. Thank you for catching it!