You are viewing paulmck

What is the overhead of rcu_read_lock()?

RCU
A recent post to LKML stated that the patch in question did not plan to hold any global locks, including rcu_read_lock(), presumably because of concerns about contention or overhead. This blog entry is intended to help address any lingering concerns about rcu_read_lock() contention and overhead.

To be fair, at first glance, rcu_read_lock()'s source code does look a bit scary and slow:

static inline void rcu_read_lock(void)
{       
        __rcu_read_lock();
        __acquire(RCU);
        rcu_lock_acquire(&rcu_lock_map);
}

However, a closer look reveals that both __acquire() and rcu_lock_acquire() compile to nothing in production kernels (CONFIG_PROVE_LOCKING=n). This leaves __rcu_read_lock(), which is compiled differently for different settings of CONFIG_PREEMPT.

Let's start with CONFIG_PREEMPT=n. We have:

static inline void __rcu_read_lock(void)
{       
        preempt_disable();
}

And, again for CONFIG_PREEMPT=n:

#define preempt_disable()               do { } while (0)

The overall result for rcu_read_lock() when CONFIG_PREEMPT=n is therefore as follows:

static inline void rcu_read_lock(void)
{       
}

Similar analysis of rcu_read_unlock()</code> reveals that it is also an empty static inline function for CONFIG_PREEMPT=n. It is to be hoped that this is sufficiently lightweight for most practical purposes. If you find a case where it is too heavyweight, I would be very interested in hearing about it!

That leaves CONFIG_PREEMPT=y, which actually has executable code in its definition of __rcu_read_lock() as follows:

void __rcu_read_lock(void)
{       
        current->rcu_read_lock_nesting++;
        barrier();
}

The first statement is a simple non-atomic increment of a simple int that is located in the task's own task_struct. The barrier in the second statement expands as follows:

#define barrier() __asm__ __volatile__("": : :"memory")

This is an empty asm that generates no code, but that does disable code-motion optimizations that might otherwise move memory references across the barrier() statement. So, the executable code for rcu_read_lock() for CONFIG_PREEMPT=y is as follows:

void rcu_read_lock(void)
{       
        current->rcu_read_lock_nesting++;
}

A similar analysis of rcu_read_unlock() for CONFIG_PREEMPT=y yields the following:

void __rcu_read_unlock(void)
{
        struct task_struct *t = current;

        if (t->rcu_read_lock_nesting != 1)
                --t->rcu_read_lock_nesting;
        else {
                t->rcu_read_lock_nesting = INT_MIN;
                if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
                        rcu_read_unlock_special(t);
                t->rcu_read_lock_nesting = 0;
        }
}

The common case of a single level of rcu_read_lock() nesting executes the else clause of the first if statement, and only executes the then clause of the second if statement when the RCU read-side critical section was either preempted or ran for several milliseconds.

So in the common case, rcu_read_unlock() for CONFIG_PREEMPT=y executes two tests of task-local variables and two assignments to task-local variables.

This should be sufficiently lightweight for most purposes.

Of course, RCU is intended for read-mostly situations, and RCU updates can have significant overhead, incurring significant latency, CPU overhead, and/or cache misses. That said, there are some special cases where RCU updates can be extremely fast, as shown in Figures 12 and 13 of the supplementary materials to User-Level Implementations of Read-Copy Update. (No, the supplementary materials are not behind a paywall: Click on the “Supplemental Material(PDF)” link.)



Comment Form

No HTML allowed in subject

  
 
   
 

Notice! This user has turned on the option that logs your IP address when posting. 

(will be screened)