You are viewing paulmck

Previous Entry | Next Entry

RCU
I recently refactored my rcutorture tests, getting that item off of my todo list after “only” three or four years. To my surprise, one (but only one) of the new rcutorture test scenarios failed, namely the one creatively named TREE08:





CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_FAST_NO_HZ=n
CONFIG_RCU_TRACE=n
CONFIG_HOTPLUG_CPU=n
CONFIG_SUSPEND=n
CONFIG_HIBERNATION=n
CONFIG_RCU_FANOUT=3
CONFIG_RCU_FANOUT_EXACT=y
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_PROVE_RCU_DELAY=n
CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_CPU_STALL_VERBOSE=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_PRINTK_TIME=y


One reason for my surprise was that I had been running similar rcutorture scenarios for some years. In sharp (and happy) contrast to previous efforts featuring but a handful of failures in a given ten-hour run, the effect was not subtle. In fact, TREE08 often failed more than 20 times per minute.

My first thought was that I had blown a recent commit, but testing earlier kernel versions gave similar failure rates.

Additional testing and analysis showed that the bug was quite sensitive to the exact values of the Kconfig parameters. For example, setting CONFIG_RCU_FANOUT=2 or CONFIG_RCU_FANOUT=4 resulted in rock-solid behavior, despite CONFIG_RCU_FANOUT=3 failing miserably. Similarly, setting CONFIG_RCU_FANOUT_EXACT=n (which is thankfully the default) also resulted in rock-solid behavior. This fortunately provides not only a workaround for the bug, but, even better, a workaround that is enabled by default. This enabled-by-default workaround did much to explain the total lack of bug reports.

This sensitivity to a few Kconfig parameters implicated the boot-time code that sets up the rcu_node structures. I therefore ran this boot-time setup code in usermode scaffolding, resulting in the following output:

NR_CPUS = 16, CONFIG_RCU_FANOUT = 3, CONFIG_RCU_FANOUT_LEAF = 2, MAX_RCU_LVLS = 4
rcu_fanout_leaf = 16, nr_cpu_ids = 3
NR_CPUS: 16, RCU_NUM_LVLS: 3, rcu_num_lvls: 3, rcu_num_nodes: 12
num_rcu_lvl[0]: 1
num_rcu_lvl[1]: 3
num_rcu_lvl[2]: 8
num_rcu_lvl[3]: 16
num_rcu_lvl[4]: 0
rsp->levelspread[0]: 2
rsp->levelspread[1]: 3
rsp->levelspread[2]: 3
CPU 0: rdp=0x804c420 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 1: rdp=0x804c4cc rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 2: rdp=0x804c578 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 3: rdp=0x804c624 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 4: rdp=0x804c6d0 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 5: rdp=0x804c77c rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 6: rdp=0x804c828 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 7: rdp=0x804c8d4 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 8: rdp=0x804c980 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 9: rdp=0x804ca2c rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 10: rdp=0x804cad8 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 11: rdp=0x804cb84 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 12: rdp=0x804cc30 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 13: rdp=0x804ccdc rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 14: rdp=0x804cd88 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 15: rdp=0x804ce34 rnp=0x804b2e8 (15-15) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)


The rdp= numbers show each CPU's rcu_data structure addresses, and the sequence of rnp= numbers show the rcu_node structures when traversing upwards from the corresponding CPU to the root of the rcu_node tree. The numbers in parentheses show which CPUs are handled by each rcu_node structure. The initial lines simply show the values of a few key boot-time macros, variables, and arrays.

Had I been paying attention, I would have noticed that this output shows two very odd things, with one of the called out twice. However, some of my assumptions blinded me to these warning signs.

I therefore enabled RCU event tracing, with the usual false start where I failed to build the kernel with CONFIG_RCU_TRACE=y. This resulted in a 20MB trace, roughly 15MB of which was required to show the entire grace-period dance starting from the callback being registered and ending with rcutorture detecting RCU bug. The following entries caught my eye:

rcu_pree-7       0d..2 5852674733: rcu_grace_period_init: rcu_preempt 18446744073709551553 0 0 15 3
rcu_pree-7 0d..2 5852674946: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 0 8 7
rcu_pree-7 0d..2 5852675138: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 9 15 7
rcu_pree-7 0d..2 5852675313: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 18 15 0
rcu_pree-7 0d..2 5852675631: rcu_grace_period: rcu_preempt 18446744073709551553 cpustart
rcu_pree-7 0d..2 5852675802: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 0 2 7
rcu_pree-7 0d..2 5852675974: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 3 5 7
rcu_pree-7 0d..2 5852676148: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 6 8 7
rcu_pree-7 0d..2 5852676321: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 9 11 7
rcu_pree-7 0d..2 5852676519: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 12 14 7
rcu_pree-7 0d..2 5852676691: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 15 15 1
rcu_pree-7 0d..2 5852676862: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 18 15 0
rcu_pree-7 0d..2 5852677034: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 21 15 0


The smoking gun is the fourth line, the one containing the “18 15”. This indicates that the rcu_node structure corresponding to this trace event covers all CPUs whose numbers (as listed by smp_processor_id()) are greater or equal to 18 and also less than or equal to 15. There are of course no such CPUs. The last two lines show similar brokenness.

These trace events narrowed the bug down to the following code fragment in kernel/rcutree.c:

 1 #ifdef CONFIG_RCU_FANOUT_EXACT
2 static void __init rcu_init_levelspread(struct rcu_state *rsp)
3 {
4 int i;
5
6 for (i = rcu_num_lvls - 1; i > 0; i--)
7 rsp->levelspread[i] = CONFIG_RCU_FANOUT;
8 rsp->levelspread[0] = rcu_fanout_leaf;
9 }
10 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */
11 static void __init rcu_init_levelspread(struct rcu_state *rsp)
12 {
13 int ccur;
14 int cprv;
15 int i;
16
17 cprv = nr_cpu_ids;
18 for (i = rcu_num_lvls - 1; i >= 0; i--) {
19 ccur = rsp->levelcnt[i];
20 rsp->levelspread[i] = (cprv + ccur - 1) / ccur;
21 cprv = ccur;
22 }
23 }
24 #endif /* #else #ifdef CONFIG_RCU_FANOUT_EXACT */


Can you spot the problem and the fix?

Comments

pavelmachek
Jan. 6th, 2014 10:05 am (UTC)
Is there something clever going on?
I mean... that's a very strange way to write a loop in C. Is there reason why the first loop can't be in "normal" for (i=0; i<bla; i++) form? Because this looks really interesting...
paulmck
Jan. 6th, 2014 09:49 pm (UTC)
Re: Is there something clever going on?
Nothing particularly clever, just wanted the loops for both alternatives to go in the same direction.

That said, it would need to be “for (i=1; i<bla; i++)” so that it starts at index 1 rather than index 0. I suppose that one could argue that double-assigning rsp->levelspread[0] wouldn't be that much of a problem, given that this only happens once at boot time, but still...