paulmck (paulmck) wrote,
paulmck
paulmck

Stupid RCU Tricks: Enlisting the Aid of a Debugger

Using Debuggers With rcutorture



So rcutorture found a bug, you have figured out how to reproduce it, git bisect was unhelpful (perhaps because the bug has been around forever), and the bug happens to be one of those rare RCU bugs for which a debugger might be helpful. What can you do?

What I have traditionally done is to get partway through figuring out how to make gdb work with rcutorture, then suddenly realize what the bug's root cause must be. At this point, I of course abandon gdb in favor of fixing the bug. As a result, although I have tried to apply gdb to the Linux kernel many times over the past 20 years, I never have actually succeeded in doing so. Now, this is not to say that gdb is useless to Linux-kernel hackers. Far from it! For one thing, the act of trying to use gdb has inspired me to perceive the root cause of a great many bugs, which means that it has served as a great productivity aid. For another thing, I frequently extract Linux-kernel code into a usermode scaffolding and use gdb in that context. And finally, there really are a number of Linux-kernel hackers who make regular use of gdb.

One of these hackers is Omar Sandoval, who happened to mention that he had used gdb to track down a Linux-kernel bug. And without first extracting the code to userspace. I figured that it was time for this old dog to learn a new trick, so I asked Omar how he made this happen.

Omar pointed out that because rcutorture runs in guest OSes, gdb can take advantage of the debugging support provided by qemu. To make this work, you build a kernel with CONFIG_DEBUG_INFO=y (which supplies gdb with additional symbols), provide the nokaslr kernel boot parameter (which prevents kernel address-space randomization from invalidating these symbols), and supply qemu with the -s -S command-line arguments (which causes it to wait for gdb to connect instead of immediately booting the kernel). You then specify the vmlinux file's pathname as the sole command-line argument to gdb. Once you see the (gdb) prompt, the target remote :1234 command will connect to qemu and then the continue command will boot the kernel.

I tried this, and it worked like a charm.

Alternatively, you can now use the shiny new rcutorture --gdb command-line argument in the -rcu tree, which will automatically set up the kernel and qemu, and will print out the required gdb commands, including the path to the newly built vmlinux file.

And yes, I do owe Omar a --drgn command-line argument, which I will supply once he lets me know how to connect drgn to qemu. :-)

In the meantime, the following sections cover a couple of uses I have made of --gdb, mostly to get practice with this approach to Linux-kernel debugging.

Case study 1: locktorture

For example, let's use gdb to investigate a long-standing locktorture hang when running scenario LOCK05:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock \
    --duration 3 --configs LOCK05 --gdb

This will print out the following once the kernel is built and qemu has started:

Waiting for you to attach a debug session, for example:
    gdb gdb /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2020.08.27-14.51.45/LOCK05/vmlinux
After symbols load and the "(gdb)" prompt appears:
    target remote :1234
    continue

Once you have started gdb and entered the two suggested commands, the kernel will start. You can track its console output by locating its console.log file as described in an earlier post. Or you can use the ps command to dump the qemu command line, looking for the -serial file: command, which is following by the pathname of the file receiving the console output.

Once the kernel is sufficiently hung, that is, more than 15 seconds elapses after the last statistics output line (Writes: Total: 27668769 Max/Min: 27403330/34661 Fail: 0), you can hit control-C at gdb. The usual info threads command will show the CPUs' states, here with the 64-bit hexadecimal addresses abbreviated:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 1 (CPU#0 [running]) stutter_wait (title=0xf... "lock_torture_writer")
    at kernel/torture.c:615
  2    Thread 2 (CPU#1 [running]) 0xf... in stutter_wait (
    title=0xf... "lock_torture_writer") at kernel/torture.c:615
  3    Thread 3 (CPU#2 [halted ]) default_idle () at arch/x86/kernel/process.c:689
  4    Thread 4 (CPU#3 [halted ]) default_idle () at arch/x86/kernel/process.c:689

It is odd that CPUs 0 and 1 are in stutter_wait(), spinning on the global variable stutter_pause_test. Even more odd is that the value of this variable is not zero, as it should be at the end of the test, but rather the value two. After all, all paths out of torture_stutter() should zero this variable.

But maybe torture_stutter() is still stuck in the loop prior to the zeroing of stutter_pause_test. A quick look at torture_stutter_init shows us that the task_struct pointer to the task running torture_stutter lives in stutter_task, which is non-NULL, meaning that this task still lives. One might hope to use sched_show_task(), but this sadly fails with Could not fetch register "fs_base"; remote failure reply 'E14'.

The value of stutter_task.state is zero, which indicates that this task is running. But on what CPU? CPUs 0 and 1 are both spinning in stutter_wait, and the other two CPUs are in the idle loop. So let's look at stutter_task.on_cpu, which is zero, as in not on a CPU. In addition, stutter_task.cpu has the value one, and CPU 1 is definitely running some other task.

It would be good to just be able to print the stack of the blocked task, but it is also worth just rerunning this test, but this time with the locktorture.stutter module parameter set to zero. This test completed successfully, in particular, with no hangs. Given that no other locktorture or rcutorture scenario suffers from similar hangs, perhaps the problem is in rt_mutex_lock() itself. To check this, let's restart the test, but with the default value of the locktorture.stutter module parameter. After letting it hang and interrupting it with control-C, even though it still feels strange to control-C a kernel:

(gdb)  print torture_rtmutex
$1 = {wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0 '\000'}, {
          locked_pending = 0, tail = 0}}}}, waiters = {rb_root = {rb_node = 0xffffc9000025be50}, 
    rb_leftmost = 0xffffc90000263e50}, owner = 0x1 <fixed_percpu_data+1>}

The owner = 0x1 looks quite strange for a task_struct pointer, but the block comment preceding rt_mutex_set_owner() says that this value is legitimate, and represents one of two transitional states. So maybe it is time for CONFIG_DEBUG_RT_MUTEXES=y, but enabling this Kconfig option produces little additional enlightenment.

However, the torture_rtmutex.waiters field indicates that there really is something waiting on the lock. Of course, it might be that we just happened to catch the lock at this point in time. To check on this, let's add a variable to capture the time of the last lock release. I empirically determined that it is necessary to use WRITE_ONCE() to update this variable in order to prevent the compiler from optimizing it out of existence. Learn from my mistakes!

With the addition of WRITE_ONCE(), the next run showed that the last lock operation was more than three minutes in the past and that the transitional lock state still persisted, which provides strong evidence that this is the result of a race condition in the locking primitive itself. Except that a quick scan of the code didn't immediately identify a race condition. Furthermore, the failure happens even with CONFIG_DEBUG_RT_MUTEXES=y, which disables the lockless fastpaths (or the obvious lockless fastpaths, anyway).

Perhaps this is instead a lost wakeup? This would be fortuitous given that there are rare lost-IPI issues, and having this reproduce so easily on my laptop would be extremely convenient. And adding a bit of debug code to mark_wakeup_next_waiter() and lock_torture_writer() show that there is a task that was awakened, but that never exited from rt_mutex_lock(). And this task is runnable, that is, its ->state value is zero. But it is clearly not running very far! And further instrumentation demonstrates that control is not reaching the __smp_call_single_queue() call from __ttwu_queue_wakelist(). The chase is on!

Except that the problem ended up being in stutter_wait(). As the name suggests, this function controls stuttering, that is, periodically switching between full load and zero load. Such stuttering can expose bugs that a pure full-load stress test would miss.

The stutter_wait() uses adaptive waiting, so that schedule_timeout_interruptible() is used early in each no-load interval, but a tight loop containing cond_resched() is used near the end of the interval. The point of this is to more tightly synchronize the transition from no-load to full load. But the LOCK05 scenario's kernel is built with CONFIG_PREEMPT=y, which causes cond_resched() to be a no-op. In addition, the kthreads doing the write locking lower their priority using set_user_nice(current, MAX_NICE), which appears to be preventing preemption. (We can argue that even MAX_NICE should not indefinitely prevent preemption, but the multi-minute waits that have been observed are for all intents and purposes indefinite.)

The fix (or workaround, as the case might be) is for stutter_wait() to block periodically, thus allowing other tasks to run.

Case study 2: RCU Tasks Trace

I designed RCU Tasks Trace for the same grace-period latency that I had designed RCU Tasks for, namely roughly one second. Unfortunately, this proved to be about 40x too slow, so adjustments were called for.

After those reporting the issue kindly verified for me that this was not a case of too-long readers, I used --gdb to check statistics and state. I used rcuscale, which is a member of the rcutorture family designed to measure performance and scalability of the various RCU flavors' grace periods:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus \
    --configs TRACE01 --bootargs "rcuscale.nreaders=0 rcuscale.nwriters=10" \
    --trust-make --gdb

Once the (gdb) prompt appears, we connect to qemu, set a break point, and then continue execution:

(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
(gdb) b rcu_scale_cleanup
Breakpoint 1 at 0xffffffff810d27a0: file kernel/rcu/rcuscale.c, line 505.
(gdb) cont
Continuing.
Remote connection closed
(gdb)

Unfortunately, as shown above, this gets us Remote connection closed instead of a breakpoint. Apparently, the Linux kernel does not take kindly to debug exception instructions being inserted into its code. Fortunately, gdb also supplies a hardware breakpoint command:

(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
(gdb) hbreak rcu_scale_cleanup
Hardware assisted breakpoint 1 at 0xffffffff810d27a0: file kernel/rcu/rcuscale.c, line 505.
(gdb) cont
Continuing.
[Switching to Thread 12]

Thread 12 hit Breakpoint 1, rcu_scale_cleanup () at kernel/rcu/rcuscale.c:505
505     {

This works much better, and the various data structures may now be inspected to check the validity of various optimization approaches. Of course, as the optimization effort continued, hand-typing gdb commands became onerous, and was therefore replaced with crude but automatic accumulation and display of relevant statistics.

Of course, Murphy being who he is, the eventual grace-period speedup also caused a few heretofore latent race conditions to be triggered by a few tens of hours of rctorture. These race conditions resulted in rcu_torture_writer() stalls, along with the occasional full-fledged RCU-Tasks-Trace CPU stall warning.

Now, rcutorture does dump out RCU grace-period kthread state when these events occur, but in the case of the rcu_torture_writer() stalls, this state is for vanilla RCU rather than the flavor of RCU under test. Which is an rcutorture bug that will be fixed. But in the meantime, gdb provides a quick workaround by setting a hardware breakpoint on the ftrace_dump() function, which is called when either of these sorts of stalls occur. When the breakpoint triggers, it is easy to manually dump the data pertaining to the grace-period kthread of your choice.

For those who are curious, the race turned out to be an IPI arriving between a pair of stores in rcu_read_unlock_trace() that could leave the corresponding task forever blocking the current RCU Tasks Trace grace period. The solution, as with vanilla RCU in the v3.0 timeframe, is to set the read-side nesting value to a negative number while clearing the .need_qs field indicating that a quiescent state is required. The buggy code is as follows:

if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
    // BUG: IPI here sets .need_qs after check!!!
    WRITE_ONCE(t->trc_reader_nesting, nesting);
    return;  // We assume shallow reader nesting.
}

Again, the fix is to set the nesting count to a large negative number, which allows the IPI handler to detect this race and refrain from updating the .need_qs field when the ->trc_reader_nesting field is negative, thus avoiding the grace-period hang:

WRITE_ONCE(t->trc_reader_nesting, INT_MIN); // FIX
if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
    WRITE_ONCE(t->trc_reader_nesting, nesting);
    return;  // We assume shallow reader nesting.
}

This experience of course suggests testing with grace period latencies tuned much more aggressively than they are in production, with an eye to finding additional low-probability race conditions.

To be continued...
Tags: rcu, scalability, stupid rcu tricks
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 0 comments