How should you celebrate a successful heisenbug hunt? Any five people would likely have ten different opinions, but whatever your choice, you should also fix your broken tests. Just what makes me say that your tests are broken? Consider the following definitions:
From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.
I fixed the rcutorture test process as follows:
This improved testing regimen resulted in grace-period hangs, but only with
A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.
So how should I celebrate success in this latest hunt?
- The only bug-free programs are trivial programs.
- A reliable program has no known bugs.
From these two definitions, it clearly follows that any reliable non-trivial program has at least one bug that you don't know about. The fact that you don't know about it means that your tests have not yet found that bug, and therefore, as asserted above, your tests are broken. And yes, I am giving your program the benefit of the doubt by assuming that it is non-trivial, and in any case, I claim that the Linux kernel's RCU implementation is non-trivial. RCU therefore contains at least one bug I don't know about — one bug my tests didn't find — and I therefore need to raise my rcutorture game.
I fixed the rcutorture test process as follows:
- Choosing
CONFIG_RCU_FANOUT=2on an eight-CPU machine, which exercised code paths that would otherwise require 1024 CPUs. - Decreasing the wait time successive randomly chosen CPU-hotplug operations from three seconds to a hundred milliseconds (using “sleep 0.1” from a
bashscript!) - Retaining the artificially low one-scheduler-clock-tick delay between invocations of
force_quiescent_state().
This improved testing regimen resulted in grace-period hangs, but only with
CONFIG_PREEMPT_TREE_RCU. The really nice thing about this improved test was that the failures occurred within a few minutes, which permits use of printk() debugging, a rare luxury when working with RCU infrastructure. I nevertheless suppressed my urge to wildly code up random printk() statements in favor of enabling the RCU tracing infrastructure. The tracing data clearly showed that all leaf rcu_node structures had recorded quiescent-state passage for all their CPUs, but that this information had not propagated up the rcu_node hierarchy. Given that these grace-period hangs occurred only in CONFIG_TREE_PREEMPT_RCU, this data fingered two possible code paths:-
rcu_read_unlock_special()invoked from a task that had blocked within the prior RCU read-side critical section, and -
__rcu_offline_cpu()when offlining the last CPU from a given leafrcu_nodestructure when that structure has queued a task that has blocked within its current RCU read-side critical section.
A little code inspection on these two code paths located the bug and the fix. With this fix, RCU appears to be quite reliable.
So how should I celebrate success in this latest hunt?
My children's opinions notwithstanding, I recently found myself pursuing some nasty concurrency bugs in Linux's TREE_RCU implementation. This was not particularly surprising, given that I recently added preemption support, and the code really hadn't put up that much of a fight. In fact, I was getting the feeling that the bugs had gotten together and decided to hide out, the better to ambush me. This feeling wasn't far from wrong.
My first hint of trouble appeared when I began running longer sequences of
The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.
Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies
Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.
Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?
Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes
This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be:
I stubbed out
I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of
Quick Quiz 2: How could increasing the frequency of
Given that the race I found involved unsynchronized access to the
And it only took 882.5 hours of machine time to track down these bugs. :–)
This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the
Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.
Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.
Answer to Quick Quiz 2: How could increasing the frequency of
My first hint of trouble appeared when I began running longer sequences of
rcutorture runs, seeing an occasional failure on one-hour runs. My first reaction was to increase the duration to ten hours and attempt to bisect the problem. Of course, even with bisection, finding the bug takes quite some time given ten hours for each probe, so rather than use “git bisect”, I manually ran parallel runs and (for example) quadrisected. I also ran multiple configurations. The results initially hinted that CONFIG_NO_HZ might have something to do with it, but later runs showed no shortage of failures in !CONFIG_NO_HZ runs as well.The initial results of the bisection were quite puzzling, converging on a commit that could not possibly change RCU's behavior. Then I noticed that one of the machines seemed to be generating more failures than others, and, sure enough, this difference in failure rate was responsible for the false convergence. I therefore started keeping more careful records, including the machine name, test duration, configuration parameters, commit number, and number of errors for each run. These records proved extremely helpful later on.
Further testing showed that 2.6.32-rc1 (AKA 2.6.32-rc2) was reliable, even for the error-prone machine, and that 2.6.32-rc3 was buggy. Unfortunately, there are no RCU-related commits between 2.6.32-rc1 and 2.6.32-rc3. Unless you count commit #828c0950, which simply applies
const to a few data structures involved in RCU tracing, which I don't and you shouldn't. So I ran a few more runs on 2.6.32-rc1, and eventually did trigger a failure. In contrast, 2.6.31 was rock solid.Now there are quite a few RCU-related patches between 2.6.31 and 2.6.32-rc1, so I started searching for the offending commit. However, by this time I had written scripts to analyze rcutorture output, which I used to check the status of the test runs, stopping runs as soon as they generated an error. This sped things up considerably, because failed runs now took on average only a few hours rather than the 10 hours I was using as a (rough) success criterion.
Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?
Testing eventually converged on commit #b8d57a76. By this time, I getting a bit paranoid, so I ran no fewer than three ten-hour runs at the preceding commit on the most error-prone machine, none of which failed. But this commit does nothing to RCU, but rather makes
rcutorture testing more severe, inserting delays of up to 50 milliseconds in RCU read-side critical sections. I therefore cherry-picked this commit back onto 2.6.31 and 2.6.30, and, sure enough, got failures in both cases. As it turned out, I was dealing with a day-one bug in TREE_RCU.This did simplify matters, permitting me to focus my testing efforts on the most recent version of RCU rather than spreading my testing efforts across every change since 2.6.31. In addition, the fact that long-running RCU read-side critical sections triggered the bug told me roughly where the bug had to be:
force_quiescent_state() or one of the functions it calls. This function runs more often in face of long-running RCU read-side critical sections. In addition, this explained the earlier CONFIG_NO_HZ results, because one of the force_quiescent_state() function's responsibilities is detecting dyntick-idle CPUs. In addition, it raised the possibility that the bug was unrelated to memory ordering, which motivated me to try a few runs on x86 — which, to my surprise, resulted in much higher failure rates than did the earlier tests on the Power machines.I stubbed out
force_quiescent_state() to check my assumption that it was to blame (but please, please do not do this on production systems!!!). Stubbing out force_quiescent_state() resulted in a statistically significant 3x decrease in failures on the x86 machine, confirming my assumption, at least for some subset of the bugs. Now that there was a much smaller section of code to inspect, I was able to locate one race involving mishandling of the ->completed values. This reduced the error rate on the x86 machine by roughly the same amount as did stubbing out force_quiescent_state(). One bug down, but more bugs still hiding.I was also now in a position to take some good advice from Ingo Molnar: when you see a failure, work to increase the failure rate. This might seem counter-intuitive, but the more frequent the failures, the shorter the test runs, and the faster you can find the bug. I therefore changed the value of
RCU_JIFFIES_TILL_FORCE_QS from three to one, which increased the failure rate by well over an order of magnitude on the x86 machine.Quick Quiz 2: How could increasing the frequency of
force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?Given that the race I found involved unsynchronized access to the
->completed values, it made sense to look at other unsynchronized accesses. I found three other such issues, and testing of the resulting patches has thus far turned up zero rcutorture failures.And it only took 882.5 hours of machine time to track down these bugs. :–)
This raises the question of why these bugs happened in the first place. After all, I do try to be quite careful with RCU-infrastructure code. In this case, it appears that these bugs were inserted during a bug-fixing session fairly late in the
TREE_RCU effort. Bug-fixing is often done under considerably more time pressure than is production of new code, and the mistake in this case was failing to follow up with more careful analysis.Another question is the number of bugs remaining. This is of course hard to say at present, but Murphy would assert that, no matter what you do, there will always be at least a few more bugs.
Answer to Quick Quiz 1: If successful tests take 10 hours and failed runs take only a single hour, is bisection still the optimal bug-finding method?.
Answer to Quick Quiz 2: How could increasing the frequency of
force_quiescent_state() by a factor of three increase the rcutorture failure rate by more than an order of magnitude? Wouldn't the increase instead be about a factor of three?I first encountered the C language about three decades ago, and I must confess that my first impressions were not particularly positive. Part of my early dissatisfaction was due to the fact that I was using 8-bit microprocessors and that early C compilers had a bad habit of promoting
This sort of assembly code sequence was simply not what you wanted to see when attempting to jam 50,000 lines of source code into a 64KB address space, even when given a full 256K of physical memory and an overlay loader hacked up to do bank switching. Other long-dead and unlamented languages were able to generate a pair of loads, an add, and a store, which did much to limit C's uptake on 8-bit platforms. Of course, these other languages had their own charming limitations, such as lacking any variable-initialization syntax (use assembly instead!), insanely inefficient
After all, the compiler had to fit in 48K, since the operating system (such as it was) consumed the remaining 16K. (My application dispensed with the operating system, and hence could use the entire 64K.)
The past three decades have seen C grow up to be almost unrecognizable, which, believe me, is a very good thing. This growth allows us to enjoy a huge number of language features, including:
But one feature that I hadn't noticed until recently is the empty structure declaration, which Peter Zijlstra recently introduced me to in the form of the non-lockdep definition of
To my surprise, this actually works, at least in gcc:
This generates an object module whose bss and data really are of zero size:
And different elements of the array really do have identical addresses:
The C language certainly has come a long way in the past three decades. And yes, after all these years, I am still easily amused.
char to int in the generated object code, regardless of whether this made any particular sense. For example, if x and y were of type char, and were to be added with the sum placed in z, then the assembly output would execute something resembling the following steps:- Load
xinto an eight-bit register, which on the 8080 or z80 would normally be thearegister. - Move the
aregister to the lower half of a sixteen-bit register, for example, thelportion of thehlregister. - Test the sign bit of the
aregister. - Using one of a number of obscure instruction sequences, fill the upper half of the sixteen-bit register (in this example,
h) with eight copies of this sign bit. - Repeat the above steps to get a sign-extended copy of
yinto (say) thederegister. - Move
ltoa. - Add
etoa(which sets the carry bit in the condition-code register appropriately). - Move
ato (say)c. - Repeat the preceding three steps to add (with carry)
htod, placing the result inb. - Move
btoa, never mind that this is where we just gotbfrom. - Store
aintoz.
This sort of assembly code sequence was simply not what you wanted to see when attempting to jam 50,000 lines of source code into a 64KB address space, even when given a full 256K of physical memory and an overlay loader hacked up to do bank switching. Other long-dead and unlamented languages were able to generate a pair of loads, an add, and a store, which did much to limit C's uptake on 8-bit platforms. Of course, these other languages had their own charming limitations, such as lacking any variable-initialization syntax (use assembly instead!), insanely inefficient
for loops, strange restrictions on their equivalent of union, and much else besides.After all, the compiler had to fit in 48K, since the operating system (such as it was) consumed the remaining 16K. (My application dispensed with the operating system, and hence could use the entire 64K.)
The past three decades have seen C grow up to be almost unrecognizable, which, believe me, is a very good thing. This growth allows us to enjoy a huge number of language features, including:
- identifiers longer than eight characters (and longer than seven for
externidentifiers). - type-checked function arguments and return values.
- the deprecation of the
gets()function. - 32-bit and 64-bit integers.
- inline functions.
- standard
printf()formats (yes, these really did vary from compiler to compiler).
But one feature that I hadn't noticed until recently is the empty structure declaration, which Peter Zijlstra recently introduced me to in the form of the non-lockdep definition of
struct lock_class_key. A recently detected hang in RCU under extreme conditions caused me to want an array of struct lock_class_key, which of course turns into an array of empty structures in non-lockdep builds.To my surprise, this actually works, at least in gcc:
#include <stdio.h>
struct foo {};
struct foo a[10];
int main(int argc, char *argv[])
{
printf("%p %p\n", &a[0], &a[1]);
return 0;
}
This generates an object module whose bss and data really are of zero size:
text data bss dec hex filename
66 0 0 66 42 arrayemptystruct.o
And different elements of the array really do have identical addresses:
./arrayemptystruct 0x8049588 0x8049588
The C language certainly has come a long way in the past three decades. And yes, after all these years, I am still easily amused.
My children seem to have decided that I am not putting enough bugs in my software, given that they gave me one of these:

This does raise the question as to whether I prefer bugs in my software or in my candy. In this case, the answer is easy: the candy is transparent, so I know what I would be eating, so that cricket in the candy is far less troublesome than a bug in my software. Of course, if the candy were opaque, this decision might be somewhat more difficult.
In happy contrast to previous lives, these days my software is also transparent: freely available for all to download, use, study, redistribute, and improve. This means that other people can find my bugs (and vice versa), which can be surprisingly helpful. It is often the case that the “simple” matter of identifying the bug is the most difficult step in the process of fixing it.
Bugs can be inserted into code in the process of writing it and in the process of fixing other bugs, and I certainly have inserted plenty of bugs using both of these time-honored mechanisms. However, bugs can also appear spontaneously, as a result of changing workloads and changing requirements. I offer two examples from the current mainline implementation of RCU:
All that aside, exactly when did these bugs appear? When the code was last changed? When the new use cases were first articulated? Or when I decided to accept them as bugs?
In the end, I guess it really doesn't matter so long as I fix them.

This does raise the question as to whether I prefer bugs in my software or in my candy. In this case, the answer is easy: the candy is transparent, so I know what I would be eating, so that cricket in the candy is far less troublesome than a bug in my software. Of course, if the candy were opaque, this decision might be somewhat more difficult.
In happy contrast to previous lives, these days my software is also transparent: freely available for all to download, use, study, redistribute, and improve. This means that other people can find my bugs (and vice versa), which can be surprisingly helpful. It is often the case that the “simple” matter of identifying the bug is the most difficult step in the process of fixing it.
Bugs can be inserted into code in the process of writing it and in the process of fixing other bugs, and I certainly have inserted plenty of bugs using both of these time-honored mechanisms. However, bugs can also appear spontaneously, as a result of changing workloads and changing requirements. I offer two examples from the current mainline implementation of RCU:
- The
call_rcu()function has a tunable safety limit (defaulting to 10,000 callbacks per CPU) beyond which RCU takes extreme measures to force the current grace period to complete. This has worked flawlessly for more than five years, but recent patches (which are otherwise eminently reasonable) cause this limit to be exceeded on a routine basis. Poof!!! Thecall_rcu()function now has a bug! And yes, I sent out a temporary hack, and now have a real fix in mainline. - Both Mathieu Desnoyers and Udo Steinberg have recently argued that the
call_rcu()function should execute deterministically when invoked from a process running at real-time priority. And older versions ofcall_rcu()used to do just that. However, recent concerns with grace-period latency have resulted in my adding non-deterministic grace-period-acceleration code tocall_rcu(). So, doescall_rcu()have a real-time-response bug or doesn't it? I guess that the final answer is up to me, but the more I think about it, the more reasonable Mathieu's and Udo's position seems. So it seems that a secondcall_rcu()bug is in the offing.
All that aside, exactly when did these bugs appear? When the code was last changed? When the new use cases were first articulated? Or when I decided to accept them as bugs?
In the end, I guess it really doesn't matter so long as I fix them.
If you actually read my full set of Transactional Memory Everywhere blog postings, I offer my enthusiastic congratulations and heartfelt condolences.
In short, although TM offers much promise for small changes to memory-only data structures, there are a number of issues with large-scale transacations in real software systems:
So, what can we conclude from the above list of issues?
What should TM researchers and developers do about all of this?
One approach is to focus on TM in the small, focusing on situations where hardware assist potentially provides substantial advantages over other synchronization primitives. This is in fact the approach Sun took with its Rock research CPU. Some TM researchers seem to agree with this approach, while others have much higher hopes for TM.
Of course, it is quite possible that TM will be able to take on larger problems, and this series of blog posts lists a few of the issues that must be resolved if TM is to achieve this lofty goal.
My personal hope is that everyone involved treats this as a learning experience.
It appears to me that TM researchers have great deal to learn from practitioners who have successfully built large software systems using traditional synchronization primitives.
And vice versa.
In short, although TM offers much promise for small changes to memory-only data structures, there are a number of issues with large-scale transacations in real software systems:
- I/O operations, especially RPCs, which cannot in general be rolled back.
- Memory-mapping operations, particularly when you unmap some of a given transaction's variables from outside of that transaction.
- Multi-threaded transactions do not seem to be supported by most TM implementations, which means
that TM does not generally support things like pthread_create(). - Extra-transactional accesses, whose behavior varies greatly from one TM implementation to another.
- Time delays, which can interact with the notion of atomicity in interesting and ill-defined ways.
- Interactions with locking, particularly reader-writer locking. These interactions are clearly critically important if TM is to be introduced into large existing software programs that use locking. Of course, the interactions between TM and RCU are a special interest of mine.
- Persistent transactions are an interesting possibility. There are forms of locking that can span address spaces, and that can survive reboots and even operating-system upgrades. Should TM offer similar persistence?
- Dynamic linking and loading of functions invoked from within transactions.
- Debugging transactions, especially setting breakpoints within transactions for hardware TM implementations.
- Transactions containing the exec() system call have interesting implications.
So, what can we conclude from the above list of issues?
- One interesting property of TM is the fact that transactions are subject to rollback and retry. This property underlies TM's difficulties with irreversible operations, including unbuffered I/O, RPCs, memory-mapping operations, time delays, and the exec() system call.
- Another interesting property of TM, noted by Shpeisman et al., is that TM intertwines the synchronization with the data it protects. This property underlies TM's issues with I/O, memory-mapping operations, extra-transactional accesses, and debugging breakpoints. In contrast, conventional synchronization primitives, including locking and RCU, maintain a clear separation between the synchronization
primitives and the data that they protect. - One of the stated goals of many workers in the TM area is to ease parallelization of large sequential programs. As such, individual transactions are commonly expected to execute serially, which might do much to explain TM's issues with multithreaded transactions.
What should TM researchers and developers do about all of this?
One approach is to focus on TM in the small, focusing on situations where hardware assist potentially provides substantial advantages over other synchronization primitives. This is in fact the approach Sun took with its Rock research CPU. Some TM researchers seem to agree with this approach, while others have much higher hopes for TM.
Of course, it is quite possible that TM will be able to take on larger problems, and this series of blog posts lists a few of the issues that must be resolved if TM is to achieve this lofty goal.
My personal hope is that everyone involved treats this as a learning experience.
It appears to me that TM researchers have great deal to learn from practitioners who have successfully built large software systems using traditional synchronization primitives.
And vice versa.
Because read-copy update (RCU) finds its main use in the Linux kernel, one might be forgiven for assuming that there had been no academic work on combining RCU and TM. However, the TxLinux group from the University of Texas at Austin had no choice. The fact that they applied TM to the Linux 2.6 kernel, which uses RCU, forced them to integrate TM and RCU, with TM taking the place of locking for RCU updates. Unfortunately, although the paper does state that the RCU implementation's locks (e.g., rcu_ctrlblk.lock) were converted to transactions, it is silent about what happened to locks used in RCU-based updates (e.g., dcache_lock).
It is important to note that RCU permits readers and updaters to run concurrently, further permitting RCU readers to access data that is in the act of being updated. Of course, this property of RCU, whatever its performance, scalability, and real-time-response benefits might be, flies in the face of the underlying atomicity properties of TM.
So how should TM-based updates interact with concurrent RCU readers?
It is important to note that RCU permits readers and updaters to run concurrently, further permitting RCU readers to access data that is in the act of being updated. Of course, this property of RCU, whatever its performance, scalability, and real-time-response benefits might be, flies in the face of the underlying atomicity properties of TM.
So how should TM-based updates interact with concurrent RCU readers?
One can execute an exec() system call while holding a lock, and also from within an RCU read-side critical section. The exact semantics depends on the type of primitive.
In the case of non-persistent primitives (including pthread_mutex_lock(), pthread_rwlock_rdlock(), and RCU), if the exec() succeeds, the whole address space vanishes, along with any locks being held. Of course, if the exec() fails, the address space still lives, so any associated locks would also still live. A bit strange perhaps, but reasonably well defined.
On the other hand, persistent primitives (including the flock family, lockf(), System V semaphores, and the O_CREAT flag to open()) would survive regardless of whether the exec() succeeded or failed, so that the exec()ed program might well release them. (Non-persistent primitives represented by data structures in mmap() regions of memory are an interesting intermediate point that is left to the reader.)
What happens when you attempt to execute an exec() system call from within a transaction?
In the case of non-persistent primitives (including pthread_mutex_lock(), pthread_rwlock_rdlock(), and RCU), if the exec() succeeds, the whole address space vanishes, along with any locks being held. Of course, if the exec() fails, the address space still lives, so any associated locks would also still live. A bit strange perhaps, but reasonably well defined.
On the other hand, persistent primitives (including the flock family, lockf(), System V semaphores, and the O_CREAT flag to open()) would survive regardless of whether the exec() succeeded or failed, so that the exec()ed program might well release them. (Non-persistent primitives represented by data structures in mmap() regions of memory are an interesting intermediate point that is left to the reader.)
What happens when you attempt to execute an exec() system call from within a transaction?
The usual debugging operations such as breakpoints work normally within lock-based critical sections and from RCU read-side critical sections. However, in initial transactional-memory hardware implementations an exception within a transaction will abort that transaction, which in turn means that breakpoints abort all enclosing transactions.
So how can transactions be debugged?
So how can transactions be debugged?
Both lock-based critical sections and RCU read-side critical sections can legitimately contain code that invokes dynamically linked and loaded functions, including C/C++ shared libraries and Java class libraries. Of course, the code contained in these libraries is by definition unknowable at compile time. So, what happens if a dynamically loaded function is invoked within a transaction?
This question has two parts: (a) how do you dynamically link and load a function within a transaction and (b) what do you do about the unknowable nature of the code within this function? To be fair, item (b) poses some challenges for locking and RCU as well, at least in theory. For example, the dynamically linked function might introduce a deadlock for locking or might (erroneously) introduce a quiescent state into an RCU read-side critical section. The difference is that while the class of operations permitted in locking and RCU critical sections is well-understood, there appears to still be considerable uncertainty in the case of TM. In fact, different implementations of TM seem to have different restrictions.
So what can TM do about dynamically linked and loaded library functions?
This question has two parts: (a) how do you dynamically link and load a function within a transaction and (b) what do you do about the unknowable nature of the code within this function? To be fair, item (b) poses some challenges for locking and RCU as well, at least in theory. For example, the dynamically linked function might introduce a deadlock for locking or might (erroneously) introduce a quiescent state into an RCU read-side critical section. The difference is that while the class of operations permitted in locking and RCU critical sections is well-understood, there appears to still be considerable uncertainty in the case of TM. In fact, different implementations of TM seem to have different restrictions.
So what can TM do about dynamically linked and loaded library functions?
There are many different types of locking primitives. One interesting distinction is persistence, in other words, whether the lock can exist independently of the address space of the process using the lock.
Non-persistent locks include pthread_mutex_lock(), pthread_rwlock_rdlock(), and most kernel-level locking primitives. If the memory locations instantiating a non-persistent lock's data structures disappear, so does the lock. For typical use of pthread_mutex_lock(), this means that when the process exits, all of its locks vanish. This property can be exploited in order to trivialize lock cleanup at program shutdown time, but makes it more difficult for unrelated applications to share locks, as such sharing requires the applications to share memory.
Persistent locks help avoid the need to share memory among unrelated applications. Persistent locking APIs include the flock family, lockf(), System V semaphores, or the O_CREAT flag to open(). These persistent APIs can be used to protect large-scale operations spanning runs of multiple applications, and, in the case of O_CREAT even surviving operating-system reboot. If need be, locks can span multiple computer systems via distributed lock managers.
Persistent locks can be used by any application, including applications written using multiple languages and software environments.
How could similar functionality be provided via transactional memory?
Non-persistent locks include pthread_mutex_lock(), pthread_rwlock_rdlock(), and most kernel-level locking primitives. If the memory locations instantiating a non-persistent lock's data structures disappear, so does the lock. For typical use of pthread_mutex_lock(), this means that when the process exits, all of its locks vanish. This property can be exploited in order to trivialize lock cleanup at program shutdown time, but makes it more difficult for unrelated applications to share locks, as such sharing requires the applications to share memory.
Persistent locks help avoid the need to share memory among unrelated applications. Persistent locking APIs include the flock family, lockf(), System V semaphores, or the O_CREAT flag to open(). These persistent APIs can be used to protect large-scale operations spanning runs of multiple applications, and, in the case of O_CREAT even surviving operating-system reboot. If need be, locks can span multiple computer systems via distributed lock managers.
Persistent locks can be used by any application, including applications written using multiple languages and software environments.
How could similar functionality be provided via transactional memory?
It is commonplace to read-acquire reader-writer locks while holding other locks, which just works, at least as long as the usual well-known software-engineering techniques are employed to avoid deadlock. Read-acquiring reader-writer locks from within RCU read-side critical sections also works, and doing so eases deadlock concerns because RCU read-side primitives cannot participated in lock-based deadlock cycles. But what happens when you attempt to read-acquire a reader-writer lock from within a transaction?
Unfortunately, the straightforward approach to read-acquiring the traditional counter-based reader-writer lock within a transaction defeats the purpose of the reader-writer lock. To see this, consider a pair of transactions concurrently attempting to read-acquire the same reader-writer lock. Because read-acquisition involves modifying the reader-writer lock's data structures, a conflict will result, which will roll back one of the two transactions. This behavior is completely inconsistent with the reader-writer lock's goal of allowing concurrent readers.
So, what is TM to do about reader-writer locking?
Unfortunately, the straightforward approach to read-acquiring the traditional counter-based reader-writer lock within a transaction defeats the purpose of the reader-writer lock. To see this, consider a pair of transactions concurrently attempting to read-acquire the same reader-writer lock. Because read-acquisition involves modifying the reader-writer lock's data structures, a conflict will result, which will roll back one of the two transactions. This behavior is completely inconsistent with the reader-writer lock's goal of allowing concurrent readers.
So, what is TM to do about reader-writer locking?
It is commonplace to acquire locks while holding other locks, which works quite well, at least as long as the usual well-known software-engineering techniques are employed to avoid deadlock. It is not unusual to acquire locks from within RCU read-side critical sections, which eases deadlock concerns because RCU read-side primitives cannot participated in lock-based deadlock cycles. But happens when you attempt to acquire a lock from within a transaction?
In theory, the answer is trivial: simply manipulate the data structure representing the lock as part of the transaction, and everything works out perfectly. In practice, a number of non-obvious complications can arise, depending on implementation details of the TM system. These complications can be resolved, but at the cost of a 45% increase in overhead for locks acquired outside of transactions and a 300% increase in overhead for locks acquired within transactions. Although these overheads might be acceptable for transactional programs containing small amounts of locking, they are often completely unacceptable for production-quality lock-based programs wishing to use the occasional transaction.
So what is TM to do about locking?
In theory, the answer is trivial: simply manipulate the data structure representing the lock as part of the transaction, and everything works out perfectly. In practice, a number of non-obvious complications can arise, depending on implementation details of the TM system. These complications can be resolved, but at the cost of a 45% increase in overhead for locks acquired outside of transactions and a 300% increase in overhead for locks acquired within transactions. Although these overheads might be acceptable for transactional programs containing small amounts of locking, they are often completely unacceptable for production-quality lock-based programs wishing to use the occasional transaction.
So what is TM to do about locking?
An important special case of interaction with extra-transactional accesses involves explicit time delays within a transaction. Of course, the idea of a time delay within a transaction flies in the face of TM's atomicity property, but one can argue that this sort of thing is what weak atomicity is all about.
So, what should TM do about time delays in transactions?
So, what should TM do about time delays in transactions?
Within a lock-based critical section, it is perfectly legal to manipulate variables that are concurrently accessed or even modified outside that lock's critical section, with one common example being statistical counters. The same thing is possible within RCU read-side critical sections, and is in fact the common case.
Given mechanisms like the so-called “dirty reads” that are prevalent in production database systems, it is not surprising that extra-transactional accesses have received serious attention from the proponents of TM, with the concepts of weak and strong atomicity being but one case in point.
So what can TM do about extra-transactional accesses?
Given mechanisms like the so-called “dirty reads” that are prevalent in production database systems, it is not surprising that extra-transactional accesses have received serious attention from the proponents of TM, with the concepts of weak and strong atomicity being but one case in point.
So what can TM do about extra-transactional accesses?
The following pseudo-code fragment might be unconventional, but is meaningful, well-defined, and useful:
pthread_mutex_lock(...);
for (i = 0; i < ncpus; i++)
tid[i] = pthread_create(...);
for (i = 0; i < ncpus; i++)
pthread_join(tid[i], ...)
pthread_mutex_unlock(...);
This pseudo-code fragment uses pthread_create() to spawn one thread per CPU, then uses pthread_join() to wait for each to complete, all under the protection of pthread_mutex_lock(). The effect is to execute a lock-based critical section in parallel. Of course, the critical section would need to be quite large to justify the thread-spawning overhead, but there are many examples of large critical sections in production software. It is also legal to spawn threads within other types of critical sections, for example reader-writer locks and RCU.
What might TM do about thread spawning within a transaction?
pthread_mutex_lock(...);
for (i = 0; i < ncpus; i++)
tid[i] = pthread_create(...);
for (i = 0; i < ncpus; i++)
pthread_join(tid[i], ...)
pthread_mutex_unlock(...);
This pseudo-code fragment uses pthread_create() to spawn one thread per CPU, then uses pthread_join() to wait for each to complete, all under the protection of pthread_mutex_lock(). The effect is to execute a lock-based critical section in parallel. Of course, the critical section would need to be quite large to justify the thread-spawning overhead, but there are many examples of large critical sections in production software. It is also legal to spawn threads within other types of critical sections, for example reader-writer locks and RCU.
What might TM do about thread spawning within a transaction?
It is perfectly legal to execute memory-mapping operations (including
It should not be necessary to consider cases where the TM system's metadata is remapped, given that most locking primitives do not define the outcome of remapping their lock variables.
Here are some possible ways for TM to handle memory mapping operations.
mmap() and shmat) within a lock-based critical section, and, at least in principle, from within an RCU read-side critical section. What happens when you attempt to execute such an operation from within a transaction? More to the point, what happens if the memory region being remapped contains some variables participating in the current thread's transaction? And what if this memory region contains variables particpating in some other thread's transaction?It should not be necessary to consider cases where the TM system's metadata is remapped, given that most locking primitives do not define the outcome of remapping their lock variables.
Here are some possible ways for TM to handle memory mapping operations.
One can execute RPCs within a lock-based critical section, as well as from within an RCU read-side critical section. What happens when you attempt to execute an RPC from within a transaction?
If both the RPC request and its response are to be contained within the transaction, and if some part of the transaction depends on the result returned by the response, then it is not possible to use the memory-buffer tricks that can be used in the case of buffered I/O. Any attempt to take this buffering approach would deadlock the transaction, as the request could not be transmitted until the transaction was guaranteed to succeed, but the transaction's success might not be knowable until after the response is received, as is the case in the following example:
begin_trans();
rpc_request();
i = rpc_response();
a[i]++;
end_trans();
The transaction's memory footprint cannot be determined until after the RPC response is received, and until the transaction's memory footprint can be determined. It is therefore impossible to determine whether the transaction can be allowed to commit.
What can TM do about this scenario?
If both the RPC request and its response are to be contained within the transaction, and if some part of the transaction depends on the result returned by the response, then it is not possible to use the memory-buffer tricks that can be used in the case of buffered I/O. Any attempt to take this buffering approach would deadlock the transaction, as the request could not be transmitted until the transaction was guaranteed to succeed, but the transaction's success might not be knowable until after the response is received, as is the case in the following example:
begin_trans();
rpc_request();
i = rpc_response();
a[i]++;
end_trans();
The transaction's memory footprint cannot be determined until after the RPC response is received, and until the transaction's memory footprint can be determined. It is therefore impossible to determine whether the transaction can be allowed to commit.
What can TM do about this scenario?
One can execute I/O operations within a lock-based critical section, and, at least in principle, from within an RCU read-side critical section. What happens when you attempt to execute an I/O operation from within a transaction?
The underlying problem is that transactions may be rolled back, for example, due to conflicts. Roughly speaking, this requires that all operations within any given transaction be idempotent, so that executing the operation twice has the same effect as executing it once. Unfortunately, I/O is in general the prototypical non-idempotent operation, making it difficult to include general I/O operations in transactions.
I/O options available to TM.
The underlying problem is that transactions may be rolled back, for example, due to conflicts. Roughly speaking, this requires that all operations within any given transaction be idempotent, so that executing the operation twice has the same effect as executing it once. Unfortunately, I/O is in general the prototypical non-idempotent operation, making it difficult to include general I/O operations in transactions.
I/O options available to TM.
I recently encountered a researcher who assured me that transactional memory (TM) would soon be the only synchronization mechanism, supplanting all the complex and error-prone synchronization mechanisms currently in use. This did surprise me a bit, given that TM has some difficulty with commonplace operations such as input and output, but I most certainly cannot fault the man's ambitions for TM. To his credit, he did acknowledge TM's difficulties with input and output. Of course, there are a number of I/O scenarios that TM does quite well with, the prime example being buffered I/O, but other simple I/O scenarios such as remote procedure call appear to be problematic, at least at present.
However, anyone who has worked on a large software artifact will have encountered situations where the ability to do even very small and restricted transactions would be extremely helpful. Even the relatively slow software TM (STM) implementations could do well in such situations, as the competing deadlock-avoidance/recovery schemes can be quite complex and slow in and of themselves.
That said, if TM is to achieve the researcher's great ambitions, it will need to interact with other mechanisms, be they I/O, system calls, or other synchronization primitives, with this latter being critically important when converting legacy software to TM. To this end, this series of posts will look at challenges to TM posed by other mechanisms, along with alternative resolutions to these challenges.
However, anyone who has worked on a large software artifact will have encountered situations where the ability to do even very small and restricted transactions would be extremely helpful. Even the relatively slow software TM (STM) implementations could do well in such situations, as the competing deadlock-avoidance/recovery schemes can be quite complex and slow in and of themselves.
That said, if TM is to achieve the researcher's great ambitions, it will need to interact with other mechanisms, be they I/O, system calls, or other synchronization primitives, with this latter being critically important when converting legacy software to TM. To this end, this series of posts will look at challenges to TM posed by other mechanisms, along with alternative resolutions to these challenges.
Yes, the Linux Plumbers Conference is filling up quite quickly. A few short weeks ago we were just barely one-third full, but this past week we blew through the three-quarters mark.
So, if you are one of those enlightened people who intends to hold off registering until September, thus providing the full measure of support for Linux Plumbing via the late fee... Well, I do very much appreciate the thought, and I just love the way you think, but it is my painful duty to inform you that there is a very real chance that the conference will fill up before the sun rises to greet the first day of September. If that happens, the unfortunate fact is that you won't be able to register, and you will be excluded from the conference.
Now, I don't want that to happen to you, so please don't miss out! Please register now before the conference fills up!!!
So, if you are one of those enlightened people who intends to hold off registering until September, thus providing the full measure of support for Linux Plumbing via the late fee... Well, I do very much appreciate the thought, and I just love the way you think, but it is my painful duty to inform you that there is a very real chance that the conference will fill up before the sun rises to greet the first day of September. If that happens, the unfortunate fact is that you won't be able to register, and you will be excluded from the conference.
Now, I don't want that to happen to you, so please don't miss out! Please register now before the conference fills up!!!
