You are viewing paulmck

elephant, penguin
We have all suffered from changing requirements. We are almost done with implemention, maybe even most of the way done testing, and then a change in requirements invalidates all of our hard work. This can of course be extremely irritating.

It turns out that changing requirements are not specific to software, nor are they anything new. Here is an example from my father's area of expertise, which is custom-built large-scale food processing equipment.

My father's company was designing a new-to-them piece of equipment, namely a spinach chopper able to chop 15 tons of spinach per hour. Given that this was a new area, they built a small-scale prototype with which they ran a number of experiments on “small” batches of spinach (that is, less than a ton of spinach). The customer provided feedback on the results of these experiments, which fed into the design of the full-scale production model.

After the resulting spinach chopper had been in production for a short while, there was a routine follow-up inspection. The purpose of the inspection was to check for unexpected wear and other unanticipated problems with the design and fabrication. While the inspector was there, the chopper kicked a large rock into the air. It turned out that spinach from the field can contain rocks up to six inches (15cm) in diameter, and this requirement was not been communicated during development. This situation of course inspired a quick engineering change, installing a cage that captured the rocks.

Of course, it is only fair to point out that this requirement didn't actually change. After all, spinach from the field has always contained rocks up to six inches in diameter. There had been a failure to communicate an existing requirement, not a change in the actual requirement.

However, from the viewpoint of the engineers, the effect is the same. Regardless of whether there was an actual change in the requirements or the exposure of a previously hidden requirement, redesign and rework will very likely required.

One other point of interest to those who know me... The spinach chopper was re-inspected some five years after it started operation. Its blades had never been sharpened, and despite encountering the occasional rock, still did not need sharpening. So to those of you who occasionally accuse me of over-engineering, all I can say is that I come by it honestly! ;–)
The first edition of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” is now available in electronic form. The dead-tree version will be available a bit later, or, as they say, Real Soon Now.

Big-animal changes over the early 2013 version include:

  1. Some material buried in the introduction and a few key Quick Quizzes has been pulled up into a new “How To Use This Book” chapter.
  2. Threre is significantly more material in the “Beyond Partitioning” section.
  3. A new “Hazard Pointers” section has been added to the “Deferred Processing” chapter.
  4. The “Data Structures” chapter has been filled out.
  5. Formal verification has been moved from an appendix to a new chapter following the pre-existing “Validation” chapter.
  6. A new “Putting It All Together” chapter contains some case studies.
  7. New cartoons have been added and some of the old cartoons have been updated.

Interestingly enough, even before its first edition, this book has seen some use in various classrooms. The memory-barrier information and the “Counting” chapter seem to be the most popular in that environment.

So what next?

Why, the second edition, of course!!! What else? :-)

Parallel Programming: January 2014 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features filled-out data-structures, debugging, and advanced-synchronization chapters; the addition of hazard pointers to the deferred-processing chapter; two new cartoons; and random updates to a number of other chapters. It also includes contributions from Angela Demke Brown, Darren Hart, Ralf Moeller, Charles-François Natali, Borislav Petkov, Anatol Pomozov, Alexey Roytman, and Frederic Weisbecker.

This release is a bit unusual in that it is a release candidate for a first edition. Although I still stand by my 2011 statement that this book will never be finished, some readers have been urging me to finish it, and releasing a first edition seems as good a way as any of resolving these two conflicting impulses. I expect to release the first edition in a month or two, and will of course address any editorial feedback I receive during that time. Another request from readers has been to make hardcopies easier to obtain, which I find quite surprising, but which I will nevertheless do for the first edition. In fact, a few of the people who do the best job of providing editorial feedback on the release candidates will be eligible to receive rare signed first editions. The decision of the judge (which would be me) as to who is eligible to received signed first editions is final.

What is “editorial feedback”? This is hard to define precisely, but would include grammar fixes, fixes for bugs in example code, and small expansions or clarifications to text. It does not include things like adding a chapter on update-friendly mechanisms or on GPUs and FPGAs. In other words, “editorial feedback” consists of fixes that can reasonably be accommodated at the current release-candidate stage.

As always, git:// will be updated in real time.
I recently refactored my rcutorture tests, getting that item off of my todo list after “only” three or four years. To my surprise, one (but only one) of the new rcutorture test scenarios failed, namely the one creatively named TREE08:


One reason for my surprise was that I had been running similar rcutorture scenarios for some years. In sharp (and happy) contrast to previous efforts featuring but a handful of failures in a given ten-hour run, the effect was not subtle. In fact, TREE08 often failed more than 20 times per minute.

My first thought was that I had blown a recent commit, but testing earlier kernel versions gave similar failure rates.

Additional testing and analysis showed that the bug was quite sensitive to the exact values of the Kconfig parameters. For example, setting CONFIG_RCU_FANOUT=2 or CONFIG_RCU_FANOUT=4 resulted in rock-solid behavior, despite CONFIG_RCU_FANOUT=3 failing miserably. Similarly, setting CONFIG_RCU_FANOUT_EXACT=n (which is thankfully the default) also resulted in rock-solid behavior. This fortunately provides not only a workaround for the bug, but, even better, a workaround that is enabled by default. This enabled-by-default workaround did much to explain the total lack of bug reports.

This sensitivity to a few Kconfig parameters implicated the boot-time code that sets up the rcu_node structures. I therefore ran this boot-time setup code in usermode scaffolding, resulting in the following output:

rcu_fanout_leaf = 16, nr_cpu_ids = 3
NR_CPUS: 16, RCU_NUM_LVLS: 3, rcu_num_lvls: 3, rcu_num_nodes: 12
num_rcu_lvl[0]: 1
num_rcu_lvl[1]: 3
num_rcu_lvl[2]: 8
num_rcu_lvl[3]: 16
num_rcu_lvl[4]: 0
rsp->levelspread[0]: 2
rsp->levelspread[1]: 3
rsp->levelspread[2]: 3
CPU 0: rdp=0x804c420 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 1: rdp=0x804c4cc rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 2: rdp=0x804c578 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 3: rdp=0x804c624 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 4: rdp=0x804c6d0 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 5: rdp=0x804c77c rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 6: rdp=0x804c828 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 7: rdp=0x804c8d4 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 8: rdp=0x804c980 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU 9: rdp=0x804ca2c rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 10: rdp=0x804cad8 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 11: rdp=0x804cb84 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 12: rdp=0x804cc30 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 13: rdp=0x804ccdc rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 14: rdp=0x804cd88 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU 15: rdp=0x804ce34 rnp=0x804b2e8 (15-15) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)

The rdp= numbers show each CPU's rcu_data structure addresses, and the sequence of rnp= numbers show the rcu_node structures when traversing upwards from the corresponding CPU to the root of the rcu_node tree. The numbers in parentheses show which CPUs are handled by each rcu_node structure. The initial lines simply show the values of a few key boot-time macros, variables, and arrays.

Had I been paying attention, I would have noticed that this output shows two very odd things, with one of the called out twice. However, some of my assumptions blinded me to these warning signs.

I therefore enabled RCU event tracing, with the usual false start where I failed to build the kernel with CONFIG_RCU_TRACE=y. This resulted in a 20MB trace, roughly 15MB of which was required to show the entire grace-period dance starting from the callback being registered and ending with rcutorture detecting RCU bug. The following entries caught my eye:

rcu_pree-7       0d..2 5852674733: rcu_grace_period_init: rcu_preempt 18446744073709551553 0 0 15 3
rcu_pree-7 0d..2 5852674946: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 0 8 7
rcu_pree-7 0d..2 5852675138: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 9 15 7
rcu_pree-7 0d..2 5852675313: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 18 15 0
rcu_pree-7 0d..2 5852675631: rcu_grace_period: rcu_preempt 18446744073709551553 cpustart
rcu_pree-7 0d..2 5852675802: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 0 2 7
rcu_pree-7 0d..2 5852675974: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 3 5 7
rcu_pree-7 0d..2 5852676148: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 6 8 7
rcu_pree-7 0d..2 5852676321: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 9 11 7
rcu_pree-7 0d..2 5852676519: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 12 14 7
rcu_pree-7 0d..2 5852676691: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 15 15 1
rcu_pree-7 0d..2 5852676862: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 18 15 0
rcu_pree-7 0d..2 5852677034: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 21 15 0

The smoking gun is the fourth line, the one containing the “18 15”. This indicates that the rcu_node structure corresponding to this trace event covers all CPUs whose numbers (as listed by smp_processor_id()) are greater or equal to 18 and also less than or equal to 15. There are of course no such CPUs. The last two lines show similar brokenness.

These trace events narrowed the bug down to the following code fragment in kernel/rcutree.c:

2 static void __init rcu_init_levelspread(struct rcu_state *rsp)
3 {
4 int i;
6 for (i = rcu_num_lvls - 1; i > 0; i--)
7 rsp->levelspread[i] = CONFIG_RCU_FANOUT;
8 rsp->levelspread[0] = rcu_fanout_leaf;
9 }
10 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */
11 static void __init rcu_init_levelspread(struct rcu_state *rsp)
12 {
13 int ccur;
14 int cprv;
15 int i;
17 cprv = nr_cpu_ids;
18 for (i = rcu_num_lvls - 1; i >= 0; i--) {
19 ccur = rsp->levelcnt[i];
20 rsp->levelspread[i] = (cprv + ccur - 1) / ccur;
21 cprv = ccur;
22 }
23 }
24 #endif /* #else #ifdef CONFIG_RCU_FANOUT_EXACT */

Can you spot the problem and the fix?

Al Viro came up with an example similar to this gem, where A and B are both initially zero:

B = 1;
r1 = A;
A = 1;
r2 = B;

Al's question is whether the outcome r1 == 0 && r2 == 0 is possible after both code fragments complete and all the dust settles.

To answer this question, we will use the fundamental property of RCU, which guarantees that an RCU read-side critical section cannot span a grace period. In the above example, the RCU read-side critical section is the code between rcu_read_lock() and rcu_read_unlock(), and the RCU grace period is provide by synchronize_rcu(). It turns out that this rather weak guarantee is actually quite powerful, as we will see in this example.

Now we know that if r1 == 0, then CPU 0's load from A must have preceded CPU 1's store to A, and thus also have preceded the grace period provided by synchronize_rcu(). Then RCU guarantees that CPU 0's store to B also preceded that same grace period. Because CPU 1's load from B follows the grace period, it must see the results of CPU 0's store, which means that if r1 == 0 we know that r2 == 1. Therefore, RCU prohibits Al's outcome of r1 == 0 && r2 == 0, as Linus Torvalds surmised.

You can also start by assuming that r2 == 0, but this is left as an exercise for the reader.

Let us instead proceed to a more elaborate example:

r1 = A;
r2 = B;
r3 = B;
r4 = A;
A = 1;    B = 1;

Is the result r1 == 1 && r2 == 0 && r3 == 1 && r4 == 0 possible? Why or why not?
elephant, penguin
The computing field has had its share of holy wars over the decades, with the subject of one famous 1970s/1980s holy war being endianness. Of course, endianness within a system refers to the mapping between numerical values and character arrays, so that for little-endian the numerical value 1 results in the first character of the corresponding array being non-zero, while for big-endian it results in the last character of the corresponding array being non-zero. For more on endianness, see the Wikipedia article.

Although many of the “cool kids” of the 1980s (such as Apple and Sun) were big-endian, the vast majority of today's Linux-capable systems are little-endian. This of course includes x86, but it also includes ARM, whose partners shipped enough ARM CPUs to give one to each man, woman, and child on this planet—and that only counts the ARM CPUs shipped in 2012.

Although most modern software is endian-neutral, device drivers and related software do sometimes make assumptions about endianness. In addition, someone considering creating special-purpose hardware would likely do a little-endian implementation first and foremost, and a big-endian implementation later, if at all. Furthermore, because single-threaded CPU throughput is not rising anywhere near as quickly as it was back in the 1980s and 1990s, special-purpose hardware is becoming increasingly important.

So what is a big-endian architecture like Power supposed to do?

The answer turns out to be “Both!!!”

Power hardware has long supported both big-endian and little-endian byte ordering, and the toolchain has had prototype little-endian support for some years. Furthermore, if you have been paying close attention, you will have noticed that the toolchain's little-endian support has received significant care and feeding over the past few months. Expect to see patches for the Linux kernel, QEMU, and other packages soon, from IBM and from others. IBM's partners in this effort include Google, Mellanox, NVIDIA, and Tyan.

Should be exciting! ;-)
Last year, I noted that hardware transactional memory (HTM) announcements lacked forward-progress guarantees. As noted in that posting:

Until HTM implementations provide some sort of forward-progress guarantee, HTM will be limited by its fallback code sequences. For example, if the fallback code uses locking, then HTM will be just as vulnerable to deadlock as is the fallback code.

And during the past year, the IBM Mainframe announced an HTM implementation that includes constrained transactions in addition to the usual best-effort HTM implementation. A constrained transaction starts with the tbeginc instruction instead of the tbegin instruction that is used for best-effort transactions. Constrained transactions are guaranteed to always complete (eventually), so if a transaction aborts, rather than branching to a fallback path (as is done for best-effort transactions), the hardware instead restarts the transaction at the tbeginc instruction.

The Mainframe architects needed to take extreme measures to deliver on this forward-progress guarantee. If a given constrained transaction repeatedly fails, the CPU might disable branch prediction, force in-order execution, and even completely disable pipelining. If the repeated failures are due to high contention, the CPU might disable speculative fetches, introduce random delays, and even serialize execution of the conflicting CPUs. “Interesting” forward-progress scenarios involve as few as two CPUs or as many as one hundred CPUs. Perhaps these extreme measures provide some insight as to why other CPUs have thus far refrained from offering constrained transactions.

As the name implies, constrained transactions are in fact severely constrained:

  1. The maximum data footprint is four blocks of memory, where each block can be no larger than 32 bytes.
  2. The maximum code footprint is 256 bytes.
  3. If a given 4K page contains a constrained transaction’s code, then that page may not contain that transaction’s data.
  4. The maximum number of assembly instructions that may be executed is 32.
  5. Backwards branches are forbidden.

Nevertheless, these constraints support a number of important data structures, including linked lists, stacks, queues, and arrays. Perhaps HTM with constrained transactions will eventually become an important tool in the parallel programmer's toolbox.


I had the privilege of attending this year's USENIX Workshop on Hot Topics in Parallelism (HOTPAR), which was as always an interesting gathering. One very positive change compared to the first HOTPAR in 2009 is that the participants seemed much more comfortable with parallelism. This is not to say that I agreed with all viewpoints put forward (quite the contrary, as other attendees can attest!), but rather that the discussions this year seemed to be driven by actual experience, in happy contrast with the first year's tendency towards conceptual opinions.
There were also more talks stretching beyond pure scalability. Some areas follow, along with examples: from the workshop, from the Linux community, of things needing doing, and of things that will likely be done to us.

The first area is extreme scale, with Bill Dally's keynote presentation being the best example. Over the next seven years, Bill expects a 50x performance improvement provided by systems sporting 18,000 GPUs with no fewer than ten billion concurrently executing threads. Yes, Bill expects systems to achieve 1,000 PFLOPS by the year 2020. There was less discussion of single-system extreme scale, in fact, a number of participants seemed quite surprised that the Linux kernel could run on systems with 4096 CPUs (admittedly with severely constrained workloads).

The second area is energy efficiency, where Bill again put forward some interesting predictions. You see, he calls for this 50x performance increase to be provided by a system drawing only 2x the power of current extreme-scale supercomputers. My paper and posters were also mainly about energy efficiency, albeit for much smaller devices. Unfashionable though it might be to admit this, much of my work on energy efficiency has felt like something being done to me. :-)

The third area is predictability, in this case, a lightening talk on capacity planning from Greg Bronevetsky. Of course, real-time response is another example of predictability, and many attendees were surprised that the Linux kernel's -rt patchset could achieve latencies in the low tens of microseconds. At a larger scale and at longer response times, Eric Brewer's Parallelism in the Cloud keynote discussed throughput/latency tradeoffs in cloud-computing environments, with the usual lament that many mechanisms that improve throughput degrade latency, which also qualifies as something being done to us. The saving grace for most cloud environments is that a large chunk of the cloud-computing workload is time-insensitive batch processing, which allows the cloud to run at reasonable utilization levels while still meeting interactive response-time goals. Interestingly enough, Berkeley is getting back into the OS business, working on an OS that provides just enough functionality for cloud-based applications. For example, this OS provides only rudimentary scheduling, with more complex scheduling policies being implemented by user programs.

The fourth area is heterogenous computing, with Bill Dally's keynote being the primary case in point. Sheffield, Anderson, and Keutzer presented on Three-Fingered Jack, which allows Python programs to use SIMD vector units. Tillet, Rupp, Selberherr, and Lin presented Towards Performance-Portable, Scalable, and Convenient Linear Algebra, which discussed performance portability across multiple GPUs. They were able to automatically generate code from OpenCL that beat the best hand-generated code, which I take as a sign that GPGPUs are finally coming of age. Perhaps GPUs will one day feel more like an organic part of the overall computing system.

The fifth area is software-engineering implications. The discussion in this area has advanced significantly since 2009, for example, it was good to see transactional-memory researchers taking debugging seriously (Gottschlich, Knauerhase, and Pokem But How Do We Really Debug Transactional Memory Programs?). They proposed additional record-replay hardware support, which has a number of interesting issues, including the need to ensure that other CPUs replay in a manner consistent with the CPU that is executing the transaction that is being debugged. Another approach is to allow non-transactional accesses within a transaction, so that these non-transactional accesses are not rolled back should the transaction abort. This provides a straightforward printf-like capability without the need for replay. Such non-transactional accesses are supported on some commercial platforms, including Power (suspended transactions) and the mainframe (the non-transactional store instruction). Perhaps other hardware platforms supporting transactional memory will also gain support for non-transactional accesses within a transaction.

The sixth and last area is extreme productivity via application-specific approaches. Quite impressively, Best, Jacobsen, Vining, and Fedorova are looking to enable artists and designers to successfully exploit parallelism in Collection-focused Parallelism. This talk recalled to mind how much the spreadsheet, word processor, and presentation manager did for PC uptake in the 1980s, in stark contrast to any number of high-minded language-based innovations. As I have said before, it seems likely that application-specific tools will provide the best path towards ubiquitous parallel computing. It is certainly the case that other engineering fields have specialized over time, and it would be quite surprising if computing were to prove the sole exception to this rule.

There were other papers as well, which can be downloaded from the conference website. One talk deserving special mention is Martin Rinard's Parallel Synchronization-Free Approximate Data Structure Construction, which uses approximate data-structure construction for a digital orrery (similar to his earlier talk at RACES'2012). It is always good to have Martin around, as his ideas are perceived by many to be even crazier than RCU.

Finally, it is important to note that it will not be sufficient to do well in only one or two of these areas, craziness included. Parallel systems of the future must do all of this simultaneously, which means that there is no shortage of parallel-programming work left to be done!
elephant, penguin
I have been using xfig for a very long time, almost as long as I have been using gnuplot. But xfig has been getting a bit cranky lately, mostly in terms of font handling. I suspect that it is possible to make it handle fonts like it used to, but I decided to take this as a hint to try something that might actually be younger than the typical Linux kernel hacker. (Yes, I am getting a bit old to engage in ageism, but there you have it!)

I had tried inkscape some years back, but at the time it was not ready for prime time, at least not from the perspective of a long-time xfig user. But I had recently received a .svg, and had installed inkscape in order to be able to work with it. Besides, some of the more recent browsers can render directly from .svg, which may in the fullness of time remove the need to generate bitmap files for HTML documents.

So I gave inkscape a try.

The first pleasant surprise is that inkscape is able to import xfig's .fig file format. This import process is not perfect, for example, the fonts do not match exactly and arrowheads are sometimes imported as objects separate from the line that they are supposed to be attached to, but it is much nicer than recreating the diagram from scratch. In addition, in many cases, the import imperfections are not a problem, such as when the goal is simply to add something to the figure.

Of course, the menu layout is completely different than that of xfig, but this is not always a bad thing. For example, even given long familiarity with xfig, I found inkscape's object rotation to be much more powerful and easier to use than that of xfig. Object alignment and distribution is also much nicer in xfig. The manual canvas configuration in inkscape is a step back from xfig's automation, but it might well be that I just haven't yet found the corresponding inkscape setting. Finally, the ability to directly generate .pdf files works more smoothly with pdflatex, which I use heavily. The fact that they get rotated 90 degrees was a bit surprising at first, but the \rotatebox{270} directive in Latex takes care of that.

So who knows? After more years than I care to recall, it might finally be time to bid xfig a fond farewell.
Suppose that you have an initially empty RCU-protected hash table with per-bucket-locked updates. Suppose that one thread concurrently inserts items A and B in that order (but into different buckets) while a second thread concurrently looks up item A then item B—and while yet a third thread concurrently looks up these two items in the opposite order. The code for these three threads might look something like the following:

 1 void t1(void)
 2 {
 3   spin_lock(chain(A));
 4   insert(A);
 5   spin_unlock(chain(A));
 6   spin_lock(chain(B));
 7   insert(B);
 8   spin_unlock(chain(B));
 9 }
11 void t2(void)
12 {
13   rcu_read_lock();
14   l1 = lookup(A);
15   rcu_read_unlock();
16   rcu_read_lock();
17   l2 = lookup(B);
18   rcu_read_unlock();
19 }
21 void t3(void)
22 {
23   rcu_read_lock();
24   l3 = lookup(B);
25   rcu_read_unlock();
26   rcu_read_lock();
27   l4 = lookup(A);
28   rcu_read_unlock();
29 }

Because there is absolutely nothing guaranteeing the order of t2()'s and t3's lookups, it is quite possible that we could end up with l1==1&&l2==0&&l3==1&&l4==0, which would mean that t2() and t3() disagree on the order of insertion. However, this outcome is excluded if we place a synchronize_rcu() between lines 5 and 6 above.

But how can synchronize_rcu() possibly exclude this outcome given that t2()'s and t3()'s lookups can still be reordered?