Log in

Last year, I noted that hardware transactional memory (HTM) announcements lacked forward-progress guarantees. As noted in that posting:

Until HTM implementations provide some sort of forward-progress guarantee, HTM will be limited by its fallback code sequences. For example, if the fallback code uses locking, then HTM will be just as vulnerable to deadlock as is the fallback code.

And during the past year, the IBM Mainframe announced an HTM implementation that includes constrained transactions in addition to the usual best-effort HTM implementation. A constrained transaction starts with the tbeginc instruction instead of the tbegin instruction that is used for best-effort transactions. Constrained transactions are guaranteed to always complete (eventually), so if a transaction aborts, rather than branching to a fallback path (as is done for best-effort transactions), the hardware instead restarts the transaction at the tbeginc instruction.

The Mainframe architects needed to take extreme measures to deliver on this forward-progress guarantee. If a given constrained transaction repeatedly fails, the CPU might disable branch prediction, force in-order execution, and even completely disable pipelining. If the repeated failures are due to high contention, the CPU might disable speculative fetches, introduce random delays, and even serialize execution of the conflicting CPUs. “Interesting” forward-progress scenarios involve as few as two CPUs or as many as one hundred CPUs. Perhaps these extreme measures provide some insight as to why other CPUs have thus far refrained from offering constrained transactions.

As the name implies, constrained transactions are in fact severely constrained:

  1. The maximum data footprint is four blocks of memory, where each block can be no larger than 32 bytes.
  2. The maximum code footprint is 256 bytes.
  3. If a given 4K page contains a constrained transaction’s code, then that page may not contain that transaction’s data.
  4. The maximum number of assembly instructions that may be executed is 32.
  5. Backwards branches are forbidden.

Nevertheless, these constraints support a number of important data structures, including linked lists, stacks, queues, and arrays. Perhaps HTM with constrained transactions will eventually become an important tool in the parallel programmer's toolbox.


I had the privilege of attending this year's USENIX Workshop on Hot Topics in Parallelism (HOTPAR), which was as always an interesting gathering. One very positive change compared to the first HOTPAR in 2009 is that the participants seemed much more comfortable with parallelism. This is not to say that I agreed with all viewpoints put forward (quite the contrary, as other attendees can attest!), but rather that the discussions this year seemed to be driven by actual experience, in happy contrast with the first year's tendency towards conceptual opinions.
There were also more talks stretching beyond pure scalability. Some areas follow, along with examples: from the workshop, from the Linux community, of things needing doing, and of things that will likely be done to us.

The first area is extreme scale, with Bill Dally's keynote presentation being the best example. Over the next seven years, Bill expects a 50x performance improvement provided by systems sporting 18,000 GPUs with no fewer than ten billion concurrently executing threads. Yes, Bill expects systems to achieve 1,000 PFLOPS by the year 2020. There was less discussion of single-system extreme scale, in fact, a number of participants seemed quite surprised that the Linux kernel could run on systems with 4096 CPUs (admittedly with severely constrained workloads).

The second area is energy efficiency, where Bill again put forward some interesting predictions. You see, he calls for this 50x performance increase to be provided by a system drawing only 2x the power of current extreme-scale supercomputers. My paper and posters were also mainly about energy efficiency, albeit for much smaller devices. Unfashionable though it might be to admit this, much of my work on energy efficiency has felt like something being done to me. :-)

The third area is predictability, in this case, a lightening talk on capacity planning from Greg Bronevetsky. Of course, real-time response is another example of predictability, and many attendees were surprised that the Linux kernel's -rt patchset could achieve latencies in the low tens of microseconds. At a larger scale and at longer response times, Eric Brewer's Parallelism in the Cloud keynote discussed throughput/latency tradeoffs in cloud-computing environments, with the usual lament that many mechanisms that improve throughput degrade latency, which also qualifies as something being done to us. The saving grace for most cloud environments is that a large chunk of the cloud-computing workload is time-insensitive batch processing, which allows the cloud to run at reasonable utilization levels while still meeting interactive response-time goals. Interestingly enough, Berkeley is getting back into the OS business, working on an OS that provides just enough functionality for cloud-based applications. For example, this OS provides only rudimentary scheduling, with more complex scheduling policies being implemented by user programs.

The fourth area is heterogenous computing, with Bill Dally's keynote being the primary case in point. Sheffield, Anderson, and Keutzer presented on Three-Fingered Jack, which allows Python programs to use SIMD vector units. Tillet, Rupp, Selberherr, and Lin presented Towards Performance-Portable, Scalable, and Convenient Linear Algebra, which discussed performance portability across multiple GPUs. They were able to automatically generate code from OpenCL that beat the best hand-generated code, which I take as a sign that GPGPUs are finally coming of age. Perhaps GPUs will one day feel more like an organic part of the overall computing system.

The fifth area is software-engineering implications. The discussion in this area has advanced significantly since 2009, for example, it was good to see transactional-memory researchers taking debugging seriously (Gottschlich, Knauerhase, and Pokem But How Do We Really Debug Transactional Memory Programs?). They proposed additional record-replay hardware support, which has a number of interesting issues, including the need to ensure that other CPUs replay in a manner consistent with the CPU that is executing the transaction that is being debugged. Another approach is to allow non-transactional accesses within a transaction, so that these non-transactional accesses are not rolled back should the transaction abort. This provides a straightforward printf-like capability without the need for replay. Such non-transactional accesses are supported on some commercial platforms, including Power (suspended transactions) and the mainframe (the non-transactional store instruction). Perhaps other hardware platforms supporting transactional memory will also gain support for non-transactional accesses within a transaction.

The sixth and last area is extreme productivity via application-specific approaches. Quite impressively, Best, Jacobsen, Vining, and Fedorova are looking to enable artists and designers to successfully exploit parallelism in Collection-focused Parallelism. This talk recalled to mind how much the spreadsheet, word processor, and presentation manager did for PC uptake in the 1980s, in stark contrast to any number of high-minded language-based innovations. As I have said before, it seems likely that application-specific tools will provide the best path towards ubiquitous parallel computing. It is certainly the case that other engineering fields have specialized over time, and it would be quite surprising if computing were to prove the sole exception to this rule.

There were other papers as well, which can be downloaded from the conference website. One talk deserving special mention is Martin Rinard's Parallel Synchronization-Free Approximate Data Structure Construction, which uses approximate data-structure construction for a digital orrery (similar to his earlier talk at RACES'2012). It is always good to have Martin around, as his ideas are perceived by many to be even crazier than RCU.

Finally, it is important to note that it will not be sufficient to do well in only one or two of these areas, craziness included. Parallel systems of the future must do all of this simultaneously, which means that there is no shortage of parallel-programming work left to be done!
I have been using xfig for a very long time, almost as long as I have been using gnuplot. But xfig has been getting a bit cranky lately, mostly in terms of font handling. I suspect that it is possible to make it handle fonts like it used to, but I decided to take this as a hint to try something that might actually be younger than the typical Linux kernel hacker. (Yes, I am getting a bit old to engage in ageism, but there you have it!)

I had tried inkscape some years back, but at the time it was not ready for prime time, at least not from the perspective of a long-time xfig user. But I had recently received a .svg, and had installed inkscape in order to be able to work with it. Besides, some of the more recent browsers can render directly from .svg, which may in the fullness of time remove the need to generate bitmap files for HTML documents.

So I gave inkscape a try.

The first pleasant surprise is that inkscape is able to import xfig's .fig file format. This import process is not perfect, for example, the fonts do not match exactly and arrowheads are sometimes imported as objects separate from the line that they are supposed to be attached to, but it is much nicer than recreating the diagram from scratch. In addition, in many cases, the import imperfections are not a problem, such as when the goal is simply to add something to the figure.

Of course, the menu layout is completely different than that of xfig, but this is not always a bad thing. For example, even given long familiarity with xfig, I found inkscape's object rotation to be much more powerful and easier to use than that of xfig. Object alignment and distribution is also much nicer in xfig. The manual canvas configuration in inkscape is a step back from xfig's automation, but it might well be that I just haven't yet found the corresponding inkscape setting. Finally, the ability to directly generate .pdf files works more smoothly with pdflatex, which I use heavily. The fact that they get rotated 90 degrees was a bit surprising at first, but the \rotatebox{270} directive in Latex takes care of that.

So who knows? After more years than I care to recall, it might finally be time to bid xfig a fond farewell.
Suppose that you have an initially empty RCU-protected hash table with per-bucket-locked updates. Suppose that one thread concurrently inserts items A and B in that order (but into different buckets) while a second thread concurrently looks up item A then item B—and while yet a third thread concurrently looks up these two items in the opposite order. The code for these three threads might look something like the following:

 1 void t1(void)
 2 {
 3   spin_lock(chain(A));
 4   insert(A);
 5   spin_unlock(chain(A));
 6   spin_lock(chain(B));
 7   insert(B);
 8   spin_unlock(chain(B));
 9 }
11 void t2(void)
12 {
13   rcu_read_lock();
14   l1 = lookup(A);
15   rcu_read_unlock();
16   rcu_read_lock();
17   l2 = lookup(B);
18   rcu_read_unlock();
19 }
21 void t3(void)
22 {
23   rcu_read_lock();
24   l3 = lookup(B);
25   rcu_read_unlock();
26   rcu_read_lock();
27   l4 = lookup(A);
28   rcu_read_unlock();
29 }

Because there is absolutely nothing guaranteeing the order of t2()'s and t3's lookups, it is quite possible that we could end up with l1==1&&l2==0&&l3==1&&l4==0, which would mean that t2() and t3() disagree on the order of insertion. However, this outcome is excluded if we place a synchronize_rcu() between lines 5 and 6 above.

But how can synchronize_rcu() possibly exclude this outcome given that t2()'s and t3()'s lookups can still be reordered?

Parallel Programming: January 2013 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features the (very skeletal) beginnings of the data-structures chapter, twelve new cartoons, and general updates to the first seven chapters. It also includes contributions from Kornilios Kourtis, Namhyung Kim, Ricardo Fabbri and his class, and Yuchen Dai.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.

RCU and Crowds

A Read-Copy Update based parallel server for distributed crowd simulations” (paywalled) recently appeared in Journal of Supercomputing. Vigueras et al. apply RCU to crowd simulation (as you might guess from the title), which combines throughput (“faster is better”) and response-time requirements (“never trust any result older than 250 milliseconds”). Use of RCU instead of straight locking allows them to support crowds with three times the number of agents while still meeting their response-time requirements.

It is good to see academics experimenting with RCU. Even better, I learned of this only after seeing the paper. ;-)

Off to the races!

I was fortunate enough to be able to attend the first workshop on Relaxing Synchronization for Multicore and Manycore Scalability (also known as “RACES'12”), which drew concurrency researchers from across academia and industry. The topic was somewhat controversial, quoting from the call for participation:

How can we cast aside the security of correctness, the logic of a proof, and adopt a new way of thinking, where answers are good enough but not certain, and where many processors work together in parallel without quite knowing the states that the others are in? We may need some amount of synchronization, but how much? Or better yet, how little? What mental tools and linguistic devices can we give programmers to help them adapt to this challenge?

The workshop's presenters expressed a number of views on this matter.

One group of talks covered applications using approximate methods, where the approximations might be due to inaccuracies in the input data, convergence criteria in iterative algorithms, use of discrete rather than continuous time intervals, and so on. The key point was that all of these applications maintained an error budget, and that this budget could just as well include a component for errors due to weak synchronization—or even due to entirely omitted synchronization. Applications included the Barnes-Hutt gravitational multi-body solver (Martin Rinard); kmeans clustering and breadth-first search of large graphs (Ravi Nair et al.), and web-based analytics (Michael Carbin et al.). Of course, synchronization-free numerical algorithms have been around for some decades for dense arrays, but these talks used more complex data structures. The talks were quite interesting, though I would have liked to see more consistent comparisons against unsynchronized single-threaded implementation. After all, eliminating synchronization does not necessarily eliminate synchronization overhead in the form of communications cache misses. It nevertheless felt quite good to see people taking even more aggressive approaches to parallelism than I normally take. ;-)

There was considerable discussion as to when inaccuracy do to relaxed synchronization could be tolerated. Martin Rinard pointed out that the relaxed Barnes-Hutt program actually provided greater accuracy than did some of the more heavily synchronized variants, but the fact that the unsynchronized version could omit objects completely caused some consternation. Of course, Barnes-Hutt's use of centroids to represent large groups of bodies is inherently approximate in any case: Reasonable accuracy is achieved only when the ratio of the distances to the nearest object in the group and to the centroid of that group is reasonably close to 1.0. Doug Lea eventually closed this discussion by suggesting that: (1) Any algorithm willing to tolerate the approximations inherent in floating-point arithmetic should also be willing to tolerate the approximations inherent in relaxed or omitted synchronization, but that (2) algorithms using exact computations might wish to think twice before omitting quite so much synchronization. I suspect that important future work will include classifying what sorts of errors and non-determinism are acceptable in a given situation.

Hans-J. Boehm's talk on the evils of data races exposed a sharp difference in the definition of “data race”. A common definition is “at least two concurrent unsynchronized accesses, at least one of which is a write,” but it turned out that there was little agreement on what constitutes “unsynchronized.” Martin Rinard argued that a synchronized access should involve locking, atomic instructions, or at the very least a memory barrier, while Hans asserted that as long as the developer informed the compiler that the memory location in question was subject to concurrent conflicting accesses, the corresponding accesses were in fact synchronized. Hans's version of the definition has the interesting property that exactly the same assembly code might be omitted for a data-race-free program as for the analogous program having data races, but on the other hand, Hans's definition allows Martin to write his racy algorithms in C++11 without having to drop into assembly language. I doubt that these two groups will come to agreement on this point any time soon, but such is life in computing. Therefore, if someone talks to you about data races, you would be wise to ask them exactly what definition they are using.

The next presentation examined the meaning of “FIFO” in concurrent systems. Andreas Haas et al. noted that the usual definitions of “perfect FIFO queues” used ``linearization points'' that are not visible to the caller, who can really only observe when the enqueue and dequeue operation begin and end. The authors therefore measured the deviation of the order in the queue from the order in which the enqueue operations began. They found that even strict FIFO-ordered queues suffered significant misordering. Interestingly enough, queues designed to emphasize performance over ordering actually produced more strongly ordered results than did the “perfect FIFO” queues, mainly because the shorter durations of the enqueue and dequeue operation exposed less opportunity for misordering. I found this to be the most interesting talk, as it gave yet another reason why it is a bad idea to push your parallel application's entire data set through a single queue.

Philip Howard presented a talk on relativistic programming, which sparked a debate as to whether such techniques were actually usable in practice. I eventually squelched the debate by pointing out that this technique is heavily used in the Linux kernel. Trey Cain discussed hardware techniques for gaining more performance and scalability from weakly ordered hardware, David Ungar raised the question of whether increased scalability and throughput always implied degraded latency, and Max Orhai presented on parallel sort on a spatial computer, that is a computer whose computing elements are arranged in a linear array or a two-dimensional grid. Finally, Sasa Misailovic discussed a system that automatically removed synchronization from a program and the tested the degree of inaccuracy that this removal introduced. I am not so sure I would want to bet my life on a program produced by Sasa's system, but random removal of synchronization does sound like a good way to evaluate the effectiveness of test/validation suites for parallel programs.

My presentation made the claim that parallel programming could in fact be learned by large groups of people (paper, presentation). In keeping with most of the other presentations, this claim proved to be somewhat controversial.

All in all, this was an interesting and worthwhile workshop, and I look forward to similar gatherings in the future.

Plant your flag!

About fifty years ago, I received a wonderful gift from a friend of the family. It was a colorful flag, just the thing for a three-year-old boy. I ran around with it for a while, watching it flap merrily along. Then I planted it in the front garden, where I could see it waving in the wind from inside our house.

That evening was a bit unusual, as my father was unable to come home from work. My mother made dinner and then read to me by candlelight, frequently warning me to stay away from the living room's large picture window. Several times, I asked about the strange noises from outside, and each time my mother answered that it was the wind. I had some difficulty believing her, as I had heard wind many times before, and it had never sounded anything at all like what I was hearing that evening. My mother eventually put me to bed, and when I woke up the next morning, my father had returned home.

In short, at the time, I was not all that impressed by the Columbus Day Storm of October 12, 1962. Of course, anyone unfortunate enough to have experienced Hurricane Camille or Hurricane Katrina won't be much impressed, either, but then again, tropical hurricanes are in a class of their own.

Others were plenty impressed, though. Wind gusts in excess of 160 miles an hour were reported in several places, although the actual maximum wind speed was often unknown. In some cases, anemometers were pegged at their maximum readings for extended periods of time, in other cases the anemometers were blown off their roofs, and in at least one case the wind took out the entire weather station for good measure. There were casualties, but the Pacific Northwest being what it is, far more trees than people perished. Property damage was such that one of the casualties of the Columbus Day Storm was an insurance company.

One reason for the casualties was lack of preparation. You see, there were no weather satellites back then, and, unlike the Gulf Coast, the Pacific Northwest did not fly hurricane-hunter airplane patrols. The weather forecast called for a typical winter gale, and the only indication that this forecast might be overly optimistic was a radio report from a ship out in the Pacific. This report featured not only unbelievably high winds and seas, but also barometer readings that dropped by an equally unbelievable 0.66 inches of mercury over the course of only two hours. Because malfunctioning ship-board equipment is not all that uncommon, the weathermen upgraded their forecast only after barometers along the coast of northern California started dropping like rocks. But even then, they expected “only” 70-mile-per-hour winds.

It is a ill wind that blows no good, and this storm was no exception. For example, one man worked in southern Oregon and lived in Portland, at the northern end of the state. He spent his weeks at work and his weekends with his family. His drive up to Portland normally took a full tank of gas, but on that Friday evening, he made the trip wind-assisted on only a quarter tank. Other people driving north on I5 that evening reported seeing barns keeping up with traffic. Needless to say, the casualties among livestock were not inconsequential.

But from my three-year-old viewpoint, it was not all that big a deal. Sure, the power went out, but my dad managed to scrounge a diesel generator the very next day. Not only did our unassuming house somehow manage to escape damage, but there were absolutely no fallen trees or power lines within sight of our house.

But my parents were completely unable to explain the disappearance of my prized flag to my satisfaction. They just kept repeating this insane story about how it had been blown away by the wind. ;-)
Linux Plumbers Conference this past week in San Diego was a great event, and in my role as Program Committee Chair, I thank the many people who provided the excellent content that makes this event what it is. The program committee had to make some tough decisions between worthy submissions. The submitters, both successful and otherwise, provided an excellent set of abstracts and presentations. The Microconference leads put together excellent events, including first-ever Plumbers Constraint-Framework, LLVM, and Android microconferences. Last, but by no means least, there was wonderfui audience participation, including a vibrant hallway track. (A number of the people I informally polled called out the hallway track as the most valuable part of the conference, which I believe to be a very good thing indeed!)

I join all the attendees in thanking the Plumbers Organizing Committee and the Linux Foundation staff for providing the venue, scheduling, organization, food, and drink that powers these events.

And I am already looking forward to next year's Plumbers! I hope to see you there!!!


Parallel Programming: August 2012 Update

This release of the parallel programming book features a completion of the functional-testing validation chapter (performance testing likely gets a separate chapter), updates to the locking and transactional-memory sections, and lots of fixes, primarily from Namhyung Kim, but with a fix to a particularly gnarly memory-ordering example from David Ungar. (The ppcmem tool is even more highly recommended as a result.)

This release is rather late. This tardiness was in roughly equal parts due to:

  1. Inertia.
  2. Unexpected requests for more information on hardware transactional memory.
  3. Frequent changes in approach to the validation chapter.
  4. A sudden desire to add a third example to the "Partitioning and Synchronization Design" chapter. This was motivated by seeing people blog on the difficulty of solving mazes in parallel, and it would indeed be difficult if you confined yourself to their designs. However, the results were surprising, so much so that I published paper describing a scalable solution, which was not simply embarrassingly parallel, but rather humiliatingly parallel. Which means that this chapter is still short a third example.

Future releases will hopefully return to the 3-4 per year originally intended.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.