You are viewing paulmck

Suppose that you have an initially empty RCU-protected hash table with per-bucket-locked updates. Suppose that one thread concurrently inserts items A and B in that order (but into different buckets) while a second thread concurrently looks up item A then item B—and while yet a third thread concurrently looks up these two items in the opposite order. The code for these three threads might look something like the following:

 1 void t1(void)
 2 {
 3   spin_lock(chain(A));
 4   insert(A);
 5   spin_unlock(chain(A));
 6   spin_lock(chain(B));
 7   insert(B);
 8   spin_unlock(chain(B));
 9 }
11 void t2(void)
12 {
13   rcu_read_lock();
14   l1 = lookup(A);
15   rcu_read_unlock();
16   rcu_read_lock();
17   l2 = lookup(B);
18   rcu_read_unlock();
19 }
21 void t3(void)
22 {
23   rcu_read_lock();
24   l3 = lookup(B);
25   rcu_read_unlock();
26   rcu_read_lock();
27   l4 = lookup(A);
28   rcu_read_unlock();
29 }

Because there is absolutely nothing guaranteeing the order of t2()'s and t3's lookups, it is quite possible that we could end up with l1==1&&l2==0&&l3==1&&l4==0, which would mean that t2() and t3() disagree on the order of insertion. However, this outcome is excluded if we place a synchronize_rcu() between lines 5 and 6 above.

But how can synchronize_rcu() possibly exclude this outcome given that t2()'s and t3()'s lookups can still be reordered?

Parallel Programming: January 2013 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features the (very skeletal) beginnings of the data-structures chapter, twelve new cartoons, and general updates to the first seven chapters. It also includes contributions from Kornilios Kourtis, Namhyung Kim, Ricardo Fabbri and his class, and Yuchen Dai.

As always, git:// will be updated in real time.

RCU and Crowds

A Read-Copy Update based parallel server for distributed crowd simulations” (paywalled) recently appeared in Journal of Supercomputing. Vigueras et al. apply RCU to crowd simulation (as you might guess from the title), which combines throughput (“faster is better”) and response-time requirements (“never trust any result older than 250 milliseconds”). Use of RCU instead of straight locking allows them to support crowds with three times the number of agents while still meeting their response-time requirements.

It is good to see academics experimenting with RCU. Even better, I learned of this only after seeing the paper. ;-)

Off to the races!

I was fortunate enough to be able to attend the first workshop on Relaxing Synchronization for Multicore and Manycore Scalability (also known as “RACES'12”), which drew concurrency researchers from across academia and industry. The topic was somewhat controversial, quoting from the call for participation:

How can we cast aside the security of correctness, the logic of a proof, and adopt a new way of thinking, where answers are good enough but not certain, and where many processors work together in parallel without quite knowing the states that the others are in? We may need some amount of synchronization, but how much? Or better yet, how little? What mental tools and linguistic devices can we give programmers to help them adapt to this challenge?

The workshop's presenters expressed a number of views on this matter.

One group of talks covered applications using approximate methods, where the approximations might be due to inaccuracies in the input data, convergence criteria in iterative algorithms, use of discrete rather than continuous time intervals, and so on. The key point was that all of these applications maintained an error budget, and that this budget could just as well include a component for errors due to weak synchronization—or even due to entirely omitted synchronization. Applications included the Barnes-Hutt gravitational multi-body solver (Martin Rinard); kmeans clustering and breadth-first search of large graphs (Ravi Nair et al.), and web-based analytics (Michael Carbin et al.). Of course, synchronization-free numerical algorithms have been around for some decades for dense arrays, but these talks used more complex data structures. The talks were quite interesting, though I would have liked to see more consistent comparisons against unsynchronized single-threaded implementation. After all, eliminating synchronization does not necessarily eliminate synchronization overhead in the form of communications cache misses. It nevertheless felt quite good to see people taking even more aggressive approaches to parallelism than I normally take. ;-)

There was considerable discussion as to when inaccuracy do to relaxed synchronization could be tolerated. Martin Rinard pointed out that the relaxed Barnes-Hutt program actually provided greater accuracy than did some of the more heavily synchronized variants, but the fact that the unsynchronized version could omit objects completely caused some consternation. Of course, Barnes-Hutt's use of centroids to represent large groups of bodies is inherently approximate in any case: Reasonable accuracy is achieved only when the ratio of the distances to the nearest object in the group and to the centroid of that group is reasonably close to 1.0. Doug Lea eventually closed this discussion by suggesting that: (1) Any algorithm willing to tolerate the approximations inherent in floating-point arithmetic should also be willing to tolerate the approximations inherent in relaxed or omitted synchronization, but that (2) algorithms using exact computations might wish to think twice before omitting quite so much synchronization. I suspect that important future work will include classifying what sorts of errors and non-determinism are acceptable in a given situation.

Hans-J. Boehm's talk on the evils of data races exposed a sharp difference in the definition of “data race”. A common definition is “at least two concurrent unsynchronized accesses, at least one of which is a write,” but it turned out that there was little agreement on what constitutes “unsynchronized.” Martin Rinard argued that a synchronized access should involve locking, atomic instructions, or at the very least a memory barrier, while Hans asserted that as long as the developer informed the compiler that the memory location in question was subject to concurrent conflicting accesses, the corresponding accesses were in fact synchronized. Hans's version of the definition has the interesting property that exactly the same assembly code might be omitted for a data-race-free program as for the analogous program having data races, but on the other hand, Hans's definition allows Martin to write his racy algorithms in C++11 without having to drop into assembly language. I doubt that these two groups will come to agreement on this point any time soon, but such is life in computing. Therefore, if someone talks to you about data races, you would be wise to ask them exactly what definition they are using.

The next presentation examined the meaning of “FIFO” in concurrent systems. Andreas Haas et al. noted that the usual definitions of “perfect FIFO queues” used ``linearization points'' that are not visible to the caller, who can really only observe when the enqueue and dequeue operation begin and end. The authors therefore measured the deviation of the order in the queue from the order in which the enqueue operations began. They found that even strict FIFO-ordered queues suffered significant misordering. Interestingly enough, queues designed to emphasize performance over ordering actually produced more strongly ordered results than did the “perfect FIFO” queues, mainly because the shorter durations of the enqueue and dequeue operation exposed less opportunity for misordering. I found this to be the most interesting talk, as it gave yet another reason why it is a bad idea to push your parallel application's entire data set through a single queue.

Philip Howard presented a talk on relativistic programming, which sparked a debate as to whether such techniques were actually usable in practice. I eventually squelched the debate by pointing out that this technique is heavily used in the Linux kernel. Trey Cain discussed hardware techniques for gaining more performance and scalability from weakly ordered hardware, David Ungar raised the question of whether increased scalability and throughput always implied degraded latency, and Max Orhai presented on parallel sort on a spatial computer, that is a computer whose computing elements are arranged in a linear array or a two-dimensional grid. Finally, Sasa Misailovic discussed a system that automatically removed synchronization from a program and the tested the degree of inaccuracy that this removal introduced. I am not so sure I would want to bet my life on a program produced by Sasa's system, but random removal of synchronization does sound like a good way to evaluate the effectiveness of test/validation suites for parallel programs.

My presentation made the claim that parallel programming could in fact be learned by large groups of people (paper, presentation). In keeping with most of the other presentations, this claim proved to be somewhat controversial.

All in all, this was an interesting and worthwhile workshop, and I look forward to similar gatherings in the future.

Plant your flag!

About fifty years ago, I received a wonderful gift from a friend of the family. It was a colorful flag, just the thing for a three-year-old boy. I ran around with it for a while, watching it flap merrily along. Then I planted it in the front garden, where I could see it waving in the wind from inside our house.

That evening was a bit unusual, as my father was unable to come home from work. My mother made dinner and then read to me by candlelight, frequently warning me to stay away from the living room's large picture window. Several times, I asked about the strange noises from outside, and each time my mother answered that it was the wind. I had some difficulty believing her, as I had heard wind many times before, and it had never sounded anything at all like what I was hearing that evening. My mother eventually put me to bed, and when I woke up the next morning, my father had returned home.

In short, at the time, I was not all that impressed by the Columbus Day Storm of October 12, 1962. Of course, anyone unfortunate enough to have experienced Hurricane Camille or Hurricane Katrina won't be much impressed, either, but then again, tropical hurricanes are in a class of their own.

Others were plenty impressed, though. Wind gusts in excess of 160 miles an hour were reported in several places, although the actual maximum wind speed was often unknown. In some cases, anemometers were pegged at their maximum readings for extended periods of time, in other cases the anemometers were blown off their roofs, and in at least one case the wind took out the entire weather station for good measure. There were casualties, but the Pacific Northwest being what it is, far more trees than people perished. Property damage was such that one of the casualties of the Columbus Day Storm was an insurance company.

One reason for the casualties was lack of preparation. You see, there were no weather satellites back then, and, unlike the Gulf Coast, the Pacific Northwest did not fly hurricane-hunter airplane patrols. The weather forecast called for a typical winter gale, and the only indication that this forecast might be overly optimistic was a radio report from a ship out in the Pacific. This report featured not only unbelievably high winds and seas, but also barometer readings that dropped by an equally unbelievable 0.66 inches of mercury over the course of only two hours. Because malfunctioning ship-board equipment is not all that uncommon, the weathermen upgraded their forecast only after barometers along the coast of northern California started dropping like rocks. But even then, they expected “only” 70-mile-per-hour winds.

It is a ill wind that blows no good, and this storm was no exception. For example, one man worked in southern Oregon and lived in Portland, at the northern end of the state. He spent his weeks at work and his weekends with his family. His drive up to Portland normally took a full tank of gas, but on that Friday evening, he made the trip wind-assisted on only a quarter tank. Other people driving north on I5 that evening reported seeing barns keeping up with traffic. Needless to say, the casualties among livestock were not inconsequential.

But from my three-year-old viewpoint, it was not all that big a deal. Sure, the power went out, but my dad managed to scrounge a diesel generator the very next day. Not only did our unassuming house somehow manage to escape damage, but there were absolutely no fallen trees or power lines within sight of our house.

But my parents were completely unable to explain the disappearance of my prized flag to my satisfaction. They just kept repeating this insane story about how it had been blown away by the wind. ;-)
LPC Logo
Linux Plumbers Conference this past week in San Diego was a great event, and in my role as Program Committee Chair, I thank the many people who provided the excellent content that makes this event what it is. The program committee had to make some tough decisions between worthy submissions. The submitters, both successful and otherwise, provided an excellent set of abstracts and presentations. The Microconference leads put together excellent events, including first-ever Plumbers Constraint-Framework, LLVM, and Android microconferences. Last, but by no means least, there was wonderfui audience participation, including a vibrant hallway track. (A number of the people I informally polled called out the hallway track as the most valuable part of the conference, which I believe to be a very good thing indeed!)

I join all the attendees in thanking the Plumbers Organizing Committee and the Linux Foundation staff for providing the venue, scheduling, organization, food, and drink that powers these events.

And I am already looking forward to next year's Plumbers! I hope to see you there!!!


Parallel Programming: August 2012 Update

This release of the parallel programming book features a completion of the functional-testing validation chapter (performance testing likely gets a separate chapter), updates to the locking and transactional-memory sections, and lots of fixes, primarily from Namhyung Kim, but with a fix to a particularly gnarly memory-ordering example from David Ungar. (The ppcmem tool is even more highly recommended as a result.)

This release is rather late. This tardiness was in roughly equal parts due to:

  1. Inertia.
  2. Unexpected requests for more information on hardware transactional memory.
  3. Frequent changes in approach to the validation chapter.
  4. A sudden desire to add a third example to the "Partitioning and Synchronization Design" chapter. This was motivated by seeing people blog on the difficulty of solving mazes in parallel, and it would indeed be difficult if you confined yourself to their designs. However, the results were surprising, so much so that I published paper describing a scalable solution, which was not simply embarrassingly parallel, but rather humiliatingly parallel. Which means that this chapter is still short a third example.

Future releases will hopefully return to the 3-4 per year originally intended.

As always, git:// will be updated in real time.
elephant, penguin
My transition from proprietary programmer to open-source programmer definitely increased my travel. And with travel comes jet lag. Fortunately, there are ways to avoid jet lag, a few of which are discussed below.

When devising algorithms for parallel hardware, understanding the capabilities of the underlying hardware and software is critically important. When devising ways of avoiding jet lag, understanding how your body and mind react to time shifts is no less important. Although people are not computers, and thus do vary, most people can shift one timezone east or two timezones west per day without ill effects (and perhaps you have noticed that it is easier to go to bed late than it is to get up early). Your body and mind will no doubt vary somewhat from the norm, so you will need to experiment a bit to learn your own personal limits.

The typical limit of one timezone per day east and two timezones per day west means that the most difficult trip is eight timezones east, for example, from Oregon to England. Regardless of whether you adjust your body eight timezones east or 16 timezones west, you are looking at an eight-day adjustment period. In contrast, the 11.5 timezone summertime westward shift from Oregon to India requires only a six-day adjustment period.

The key insight is that you can often schedule the adjustment period. For example, on my recent trip to Italy, I set my alarm clock 45 minutes earlier each day for the week prior to the workshop. This meant that on the day of my departure, I woke up at midnight Pacific time. This is 9AM Italy time, which left me only a timezone or two to adjust upon arrival. Which was important, because I had to give a 90-minute talk at 9AM the following morning, Italy time. This pre-adjusting period meant also that on the night before my departure, I went to bed at about 5PM Pacific time.

This pre-adjusting approach clearly requires the cooperation of your family and co-workers. Most of my meetings are early in the morning, which makes eastward adjustments easier on my co-workers than westward adjustments. On the other hand, fully adjusting the roughly eight timezones westward to (say) China requires only four days, two of which might be weekend days. That said, there will be times when you simply cannot pre-adjust, and on those times, you must take the brunt of a sudden large time change. More on this later.

Especially when pre-adjusting eastwards, I feel like I am inflicting mild jet lag on myself. Exercising when I first get up does help, and the nearby 24-hour-per-day gym is an important part of my eastwards pre-adjustment regimen. Shifting mealtimes can be difficult given the expectations of families and co-worker, but in my experience is of secondary importance.

As I get older, it is getting easier to go to bed early, but sleeping earlier than normal on an airplane is still quite challenging. In contrast, getting to sleep two hours later than normal (when travelling westwards) works quite well, even on an airplane. Some swear by various sovereign sleeping aids ranging from alcohol to melatonin, but when travelling eastwards I often simply stay awake for the whole trip, in effect skipping one night's sleep. This approach will likely become untenable at some point, but it currently has the good side-effect of allowing me to get to sleep easily in the evening local time when I do arrive.

But what if you cannot pre-adjust? It is possible to tough it out, especially with the help of caffeine. However, keep in mind that the half-life of caffeinne in your body is about five hours (your mileage may vary), so unless if you are already a heavy caffeine user, taking it late in the afternoon can be counterproductive. I am quite sensitive to caffeine, and normally must avoid taking any after about 9AM—as little as an ounce of chocolate in the afternoon will disrupt the following night's sleep.

The most difficult time will be that corresponding to about 3:30AM in your original timezone. For example, if I were to travel from Oregon nine timezones east to Europe, I would be very sleepy at about half past noon, Europe time. It can be very helpful to walk outside in full daylight during that time.

All that said, everyone reacts to time changes a little differently, so you will likely need to experiment a bit to arrive at the approach best suited to your body and mind. But an approach informed by your own personal time-change limitations can be considerably more comfortable than toughing out a large time change!
The previous post described some potential pitfalls in blindly applying hardware lock elision to legacy software. This post looks instead at how changes in CPU cache geometry have increased hardware transactional memory's (HTM's) cache footprint, at least from a theoretical standpoint. This is yet another example of how hardware giveth and software taketh away, at least according to my colleagues working on hardware.

For example, cache associativity has increased over the past couple of decades. For example, Sequent systems of 20 years ago had two-way set-associative caches, while the laptop I am typing this on has an eight-way set-associative cache. The following graph, which models a 4KB cache with 64-byte cache lines, shows why this is increase is important to HTM:

HTM success probability for 4K cache

To use this graph, select the transaction size in cache lines on the x-axis, then read the y-axis to find the overflow (failure) probability for the desired associativity. For example, a direct-mapped (AKA one-way set-associative) cache has about a 50% failure probability for a 10-cacheline transaction, while an eight-way set-associative cache has less than a one-in-a-million failure probability at this same size. Clearly, the increases in microprocessor cache associativity over the past 20 years have profoundly reduced the failure probability of modest-sized transactions.

But a 4KB cache with 64-byte cache lines cannot support a transaction of more than 64 cache lines, no matter how large the associativity. Fortunately, on-chip cache sizes have also increased, with my laptop's L0 cache being 32KB rather than 4KB. Unfortunately, the standard closed-form expression for success probability is severely stressed by this increase in size. For example, the expression for 64 references into a 32KB cache having 64-byte cache lines and eight-way set associativity weighs in at more than 12 megabytes. Because this expression is a sum of a very large number of terms having very small values, indefinite-precision arithmetic is a must, which makes the analytic approach quite slow for modest sized transactions (though it is quite efficient for the smallest transactions as well as the largest transactions that have some chance of success).

Fortunately, there is another approach that works quite well for modest failure probabilities, for example, down to 0.0001. This approach is monte carlo simulation, where we randomly generate a long series of sequences of references and estimate the failure probability based on the results. For example, the following figure shows the same analytic data as the earlier figure, but overlays it with billion-trial monte carlo simulations.

HTM success probability for 4K cache plus monte carlo

The monte carlo trials agree quite well with the analytic results, especially for non-infinitesimal failure probabilities, which gives us some confidence in the monte carlo results for a 32KB cache:

HTM success probability for 32K cache

As you can see, moving from a 4KB to a 32KB cache significantly decreases the failure probability for a given HTM transaction. But the x86 family normally increases L0 cache size in tandem with associativity because this allows the L0 cache lookup to be carried out concurrently with the TLB lookup. Thus, an x86 4KB cache will be direct mapped, a 8KB cache will be two-way set associative, and so on:

HTM success probability for x86 cache

The newer x86 systems are more friendly to HTM than are the older systems, and might be even more friendly if transactions are allowed to overflow from the L0 cache into the L1 or L2 caches.

But not all hardware advances have been helpful. The increase in cache-line size from 32 bytes to 64 bytes reduces the number of cache lines, which, if the addresses are randomly selected from a large memory, increases overflow probability as shown in the following figure:

HTM success probability for different cache-line sizes

In short, HTM failure probabilities have decreased with increasing hardware capabilities, which has interestingly enough made analysis more difficult. But personally, I have always preferred improved performance over easy analysis.
My earlier posting on hardware transactional memory ( brought an unusual amount of private email, especially on my discussion of the hazards of eliding empty critical sections ( Several people made the interesting claim that any correct code involving empty lock-based critical sections that relies entirely on memory references is guaranteed either to: (1) operate correctly under HTM or (2) result in an abort, causing execution to fall back to the lock-based code, again operating correctly. I invested more time than I care to admit unsuccessfully attempting to generate a counter-example, so I am becoming more inclined to believe them, despite the lack of a formal proof.

There is a straightforward rationale for their claim, namely that in strongly atomic HTM implementations, the results of a given transaction are not visible until after the transaction completes. So if you can see that the transaction started, it has already finished, at which point a no-op will successfully wait for it. So any sequence of code that tests variables to determine when a given critical section has started, and then uses an empty lock-based critical section to wait for it to complete will operate correctly even when the empty critical section becomes essentially a no-op under hardware transactional lock elision.

Of course, this would not be true of a weakly atomic HTM implementation, but as far as I know all the currently proposed production HTM implementations are strongly atomic.

But there are other ways to communicate than relying on memory references, for example, you could instead rely on the passage of time. Please note that relying on the passage of time is very difficult to get right, especially on today's super-scalar hardware that relies so heavily on speculative execution. And if your code is running on top of a hypervisor, relying on timing is even more risky. And if I catch you trying to sneak a timing-based patch into the Linux kernel, I will almost certainly NAK it. However, older computer systems were much more amenable to use of timing, so it would be no surprise if large legacy software made use of such techniques. The following is an example of what a timing-based algorithm might look like.

Each worker thread corresponds to a data feed out to some receiver. Each worker thread records the current time (e.g., from clock_gettime()) to a per-thread timestamp after executing each work unit. Because this is a per-thread timestamp, a normal write suffices. Because this is a real-time application, it is a fatal error for a given thread to fail to update its per-thread timestamp for more than (say) 100 microseconds. Results are placed into a shared-memory buffer, which is drained by a separate set of threads that transmit the results out over the network. Locks are used sparingly to access and update global state, and, again given that this is a real-time application, locks are granted in FIFO order within priority level.

Worker threads can finish, and when they do they must disentangle themselves from the rest of the application. Worker threads also indicate their status in a thread-local variable my_status that is initialized to -1. However, worker threads do not exit: They instead go to a thread pool to accommodate later processing requirements.

There is also a control thread that monitors the worker threads. In particular, it maintains a histogram of thread statuses for threads that have exited. This same thread is responsible for assigning new threads or reassigning old ones, and runs at a priority no higher than that of the worker threads.

Worker threads do something like the following:

	int my_status = -1;  /* Thread-local variable. */

	while (continue_working()) {
		enqueue_any_new_work(); /* Might include locks. */
		wp = dequeue_work();  /* Thread-local, no locking. */
		do_work(wp);  /* Might include locks. */
		my_timestamp = clock_gettime(...); /* Thread-local. */


	 * disentangle from application, might acquire other locks,
	 * can take much longer than MAX_LOOP_TIME, especially if
	 * many threads exit concurrently.

	my_status = get_return_status();


	/* thread awaits repurposing. */

The control thread might do the following:

	for (;;) {
		for_each_thread(t) {
			ct = clock_gettime(...);
		        d = ct - per_thread(my_timestamp, t);
			if (d >= MAX_LOOP_TIME) {
				/* thread departing. */
				i = per_thread(my_status, t);
				status_hist[i]++; /* Bad idea if TLE! */
		/* Repurpose threads as needed. */

Here the passage of time is used to ensure that the empty lock-based critical section happens after the departing worker thread has acquired the departing_thread_lock. If the empty lock is allowed to become a no-op, then the per_thread(my_status, t) access can abort the departing thread's transaction, so that the result is the initial value of -1. This cannot happen with locking.

Note that enclosing the access to my_status in a lock/transaction does not change things. This problem is not the extra-transactional access, but rather the structure of the application.

Again, please don't try this sort of thing on superscalar CPUs (which is almost all of them these days), and especially please don't try it in the Linux kernel. In fact, you should be very cautious about doing this sort of thing even on systems specially designed for hard real-time response. But the key point is that if you have a large application that is not well understood, you might want to be careful about applying HTM-based lock elision to it. This should not be a surprise: What you have is a real-time program, and there has not yet been much work on TM in real-time environments.

Speaking of real-time environments, how does priority boosting interact with HTM-based lock elision?

Acknoowledgments: Many thanks to Josh Triplett, Mathieu Desnoyers, Jon Walpole, Phil Howard, and Stephan Diestelhorst for many illuminating discussions. Please note that my acknowledgment of these people's contributions to my understanding of transactional lock elision should in no way be interpreted to imply their agreement with anything I say here. ;-)