## October 30th, 2012

“RACES'12”), which drew concurrency researchers from across academia and industry. The topic was somewhat controversial, quoting from the call for participation:

The workshop's presenters expressed a number of views on this matter.

One group of talks covered applications using approximate methods, where the approximations might be due to inaccuracies in the input data, convergence criteria in iterative algorithms, use of discrete rather than continuous time intervals, and so on. The key point was that all of these applications maintained an error budget, and that this budget could just as well include a component for errors due to weak synchronization—or even due to entirely omitted synchronization. Applications included the Barnes-Hutt gravitational multi-body solver (Martin Rinard); kmeans clustering and breadth-first search of large graphs (Ravi Nair et al.), and web-based analytics (Michael Carbin et al.). Of course, synchronization-free numerical algorithms have been around for some decades for dense arrays, but these talks used more complex data structures. The talks were quite interesting, though I would have liked to see more consistent comparisons against unsynchronized single-threaded implementation. After all, eliminating synchronization does not necessarily eliminate synchronization overhead in the form of communications cache misses. It nevertheless felt quite good to see people taking even more aggressive approaches to parallelism than I normally take. ;-)

There was considerable discussion as to when inaccuracy do to relaxed synchronization could be tolerated. Martin Rinard pointed out that the relaxed Barnes-Hutt program actually provided greater accuracy than did some of the more heavily synchronized variants, but the fact that the unsynchronized version could omit objects completely caused some consternation. Of course, Barnes-Hutt's use of centroids to represent large groups of bodies is inherently approximate in any case: Reasonable accuracy is achieved only when the ratio of the distances to the nearest object in the group and to the centroid of that group is reasonably close to 1.0. Doug Lea eventually closed this discussion by suggesting that: (1) Any algorithm willing to tolerate the approximations inherent in floating-point arithmetic should also be willing to tolerate the approximations inherent in relaxed or omitted synchronization, but that (2) algorithms using exact computations might wish to think twice before omitting quite so much synchronization. I suspect that important future work will include classifying what sorts of errors and non-determinism are acceptable in a given situation.

Hans-J. Boehm's talk on the evils of data races exposed a sharp difference in the definition of “data race”. A common definition is “at least two concurrent unsynchronized accesses, at least one of which is a write,” but it turned out that there was little agreement on what constitutes “unsynchronized.” Martin Rinard argued that a synchronized access should involve locking, atomic instructions, or at the very least a memory barrier, while Hans asserted that as long as the developer informed the compiler that the memory location in question was subject to concurrent conflicting accesses, the corresponding accesses were in fact synchronized. Hans's version of the definition has the interesting property that exactly the same assembly code might be omitted for a data-race-free program as for the analogous program having data races, but on the other hand, Hans's definition allows Martin to write his racy algorithms in C++11 without having to drop into assembly language. I doubt that these two groups will come to agreement on this point any time soon, but such is life in computing. Therefore, if someone talks to you about data races, you would be wise to ask them exactly what definition they are using.

The next presentation examined the meaning of “FIFO” in concurrent systems. Andreas Haas et al. noted that the usual definitions of “perfect FIFO queues” used ``linearization points'' that are not visible to the caller, who can really only observe when the enqueue and dequeue operation begin and end. The authors therefore measured the deviation of the order in the queue from the order in which the enqueue operations began. They found that even strict FIFO-ordered queues suffered significant misordering. Interestingly enough, queues designed to emphasize performance over ordering actually produced more strongly ordered results than did the “perfect FIFO” queues, mainly because the shorter durations of the enqueue and dequeue operation exposed less opportunity for misordering. I found this to be the most interesting talk, as it gave yet another reason why it is a bad idea to push your parallel application's entire data set through a single queue.

Philip Howard presented a talk on relativistic programming, which sparked a debate as to whether such techniques were actually usable in practice. I eventually squelched the debate by pointing out that this technique is heavily used in the Linux kernel. Trey Cain discussed hardware techniques for gaining more performance and scalability from weakly ordered hardware, David Ungar raised the question of whether increased scalability and throughput always implied degraded latency, and Max Orhai presented on parallel sort on a spatial computer, that is a computer whose computing elements are arranged in a linear array or a two-dimensional grid. Finally, Sasa Misailovic discussed a system that automatically removed synchronization from a program and the tested the degree of inaccuracy that this removal introduced. I am not so sure I would want to bet my life on a program produced by Sasa's system, but random removal of synchronization does sound like a good way to evaluate the effectiveness of test/validation suites for parallel programs.

My presentation made the claim that parallel programming could in fact be learned by large groups of people (paper, presentation). In keeping with most of the other presentations, this claim proved to be somewhat controversial.

All in all, this was an interesting and worthwhile workshop, and I look forward to similar gatherings in the future.

I was fortunate enough to be able to attend the first workshop on Relaxing Synchronization for Multicore and Manycore Scalability (also known as

How can we cast aside the security of correctness, the logic of a proof, and adopt a new way of thinking, where answers are good enough but not certain, and where many processors work together in parallel without quite knowing the states that the others are in? We may need some amount of synchronization, but how much? Or better yet, how little? What mental tools and linguistic devices can we give programmers to help them adapt to this challenge?

The workshop's presenters expressed a number of views on this matter.

One group of talks covered applications using approximate methods, where the approximations might be due to inaccuracies in the input data, convergence criteria in iterative algorithms, use of discrete rather than continuous time intervals, and so on. The key point was that all of these applications maintained an error budget, and that this budget could just as well include a component for errors due to weak synchronization—or even due to entirely omitted synchronization. Applications included the Barnes-Hutt gravitational multi-body solver (Martin Rinard); kmeans clustering and breadth-first search of large graphs (Ravi Nair et al.), and web-based analytics (Michael Carbin et al.). Of course, synchronization-free numerical algorithms have been around for some decades for dense arrays, but these talks used more complex data structures. The talks were quite interesting, though I would have liked to see more consistent comparisons against unsynchronized single-threaded implementation. After all, eliminating synchronization does not necessarily eliminate synchronization overhead in the form of communications cache misses. It nevertheless felt quite good to see people taking even more aggressive approaches to parallelism than I normally take. ;-)

There was considerable discussion as to when inaccuracy do to relaxed synchronization could be tolerated. Martin Rinard pointed out that the relaxed Barnes-Hutt program actually provided greater accuracy than did some of the more heavily synchronized variants, but the fact that the unsynchronized version could omit objects completely caused some consternation. Of course, Barnes-Hutt's use of centroids to represent large groups of bodies is inherently approximate in any case: Reasonable accuracy is achieved only when the ratio of the distances to the nearest object in the group and to the centroid of that group is reasonably close to 1.0. Doug Lea eventually closed this discussion by suggesting that: (1) Any algorithm willing to tolerate the approximations inherent in floating-point arithmetic should also be willing to tolerate the approximations inherent in relaxed or omitted synchronization, but that (2) algorithms using exact computations might wish to think twice before omitting quite so much synchronization. I suspect that important future work will include classifying what sorts of errors and non-determinism are acceptable in a given situation.

Hans-J. Boehm's talk on the evils of data races exposed a sharp difference in the definition of “data race”. A common definition is “at least two concurrent unsynchronized accesses, at least one of which is a write,” but it turned out that there was little agreement on what constitutes “unsynchronized.” Martin Rinard argued that a synchronized access should involve locking, atomic instructions, or at the very least a memory barrier, while Hans asserted that as long as the developer informed the compiler that the memory location in question was subject to concurrent conflicting accesses, the corresponding accesses were in fact synchronized. Hans's version of the definition has the interesting property that exactly the same assembly code might be omitted for a data-race-free program as for the analogous program having data races, but on the other hand, Hans's definition allows Martin to write his racy algorithms in C++11 without having to drop into assembly language. I doubt that these two groups will come to agreement on this point any time soon, but such is life in computing. Therefore, if someone talks to you about data races, you would be wise to ask them exactly what definition they are using.

The next presentation examined the meaning of “FIFO” in concurrent systems. Andreas Haas et al. noted that the usual definitions of “perfect FIFO queues” used ``linearization points'' that are not visible to the caller, who can really only observe when the enqueue and dequeue operation begin and end. The authors therefore measured the deviation of the order in the queue from the order in which the enqueue operations began. They found that even strict FIFO-ordered queues suffered significant misordering. Interestingly enough, queues designed to emphasize performance over ordering actually produced more strongly ordered results than did the “perfect FIFO” queues, mainly because the shorter durations of the enqueue and dequeue operation exposed less opportunity for misordering. I found this to be the most interesting talk, as it gave yet another reason why it is a bad idea to push your parallel application's entire data set through a single queue.

Philip Howard presented a talk on relativistic programming, which sparked a debate as to whether such techniques were actually usable in practice. I eventually squelched the debate by pointing out that this technique is heavily used in the Linux kernel. Trey Cain discussed hardware techniques for gaining more performance and scalability from weakly ordered hardware, David Ungar raised the question of whether increased scalability and throughput always implied degraded latency, and Max Orhai presented on parallel sort on a spatial computer, that is a computer whose computing elements are arranged in a linear array or a two-dimensional grid. Finally, Sasa Misailovic discussed a system that automatically removed synchronization from a program and the tested the degree of inaccuracy that this removal introduced. I am not so sure I would want to bet my life on a program produced by Sasa's system, but random removal of synchronization does sound like a good way to evaluate the effectiveness of test/validation suites for parallel programs.

My presentation made the claim that parallel programming could in fact be learned by large groups of people (paper, presentation). In keeping with most of the other presentations, this claim proved to be somewhat controversial.

All in all, this was an interesting and worthwhile workshop, and I look forward to similar gatherings in the future.