Log in

No account? Create an account

Previous Entry | Next Entry


I had the privilege of attending this year's USENIX Workshop on Hot Topics in Parallelism (HOTPAR), which was as always an interesting gathering. One very positive change compared to the first HOTPAR in 2009 is that the participants seemed much more comfortable with parallelism. This is not to say that I agreed with all viewpoints put forward (quite the contrary, as other attendees can attest!), but rather that the discussions this year seemed to be driven by actual experience, in happy contrast with the first year's tendency towards conceptual opinions.
There were also more talks stretching beyond pure scalability. Some areas follow, along with examples: from the workshop, from the Linux community, of things needing doing, and of things that will likely be done to us.

The first area is extreme scale, with Bill Dally's keynote presentation being the best example. Over the next seven years, Bill expects a 50x performance improvement provided by systems sporting 18,000 GPUs with no fewer than ten billion concurrently executing threads. Yes, Bill expects systems to achieve 1,000 PFLOPS by the year 2020. There was less discussion of single-system extreme scale, in fact, a number of participants seemed quite surprised that the Linux kernel could run on systems with 4096 CPUs (admittedly with severely constrained workloads).

The second area is energy efficiency, where Bill again put forward some interesting predictions. You see, he calls for this 50x performance increase to be provided by a system drawing only 2x the power of current extreme-scale supercomputers. My paper and posters were also mainly about energy efficiency, albeit for much smaller devices. Unfashionable though it might be to admit this, much of my work on energy efficiency has felt like something being done to me. :-)

The third area is predictability, in this case, a lightening talk on capacity planning from Greg Bronevetsky. Of course, real-time response is another example of predictability, and many attendees were surprised that the Linux kernel's -rt patchset could achieve latencies in the low tens of microseconds. At a larger scale and at longer response times, Eric Brewer's Parallelism in the Cloud keynote discussed throughput/latency tradeoffs in cloud-computing environments, with the usual lament that many mechanisms that improve throughput degrade latency, which also qualifies as something being done to us. The saving grace for most cloud environments is that a large chunk of the cloud-computing workload is time-insensitive batch processing, which allows the cloud to run at reasonable utilization levels while still meeting interactive response-time goals. Interestingly enough, Berkeley is getting back into the OS business, working on an OS that provides just enough functionality for cloud-based applications. For example, this OS provides only rudimentary scheduling, with more complex scheduling policies being implemented by user programs.

The fourth area is heterogenous computing, with Bill Dally's keynote being the primary case in point. Sheffield, Anderson, and Keutzer presented on Three-Fingered Jack, which allows Python programs to use SIMD vector units. Tillet, Rupp, Selberherr, and Lin presented Towards Performance-Portable, Scalable, and Convenient Linear Algebra, which discussed performance portability across multiple GPUs. They were able to automatically generate code from OpenCL that beat the best hand-generated code, which I take as a sign that GPGPUs are finally coming of age. Perhaps GPUs will one day feel more like an organic part of the overall computing system.

The fifth area is software-engineering implications. The discussion in this area has advanced significantly since 2009, for example, it was good to see transactional-memory researchers taking debugging seriously (Gottschlich, Knauerhase, and Pokem But How Do We Really Debug Transactional Memory Programs?). They proposed additional record-replay hardware support, which has a number of interesting issues, including the need to ensure that other CPUs replay in a manner consistent with the CPU that is executing the transaction that is being debugged. Another approach is to allow non-transactional accesses within a transaction, so that these non-transactional accesses are not rolled back should the transaction abort. This provides a straightforward printf-like capability without the need for replay. Such non-transactional accesses are supported on some commercial platforms, including Power (suspended transactions) and the mainframe (the non-transactional store instruction). Perhaps other hardware platforms supporting transactional memory will also gain support for non-transactional accesses within a transaction.

The sixth and last area is extreme productivity via application-specific approaches. Quite impressively, Best, Jacobsen, Vining, and Fedorova are looking to enable artists and designers to successfully exploit parallelism in Collection-focused Parallelism. This talk recalled to mind how much the spreadsheet, word processor, and presentation manager did for PC uptake in the 1980s, in stark contrast to any number of high-minded language-based innovations. As I have said before, it seems likely that application-specific tools will provide the best path towards ubiquitous parallel computing. It is certainly the case that other engineering fields have specialized over time, and it would be quite surprising if computing were to prove the sole exception to this rule.

There were other papers as well, which can be downloaded from the conference website. One talk deserving special mention is Martin Rinard's Parallel Synchronization-Free Approximate Data Structure Construction, which uses approximate data-structure construction for a digital orrery (similar to his earlier talk at RACES'2012). It is always good to have Martin around, as his ideas are perceived by many to be even crazier than RCU.

Finally, it is important to note that it will not be sufficient to do well in only one or two of these areas, craziness included. Parallel systems of the future must do all of this simultaneously, which means that there is no shortage of parallel-programming work left to be done!


Jul. 3rd, 2013 05:42 pm (UTC)
Escape actions
Hi Paul -

Thanks for putting together this excellent write-up. I have two questions for you.

1. Do you know where I can find out more about suspended transactions on Power? I googled it and I'm not finding anything.

2. I agree with you that if we have full escape actions, we can using ad hoc debugging techniques and breakpoint debugging for TM programs ... at least in theory. However, I see problems with both of these techniques even if we have support for them with HTM.

First, breakpoint debugging seems, at least to me, to be primarily a sequential program debugging technique. Trying to use breakpoints to reproduce a concurrency bug seems challenging. Second, using ad hoc debugging, such as printf() for pattern analysis, may introduce significant overhead to transactional executions, most especially if those transactions are being executed primarily in hardware. This overhead could change the contention signature of the program and therefore lead to either hiding correctness bugs and/or obfuscating real performance bottlenecks, preventing the programmer from seeing anything useful. Or worse, showing the programmer a performance issue that only exists due to the ad hoc techniques' execution perturbation.

So, my question is to you is this. Let's suppose we have full escape action support. Given the points above, do you believe the current debugging techniques are sufficient for debugging TM programs?

Thanks, as always, for sharing your thoughts and thinking deeply about these questions and problems with me. :)


Jul. 5th, 2013 04:35 pm (UTC)
Re: Escape actions
Hello, Justin!

On #1, please see https://www.power.org/documentation/power-isa-transactional-memory/. Registration required, but otherwise freely downloadable.

On to #2.

Practitioners, myself included, tend to use the following debugging techniques for parallel software: assertions, breakpoints, printf(), lightweight event tracing, statistical counters, formal methods, and programmatic combinations of the above. Others will no doubt add their favorite techniques to the mix.

Assertions are straightforward, relatively lightweight when they don't fire (hopefully the common case), and quite helpful in cases where the bug has a low probability of occurring. When debugging user-mode code, I normally run within a debugger in order to inspect global state after the assertion fires.

I often use breakpoints as a lightweight assertion, for example, setting breakpoints in locations that a particular set of inputs should not reach. Single-stepping parallel applications can also be useful, but inter-thread dependencies can make it difficult. That said, there is no substitute for single-stepping the code when running it single-threaded, in part because the vast majority of bugs are not subtle parallel-programming issues, but rather simple mistakes that affect both single-threaded and parallel execuion. (And you do test your code running single-threaded before running it in parallel, don't you?)

Despite its shortcomings, printf() does actually see some use. While it is possible that the bug is profoundly dependent on timing, bugs are often reasonably robust and printf() is profoundly easy to use. So one not-uncommon approach is to insert printf()s, and if that makes the bug morph or even disappear completely, substitute one of the lighter weight (but harder to use) techniques below.

Lightweight event tracing uses a per-thread (or per-CPU) data structure to record the trace events, with the usual example being a ring buffer. This approach is well-suited to escape-action support in current hardware. Ring-buffer overflow can be a problem (especially for lightweight primitives such as RCU) so the following techniques are often used in combination with event tracing.

When using the statistical counter technique, you simply increment per-thread variables at selected points in your code, which also works well with current hardware's escape-action support. These counters can be read out at any convenient point, and the values examined for discrepancies. These discrepancies often point directly at the bug.

Livejournal's length limitations mean that the rest of this discussion is in the next comment. ;-)
Jul. 5th, 2013 04:36 pm (UTC)
Re: Escape actions
Some types of formal methods are heavily used, especially static-analysis techniques, but also runtime techniques such as lock-dependency checking. Although the more-general formal methods currently have severe program-size limitations, they are exceedingly powerful. Recent work is also promising to relax current program-size limitations somewhat, especially in the common case where the analysis targets a specific property or assertion rather than attempting to capture the program's full behavior. In particular, I highly recommend recent work by Alglave, Kroening, and Tautschnig (http://www.kroening.com/papers/cav2013-wpo.pdf), which synthesizes logic expressions instead of explicitly visiting every possible state. This results in impressive reductions in analysis time and the size of code that can be analyzed. To give you some idea, one piece of code that they analyze in this paper is the Linux kernel's RCU implementation, albeit for one of RCU's less-involved properties.

Finally, I often use programmatic combinations of these techniques. For example, consider the (thankfully rare) case of an RCU bug that takes ten hours of rcutorture testing to locate. The event logs would be truly huge, and would be chock full of irrelevant information. Worse yet, the high event rate over such a long period of time would likely result in ring-buffer overflows, greatly complicating analysis of the combined trace logs. So I have occasionally used a staged approach, where I record high-event-rate information from each test iteration into a per-thread structure, then trace the contents of this structure only if an error is detected. I also sometimes place printf() statements prior to assertions or breakpoints, or, alternatively, invoke functions from the debugger after the breakpoint/assertion in order to print relevant state. I sometimes also use formal methods in conjunction with testing, proving that which can most readily be proven and testing that which can most readily be tested.

Now onto the unlabeled question #3. First, at any given time, current debugging techniques will sometimes prove insufficient -- but for sequential programs as well as for parallel programs. Therefore, there will always be some opportunity to advance the state of the art. The big question is therefore "In which direction should we take our next step?" I have tended to favor small evolutionary steps in response to problems that I encounter, but larger steps are also important, with the Linux kernel's lock-dependency checker ("lockdep") being but one case in point.

Does the hardware approach described in your paper qualify as a valuable larger step? This is of course hard to say. However, back in the days before caches moved onto the CPU chip, I made heavy use of logic analyzers, mostly for fine-grained performance analysis. These days, I use tracing instead, which has the advantage of covering all CPUs rather than just the one that the logic analyzer is connected to, a limitation that your approach might overcome. I could imagine a per-CPU logic analyzer being quite useful, but it is not yet clear to me whether or not it would provide sufficient advantages over light-weight event tracing. Possible advantages include fine-grained timing, lower overhead, and greater access to CPU state. Possible disadvantages include too-small event buffers (thus the possibility of overflow), limited flexibility (including limited collection-time event analysis), and the time delay to get the hardware and corresponding tools in place.

Over to you! ;-)