Log in

No account? Create an account

Previous Entry | Next Entry


I had the privilege of attending this year's USENIX Workshop on Hot Topics in Parallelism (HOTPAR), which was as always an interesting gathering. One very positive change compared to the first HOTPAR in 2009 is that the participants seemed much more comfortable with parallelism. This is not to say that I agreed with all viewpoints put forward (quite the contrary, as other attendees can attest!), but rather that the discussions this year seemed to be driven by actual experience, in happy contrast with the first year's tendency towards conceptual opinions.
There were also more talks stretching beyond pure scalability. Some areas follow, along with examples: from the workshop, from the Linux community, of things needing doing, and of things that will likely be done to us.

The first area is extreme scale, with Bill Dally's keynote presentation being the best example. Over the next seven years, Bill expects a 50x performance improvement provided by systems sporting 18,000 GPUs with no fewer than ten billion concurrently executing threads. Yes, Bill expects systems to achieve 1,000 PFLOPS by the year 2020. There was less discussion of single-system extreme scale, in fact, a number of participants seemed quite surprised that the Linux kernel could run on systems with 4096 CPUs (admittedly with severely constrained workloads).

The second area is energy efficiency, where Bill again put forward some interesting predictions. You see, he calls for this 50x performance increase to be provided by a system drawing only 2x the power of current extreme-scale supercomputers. My paper and posters were also mainly about energy efficiency, albeit for much smaller devices. Unfashionable though it might be to admit this, much of my work on energy efficiency has felt like something being done to me. :-)

The third area is predictability, in this case, a lightening talk on capacity planning from Greg Bronevetsky. Of course, real-time response is another example of predictability, and many attendees were surprised that the Linux kernel's -rt patchset could achieve latencies in the low tens of microseconds. At a larger scale and at longer response times, Eric Brewer's Parallelism in the Cloud keynote discussed throughput/latency tradeoffs in cloud-computing environments, with the usual lament that many mechanisms that improve throughput degrade latency, which also qualifies as something being done to us. The saving grace for most cloud environments is that a large chunk of the cloud-computing workload is time-insensitive batch processing, which allows the cloud to run at reasonable utilization levels while still meeting interactive response-time goals. Interestingly enough, Berkeley is getting back into the OS business, working on an OS that provides just enough functionality for cloud-based applications. For example, this OS provides only rudimentary scheduling, with more complex scheduling policies being implemented by user programs.

The fourth area is heterogenous computing, with Bill Dally's keynote being the primary case in point. Sheffield, Anderson, and Keutzer presented on Three-Fingered Jack, which allows Python programs to use SIMD vector units. Tillet, Rupp, Selberherr, and Lin presented Towards Performance-Portable, Scalable, and Convenient Linear Algebra, which discussed performance portability across multiple GPUs. They were able to automatically generate code from OpenCL that beat the best hand-generated code, which I take as a sign that GPGPUs are finally coming of age. Perhaps GPUs will one day feel more like an organic part of the overall computing system.

The fifth area is software-engineering implications. The discussion in this area has advanced significantly since 2009, for example, it was good to see transactional-memory researchers taking debugging seriously (Gottschlich, Knauerhase, and Pokem But How Do We Really Debug Transactional Memory Programs?). They proposed additional record-replay hardware support, which has a number of interesting issues, including the need to ensure that other CPUs replay in a manner consistent with the CPU that is executing the transaction that is being debugged. Another approach is to allow non-transactional accesses within a transaction, so that these non-transactional accesses are not rolled back should the transaction abort. This provides a straightforward printf-like capability without the need for replay. Such non-transactional accesses are supported on some commercial platforms, including Power (suspended transactions) and the mainframe (the non-transactional store instruction). Perhaps other hardware platforms supporting transactional memory will also gain support for non-transactional accesses within a transaction.

The sixth and last area is extreme productivity via application-specific approaches. Quite impressively, Best, Jacobsen, Vining, and Fedorova are looking to enable artists and designers to successfully exploit parallelism in Collection-focused Parallelism. This talk recalled to mind how much the spreadsheet, word processor, and presentation manager did for PC uptake in the 1980s, in stark contrast to any number of high-minded language-based innovations. As I have said before, it seems likely that application-specific tools will provide the best path towards ubiquitous parallel computing. It is certainly the case that other engineering fields have specialized over time, and it would be quite surprising if computing were to prove the sole exception to this rule.

There were other papers as well, which can be downloaded from the conference website. One talk deserving special mention is Martin Rinard's Parallel Synchronization-Free Approximate Data Structure Construction, which uses approximate data-structure construction for a digital orrery (similar to his earlier talk at RACES'2012). It is always good to have Martin around, as his ideas are perceived by many to be even crazier than RCU.

Finally, it is important to note that it will not be sufficient to do well in only one or two of these areas, craziness included. Parallel systems of the future must do all of this simultaneously, which means that there is no shortage of parallel-programming work left to be done!


Jul. 2nd, 2013 02:04 am (UTC)
Approximate Data Structures
I skimmed Martin Rinard's paper and it's interesting stuff but a few key issues seemed to be missing.

The paper only describes inserting data structure pieces and does not describe removing them. So of course it doesn't crash despite lacking synchronization; you can't hit a "use-after-free" bug if there is no "free". I suspect there is some form of removal and freeing of the data structures at some point but it wasn't covered.

This is also important because if you accidentally lose track of inserted structures you have a memory leak whether you "crash" or not. Given the large number of bodies involved in these problems that leak could be substantial. If you *don't* lose track of these structures then perhaps we ought to say we're further amortizing the synchronization cost of some other structure rather than using synchronization-free algorithms?

The paper seems to be predicated on the Intel architecture memory consistency model. Does it work on architectures without that consistency model? My guess is it does not work on weaker consistency models but I didn't see a discussion of the consequences of other consistency models in the paper. "Link at end" is basically a pretty old (but good!) technique with only the final locking/synchronization around the pointer write omitted. My guess is this synchronization-free technique is less helpful with those other memory models.

Finally, the uniform distribution n-body problem described could very easily be concealing significant sources of error. What happens to accuracy when the masses are non-uniformly distributed? When the positions are non-uniformly distributed? For example, suppose you want to simulate dust accretion around a young star and the data structure accidentally omits the star. I'd expect you'd see some rather serious error despite the fact that 99% of the bodies were not omitted from the Barnes-Hut tree. Of course you could then model the star as many small(-ish) masses at potentially much greater computational cost. That might be an overhead worth analyzing since it could reduce the advantages of the synchronization free approach.
Jul. 2nd, 2013 04:09 am (UTC)
Re: Approximate Data Structures
I believe that the leaked memory is "freed" when the program exits, avoiding the need to track the leaked memory as well as any use-after-free bugs.

On weaker memory models, I suspect that you would need a memory barrier just prior to storing the pointer to the object you are attempting to add. Dependency ordering would handle traversals to newly added data.

The point about non-uniform n-body problems is a good one, and came up during the talk. Given your specific example, I would add the young star last using single-threaded execution, thus guaranteeing that it gets added. This sort of strategy would work well for problems with a modest number of large masses and a huge number of insignificant masses. For example, to model the solar system, one might add the sun, planets, dwarf planets, moons, and large asteroids during single-threaded execution, but only after concurrently adding the other asteroids, the comets, the Kuiper-belt objects, and so on concurrently.

On x86, there was an earlier suggestion that the additions be done using the x86 xchg() instruction. If an attempt to add via xchg() returned non-NULL, the CPU would add the corresponding object back in. I have no idea how much performance this would give up, but it would clearly avoid any leakage.

Another point raised during the talk was the possibility of more efficient algorithms than Barnes-Hutt, but on that question I must defer to someone with actual experience with such algorithms.
Jul. 2nd, 2013 09:36 pm (UTC)
Re: Approximate Data Structures
I don't know how the Barnes-Hut data structure is meant to evolve over the course of the n-body simulation. My guess is that the positions and velocities will change so the data structure would not be static. Depending on how advanced the implementation is the whole thing might even be thrown away for each simulation step rather than attempt to exploit spatial and/or temporal coherence. So I don't know if process exit is a useful leak mitigation strategy or not.

re: Distribution: Fair enough. Once you have a specific distribution in mind specialization is a valid approach to gaining better accuracy and performance. Your's is also a much better solution than my idea of breaking up the mass into constituent bodies! However, it would still be trying to produce something resembling the "worst-case". My wild guess is the uniform distribution is the best case in terms of errors introduced by the approximate data structures and I was trying to come up with something that would resemble the worst case such a general n-body solver with approximate data structures could produce.

You mentioned the uniform n-body issue came up during the talk so I'm curious how it was addressed there.

The xchg() idea is interesting. I believe I've seen it in the kernel before but it's been so long since I saw it that I only remember the technique and not where it was used. I imagine performance-wise it's a bit like the CAS solution only without the overhead of a branch.
Jul. 3rd, 2013 07:12 pm (UTC)
Re: Approximate Data Structures
Well, I never let ignorance stop me before, so here is my speculation on some answers to your interesting questions. That said, you might wish to contact the author. I believe that his email address is publicly available.

I would create a big array for all the objects, with the big boys at the end and an index to the last non-big-boy element. Each element has all the pointers needed to maintain the octtree (though the internal nodes could be allocated in a separate array if desired). The key point is that objects might be omitted from the octtree, but they remain in the underlying array. Alternatively, if the number of objects is not known up front, use a separate set of links to track the full set of objects (and follow these links instead of indexing through the array in the steps below).

So when it comes time to restructure the octtree, just start from the beginning, doing the non-big-boy elements in parallel, the doing the big-boy elements sequentially. This essentially discards the old octtree.

If it was desireable to incrementally update the octtree, first remove the big-boy elements sequentially, do the incremental update in parallel, then re-insert the big-boy elements, again, sequentially.

I have no idea what the best-case and worst-case distributions would look like. Why not try a few and see?

The author addressed the question about distributions by agreeing, and noting that his goal was to demonstrate one case where his lossy technique was useful. Determining the technique's area of applicability is future work, and he seemed to welcome help with this work.

Oddly enough, when used this way, xchg() still has CAS's branch overhead. The thing is that although the xchg()'s insertion is guaranteed to succeed, it might cause some other insertion to fail, hence returning a non-NULL pointer. To the xchg() has to retry not for its own failure, but for the failures that it induces.