You are viewing paulmck

Previous Entry | Next Entry

Later in this series of postings, we will get back to the question of what lessons from the old Great Software Crisis might be applied to the new Great Parallel Software Crisis. In the meantime, this post will look at what can make parallel programming easy. Just to maintain my reputation for arguing with myself, several upcoming posts will examine things that can make parallel programming difficult.

The easiest problems to parallelize are those whose data is partitionable. In this case, just assign a separate CPU to each partition and stand back! This can be as easy as the following shell script:

./myprogram 1 & ./myprogram 2 & ./myprogram 3 &

Here we run three copies of the program, each working on a different part of the problem. Other examples include the data parallelism used in technical computing and the data locking used in operating systems and server applications.

Many types of computer-aided design have datasets that are nearly partitionable, a classic case being finite-element analysis. Here, an grid is superimposed on the object being analyzed, and a CPU assigned to a region of the grid. As we make the grid more fine-grained (thus usually making the solution more accurate) by a factor of N, the amount of computation per CPU rises as N3, while the amount of communication between CPUs only rises as N2. As N rises, the fraction of effort required for communication becomes arbitrarily small, allowing arbitrarily good performance and scalability.

However, as soon as there is communication, we do need to start being careful. Even in the tightest SMP, the overhead of communicating is a good order of magnitude greater than that of computation, and this ratio often amounts to multiple orders of magnitude. Therefore, even a small amount of communication can destroy your software's scalability, regardless of whether that communications is within a single CPU core or across a large cluster interconnect. This is why partitioning is a good thing, the more complete the partitioning the better. Completely partitionable problems are often called “embarrassingly parallel”, though has never been clear to me why partitioning should embarrass anyone. Perhaps it is a culture thing, but whatever it is, I simply think of it as an embarrassment of performance and scalability riches.

But what about RCU? RCU neither partitions data nor replicates it, so why is it fast?

Of course, for lazy people like me, the best cases are where the parallelism is done for you, as has long been the case for SQL developers. The database's query optimizer handles the parallelism, so that the developers don't have to. That said, a quick Google search for “SQL performance” should suffice to convince you that even SQL developers have not yet reached Parallel Nirvana.

But can you think of cases where parallelism is easy but useless?

Partitioning and replicating your program's data can be a wonderful path to excellent performance and scalability, but life is not always that easy. The next set of posts will look at some of the things that can make life difficult.


Dec. 11th, 2009 03:23 am (UTC)
You can use a library that (almost entirely) encapsulates the parallelism. By axiom, a language is just a library with a syntax; thus SQL or regex.

Using VSIPL++, you can write an array processing program (e.g. that finite-element thing) as if it were a sequential program, and let the library partition it and keep all your processors busy. The "almost" shows up only where adding the occasional copy from a poorly organized array to one better-organized (e.g. one partitioned into rows to one partitioned instead by columns) makes the program run much faster. As encapsulation failures go, that's pretty benign.

It says something about MPI that it can be so completely encapsulated, and allow the same program to run well on a dual-core laptop or a supercomputer without recompiling.

To get full benefit from a powerful library, you often need to be using a powerful language. There's a reason VSIPL++ exists only in C++. Something much like it could be written in Haskell, but not in Lisp.
Dec. 11th, 2009 04:30 am (UTC)
Nice marketing piece! :-)
But you do make a good point — relational databases are not the only specialized computational model that enables automatic parallelization. Large-matrix linear-algebra programs are indeed another.

Of course, fairness dictates mention of other products and projects targeting large linear-algebra applications, including RapidMind, Matlab*p, and perhaps Gedae as well, though some would say that the latter is targeted more specifically to signal-processing applications.

Other examples of situations enabling automatic parallelization are sorting large data sets, all manner of image processing (hence all those GPGPUs), genetic algorithms, monte carlo simulations, and many others besides.

The library approach does have its limits, however. One limit is Amdahl's Law, in which scalability is limited by the fraction of the program that is single-threaded. Some other limitations that occur in less-specialized settings are set forth in Hans-J. Boehm's paper Threads cannot be implemented as a library.

On your point about powerful libraries needing powerful languages: how do you explain the SQL linkage APIs in all manner of languages? Perhaps this point is specific to things like VSIPL++.
Dec. 11th, 2009 12:39 pm (UTC)
Re: Nice marketing piece! :-)
Matlab is slower, and not very embeddable. It's common to trade off speed or expressiveness. The reason I mentioned VSIPL++ is that it -- uniquely, to my knowledge -- doesn't trade off absolute speed for anything else. (Speed, parallelism, pretty code; choose any three?) Build times go through the roof, of course, but who cares about that?

SQL is not an expressive language. Artificially limiting the expressiveness of your language, it's easier to optimize for those fewer things you allow the user to say. As you noted earlier, excel and powerpoint accomplished a lot that way.

Amdahl's Law applies everywhere, libraries or no. The need to implement a feature directly useful for application coding as a language built-in identifies a weakness in the language. The valuable language features are the ones that help code powerful libraries. E.g. C++ vtables imply a weakness. Exceptions imply a weakness. Maybe multi-argument function calls imply a weakness. A more powerful language would have those implemented in the standard library.

The less we say about Lisp's and Haskell's lists and GC, perhaps the better.
Dec. 13th, 2009 06:40 am (UTC)
My take on points of agreement and otherwise
We both agree that libraries can provide some parallelism to an otherwise sequential applications. You are more enthusiastic about this approach than am I, but reasonable people can hold different opinions and outlooks.

We both agree that specialization can ease automation of parallelization and other optimizations, at the expense of limiting what the user can do.

We both agree that Amdahl's law is, if not universal, pretty close to it. The amount of trouble that Amdahl's law causes you will depend on your application. It is quite possible that you have seen a group of applications that are more amenable to library-based scalability than those that I am familiar with. But please note that I wholeheartedly agree that library-based parallelism is valuable and has a strong role to play in the Great Parallel Software Crisis. We might differ on exactly how strong a role, but we do agree that it will play a strong role.

You are more focussed on language features than am I, and again, reasonable people can hold differing opinions and outlooks.

All that aside, regardless of our individual experiences and outlooks, I thank you for your contributions to parallelism in general and wish you the best with VSIPL++!!!
Dec. 15th, 2009 01:43 am (UTC)
Re: My take on points of agreement and otherwise
I think we agree on everything. My astonishment that a library could automatically parallelize a large class of computations with no restriction on expressive power, and no appreciable overhead or conceptual leakage, led me to explore what conditions allowed that to be achieved. Maybe you're harder to astonish...
Dec. 15th, 2009 07:11 pm (UTC)
Some might indeed argue that my hair color indicates that my supply of astonishment is greatly diminished. ;-)
Dec. 11th, 2009 04:12 am (UTC)
Indeed for SQL we've found that the better way is to not execute a query in parallel, but just find better ways of executing 10,000 queries at the same time. For a lot of applications (e.g. web apps) queries are relatively short lived and any individual CPU core is more than capable of executing the query in adequate time. However, how best you service your 100,000 concurrent users... that's where it's interesting.

- stewart
Dec. 11th, 2009 04:48 am (UTC)
Good point!
Indeed, to benefit from intra-query parallelism, you need one whopping big SQL query and lots of mass-storage devices, lots of host-bus adapters, and impressive I/O bandwidth. I nevertheless have seen intra-query parallelism used to good effect in data-mining applications. Such an application in the late 1990s used no fewer than 20 big disk arrays connected via SAN to a single-OS-instance NUMA system with many tens of CPUs. If I remember correctly, one of the queries took something like 20 minutes to complete — and consumed every scrap of computational power and I/O bandwidth that the system had to offer.

As you say, inter-query parallelism is the only way to go for small queries. However, if each query completes in (say) 100 milliseconds, you need tens of thousands of queries per hour for there to be much chance of having more than one query in flight at a given time. There certainly are many web sites with that much traffic, but they are a small fraction of the total.

So much does indeed depend on your workload!