You are viewing paulmck

Previous Entry | Next Entry

inside
Later in this series of postings, we will get back to the question of what lessons from the old Great Software Crisis might be applied to the new Great Parallel Software Crisis. In the meantime, this post will look at what can make parallel programming easy. Just to maintain my reputation for arguing with myself, several upcoming posts will examine things that can make parallel programming difficult.

The easiest problems to parallelize are those whose data is partitionable. In this case, just assign a separate CPU to each partition and stand back! This can be as easy as the following shell script:

./myprogram 1 & ./myprogram 2 & ./myprogram 3 &

Here we run three copies of the program, each working on a different part of the problem. Other examples include the data parallelism used in technical computing and the data locking used in operating systems and server applications.

Many types of computer-aided design have datasets that are nearly partitionable, a classic case being finite-element analysis. Here, an grid is superimposed on the object being analyzed, and a CPU assigned to a region of the grid. As we make the grid more fine-grained (thus usually making the solution more accurate) by a factor of N, the amount of computation per CPU rises as N3, while the amount of communication between CPUs only rises as N2. As N rises, the fraction of effort required for communication becomes arbitrarily small, allowing arbitrarily good performance and scalability.

However, as soon as there is communication, we do need to start being careful. Even in the tightest SMP, the overhead of communicating is a good order of magnitude greater than that of computation, and this ratio often amounts to multiple orders of magnitude. Therefore, even a small amount of communication can destroy your software's scalability, regardless of whether that communications is within a single CPU core or across a large cluster interconnect. This is why partitioning is a good thing, the more complete the partitioning the better. Completely partitionable problems are often called “embarrassingly parallel”, though has never been clear to me why partitioning should embarrass anyone. Perhaps it is a culture thing, but whatever it is, I simply think of it as an embarrassment of performance and scalability riches.

But what about RCU? RCU neither partitions data nor replicates it, so why is it fast?

Of course, for lazy people like me, the best cases are where the parallelism is done for you, as has long been the case for SQL developers. The database's query optimizer handles the parallelism, so that the developers don't have to. That said, a quick Google search for “SQL performance” should suffice to convince you that even SQL developers have not yet reached Parallel Nirvana.

But can you think of cases where parallelism is easy but useless?

Partitioning and replicating your program's data can be a wonderful path to excellent performance and scalability, but life is not always that easy. The next set of posts will look at some of the things that can make life difficult.

Comments

macplusg3
Dec. 11th, 2009 04:12 am (UTC)
Indeed for SQL we've found that the better way is to not execute a query in parallel, but just find better ways of executing 10,000 queries at the same time. For a lot of applications (e.g. web apps) queries are relatively short lived and any individual CPU core is more than capable of executing the query in adequate time. However, how best you service your 100,000 concurrent users... that's where it's interesting.

- stewart
paulmck
Dec. 11th, 2009 04:48 am (UTC)
Good point!
Indeed, to benefit from intra-query parallelism, you need one whopping big SQL query and lots of mass-storage devices, lots of host-bus adapters, and impressive I/O bandwidth. I nevertheless have seen intra-query parallelism used to good effect in data-mining applications. Such an application in the late 1990s used no fewer than 20 big disk arrays connected via SAN to a single-OS-instance NUMA system with many tens of CPUs. If I remember correctly, one of the queries took something like 20 minutes to complete — and consumed every scrap of computational power and I/O bandwidth that the system had to offer.

As you say, inter-query parallelism is the only way to go for small queries. However, if each query completes in (say) 100 milliseconds, you need tens of thousands of queries per hour for there to be much chance of having more than one query in flight at a given time. There certainly are many web sites with that much traffic, but they are a small fraction of the total.

So much does indeed depend on your workload!