I presented “Is Parallel Programming Hard, And, If So, Why?” at the Multicore and Parallel Computing Miniconf, and had the good fortune of having Linus in the audience. As you will quickly learn if you post to LKML, Linus often gives very insightful (and sometimes quite pointed) feedback. My presentation was no exception.
Linus noted that many parallel programming experts are unwilling to admit that there are algorithms that do not parallelize nicely.
As you might expect, I responded by saying that parallelism is an optimization, and like other optimizations has places where it works well and places where it does not.
What else could I have said?
Linus noted that many parallel programming experts are unwilling to admit that there are algorithms that do not parallelize nicely.
As you might expect, I responded by saying that parallelism is an optimization, and like other optimizations has places where it works well and places where it does not.
What else could I have said?
As some of you know, I have been working on a book on parallel programming. My thought had been to complete it, then announce it. But I finally realized that it really never will be complete, at least not as long as people keep coming up with new parallel-programming ideas, which I most definitely hope will continue for a very long time.
So, here it is!
So, here it is!
So what can we say about parallel programming?
The many answers to this question fills many a book, but the following
points seem especially worth emphasizing:
But how can parallel software ever hope to keep up with the exponential growth in the number of cores?
The many answers to this question fills many a book, but the following
points seem especially worth emphasizing:
- Parallelism is universal and intuitive. It is programming, whether parallel or not, that is counter-intuitive.
http://paulmck.livejournal.com/15144.html - The primary cause of the Great Multicore Software Crisis is the new-found low price and large supply of shared-memory parallel systems. Many lessons can be learned from the Great Software Crisis of the 1970s and 1980s, which had similar economic underpinnings.
http://paulmck.livejournal.com/15526.html - “Embarrassing parallelism” is really not an embarrassment at all, except perhaps to those who thoughtlessly use this as a derogatory term. To the rest of is, it is truly an embarrassment of riches. Where they apply, trivial solutions are indeed a wonderful thing.
http://paulmck.livejournal.com/15851.html - Parallel programmers can be trained, and the conditions for training them has improved greatly, and continues to improve. However, in parallel programming, raw intelligence without experience or expert guidance is almost always a negative: The more intelligent you are, the deeper a hole you will dig for yourself before realizing that you are in trouble.
http://paulmck.livejournal.com/15977.html - During the decades when parallel systems were rare and expensive, Darwinian selection favored projects whose leaders were afraid of or revulsed by parallelism. This fear and revulsion must be overcome if such projects are to be successful on multicore systems.
http://paulmck.livejournal.com/16201.html - Sequential software is likely to have made design and coding choices that, while eminently reasonable for a sequential program, are completely inappropriate for parallel systems. These historical choices must be dealt with if the software is to be converted to run multithreaded.
http://paulmck.livejournal.com/16478.html - Many of the complications that parallelism introduces are global in nature. For example, the root cause of a deadlock might be distributed widely throughout the parallel program. People debugging parallel programs therefore have a greater than usual need to view the source code for the entire application.
http://paulmck.livejournal.com/16752.html - They say that you get what you measure. Therefore, be very careful what benchmarks you use when evaluating multicore systems. Of course, the same caution applies when evaluating single-CPU systems.
http://paulmck.livejournal.com/17124.html - Although Amdahl's law is useful, it measures only scalability. Raw performance is, if anything, more important than is scalability.
http://paulmck.livejournal.com/17395.html - They say that a developer is only as good as his or her tools. This is as true of parallel programming as it is of sequential programming.
http://paulmck.livejournal.com/17538.html - Multithreaded programs place added stress on testing and validation. Interestingly enough, running parallel programs on systems containing CPUs running at different frequencies can cause certain kinds of race conditions to occur more frequently. So, if you choose to develope multithreaded software, beef up your validation efforts accordingly.
http://paulmck.livejournal.com/17787.html - Beware the expert who claims something is impossible. Perhaps it is in fact impossible for him or her, but that does not mean that it is impossible for everyone.
http://paulmck.livejournal.com/18041.html - Creating a new artifact requires a different set of skills than does maintaining and operating an existing artifact. But parallelizing an existing artifact is more an act of creation than of maintenance or operation. However, there are some common special cases that can easily parallelized by operations and maintenance personnel.
http://paulmck.livejournal.com/18263.html - Just as the Great Software Crisis of the 1970s and 1980s spawned the good, the fad, and the ugly, so too will the current Great Multicore Software Crisis. But never forget that the definition of “good” is something that delivers either order-of-magnitude improvements in productivity or order-of-magnitude improvements in the size of population able to make good use of multicore systems. Preferably both.
http://paulmck.livejournal.com/18539.html
But how can parallel software ever hope to keep up with the exponential growth in the number of cores?
Back in the Heeding History posting in this series, I argued that the solutions put forward to deal with the Great Software Crisis of the 1970s and 1980s could be grouped into the good, the fad, and the ugly. Is a similar grouping appropriate for the present-day Great Multicore Software Crisis?
Perhaps a review of the definition of “good” would be helpful: a good multicore programming language will (1) increase the productivity of existing multicore developers by orders of magnitude, (2) increase the fraction of the population who can make good use of multicore/parallel systems, again by orders of magnitude, or preferably (3) both. Yes, this is a high bar, but this is the bar that must be cleared.
The programming languages in the “ugly” category are quite apparent: MPI, OpenMP, C/C++ with POSIX pthreads or fork/wait, Bourne shell and relatives (don't forget the & operator and the
It is too early to tell exactly which of the much-hyped multicore programming languages and environments will fall into the “fad” category, but it seems clear that a great many of them will. If you don't like my attitude, feel free to protest all you like. However, I have heard it all long ago from the many and vociferous proponents of such wonderous gems as PASCAL, Modula, Eiffel, ADA, and many others besides. Rest assured that your protestations sound very much like theirs did. And if you really don't like my attitude, the only correct response is to prove me wrong. Just keep the definition of “good” clearly in mind, and also mind the gap between research prototypes and production-quality software.
What if you cannot see that there is a gap between good research and solid practice?
And just what are the good multicore parallel programming languages?
Perhaps a review of the definition of “good” would be helpful: a good multicore programming language will (1) increase the productivity of existing multicore developers by orders of magnitude, (2) increase the fraction of the population who can make good use of multicore/parallel systems, again by orders of magnitude, or preferably (3) both. Yes, this is a high bar, but this is the bar that must be cleared.
The programming languages in the “ugly” category are quite apparent: MPI, OpenMP, C/C++ with POSIX pthreads or fork/wait, Bourne shell and relatives (don't forget the & operator and the
wait command!), and numerous other sturdy indefensibles. Some say that these languages should have been strangled at birth, but on the other hand they are always there when you need them.It is too early to tell exactly which of the much-hyped multicore programming languages and environments will fall into the “fad” category, but it seems clear that a great many of them will. If you don't like my attitude, feel free to protest all you like. However, I have heard it all long ago from the many and vociferous proponents of such wonderous gems as PASCAL, Modula, Eiffel, ADA, and many others besides. Rest assured that your protestations sound very much like theirs did. And if you really don't like my attitude, the only correct response is to prove me wrong. Just keep the definition of “good” clearly in mind, and also mind the gap between research prototypes and production-quality software.
What if you cannot see that there is a gap between good research and solid practice?
And just what are the good multicore parallel programming languages?
Although IT professionals should take care to avoid engineering envy, it is often useful to learn from the experiences of other engineering disciplines. In this posting, I will compare and contrast construction of a building to implementation of a large software project.
Leaving aside financial engineering, building construction starts with an architect, who lays out the general shape and look of the building. A structural engineer creates a detailed design, with an eye towards ensuring that the building will remain standing despite the best efforts of wind, gravity, and plate tectonics. A construction engineer works out the details of the construction process — for example, it is good if the building can support itself while being built as opposed to doing so only when completed. Other engineering specialties may be required as well, for example, HVAC (heating, ventilating, and air conditioning).
Once the building is built, different skills are needed, including operating engineers, maintenance personnel, and janitors.
A very similar sequence of events can play out for a large software application. Software architects (for better or worse) lay out the general shape of the project, developers design and code it, and others ensure that it is built, tested, and safely ensconced in some source-code management system.
However, once the application is completed, it is likely that its care and feeding will be taken over by application, database, and system administrators. The architects and developers will switch to other projects (possibly version N+1 of this same application), and perhaps even retire or otherwise move on. Of course, if the application runs at multiple sites, there might well be a separate set of administrators for each site. But for simplicity, let's assume that this application runs at only one site.
Now suppose that it is necessary to parallelize this application.
This is tantamount to major structural change to the building, such as adding several new floors. A structural change of this nature is clearly not a job that you would normally entrust to operating engineers, maintenance personnel, or janitors.
But what else can you do if the original architects and developers are gone?
Leaving aside financial engineering, building construction starts with an architect, who lays out the general shape and look of the building. A structural engineer creates a detailed design, with an eye towards ensuring that the building will remain standing despite the best efforts of wind, gravity, and plate tectonics. A construction engineer works out the details of the construction process — for example, it is good if the building can support itself while being built as opposed to doing so only when completed. Other engineering specialties may be required as well, for example, HVAC (heating, ventilating, and air conditioning).
Once the building is built, different skills are needed, including operating engineers, maintenance personnel, and janitors.
A very similar sequence of events can play out for a large software application. Software architects (for better or worse) lay out the general shape of the project, developers design and code it, and others ensure that it is built, tested, and safely ensconced in some source-code management system.
However, once the application is completed, it is likely that its care and feeding will be taken over by application, database, and system administrators. The architects and developers will switch to other projects (possibly version N+1 of this same application), and perhaps even retire or otherwise move on. Of course, if the application runs at multiple sites, there might well be a separate set of administrators for each site. But for simplicity, let's assume that this application runs at only one site.
Now suppose that it is necessary to parallelize this application.
This is tantamount to major structural change to the building, such as adding several new floors. A structural change of this nature is clearly not a job that you would normally entrust to operating engineers, maintenance personnel, or janitors.
But what else can you do if the original architects and developers are gone?
An Eminent Reader privately indicated some distaste for the non-technical nature of recent parallel programming posts. Given that many of the obstacles to successful development of parallel software are non-technical, there will be future non-technical posts, but there is no reason not to take a technical break from these issues. And so, just for you, Eminent Reader, I present this parallel programming puzzle.
This puzzle stems from some researchers’ very selective struggles with parallel algorithms. Of course, it should be no surprise that many people, researchers and developers included, will struggle quite happily with their “baby”, but will even more happily bad-mouth competing approaches, even when (or perhaps especially when) those approaches requiring much less struggling. And yes, some might accuse me of favoring RCU in just this manner, but this is my answer to the likes of them.
Such selective struggling seems to have given rise to an interesting urban legend within the concurrency research community, namely that allowing concurrent access to both ends of a double-ended queue is difficult when using locking.
Can you come up with a lock-based solution that permits the two ends of a double-ended queue to be manipulated concurrently?
This puzzle stems from some researchers’ very selective struggles with parallel algorithms. Of course, it should be no surprise that many people, researchers and developers included, will struggle quite happily with their “baby”, but will even more happily bad-mouth competing approaches, even when (or perhaps especially when) those approaches requiring much less struggling. And yes, some might accuse me of favoring RCU in just this manner, but this is my answer to the likes of them.
Such selective struggling seems to have given rise to an interesting urban legend within the concurrency research community, namely that allowing concurrent access to both ends of a double-ended queue is difficult when using locking.
Can you come up with a lock-based solution that permits the two ends of a double-ended queue to be manipulated concurrently?
An earlier post noted that parallel programming suffers more potential failures in planning than does sequential programming due to the usual suspects: deadlocks, memory misordering, race conditions, and performance/scalability issues. This should lead us to suspect that parallel programs might need better quality assurance (Q/A) than do sequential programs. Q/A activities include validation, verification, inspection, review, and of course testing.
Although there are many highly effective Q/A groups in many software development organizations, it is not hard to find Q/A groups that find bugs, but that either cannot or will not get developers to pay attention to them. It is also not hard to find Q/A groups that are overridden whenever they point out problems that might cause a schedule slip. One way to avoid these problems is via enlightened management based on (for example) bug trends over time, and another way is for the Q/A organization to report high up into the organization. Of course, with this latter approach, one wonders just how often the Q/A organization can get away with yanking on the silver chain connecting to their executive sponsor.
Of course, FOSS communities have their own Q/A challenges, but the fact that the maintainers are usually responsible for the quality of their code adds a breath of fresh air to the process. Not least, their gatekeeper role enables them to vigorously enforce any design and coding guidelines that their FOSS community might have.
But what are the technical effects of parallel software on Q/A?
Traditionally, Q/A groups serve many roles:
- Run tests and find bugs.
- Break in new hires, who, strangely enough, are sometimes reluctant to irritate developers.
- Distract developers who are already behind schedule with pesky bugs.
- Act as scapegoat for schedule slips.
- Act as a target of complaints from developers who are tired of debugging either their new features or any bugs located by the Q/A group.
Although there are many highly effective Q/A groups in many software development organizations, it is not hard to find Q/A groups that find bugs, but that either cannot or will not get developers to pay attention to them. It is also not hard to find Q/A groups that are overridden whenever they point out problems that might cause a schedule slip. One way to avoid these problems is via enlightened management based on (for example) bug trends over time, and another way is for the Q/A organization to report high up into the organization. Of course, with this latter approach, one wonders just how often the Q/A organization can get away with yanking on the silver chain connecting to their executive sponsor.
Of course, FOSS communities have their own Q/A challenges, but the fact that the maintainers are usually responsible for the quality of their code adds a breath of fresh air to the process. Not least, their gatekeeper role enables them to vigorously enforce any design and coding guidelines that their FOSS community might have.
But what are the technical effects of parallel software on Q/A?
In a previous life, I worked on a UNIX kernel on a team of a few tens of developers. This smallish team size resulted in a very smallish set of development tools. To see why, imagine a tool that required one developer one year to create, and that would provide a 1% improvement in everyone's productivity. Given 50 developers, it would take two years to recoup the time spent developing this tool. In contrast, consider the Linux kernel, which usually has thousands of developers. Assuming 1,000 developers, it would take less than six weeks to recoup the time spent developing this tool.
This example does much to explain the large quantity of tooling available for the Linux kernel. For example, the UNIX kernel I worked on did not have anything like
It is important to note that this effect is due to the number of Linux-kernel developers, not to the FOSS nature of the Linux kernel — except insofar as the FOSS nature of the Linux kernel has enabled it to accumulate a large number of developers.
So what does this have to do with the perceived difficulty of parallel programming?
This example does much to explain the large quantity of tooling available for the Linux kernel. For example, the UNIX kernel I worked on did not have anything like
lockdep, powertop, latencytop, or perf for in-kernel use, nor anything like valgrind or perf for user-mode use. Given the relatively small number of developers on that project, it almost always made more sense to find and fix problems manually.It is important to note that this effect is due to the number of Linux-kernel developers, not to the FOSS nature of the Linux kernel — except insofar as the FOSS nature of the Linux kernel has enabled it to accumulate a large number of developers.
So what does this have to do with the perceived difficulty of parallel programming?
The initialization portion of a kernel or server application may be amortized over an arbitrarily large runtime, limited only by the robustness of the kernel or application. If the runtime portion of that kernel or application is fully parallel, Amdahl's law gives a speedup as follows:
S = 1 / (1 - P)
As the runtime of the kernel/application increases, the parallelizable fraction P approaches
1, causing S to approach infinity, for perfect speedup.
We obtain the same result from Gustafson's law:
S(P) = P - α (P - 1)
As the runtime increases, the non-parallelizable portion of the program α approaches zero, so that the speedup S(P) approaches the number of processors P, which indicates perfect linear scalability.
What do you do if Amdahl's law says your program is perfect?
S = 1 / (1 - P)
As the runtime of the kernel/application increases, the parallelizable fraction P approaches
1, causing S to approach infinity, for perfect speedup.
We obtain the same result from Gustafson's law:
S(P) = P - α (P - 1)
As the runtime increases, the non-parallelizable portion of the program α approaches zero, so that the speedup S(P) approaches the number of processors P, which indicates perfect linear scalability.
What do you do if Amdahl's law says your program is perfect?
Benchmarking has a checkered history in IT. It is not for nothing that some have paraphrased the famous and well-worn quotation “lies, damned lies, and statistics” to “lies, damned lies, and benchmarks.” This is of course unfair to benchmarks, which have been an important driving force behind improvements to computing technologies ranging from CPUs to compilers to storage arrays to graphics processors to databases. At their best, benchmarks play a number of exceedingly important roles:
Benchmarks from the Transaction Processing Council (TPC) and the Standard Performance Evaluation Corporation (SPEC) have been very helpful in the past, and so we can hope that good benchmarks will play an important role in accelerating the evolution of multicore hardware and software. That said, designing benchmarks is a painstaking exercise, not to be undertaken lightly. It is therefore interesting to look carefully at the Fibonacci benchmark that has been used for well over a decade to evaluate parallel work-stealing schedulers. An early paper using this benchmark admits that this is “a terrible way to compute Fibonacci numbers”, but argues that it stresses task creation and synchronization overhead.
A doubly recursive C-language implementation of the Fibonacci function takes about 18 seconds to compute
But again, the authors of the aforementioned paper argue that this stresses the thread creation and synchronization overhead. Is this a good justification for use of this benchmark?
- Providing a fair framework for comparing competing implementations on performance, price-performance, performance per watt, time to solution, and much else besides.
- Focusing competitive energy on improving implementations in ways that matter to users.
- Serving as example uses of the technologies being benchmarked.
Benchmarks from the Transaction Processing Council (TPC) and the Standard Performance Evaluation Corporation (SPEC) have been very helpful in the past, and so we can hope that good benchmarks will play an important role in accelerating the evolution of multicore hardware and software. That said, designing benchmarks is a painstaking exercise, not to be undertaken lightly. It is therefore interesting to look carefully at the Fibonacci benchmark that has been used for well over a decade to evaluate parallel work-stealing schedulers. An early paper using this benchmark admits that this is “a terrible way to compute Fibonacci numbers”, but argues that it stresses task creation and synchronization overhead.
A doubly recursive C-language implementation of the Fibonacci function takes about 18 seconds to compute
fib(47) on my laptop. (Why fib(47)?) One could imagine this running faster if executed in parallel, although considerable imagination is required. For purposes of comparison, an iterative C-language implementation computes fib(47) in less than 100 nanoseconds, more than 8 orders of magnitude faster. Worse yet, the interpretive computer algebra system Maxima computes the exact 200,000-digit value of fib(1,000,000) in less than 500 milliseconds, which is still much less time than that required for fib(47) by the doubly recursive C-language implementation.But again, the authors of the aforementioned paper argue that this stresses the thread creation and synchronization overhead. Is this a good justification for use of this benchmark?