What Makes Parallel Programming Hard?

One of the challenges of multi-core and tera-scale architecture is how to make parallel programming “easier”. But what makes it hard in the first place? I thought it might be worth explaining some of our experiences with this as a prelude to explaining how we’re solving it. I’ve ranked things that make parallel programming hard in roughly increasing order of difficulty:

1. Finding The Parallelism

2. Avoiding The Bugs

3. Tuning Performance

4. Future Proofing

5. Using Modern Programming Methods

Finding The Parallelism: The first problem confronting the programmer is identifying where parallelism exists in their application. It might be as simple a solution as “Hmm….I’m calling function A after function B and they don’t interact in any way, so I’ll turn them into parallel tasks” or “I’m doing the same thing to this entire array of elements, so let me do it all in parallel.” However, sometimes, it might be a little trickier. The algorithm you use for good sequential (non-parallel) performance might not be easy to parallelize. To use a real-world example: A couple of years ago we were trying to parallelize a (MPEG-4) video compression encoder. It seemed easy enough on the surface: there’s a lot of parallelism in those video frames! However, a funny thing happened on the way to getting sequential performance…

One of the tricks of video compression is to try to predict where a particular chunk of a previous video frame will go in the next video frame. This is called “motion estimation”. This allows us to represent the chunk much more compactly. However, this is a really, really compute intensive part of the code. So, one of the tricks that clever video hackers used is to predict the current motion vector from neighboring motion vectors. This intuitively makes sense…pixels next to each other are likely to be moving in the same direction in the video stream (think of an object moving through the scene, or a camera pan). This really narrowed the amount of search you had for each chunk. The results were amazing: with little perceptible loss in video quality, you could speed up your video compression up to a couple of orders of magnitude!

But this causes problems for parallelism…they sped up sequential performance by creating a dependence between motion vector computations within each frame. We could no longer parallelize this (easily). The quality of the sequential result had raised the bar so high, that we had to rethink the algorithm a little (we did, successfully, by the way).

Avoiding The Bugs: Threads executing at the same time, accessing data at the same time introduces a new class of bugs that are really difficult to reason about and, in some cases, reliably find. These are called “data races”…termed such because the result of the computation often depends on the result of multiple threads racing to write a shared object. Programmers have to be careful to make sure that any accesses to shared data objects are well coordinated. There are some pretty subtle issues here, but even a simple example illustrates the complex interactions. If two threads attempt to swap two variables, one of which is shared, as illustrated in pseudocode below, several outcomes are possible. First, the two swaps might not overlap at all, so that the outcome is correct. However, there are several possible interleavings should these overlap. One bad outcome is when the assignments “B = A” are interleaved between the two swaps: we’d end up losing the value of either “c” or “b”.

swap(int A, int B) {

int tmp = B;

B = A;

A = tmp;

}

int a, b, c;

/thread 1/ {

swap(&c, &a);

}

/thread 2/ {

swap(&b, &a);

}

Tuning Performance: When we worked on that video compression project I mentioned earlier, we probably spent 95% of our time on tuning performance. Why? Well, in the world of multi-cores, things like cache behavior, how the cores are connected, and so on make a much bigger difference for performance. For example, in linear algebra algorithms (), “blocking” is used to improve caching behaviors of the algorithm. In laymen’s terms, this means that we modify the algorithm to work progressively on chunks of the problem that fit in varying levels of the cache. So, it’s useful to know what the cache sizes are.

With multi-core, this gets a little trickier because some of the cache levels are shared (e.g. a shared L2 cache in Core 2 Duo). So, you have to know how much of the L2 cache you can use….and you may not even know what’s running on the other core and how much cache it wants to use! Looking forward, this gets even more complex as we consider distributed shared caches, complex ways of connecting these cores, etc.

() Why is linear algebra important? It pops up in a lot of places…for example, it permeates those videogames your kids (or you) are playing.

Future Proofing: For sequential applications, you benefit from the increasing performance on single cores from processor generation to processor generation. The same should be true for parallel applications, but it’s not so easy. Everything I said for item 3 above is exacerbated because increasing core counts means that we know that cache sizes, interconnects, etc. will change, we just can’t say exactly how for products we haven’t begun to design. And this will affect performance.

For example, one of the things that will likely change is how many cycles takes for cores to communicate to each other. More importantly, the amount of arithmetic my core can do while communicating with another core will probably increase over time. So, if I implement an algorithm that does 100 arithmetic operations (in 100 cycles) while synchronizing with another core that takes 50 cycles, then we’ll keep the cores nice and busy. However, if I run that same binary on a system with a higher clock frequency, so that the relative synchronization time increases to 100 or more cycles, then I’ll be spending increasing amounts of time waiting around doing no arithmetic.

Even without knowing what future architectures precisely look like, we still must be able to future-proof our code through “forward-scaling” programming models, tools, and algorithms.

Using Modern Programming Methods: I learned to program in the “loop-y” days of C and Fortran programming. This meant that compute-intensive parts of most programs were confined to well understood and easy to identify chunks of code (typically, loops) that we call “kernels”. How things have changed…

These days, most software development occurs in some object-oriented programming language, like C++, Java, C#, Ruby, or Python. The abstraction methods that make these languages so attractive for software developers (and, indeed, arguably raised software engineering to an art) also makes it quite difficult to find the compute-intensive code we want to target for parallelization. Instead of loops, we have highly abstracted iterators, which themselves are comprised of many (virtual) function calls. Moreover, the processing of a large collection of data (which we parallelization folks look for) will often involve traversing object hierarchies and complex data structures in varying and unpredictable orders. These aren’t the loops your parents knew and loved…and I don’t think we are going to (or should) reverse this trend.

Here’s a slide presented by a game developer at a data parallel programming workshop we co-sponsored earlier in the year that highlights this issue.

gameslide.png

[Editor's note: this thread continues in Anwar's next blog]

24 Responses to What Makes Parallel Programming Hard?

  1. Geva Perry says:

    There are other approaches to parallelization that are not the classic approach. At GigaSpaces we have developed an architecture called Space-Based Architecture that goes for true parallel, linearly scalable applications, including those that are stateful and data-intensive.
    Together with Intel, we’ve proven near-linear scalability on quad-core machines, as can be seen in this document: http://www.gigaspaces.com/download/GigaSpaces_Intel_Joint_Solution.pdf
    [editors note: this link requires a registration]
    Learn more on our web site: http://www.gigaspaces.com

  2. Hello Anwar,
    The thing is that when the multi-core technology needs to simulate the human brain according to its natural tasks which are using multithreading one should consider that “to err is human”.
    The processing itself can’t provide an exact result – the processing of original data should just use the algorithm assigned by an operator to get any sensible results only to him/her.
    No one can trick the Nature or own logic – what goes around comes around.
    Michael

  3. Tony Mongkolsmai says:

    Yet another product worth mentioning, which was omitted here, is Intel Threading Building Blocks. While they work only with native C++ code, it provides similar functionality to OpenMP while allowing for the flexibility of iterators.
    Using this library approach addresses the last 4 issues you mention (at least on some level), which can be very helpful. The bonus for using TBB is that it is open sourced now, so people can really see what is going on within the library if they want to.

  4. Anwar, great article! Somewhere in your list, or perhaps embedded in item 5, I expected to see ‘Expressing the Parallelism’. Hopefully this is next blog’s subject…

  5. Nick Vaidyanathan says:

    Nice article, though it would help if you discuss monitors and semaphores a little more in depth when it comes to concurrency issues. Although the article tried not to be too technical, you did introduce code…
    Also, you’re a hardware guy, don’t be lame! Do your swap operation the sexy way without temp storage :
    Check it Out!

  6. Mauricio, Yes, I’m setting up to talk a bit more about Ct, our work with deterministic parallel programming in future blogs. I want to give some background on how we got here, first:)
    Nick, I’m not a hardware person! I don’t think about values as bits;) I’ll try to talk a little more about specific algorithms in the context of some of the programming models we’ve been working with.

  7. Spranglenator says:

    Great article!
    You mention in the “tuning performance” section that cache performance makes a much bigger difference when tuning for multi-core performance. Of course cache performance also matters for single-thread performance. My question is why does it matter much more for multi-core performance vs. single-core performance?

  8. Good question. A couple of things might matter. First, one of the issues that hurts parallel code is “false sharing”, where two threads are accessing different words on the same cache line (and at least one is modifying it). While the data itself might not overlap (assuring correctness), the coherency protocol will likely end up serializing your threads through accesses to this line. So, we try to make sure that this doesn’t happen by knowing exactly how we allocate and partition our data for parallel processing, but it’s (partially) a function of cache line size. Second, it’s true that single threaded performance is also dependent on cache performance, but with multi-core, you have more cores/threads, you are likely to be more sensitive to this since the available off-chip BW is shared between simultaneously running cores…so the hit you take on a cache miss might be higher. (I know I’m hand-waving a little here.)

  9. Tony Mongkolsmai says:

    I’m not sure why monitors and semaphores would enter the discussion as most code programmers write generally use simpler constructs like critical sections or mutexes.
    And perhaps more importantly, we can see from this blog that perhaps we do not want people thinking in terms of complex synchronization constructs, but in terms of simple program layout. Again, this is where solutions like Intel Threading Building Blocks and OpenMP can help a lot.
    Last point, another reason why cache performance is important is because of how it can impact synchonization. For instance, if a thread is holding a lock which another thread is waiting for and you have to go to memory, now the other thread has to wait for that memory request to be handled first. If your working set size fits in the cache, that extra latency generally won’t occur and threads can hopefully continue happily without having to wait for the synchronization object. This is true of any code protected by a synchronization object, but it is often something people don’t think about when writing code. =D

  10. Rael Ercia says:

    I once did an undergrad study about instruction-level parallel programming (ILP), just the high-level idea, and during that time it was promising but seemed totally unnecessary. With the multi-cores nowadays, parallel programming should indeed be in the spotlights.
    Specifically with ILP, the good thing I remembered about it is that it tries to abstract the developer from thinking parallel, leaving the job to the compiler and the processors to restructure sequential code into parallel chunks.
    I’ve lost track about developments on this field, and now I’m wondering if it’s a viable solution…
    If so, wouldn’t it be wonderful to just write sequential code but still get the best out of parallelism?

  11. Great article. I think all too often it’s tempting to discuss a solution without first spelling out a well-defined problem. As far as I can see, the problem is now, at least informally, well-defined for future discussions.
    Looking ahead to future articles, I’d like to encourage you to actually avoid discussing techniques that seem ultimately inadequate or too low-level for modular high-level reasoning (I’m talking now about programs written in terms of locks, monitors, semaphores, etc.). I hope most readers can agree these don’t represent a solution, but are in fact used to build the solutions, e.g., event-based programming, transactional memory, data-parallel programming, etc.
    I’m curious to instead see convincing evidence that a programmer can indeed be shielded from these successfully (and thus can avoid some of the pitfalls itemized in your article), and an explanation as to how this can be done. Why does one model work better than another in terms of the items you list? And perhaps most importantly, where does each model break down?

  12. paige kuni says:

    While I appreciate the comments from the technical side of the house listed in the responses- I want to give a shout of appreciation from the “non-techy” crowd. Your blog posting helped me to understand the opportunities and challenges of multi-core technology in a way that I could understand and apply to my own life situations. Many thanks for writing a readable explanation with enough technical references to help me tie your comments to Intel’s biz strategies. I am making this required reading for my whole (non-technical) team.

  13. Steve Cutler says:

    Can we learn anything from Occam?? As far as I am aware this was the first laguage specifically targetted at parallel programming to support the Transputer from INMOS. This could scale to arrays of 100’s of devices, so clearly if Occam was a good solution then – it should still be a good solution.

  14. Aaron Coday says:

    Similar to your ideas, I think the concept of “expressing parallelism” is quite important. Modern object oriented languages, but also older ones as well, definitely provide a different approach to thinking about parallelism. Languages like Charm, Concurrent C++, some variations of Smalltalk and Lisp as well.
    I think more work should be done to start re-considering the work done here in research around these areas. Imagine in object oriented languages if we treat each object completely separate. Each method invocation is really a message sent to an object. Each message “could” be a separate thread of executation. This exposes a tremedous amount of fine graned parallelism, however makes the job of extracting high performance more difficult.
    However, it’s not outside of the capability of compilers. Concert compiler and runtime research project did something like this. The compiler together with a run-time could spawn and balance parallelism based on this highly fine grained parallelism. Might be interesting….

  15. Christian Black says:

    Throughout this thread, I see implied (but not blatantly stated)… Parallelism is important because no one can take advantage of (or will purchase) Intel’s silicon if we don’t grow programmers and compilers that natively parallelize. 16 cores in a box is great, but only if you can actually use them! :-)

  16. Calvin Zhao says:

    Is there a way to identify parallelism in the compiled binary code? I am not very familliar with how compiler works, but if there is a way to find the code that can be handled by parallel processing in binary code, then we don’t have to worry about the high level language abstraction and just make a binary compiler to recompile the binary code before load it up into memory, then we found a universal solution for every program that is available today. OR may be we can build a virtual machine that will virtualize the MP into a SP, but the virtual system Sw layer will identify the code that need parallel processing and break it up into threads on diffreent cores danymically, so that the programmer don’t even have to think about paralelism.
    I guess we can always teach people to programming in parallelism, however it will take years to comes. If we could do smething as I point out then it is something special to Intel, other vendor won’t be able to get any benifit from it.

  17. Jet says:

    In the 80’s I built a 17 transputer ISA card and programmed it with occam. Occam had folding editor, was fairly easy to use, also handled comunications between chips. Inmos went down and that was the last time I tried using more than one CPU at a time.
    I think we will find parallel programming will not be as difficult as the media is suggesting. You need some way to load code and data, pull the trigger, come back or be notified when the “subroutine” is complete. So long as the stuff being done “for you” takes significantly more time than “doing it yourself” and “you” can go off and do something useful in the meantime, then parallel processing will work. Many if not most problems can be set up like this. There are also rules of thumb for the data bandwith needed. 1 byte / clock cyle for each CPU in each direction. So 2.5 GHz quad should have at least 10 GB/s in each direction.
    You can now buy a 2.66 GHz Intel Quad Box with Vista – ready to go from Tiger direct for under $1000. Probably other places as well so it is now about time to see what we can do them.
    My questions about Intel Quads are
    1..Can you treat or think of one core as the master -ie can one core pull the trigger on another core
    2..Can you lock code and/or data in cache so when you pull the trigger you know something will happen quickly
    3..Can you allocate cache to each core
    4..Could you run windows on one core and linux on another
    5..Can you dedicate a core to ring zero device driver only
    6..Can you specify which core responds to an interrupt
    7..Do any Intel or other tools provide this sort of low level control
    8..How where would someone find technical support for trying to do these things
    If someone just answers these questions and provides the tools to do it, just step back and watch multicore explode.

  18. Adrian Stephens says:

    This article brought back some memories for me. Way back I worked on a system for speech recognition that was based on a hardware solution consisting of some large number (say 256) of transputers. We had a working sequential program and the challenge was to “ make it parallel” to use the large number of processors.
    In theory this was attractive, as we were using dynamic programming to simultaneously optimize a large number (around 5000) of variables characterising the speaker. However data dependences ruined it for us. We managed a speed up of about 3, which resulted from splitting the processing into acoustic, lower phonetic and upper phonetic parts treated as a pipeline.
    During this time uniprocessors had sped up by at least this factor. So if we’d just waited and run the sequential code on a faster processor, we would have had the same outcome.
    I can’t see that having some very large number of IA processors is going to help the average user. OK, certain tasks will parallelize nicely. I’m wondering what research we are doing in alternative architectures – i.e., data flow architectures (akin to Petri nets).
    In this possible alternative future, a relatively small number of IA processors are running user tasks. These processors make use of programmable data-flow networks by cross-compiling to the data-flow architecture – i.e., by monitoring performance and data locality and moving specific CPU-intensive chunks into an alternative architecture on the fly.

  19. Henry Brown says:

    Have parallel systems attempted to utilize biologic models of deterministic systems.
    rnai acts as a regulator of parallel processes in the cell. We used genetic algorithms to assemble DNA contig maps of chromosomes. Chromosomes act as parallel programs expressed at different rates due to aging, mosaicism, epigenetics. Could multicore programs require systems of regulators to optimize codes? How could a generalized set of regulators be created to optimize parallelism? Could code be passed through a self regulating system and automate paralleism?
    The advent of genes in the ocean found throughout all larger genomes shows a genetic algorithm that selects genes (algorithms) that live in parallel multicore systems(chromosome/genomes). Could a similar zoo of algorithms be developed for multicore systems?

  20. Louis Savain says:

    OK. Here’s my personal rant on the issue of parallelism:
    Threads are evil. It’s either parallel/concurrent or it’s not. Why cut corners when it comes to parallelism? The reason that programmers are having such a hard time with concurrent programming is that we are mixing sequences with parallelism. Computing will not come of age until we adopt a software model that eliminates sequential programming altogether. Only when the software model is in place and all the kinks are ironed out, should we even consider how to design the CPU architecture to support it.
    We must also stop believing that languages are a natural way to program parallel computers. Computer languages are primitive, period. There are inherently sequential and they are not suited for parallel computing.
    The algorithmic software model has been around since the days of Charles Babbage and Lady Ada Lovelace. We must design our multicore processors to support the correct software model, not a 150 year-old dinosaur. It is just plain wrong and ill-suited for parallelism. It is a relic from the distant past. It is not fit for the 21st century. It is the cause of the software crisis. It is the reason that complex software is unreliable. It belongs in a museum, not in our computers.
    There is a simple way to implement fine-grained concurrent software that does not involve the use of threads. It is well known to programmers of cellular automata, neural networks and simulation applications. With this scheme everything is a process and every process is elementary (down to simple operations) and have equal durations. This way, timing becomes 100% deterministic and this eliminates a whole slew of timing bugs. Of course, unless the CPU is optimized for and is designed to support this super fine-grained parallelism, performance will suffer.