A few weeks ago, I blogged on why parallel programming is “hard.” My goal was to set up discussions by others and myself on the Research@Intel blog on how we’re making it easier. While we’re continuing to write about this in some detail, I want to present one vision for making parallel programming easier. I hope this puts into context the work that Intel’s Corporate Technology Group will highlight at the Fall Intel Developer Forum in San Francisco. Ct, TBB, STM, OpenMP might seem like alphabet soup, but they are all designed to address the problems I discussed in distinct and complementary ways. The bottom line is that software is going to be more important than ever as multi-core and tera-scale puts more burden on the programmer to keep up with Moore’s law scaling. So, let’s tackle the problems in the same order I presented them originally.Implicit parallelism: As a programmer, you have to think about several different ways you might expose parallelism in your application: data, task, pipeline parallelism are all styles that suit different problems and, sometimes, the same problem simultaneously. Does this mean that the average programmer is going to have to re-educate themselves in new areas of software engineering? This will be hard to universally avoid. I believe that the tera-scale programming model that will ultimately gain broad acceptance will have to simultaneously support several different types of parallelism, including data, task, pipeline, and message passing flavors. (Yes, yes, these are non-orthogonal, but tell this to a programmer used to one style over another). However, if we’re successful in some of our longer-term efforts, I hope that most programmers won’t have to comprehend such explicit notions of parallelism. If you write a program that wants to do a task A followed by a task B, it would be nice if the programming language and runtime would automatically run these are two cores. Here’s some pseudo-code:
a = taskA(); // two function calls b = taskB(); // are they independent?Currently, we have to rely on auto-parallelizing compilers for languages that were never designed for parallelism (C, C++, Fortran), so the results are highly varied. Implicitly parallel programming models are languages or other application programming interfaces that don’t say anything about parallelism to the programmer, per se, but say a lot to the compiler about parallelism. There are many implicitly parallel models, including data parallel models (e.g. if you peek under the covers of Ct), functional programming languages, and logic languages. These latter two models are particularly interesting because while they’ve historically lagged behind in performance, they may offer much better scalability. What this means in practical terms is that while performance may be 50% to 2x slower for the single core versions of these language, on multi-core they may scale more effectively to sixteen, thirty-two, and many more cores. Giving up 2x sequential performance may be fine if what I’m after is 1024x scaling! Consider also that languages like C++ are considering adding more features from these “exotic” languages in their next generations. Also, consider that much of the current growth in programming tools is in scripting languages (like Python and Ruby), whose sequential performance is relatively abysmal. For example, in the example above, many programming languages define that taskA and taskB must be executable in parallel. While we aren’t going to move to implicitly parallel models really soon, language features will inexorably move that way. Determinism and atomicity: The next hurdle facing the programmer in parallelizing their applications are data races. These are basically bugs that arise when two parallel pieces of code are updating and/or reading a memory location without coordination between them. The tricky part is it gets really difficult to keep track of all the location you might be updating or reading in a piece of code (it might be embedded very deeply in a bunch of function calls). Think about how programmers rely on third party libraries for productivity. (For non-techies, this is like a piece of machinery that someone else makes and I can just drop into my program without recreating it.) Fortunately, we are continuing to develop tools like Thread Checker that help users track down these bugs. Moreover, there is now considerable attention on developing libraries that are thread-safe and exhibit scalable locking disciplines. There are, however, two longer-term solutions that will help address this problem: transactional memory and deterministic programming models. Transactional memory provides a convenient mechanism by which programmers can simply declare that region of code must execute as if it were executing without parallel threads. The compiler and the runtime system take care of the rest. The great thing about this is that if I have some legacy code that I’ve written and I’ve decided to turn a couple of function calls or a loop into parallel regions, then I can wrap the loop body in an “atomic” (i.e. indivisible) declarations, which assures that the transactional memory system will kick in to eliminate data races in these regions. What fundamentally makes data races so difficult is that they are non-deterministic in nature, meaning that they can unpredictably come and go. I can do all the testing I want in the lab, but if someone takes this software and runs it on an overclocked system (yes, users do this a lot), it may expose a software bug I wasn’t able to replicate. Future architectures are kind of unpredictable in this same way, so the problem gets worse with time. In fact, the methodology I use to unit test my components or drive regression tests may obscure the bugs. This is one reason people like to call these Heisenbugs. Determinism is a way of describing programs and programming models that always behave the same way, regardless of the number of cores they are running on. Some programming models exhibit deterministic characteristic, meaning that there is no way to write code that results in data races. These tend to be a little more constrained or more abstracted from the low-level hardware details (examples are Ct and some of the functional programming ideas I mentioned earlier), but CTG teams are trying to push determinism as far as possible in parallel programming. Programming for performance: Now, take that non-determinism I just talked about and try extending it to looking at your application’s performance. In a typical system, many variables can affect performance, including a lot of micro-architectural details the vast majority of programmers don’t think about. One of the fundamental trends of multi-core is that more of these details are exposed to the programmer as we add more cores and leverage other power-efficient mechanisms. We have an investment in great tools like Thread Profiler that help in predicting performance on near term architectures. In addition to extending these tools, what if we could get a sense of the innate scalability of an algorithm in the way that we wrote it.? That is, when developers think about an application-specific concept, like a physics solver for a 3D game engine, it would be great if the developer could guide their algorithmic choices based on higher-level programming tools, like data parallel operators or a choice of parallel representation of their data, use of particular parallel data structures, and so on. This would enable the programmer to make intelligent choices about implementation without exposing lower level hardware details to them. Let the programming language or model’s implementation take care of that. This is a goal of much of the work that we are doing in data parallel and data flow parallelism: It enables rich expression of algorithms, while providing a model where the programmer can make tradeoffs between data types that are simpler or more complex, operations that are synchronization-intensive or synchronization-free. Moreover, using these types of models, it becomes easier to feed back performance data to the user. Instead of messages that say things like “there are a lot L2 cache misses here” that might indicate a low arithmetic intensity (how much calculation we do per word of memory read), we can tell the programmer we need bigger chunks of work in a particular region of their program….this would be far more relevant to how they think about the application. Is the free lunch over?: These last few paragraphs flirt with a topic that is an area of intense focus for us: future-proofing (or, forward-scaling). It is still the case that if you have a IA32 or Intel64 binary and you buy a new processor, it will generally run faster (much faster in some cases). However, you might have expected it to run 4 times faster or better because that shiny new processor is a quad core enhanced with SSE4 instructions. Guess what? While we maintain backward compatibility, it’s largely up to the software developer to enable new features like SSE4 and multi-core performance by (at least) recompiling. Historically, this hasn’t been a big deal because the rate of change of these features is slow. Using libraries like IPP and MKL from Intel will help. Moore’s law marches on in the age of Tera-scale which means that we’ll be increasing core counts faster in the future and even enhancing the instruction set for this. Wouldn’t it be great if the application you compile to native, high-performance IA32 and Intel64 and ship today scaled to use more cores and new ISA features without recompiling? We are developing mechanisms that take the best of the virtual machine approach favored in some modern language frameworks (like Java and .Net) and combining it with the high-performance and power of native Intel instructions. Moreover, we are developing software techniques in fine-grained concurrency and synchronization that allow a programmer to either explicitly or implicitly express all the parallelism available in their application, without paying undue overhead related to parallelism (creating threads isn’t close to free, by the way). Equipped with these tools, adaptive binaries will be an important part of enabling forward-scaling applications. In Ct, which you’ll hear about if you attend our tech session, we aim to provide precisely that. Modern programming methods: Object-oriented programming is pretty ubiquitous these days. For the non-techie, this means that people program with a notion of creating things (objects) that have certain characteristics and behaviors. The benefit is that for software architecture, you can think about how different components need to talk to each other, which makes it easier to build software on a larger scale with many developers and to selectively rewrite these portions of code. (Object-orientation is simply one means to get to this end, but the problems remain in other methods of abstraction.) However, this comes at a cost: the higher levels of abstraction make it really hard for traditional compiler technology to effectively parallelize these applications. [ Intriguingly, there has been interest in using the message passing roots of object oriented language design to expose concurrency.] It is really difficult to anticipate about all the different ways a programmer might use a component when it comes to parallelism. For example, if I design my library for parallelism, I may be using a different locking discipline than my users, which could cause serious problems. Fortunately, the programming language communities have been thinking hard about thread-safety in libraries and adding parallelism features in future language proposals. Not only will future parallel programming constructs will be designed with modern software development techniques in mind, but these techniques will surely evolve to meet the requirements for parallelism. A great example is TBB, which includes parallel iterators and thread-safe containers that are commonly used in applications. Using frameworks like this gives the programmer parallelism at low adoption cost. Extending parallel data types further is essential to broadening frameworks’ applicability. We are going the next step to dynamically optimize performance across the many levels of indirection in this style of programming. Looking forward, as the non-uniformity of memory access times increases between cores, we’ll need to consider notions of placement, data layout, and parallel allocation in our designs. There are many proposals here, but we are treading carefully, as this will almost certainly (re)introduce scaling problems if not handled well. I really encourage readers to take a look at what we’re doing on the Tera-scale track at IDF! Beyond programming tools, there will be tons of cool stuff on real-time ray tracing, silicon photonics, etc.