Investing in hardware for parallel programmability

About a year ago, Intel and Microsoft each invested $10M in jointly funding Universal Parallel Computing Research Centers at UC Berkeley and U of Illinois to make parallel programming mainstream in future client software. I’ve had the pleasure of attending updates where each reported on their first year’s efforts. Clay Breshear’s blog here has a good overview of the UIUC Summit content.

One point made by both was, as the UIUC whitepaper puts it: “Hardware must be used for programmability.” For example, a UCB talk gave a plea for better performance counters and a UIUC one proposed changing the micro-architecture extensively to give programmers simpler, more easily understood parallel memory semantics.

Unfortunately, even the simplest HW modifications to increase programmability face big challenges of product costs and legacy compatibility. For example, useful as they are, performance counters are a tough sell to hard-nosed product managers. Many are intimately connected to the ‘guts’ of the processor and so are very intrusive to the design and present a big challenge to validation. That means a significant investment is required for the design and validation efforts for something that doesn’t have the direct end-user benefit of performance and other new enhancements. Of course, their use in silicon debug helps, getting the product out the door is of unquestionable value, but for that purpose not every counter has to work flawlessly, and model-specific instrumentation is fine.

The basic problem is that the customer for programmability features is not the end-user but the programmer and as one product planner facetiously commented to me: “Programmers aren’t a big market segment.” Extensive enabling of ISVs can be expensive but still more cost-effective than burdening all of 100s millions of processors shipped with the cost of features to enhance programmability.

Even so, the tuning and debug support they make possible is well recognized. We’ve continually added to and improved the performance counters since they first appeared in the original Pentium™. Architectural performance monitoring, with its commitment for consistency across micro-architectures, has appeared with Intel® Core Solo™ and Intel Core Duo™ processors.

Programmability has never been more of a concern than with today’s transition to multi-core and the need to make parallel programming mainstream. Programs that transparently scale to increasing numbers of cores are critical if multi-core is going to give ISVs the same performance progression that we enjoyed from scaling clock frequency. Lowering the bar to concurrent programs can help make more existing as well as emerging high performance applications available sooner. So, there is clear motivation to continue improvements in counters and debug features that address parallelism will continue.

But, the best way to accelerate the addition of programmability features is dual-use HW that helps at development and run-time. Some examples might be: instrumentation (performance counters) that are also needed by SW to provide outstanding QoS (quality-of-service) for multimedia, the use of replay mechanisms both for debugging and for resiliency, or ISA extensions that enable simple programming models to run faster. What are the ones that will really deliver value?

That’s the $10M question.

3 Responses to Investing in hardware for parallel programmability

  1. Mark Hoemmen says:

    To me as a programmer, something even more important than having “more counters” is having “documented counters.” If it’s not well documented, and/or it returns output that doesn’t make sense, then I can’t use it; it might as well not exist. That’s why a performance counter standard is useful: it takes documentation out of the depths of processor manuals, and relieves programmers of the burden of reading all this documentation and testing to see whether it’s actually correct.

  2. Richard says:

    Do consider also changes to the heterogeneous multicore execution model that enable better *compiler* support. Compilers mediate programmer access to the computing hardware. Just as you wouldn’t add a new instruction to the ISA, according to sound architecture orthodoxy, to accommodate the programmer, designing the multicore execution model for the programmer is also wrong. New instructions should be added *only* in concert with the ability of the compiler to utilize them. New multicore execution mechanisms should be added only in concert with the high level compiler capability to support them. Principles used in the design of an ISA for compilers, such as orthogonality, have analogues in the execution mechanism design. Furthermore, one can conceivably add things to the execution mechanism that a compiler can handle, but which may not be feasible for a programmer to handle. The low level example of this is the detailed scheduling of wide microcode in the context of complex hardware resource models. Analogues for this exist at the high level.

  3. ProDigit says:

    I’m not a programmer, nor an engineer, just an end user, so for me it looks like a lot of jibberish where I try to make ends meet.
    It seems to me you are talking about the technology behind Larabee and some 80core processor Intel invented?
    In other words, it seems to me that it’s about a processor, with many smaller processors inside, much like a graphics card these days.
    It also seems to me, that singlethreaded applications will run tremendously slow on these processors, as they seem to be optimized for applications having tens and perhaps even hundreds of threads.?
    I wonder if you could combine this technology of many smaller processors, with a single large processor, that has all instructions available like SSE, MMX, etc…
    If the end user has a dual core computer, where the first core exists of a standard single core much like a Pentium D,M, Celeron, or one core of a core2; and the other core exists out of say 8 smaller cores, then it would be very beneficial for the OS to make use of 2 or 3 smaller cores, while larger singlethreaded applications make use of the larger core.
    At least softwarematic there needs to be some sort of intelligent design that can determine the size of threads. (like eg: an anti virus might run perfectly fine on one or two smaller cores).
    I’ve come up with the idea of a Core2 with a smaller and larger core already a while ago, where the smaller core just runs the OS, and the larger core could run a program, as I saw powersaving in that.
    Instead Intel came up with the idea of disabling unused cores,which is just as good, but not effective on Core2Duo processors yet (as any singlethreaded application is divided amongst the cores, one core can not fully get into sleep mode).
    Perhaps a software issue.
    but back to the article,
    If it is true that this article is talking about the design much like Larrabee, then it should become easier to program, as a game or program just needs to be programmed in multi threaded environment, and you don’t need to worry about extensions like SSE,SSSE, MMX or so.
    Unfortunately not every game is multithread-able. There are games out there that require a purely serial approach in programming and running, and can not be run from multi cores(or if they did, the load will be unbalanced (eg: 90/10%, where say the 10% is the thread to write everything to the screen, and the 90% is the program)..
    Just a thought…