The Problem(s) with GPGPU

Hundreds of GigaFLOPs are available in your PC today….in fact, you might even have a TeraFLOP in there. As someone who cut his teeth on a Cray C90 (15 GFLOPS max), this is an intriguing opportunity to dabble; for the latter-day high performance computing programming (whether you’re trying to predict protein structure, price options, or trying to figure out how to thread your game), it is almost too tempting to ignore. However, like a shimmering, unreachable oasis, today’s GPUs offer the promise of all the performance you require, but achieving that goal for all but a few applications (notably, those they were designed for: rasterization)is elusive.

I do a lot of work in data parallel computation and deterministic programming models…this means a lot of my peers and external collaborators are from the GPGPU community, so I expect some hate mail on this :-) . But I think that by approaching parallelism purely from the GPU (or more generally, streaming) side of things, we’re losing track of the many valuable lessons learned from the slightly broader-scoped High Performance Computing community in the last forty years. (However, even looking at where graphics is going, we might find shortcomings in existing GPU designs.)

There are three major problems with taking GPUs outside the niche of rasterization today: The programming model, system bottlenecks, and the architecture. (No, I’m not going to talk about form factor and cooling, which are bigger near-term show stoppers for the IT architect.)

Let’s start with the architecture. At the lowest levels, GPUs are really good at crunching numbers as long as the you keep the shader processors busy doing useful work by anticipating their data requirements and keeping them working in lock step. Why is this? One of the reasons GPU designers can deliver huge peak performance numbers is that they’ve greatly constrained the architecture, most notably by efficiently supporting simple control flow and memory access patterns. What this means concretely is that they can design more efficient processors by not dealing with the messy, irregular patterns of computation that most applications inevitably deal with. These include unpredictable branches, looping conditions, and irregular memory access patterns. This is a fine approach if you can fit into this very narrow niche of very regular computing (or if you undertake heroic measures to transform your application, but more on that later).

Here’s a generic example we find in some visual computing applications: Oftentimes in an algorithm, we need to keep track of a bunch of partial orderings of data, like say the order in which we want to process subsets of geometric structures based on their distance from the camera or viewer in a scene or the number of physical constraints I have to deal with per object in a physics simulation. However, for each ray I shoot out from the camera or each object in the scene, there may be differing numbers of geometric objects that I might interact with. The natural way to represent this efficiently is to construct a linked list of those objects I need to process per element.

linkedlist.png

Why not keep track of the same number of objects per ray or object? Well, this would have a multiplicative effect on the amount of computation in my algorithm. In this programmatic tragedy of the commons, everyone has to deal with the worse case length. In the example below, say, it’s 4, you do 4 times the number of objects worth of work, even if the average objects per ray or object were 1.5. (Remember, in parallelism the constants matter.)

Regularized-list.png

When you scratch the surface a little, you end up finding a lot of problems like this. (Don’t ask about basic core-to-core communication or synchronization.)

Thinking about data structures, anything that wants a graph, tree, sparse matrix representation is going to behave this way…and that covers a lot of interesting stuff. The trick that nobody has mastered is how to deliver this level of performance with the flexibility to do the interesting stuff in GPU or CPU. Yet.

Now, the system bottlenecks: because of the underlying constraints of GPU architecture, oftentimes the program relies heavily on the CPU to manage the difficult parts of the control and data flow, as well as all the other (necessary) stuff like I/O, etc. Here’s the problem with this, the CPU-GPU link is relatively lower performance, engendering relatively high latencies for CPU-GPU interactions (like using a CPU to handle an outer level loop that the GPU can’t handle). This can have a devastating effect on performance. A GPU vendor released optimized BLAS kernel implementations for their latest and greatest discrete graphics part. BLAS (== Basic Linear Algebra Subroutine) is meant to encapsulate some pretty useful computations, which also happen to be GPU-friendly in most cases. However, if you used these kernels out of the box in more complex computations, you’d get nowhere near peak performance from the GPU. In fact, your performance would degrade relative to the CPU. Looking at Linpack, an application widely used to benchmark architectures for linear algebra, we found that the using this great BLAS library out of the box (much the way the average programmer would), we saw pretty dismal performance (slower than simply using a single core of the CPU and around 400x slower than “peak” performance available on the GPU). The problem: the “easy adoption” implementation path led us to being bound by the CPU-GPU roundtrip latencies. To be fair, we manually tuned up the performance to get slightly better performance than the GPU vendor was themselves reporting…but this took weeks and was, in many parts, highly specific to not just this vendor, but this particular generation of GPU architecture. To really get good, sustainable (meaning useful beyond this particular core) performance, you have to use a programming model that allows you to compose these computations in to more complex ones that can run unabated on the GPU side of things. However, you need a good programming model to facilitate this.

Which brings us to the programming models. Let me be frank here…I like what I’m hearing from the GPGPU programming model side of things, but I don’t love it yet because it’s not dealing with reality. Reality means dealing with I/O, messy boundary conditions, irregular control flow and data access, funny shapes of thing. Reality also means that the parallelism isn’t necessarily where you can easily coalesce it all…it might be distributed across several different libraries in use in the application. There is a lot of goodness in these models, but until they can be used in real, industrial strength applications, nirvana will remain elusive. When you take a highly constrained architecture, wrap it in a thin veneer of familiar syntax (curly braces and semi-colons for you C fans), you still end up forcing the programmer to deal with inelegant parts of mapping the application to the underlying architecture. In a real application, this means that your performance nosedives unless you’re willing to spend enormous amounts of programmer energy to optimize the application…for that particular GPU.

I expect that better versions of these programming models will get here sooner than most think…in fact, we’re trying to do that ourselves (see my earlier blogs). Will the architectures keep pace?

16 Responses to The Problem(s) with GPGPU

  1. Vegar Kleppe says:

    If a certain CPU vendor had been doing a better job in developing both parallel hardware and software over the years the industry would not have needed to take this GPU detour.
    Since the dominant CPU vendor in most cases ONLY focus on increasing the peak performance, and the system clock frequency, they don’t pay attention to the worst cases. Graphics used to be one of these cases, but thanks to companies like SGI, NVIDIA, 3Dlabs, ATI and Microsoft graphics are no longer bottlenecked by the CPU.
    In my opinion this less dominant CPU / GPU vendor may solve the system latency problem in a very elegant way. Ever heard of “Fusion”?
    The dominant CPU vendor can ether be a part of the solution or a part of the problem.

  2. Paul Gray says:

    Very nice and true read. We are still to leverge multi-core CPU’s fully and there are still 16bit application. Bottle-neck of any system is and always be the I/O communications from cache memory,memory,disc,peripherals down to the keyboard.
    Lets not forget a majority of optimisations are also implemented at the compile stage were from a user perspective its impacting at the runtime. Almost gets to the stage were a underlying managment OS/CPU sits ontop of everything just to monitor flow and feedback for realtime optimisations. That said we already do that level of managment on CPU’s for power, so can wonder were that line will get drawn. Given us humans find it hard enough to naturaly delegate tasks more effeciently than to do them ourselves and the factors involved; Then to find the same latencies evident in multi CORE programming and conclude there is no answear yet would be a easy and safe conclusion to make for a while yet.
    When the overheads outway the gain overall you have to wonder at the whole approach. breaking tasks up so somebody does a whole unit of a task when inreal life the tasks are specilised in that a bricklayer lays bricks and a plumber does the plumbing and both end up talking to the fat controller or CPU in our case. Hence were you can gain in some area’s you still get down to planning the tasks ahead and that takes time which is often wasted.
    Now if we could get several jobs with broken up units of work and matchup ideenticale units – then we would be able to gain overall some of the lost managment ground back. But in a ideal World, nobody would talk as there would be no need too. Were not in a ideal world and the road to multi-cores aint over yet.

    Knowing what is in the equation is not what worries me, but what is not in it and why that encompasses my understanding of its excistence.

  3. Igor says:

    As a GPGPU developer I can’t agree more with you. Great article.
    GPGPU is still too isolated from the outside world by means of lacking and poorly designed interfaces.

  4. mark says:

    To me general processing is not messy irregular patterns. Each instruction is a ‘greatly constrained architecture’ in itself. However, instruction formats are messy and irregular. But a wider choice of instructions should not imply less performance.
    ‘Graphical Processing Unit’ makes no sense! One processes information, then interprets it as sound or visuals or whatever.

  5. GB says:

    Let the world recognize that Intel sees a large and looming threat on the horizon. Fascinating that at this early juncture, powers that be find it necessary to take some initial swipes in the foreseeable battle.
    “today’s GPUs offer the promise of all the performance you require, but achieving that goal for all but a few applications (notably, those they were designed for: rasterization)is elusive.”
    Not quite. The truth is, there are non-visualization applications that are seeing real orders-of-magnitude benefit from GPGPU (see: http://www.nvidia.com/object/tesla_testimonials.html).
    “Using NVIDIA GPU computing solutions, Headwave is able to increase compute rates, and also reduce time spent in manual operations, by 100x.”
    What used to take months now takes days. Where customers used to throw away data on the premise it was too time consuming to sort, they can now quickly process and utilize. That is real value, today.
    Isn’t GPGPU a shinning example of how innovation works? Take a problem, dissect it, improve what you can, rinse and repeat. GPGPUs aren’t meant to replace the CPU but work in compliment. The CPU will always be a better general purpose tool better suited to handle exceptions and corner cases. Yet if something comes along that adds value now, and real industries as diverse as financial analysts, biological engineers, CG film renders, and oil and gas explorers can all benefit today, shouldn’t we simply embrace it?
    Or is Intel’s answer: No, this sucks, wait two years for ours?
    Intel is behind and others are leading the way, and I’m sure that’s not a comfortable position. It’s simply business as usual that this blog is published to pooh-pooh what the rest of the world is doing (and I’m sure this is just the first of many Intel-sourced negative marketing pieces over the interim until Larrabee finally hits the market; and when it does, of course it will be proclaimed nirvana).
    Just consider that the problems your guys are just thinking about trying to solve, and ones you don’t even know exist, are already being worked on by some very bright minds who have a multi-year lead.

  6. Jeffrey Howard says:

    This is in response to GB’s post:
    Hi GB, you may think that just because someone from Intel has a complaint about GPUs, that you can presume the company as a whole is taking a swipe at a viable technology for selfish reasons. I can see why you might think this, but it couldn’t be further from the truth.
    The people in the research group who actually established this blog, did it so that people in the industry can share ideas and talk about technological problems, and the subsequent innovations. nVidia may have testimonies from people who have made things work under their architecture and programming model, but as you can see from some of the other comments on this blog, your mileage will vary.
    It might actually be possible to have a large throughput oriented architecture like the GPU, and also have the means to program it efficiently. I think that is Anwar’s point, and as a research topic, I think it is interesting to hear what other people think about that. Not as a means for us Intel guys to step up to the podium and beat our chests, but for the readers to have a chance to tell us what they think, and to get a good technological discussion going.
    For what it’s worth, your point about getting 100x performance in some applications is well noted. Though you may want to give Anwar the benefit of the doubt, too, since he is one of the leading data parallel programmers in the world, and he may have some interesting insight into this subject. Of course, we all realize that our posts on this blog will be controversial, so we expect that some people will disagree from time to time. But just keep in mind that we are not speaking on behalf of Intel the company, but rather as a bunch of researchers who are in search for the right answers, and who want to offer up our thoughts to the general community.
    Do you think that you have the right answers? If so, jump into the discussion. :-)

  7. GB says:

    Jeffry Howard wrote: “we are not speaking on behalf of Intel the company”
    Thanks for your post Jeffry Howard. I appreciate what you’re saying, but I take issue. How can a blog by Intel employees, hosted on Intel servers with an Intel URL not be construed as “from Intel”? True or false: This was blog reviewed prior to publication by a member of Intel’s legal team?
    My point was Intel’s competitors are innovating and making progress which delivers real value. When you guys come out with the whole weight of the largest semiconductor company in the world behind you, and say competitors are ‘getting it wrong’ you are doing a disservice to the innovation that is happening. And you are doing marketing.
    The Headwave guy is delivering value to his oil and gas customers. He is probably an Intel customer. Are you telling that guy he’s getting it wrong too? But he’s holding the check and his customer is happy. Is his customer wrong as well for getting his data processed more efficiently?
    Because there is an Intel logo at the top of this page, competitive and contrary views and discussion will not be widely forthcoming, which, if I’ve heard you correctly, is exactly the opposite what you claim to desire. If you truly desired a pure technical discussion to flesh-out your ideas there are numerous educational or pure research oriented peer review websites that encourage feedback and that Anwar Ghuloum could’ve use without the taint of ‘speaking for the company.’
    I appreciate Anwar’s credentials. But that does not excuse him nor this blog from it’s agenda: what you guys are doing is advance marketing for your next product. Call it what you want, tell me I’m out of place for not contributing to the technical merrits of the discussion, but I believe what you are doing is deceptive to the industry and your customers. You are marketing, and you shouldn’t obfuscate that in the shroud of ‘doing research.’

  8. Sean Koehl says:

    In response to GB’s post: As one of the editors of this blog, hopefully I can clarify a few points here. The posts here do represent the professional perspectives of Intel employees on technology and trends for the future. They do not necessarily represent the views of Intel across the board, or even across the research labs. In the labs, we are in the business of creating technology options, and we often explore multiple options in parallel. You only have to go so far as to read both Anwar’s and Tim Mattson’s posts on this blog to get two different perspectives on new parallel languages, for instance.
    As to having competing and contrasting views, we do try to publish anything that’s on-topic to this comment area, whether complimentary or critical. Several recent blogs show that.
    As far as marketing, Anwar’s work is a part of our larger tera-scale research effort, which is aimed at trying, in general, to make processors based on many IA cores not only practical, but desirable. We try to evangelize (market, if you prefer) our ideas in this area because we believe these trends will have broad implications industry wide. However, none of our posts are solicited by product marketing groups — we leave the product marketing to the product teams. We’re trying to prepare the industry for where we think technology is or should be going, and these concepts go well beyond any one future product.

  9. Igor says:

    I just love how GB ignored me completely and continued to pound the table his fist thinking that he has a point.
    GB, if you don’t have any personal experience with GPGPU I suggest you to keep your views based on NVIDIA marketing for yourself.

  10. As a young scientist and newcomer to GPGPU technology and the related debates, I am searching for experienced perspectives on future directions. I understand that corporate interests are presenting different models for CPU/GPU integration. We can only hope that this competition leads to some solutions for our particular science applications.
    All readers of this blog should recognize that the bloggers are obligated to promote (evangelize?!) their corporate preferences. I would be amused to hear an answer to GB regarding whether the blog is reviewed by Intel’s legal team (or higher-ups). I would be grateful, rather than amused, to have more detailed references to earlier blogs which describe Intel’s “better versions of [GPU] programming models.” I’m also curious about who exactly is developing Larrabee and why it is so much better than GPUs.

  11. David Schwartz says:

    What is the advantage of GPGPU supposed to be? Is it just “you already have it”? Because I don’t have a GPU that’s suitable for GPGPU use, and I think almost nobody does.
    Is it “future GPUs won’t need to add much to their cost cost to provide GPGPU features and there will be widespread benefits?” If so, then neither will future CPUs (additional asymmetric cores are not complicated), and that will eliminate the GPU-CPU bottleneck.
    So what’s the argument in favor?

  12. N30 says:

    The questions I have with Larabee entering the market as a GPGPU what is the point to this article? and with the ability to scale with NV and ATI with GPGPU technology does Intel really believe they can compete? (does anyone remember intels last attemps at GPU’s?) bolting several cores to a board does not make it an accelerator or GPGPU. The manufaturers of GPU’s have been doing this a long time and are good at it! The Fusion and what it is capable of remains to be seen. Do to the fact that the Phenoms have not been the easiest to get along with. But i do hold out hope because i do not believe add on cards should be the wave of the future we have enough as it is!

  13. Daniel Alvarez Arribas says:

    The GPGPU as it exists today is a design optimized towards speeding up the COMMON CASE, meaning the model tries to exploit some embarrasingly obvious parallelism inherent in many applications, at the expense of full applicability, but at a significantly cheaper price. Maybe it is this reduction in applicability that makes the GPGPU feasible. In the end the parallelism exploited is sufficient to make GPGPUs compelling. Real companies are achieving real (and significant) speedups with it, and that’s what counts. Maybe Larrabee will improve on it, but then again I don’t see any reason why GPGPU manufacturers shouldn’t be able to do it, too.
    I don’t know how much CPUs and GPGPUs are really going to converge in the end. However I expect OpenCL to make GPGPUs as they are today even more attractive and accessible.

  14. Anonymous says:

    I see that there are very opinionated view points. I see a very valid point in the argument, that the GPU is not here to completely replace the CPU. It’s a beyond a shred of doubt that CPUs are and will still be very relevant, as pointed for most of the messy ops. But, the fact can’t be overlooked that for those operations or sub-parts which can actually be framed into this neat set of (coalesced) operations, GPU provided a very significant improvement in performance. It is for this segment that GPUs have been targeted and tailor-made. Now, it is a very clear adn established fact that they do this job pretty well. I would wonder what Mr.Anwar would have to say about the current situation in the GPGPU space, with NVIDIA, ATI coming out with the teslas and firestreams…going far beyond rasterization-type apps to more general applications ranging from sorting to iterative solvers. And, also with debacle of larrabbe. I would really be interested to see his comment now.