Hundreds of GigaFLOPs are available in your PC today….in fact, you might even have a TeraFLOP in there. As someone who cut his teeth on a Cray C90 (15 GFLOPS max), this is an intriguing opportunity to dabble; for the latter-day high performance computing programming (whether you’re trying to predict protein structure, price options, or trying to figure out how to thread your game), it is almost too tempting to ignore. However, like a shimmering, unreachable oasis, today’s GPUs offer the promise of all the performance you require, but achieving that goal for all but a few applications (notably, those they were designed for: rasterization)is elusive.I do a lot of work in data parallel computation and deterministic programming models…this means a lot of my peers and external collaborators are from the GPGPU community, so I expect some hate mail on this . But I think that by approaching parallelism purely from the GPU (or more generally, streaming) side of things, we’re losing track of the many valuable lessons learned from the slightly broader-scoped High Performance Computing community in the last forty years. (However, even looking at where graphics is going, we might find shortcomings in existing GPU designs.) There are three major problems with taking GPUs outside the niche of rasterization today: The programming model, system bottlenecks, and the architecture. (No, I’m not going to talk about form factor and cooling, which are bigger near-term show stoppers for the IT architect.) Let’s start with the architecture. At the lowest levels, GPUs are really good at crunching numbers as long as the you keep the shader processors busy doing useful work by anticipating their data requirements and keeping them working in lock step. Why is this? One of the reasons GPU designers can deliver huge peak performance numbers is that they’ve greatly constrained the architecture, most notably by efficiently supporting simple control flow and memory access patterns. What this means concretely is that they can design more efficient processors by not dealing with the messy, irregular patterns of computation that most applications inevitably deal with. These include unpredictable branches, looping conditions, and irregular memory access patterns. This is a fine approach if you can fit into this very narrow niche of very regular computing (or if you undertake heroic measures to transform your application, but more on that later). Here’s a generic example we find in some visual computing applications: Oftentimes in an algorithm, we need to keep track of a bunch of partial orderings of data, like say the order in which we want to process subsets of geometric structures based on their distance from the camera or viewer in a scene or the number of physical constraints I have to deal with per object in a physics simulation. However, for each ray I shoot out from the camera or each object in the scene, there may be differing numbers of geometric objects that I might interact with. The natural way to represent this efficiently is to construct a linked list of those objects I need to process per element. Why not keep track of the same number of objects per ray or object? Well, this would have a multiplicative effect on the amount of computation in my algorithm. In this programmatic tragedy of the commons, everyone has to deal with the worse case length. In the example below, say, it’s 4, you do 4 times the number of objects worth of work, even if the average objects per ray or object were 1.5. (Remember, in parallelism the constants matter.) When you scratch the surface a little, you end up finding a lot of problems like this. (Don’t ask about basic core-to-core communication or synchronization.) Thinking about data structures, anything that wants a graph, tree, sparse matrix representation is going to behave this way…and that covers a lot of interesting stuff. The trick that nobody has mastered is how to deliver this level of performance with the flexibility to do the interesting stuff in GPU or CPU. Yet. Now, the system bottlenecks: because of the underlying constraints of GPU architecture, oftentimes the program relies heavily on the CPU to manage the difficult parts of the control and data flow, as well as all the other (necessary) stuff like I/O, etc. Here’s the problem with this, the CPU-GPU link is relatively lower performance, engendering relatively high latencies for CPU-GPU interactions (like using a CPU to handle an outer level loop that the GPU can’t handle). This can have a devastating effect on performance. A GPU vendor released optimized BLAS kernel implementations for their latest and greatest discrete graphics part. BLAS (== Basic Linear Algebra Subroutine) is meant to encapsulate some pretty useful computations, which also happen to be GPU-friendly in most cases. However, if you used these kernels out of the box in more complex computations, you’d get nowhere near peak performance from the GPU. In fact, your performance would degrade relative to the CPU. Looking at Linpack, an application widely used to benchmark architectures for linear algebra, we found that the using this great BLAS library out of the box (much the way the average programmer would), we saw pretty dismal performance (slower than simply using a single core of the CPU and around 400x slower than “peak” performance available on the GPU). The problem: the “easy adoption” implementation path led us to being bound by the CPU-GPU roundtrip latencies. To be fair, we manually tuned up the performance to get slightly better performance than the GPU vendor was themselves reporting…but this took weeks and was, in many parts, highly specific to not just this vendor, but this particular generation of GPU architecture. To really get good, sustainable (meaning useful beyond this particular core) performance, you have to use a programming model that allows you to compose these computations in to more complex ones that can run unabated on the GPU side of things. However, you need a good programming model to facilitate this. Which brings us to the programming models. Let me be frank here…I like what I’m hearing from the GPGPU programming model side of things, but I don’t love it yet because it’s not dealing with reality. Reality means dealing with I/O, messy boundary conditions, irregular control flow and data access, funny shapes of thing. Reality also means that the parallelism isn’t necessarily where you can easily coalesce it all…it might be distributed across several different libraries in use in the application. There is a lot of goodness in these models, but until they can be used in real, industrial strength applications, nirvana will remain elusive. When you take a highly constrained architecture, wrap it in a thin veneer of familiar syntax (curly braces and semi-colons for you C fans), you still end up forcing the programmer to deal with inelegant parts of mapping the application to the underlying architecture. In a real application, this means that your performance nosedives unless you’re willing to spend enormous amounts of programmer energy to optimize the application…for that particular GPU. I expect that better versions of these programming models will get here sooner than most think…in fact, we’re trying to do that ourselves (see my earlier blogs). Will the architectures keep pace?
Connect With Us
Tags#IntelR&Dday @idf08 Big Data Cloud Computing Ct CTO energy efficient Future Lab Future Lab Radio HPC IDF IDF2008 IDF 2010 Immersive Connected Experiences innovation Intel Intel Labs Intel Labs Europe Intel Research ISSCC Justin Rattner many core microprocessor mobility multi-core parallel computing parallel programming radio Rattner ray tracing research Research@Intel Research At Intel Day Robotics security silicon silicon photonics software development Stanford technology terascale virtual worlds Wi-Fi WiMAX wireless