Generally speaking, you don’t want to deliver any kind of difficult news to customers, partners, etc. Some of us are lucky enough to talk to folks about the performance and capabilities of our processors, shipping and soon-to-ship. Some of us, however, face a somewhat more challenging situation: explaining how to tap into this performance. I find myself in this situation often, as I frequently talk to external developers about our ongoing research in programming for multi-core and terascale. The discussion typically goes in one of two directions (the relative distribution has changed over time).1. Sometimes, the developers are trying to do the minimal amount of work they need to do to tap dual- and quad-core performance…and perhaps stretch this to our DP and MP (dual and quad socket…or up to 16 cores) systems. I suppose this was the branch most discussions took a couple of years ago. 2. Increasingly, we are discussing how to scale performance to core counts that we aren’t yet shipping (but in some cases we’ve hinted heavily that we’re heading in this direction). Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander. Over time, I find that developers migrate their thinking from the first kind of discussion to the second. We have starkly different conversations about these two paths. For the incremental path, the performance bar is often much lower and the tools that programmers want support a more incremental adoption path. We tend to discuss how to use new tools with old tools, support legacy code that (in some cases) is scarcely supported internally by the developers themselves, and so on. The second path usually requires at least some degree of going back to the algorithmic drawing board and rethinking some of the core methods they implement. This also presents the “opportunity” for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they’ve adhered to for (often) most of the their software’s existence. Ultimately, the advice I’ll offer is that these developers should start thinking about tens, hundreds, and thousands of cores now in their algorithmic development and deployment pipeline. This starts at a pretty early stage of development; usually, the basic logic of the application should be influenced because it drives the asymptotic parallelism behaviors. Consider a common pattern of optimization we’ve seen in single core tuning: the use of locally adaptive algorithms to heuristically reduce the computation time. By definition, this introduces dependences in the computation that are beneficial in the single core case but limit parallelism for multi-core. Similar choices are made about libraries and programming languages that optimize for single core performance (or even small-way parallelism), but sacrifice long-term scalability. Eventually, developers realize that the end point is on the other side of a mountain of silicon innovations, but there are two routes: a flat, but potentially longer and more circuitous route around the issues that arise with increased parallelism; and a direct route that the developer largely pays for earlier. Front-loading at least some of this transition is often less costly in the long run and positions them to more competitively reap the benefits of our silicon innovations over time. It’s not quite as simple as this binary choice, but you get the basic idea…program for as many cores as possible, even if it is more cores than are currently in shipping products. Folks from traditional or emerging HPC vertical either know this (and have known it for many years) or come to this conclusion pretty quickly (). For more mainstream application developers, this advice is usually unwelcome…but it is an encouraging sign that developers are increasingly coming to this realization on their own. [Note: Some follow up here.] — () However, HPC developers have the interesting problem of (depending on how you look at it) scaling down a “plane” of parallelism or scaling up an inner level of parallelism to map efficiently on to multi-core silicon in their clusters/data centers/grids…which has sometimes subtle differences from the single, dual and quad-core based clusters they’ve been used to programming.
Connect With Us
- s.mcknight on The Third Eye View
- Qingfeng Zhu on The Third Eye View
- Anil on The Third Eye View
- Olajfestmény on Intel and Stanford Researchers Reveal Peptide Chip Details to Categorize Diseases and Analyze Protein Interactions
- Tony Rivers on Intel and Stanford Researchers Reveal Peptide Chip Details to Categorize Diseases and Analyze Protein Interactions
Tags#IntelR&Dday @idf08 Big Data Cloud Computing Ct CTO energy efficient Future Lab Future Lab Radio HPC IDF IDF2008 IDF 2010 Immersive Connected Experiences innovation Intel Intel Labs Intel Labs Europe Intel Research ISSCC Justin Rattner many core microprocessor mobility multi-core parallel computing parallel programming radio Rattner ray tracing research Research@Intel Research At Intel Day Robotics security silicon silicon photonics software development Stanford technology terascale virtual worlds Wi-Fi WiMAX wireless