This week we are excited to share further technical progress towards our vision to enable scalable, programmable multi-core architectures based on many cores. We are disclosing 8 technical papers from our Tera-scale program via the Intel Technology Journal with new results, key findings, and details on how we expect future architectures with simplified parallel programming models will evolve.The journal will be live later this week [Ed. note: the journal is now live], but I wanted to provide a preview of the information therein and try to provide some context. We are covering 8 topics from software to hardware. If you pay attention to the findings therein, you may see a common theme: a strong connection between our software and hardware research. An important part of our research agenda is the philosophy that future hardware must be designed to the demands of the usage models we expect to be prevalent when they become available. As such, our Applications Research Lab prototypes these future apps today, often in collaboration with top academic institutions, and learns which architectural changes they warrant. The flip side to this philosophy is that new architectural ideas should be benchmarked with relevant future workloads, which are often very different than the popular benchmarks of today such as SpecInt. So, as our software researchers develop and analyze future apps, they package them up as characteristic workloads for our hardware developers. In fact, we are releasing many of these workloads to the research community, which will be part of in a public suite at Princeton University. Three of the papers analyze three characteristic future multi-core applications. The topics (all titles paraphrased) include:
1. “Datacenter-on-a chip” (pure enterprise)2. Physical modeling for realism in games/movies (both client and server)3. Home multi-media search/mining (a client app)In #1, we look at a characteristic e-commerce data center with 133+ processors, determine that the same configuration could be run on a single system based on a 32-core tera-scale processor with 4 SMT threads/core, and explore the changes to the platform — especially in the memory architecture needed to balance all that processing. The proposed changes include a) a model for a hierarchy of shared caches, b) a new, high-bandwidth L4 cache, and c) cache quality of service (QoS) to optimize how multiple threads share cache space. Papers 2,3 demonstrate excellent parallel scalability for two model-based applications, but also point to the need for more cache/memory bandwidth, as would be provided by a large L4 cache. This leads us to two of the more hardware focused papers we are disclosing, covering:
4. Packaging & integration of this L4 cache5. On-die integration of many cores.6. A proposed hardware task schedulerPaper 4 discusses the implications of providing this high bandwidth memory. To do so, we must bring these closer to the die, eventually right on top of it. Our Assembly and Test Technology Development division are evaluating possible options to achieve this: Paper 5 explores how we might architecturally design and integrate caches shared between the cores, but also explores the next steps for the on-die interconnect mesh (first prototyped in our Teraflops Research Processor), and other non-core components that will be integrated such as memory controllers, I/O bridges, and graphics engines. Additionally, paper #6 proposes a specific architectural change that will accelerate applications using many threads. Specifically, we propose implementing the function of task scheduling (mapping work to cores for execution) in hardware. The current software based methods are good for the few, large tasks we typically schedule today but introduce too much overhead for the very small tasks that are needed to support scaling of future highly-parallel workloads. In fact, this paper is just such an example of using future workloads to test architectural ideas. Our application researchers tested the scheduler with benchmarks from the model-based application domain that provide advanced capabilities for the recognition, mining, and synthesis of models. Results are particularly good for the game physics workload discussed in paper #2. The fact remains, however, that parallel programming is hard (see Anwar’s blog on the subject). Here’s an excerpt from the same physics modeling paper (#2).
Many modules of these applications require extensive effort to achieve good performance scaling. In some cases, the best serial algorithms have poor parallel scalability. For these, we use alternative algorithms which are slower on one core, but have more parallelism. In other cases, we modify the algorithm to expose more parallelism. The overhead of exposing the parallelism is often small compared to the benefits of improved scaling.As such, we are working to simplify parallel programming though new hardware/software innovations. The final two papers are on this topic:
7. Giving an IA “look and feel” to the programming of integrated hardware accelerators8. A comprehensive run-time environment for tera-scale platformsMulti-core brings the opportunity to integrate non-IA accelerator cores , e.g. media accelerators. However since accelerators have different instruction sets they cannot leverage the compilers, tools, and knowledge base developed for IA programming. Research paper #7 outlines a proposed method to extend IA to make accelerators appear as application-level functional units, on which user-level threads, which we call “shreds¸” can execute accelerator instructions. This paper describes the architectural extensions to enable this as well as language extensions and runtime to program it. Paper 8 expands significantly on the topic of tailoring runtimes to the special environment of tera-scale platforms. Most treat many cores integrated on die the same as a large traditional multi-processor system. Runtimes designed to enable efficient, low-overhead use of the many cores and threads on a tera-scale processor will be critical for software scalability . The run-time presented (called McRT) provides good support for fine-grain parallelism and support new concurrency abstractions that ease parallel programming. It replaces many of the OS-level services with user-level threading “primitives” which programmers can code to directly, avoiding many expensive transitions between the user level and OS level. These primitives include a scheduler, memory manager, synchronization primitives, and a set of threading abstractions. Results show how an application using this runtime stack scales almost linearly to more than 64 hardware threads. McRT provides a high-performance transactional memory library to ease parallel programming by allow the programmer to often avoid error-prone and hard to scale locking techniques. Enabling programming is essential to making tera-scale computers widespread in the future. The fact that you have seen and will continue to see a number of posts on the topic on this blog, especially from Tim, Ali and Anwar. I hope this set of papers demonstrate how software and hardware innovations are intimately intertwined, and that taking a very broad view of multi-core is even more essential to taking computational capabilities to the next order of magnitude. Come back and check out the ITJ in a few days to learn more.