Multi-core research update: the intimate coupling of software & hardware

This week we are excited to share further technical progress towards our vision to enable scalable, programmable multi-core architectures based on many cores. We are disclosing 8 technical papers from our Tera-scale program via the Intel Technology Journal with new results, key findings, and details on how we expect future architectures with simplified parallel programming models will evolve.

The journal will be live later this week [Ed. note: the journal is now live], but I wanted to provide a preview of the information therein and try to provide some context. We are covering 8 topics from software to hardware. If you pay attention to the findings therein, you may see a common theme: a strong connection between our software and hardware research. An important part of our research agenda is the philosophy that future hardware must be designed to the demands of the usage models we expect to be prevalent when they become available. As such, our Applications Research Lab prototypes these future apps today, often in collaboration with top academic institutions, and learns which architectural changes they warrant.


The flip side to this philosophy is that new architectural ideas should be benchmarked with relevant future workloads, which are often very different than the popular benchmarks of today such as SpecInt. So, as our software researchers develop and analyze future apps, they package them up as characteristic workloads for our hardware developers. In fact, we are releasing many of these workloads to the research community, which will be part of in a public suite at Princeton University.

Three of the papers analyze three characteristic future multi-core applications. The topics (all titles paraphrased) include:

1. “Datacenter-on-a chip” (pure enterprise)

2. Physical modeling for realism in games/movies (both client and server)

3. Home multi-media search/mining (a client app)

In #1, we look at a characteristic e-commerce data center with 133+ processors, determine that the same configuration could be run on a single system based on a 32-core tera-scale processor with 4 SMT threads/core, and explore the changes to the platform — especially in the memory architecture needed to balance all that processing. The proposed changes include a) a model for a hierarchy of shared caches, b) a new, high-bandwidth L4 cache, and c) cache quality of service (QoS) to optimize how multiple threads share cache space.

Papers 2,3 demonstrate excellent parallel scalability for two model-based applications, but also point to the need for more cache/memory bandwidth, as would be provided by a large L4 cache. This leads us to two of the more hardware focused papers we are disclosing, covering:

4. Packaging & integration of this L4 cache

5. On-die integration of many cores.

6. A proposed hardware task scheduler

Paper 4 discusses the implications of providing this high bandwidth memory. To do so, we must bring these closer to the die, eventually right on top of it. Our Assembly and Test Technology Development division are evaluating possible options to achieve this:


Paper 5 explores how we might architecturally design and integrate caches shared between the cores, but also explores the next steps for the on-die interconnect mesh (first prototyped in our Teraflops Research Processor), and other non-core components that will be integrated such as memory controllers, I/O bridges, and graphics engines.

Additionally, paper #6 proposes a specific architectural change that will accelerate applications using many threads. Specifically, we propose implementing the function of task scheduling (mapping work to cores for execution) in hardware. The current software based methods are good for the few, large tasks we typically schedule today but introduce too much overhead for the very small tasks that are needed to support scaling of future highly-parallel workloads. In fact, this paper is just such an example of using future workloads to test architectural ideas. Our application researchers tested the scheduler with benchmarks from the model-based application domain that provide advanced capabilities for the recognition, mining, and synthesis of models. Results are particularly good for the game physics workload discussed in paper #2.

The fact remains, however, that parallel programming is hard (see Anwar’s blog on the subject). Here’s an excerpt from the same physics modeling paper (#2).

Many modules of these applications require extensive effort to achieve good performance scaling. In some cases, the best serial algorithms have poor parallel scalability. For these, we use alternative algorithms which are slower on one core, but have more parallelism. In other cases, we modify the algorithm to expose more parallelism. The overhead of exposing the parallelism is often small compared to the benefits of improved scaling.

As such, we are working to simplify parallel programming though new hardware/software innovations. The final two papers are on this topic:

7. Giving an IA “look and feel” to the programming of integrated hardware accelerators

8. A comprehensive run-time environment for tera-scale platforms

Multi-core brings the opportunity to integrate non-IA accelerator cores , e.g. media accelerators. However since accelerators have different instruction sets they cannot leverage the compilers, tools, and knowledge base developed for IA programming. Research paper #7 outlines a proposed method to extend IA to make accelerators appear as application-level functional units, on which user-level threads, which we call “shreds¸” can execute accelerator instructions. This paper describes the architectural extensions to enable this as well as language extensions and runtime to program it.

Paper 8 expands significantly on the topic of tailoring runtimes to the special environment of tera-scale platforms. Most treat many cores integrated on die the same as a large traditional multi-processor system. Runtimes designed to enable efficient, low-overhead use of the many cores and threads on a tera-scale processor will be critical for software scalability .


The run-time presented (called McRT) provides good support for fine-grain parallelism and support new concurrency abstractions that ease parallel programming. It replaces many of the OS-level services with user-level threading “primitives” which programmers can code to directly, avoiding many expensive transitions between the user level and OS level. These primitives include a scheduler, memory manager, synchronization primitives, and a set of threading abstractions. Results show how an application using this runtime stack scales almost linearly to more than 64 hardware threads. McRT provides a high-performance transactional memory library to ease parallel programming by allow the programmer to often avoid error-prone and hard to scale locking techniques.

Enabling programming is essential to making tera-scale computers widespread in the future. The fact that you have seen and will continue to see a number of posts on the topic on this blog, especially from Tim, Ali and Anwar. I hope this set of papers demonstrate how software and hardware innovations are intimately intertwined, and that taking a very broad view of multi-core is even more essential to taking computational capabilities to the next order of magnitude. Come back and check out the ITJ in a few days to learn more.

3 Responses to Multi-core research update: the intimate coupling of software & hardware

  1. Hello Sean,
    Just found that I have been using a kind of *model-based computing* for four years already. I’ve read in HPCwire’s article – “Model Based Computing” by Michael Feldman that *real-time translation* is supposed to be one of the primary tasks for RMS application: Recognition – Mining – Synthesis: “An important feature of this model is the interactive nature of the applications. This works on a couple of different levels. The first is that the types of applications being envisioned usually have to deal with real-time data input, and then turn around and generate timely output. Another aspect to the interactiveness is the ability of the software to learn, through the feedback of the mining and/or synthesis operations.” But I still have doubts whether the models could give relatively correct results even using *supercomputers* without involving an expert in a real-time process. They can just help for the first steps of the RMS process using the models saved in expert knowledge databases.
    As I recently told in my comment to your post about the possible applications of an 80-core processor – *rule-based machine translation* models are just abstractions that are generating simplified, based on the translations with just several general meanings. For example, let’s take any general translation dictionary – even the biggest of them having tens of volumes hardly contains 5% of phrases used in a *real language*. And only several percents of the big dictionary are contained in a general dictionary of MT system – a real language is just not described by linguists for real-time translation at any sufficient level. What the results would we get using these models?
    Let’s take model-based computer vision, for another example. A model of certain person must have all his facial expressions for all possible emotions to be resognized by his face. Is it possible to achieve for every real person at a time? It’s for the 3D space+time reality.
    Language is the meaning (intelligence, wisdom) of real communications between people, so if the *real examples* of phrases are described, the expert can select the right meaning for every case and general meaning for automatic preprocessing. This the concept of example-based machine translation. But again just a simplified way is used for TM (translation memory) systems that save only translated sentences in databases and then are just trying to find a word or number that is new for the model of the sentence for substitution of the rest of the sentence.
    A *real model* is all the meanings of a word that then are combined into new models – ready phrases by an expert with assigning a general meaning for further automatic pretranslation – that’s a learning process for an expert knowledge database. The real-time processing is a *synthesis* of the meanings of the words and phrases driven by an expert (the same is true for the visual identification systems) or *assembling* like in genetics driven by Nature. That’s the AI-based expert translation I’m working on (and it’s just my hobby for these four years – I translate’s texts every day using this system). Really interesting thing – the elements of real language coded with a binary algorithm (General-Options) work like an assembler programming language – it’s like a construction of the digital world (that is consisting of two mirrored realities).

  2. Sean Koehl says:

    Sally, according to the lead author on the McRT paper (Bratin Saha), Pthread is not actually a part of core McRT. In the figure above, the darkened portion is core McRT which does not include pthreads. There is a pthread adaptor that sits on top of core McRT – this adaptor takes pthread calls and translates them into the core McRT API.