Exploring programming models with the Single-chip Cloud Computer research prototype

Whenever I speak about Intel Labs’ Tera-scale Computing Research program I face skepticism about how far the growth to multiple cores can extend. “How will anyone ever use 100’s of cores on one task?” sums up the typical comments. Coordinating the complex interactions among cores will limit the benefit to a few cores, even if it can be done without errors. How? We have plenty of examples in the world of “scale-out” computing in the “Cloud” where routinely thousands of nodes are used for everything from search to machine learning. “Scale-out” refers to using more independent nodes to tackle an application, each with their own memory, as opposed to “scale-up” where more resources (cores, memory) are added to a node with shared memory. Messaging among nodes replaces coordinated sharing of the same memory.

Today Intel Labs disclosed our next-generation many-core research prototype processor, the Single chip Cloud Computer (SCC). It is designed to allow exploration of scale-out programming or “Cloud computing” on-die. The SCC has 48 Intel-Architecture (IA) core, the largest number ever put on a CPU, connected by an on-die network with latency and bandwidth (256GB/sec) only dreamed of by designers of traditional clusters. SCC is both a microcosm of a Cloud datacenter and an example of a possible architecture for a future many-core datacenter processor that exploits the efficiencies of high-integration silicon design. Power management is also a focus, providing fine-grained software control off voltage and frequency. SCC can run all 48 cores at a time over a range of 25 to 125W, or selectively vary to voltage and frequency of the mesh and sets of cores.

The super-computing community has used massive parallelism for years, with message-passing programming models based on APIs such as MPI. The newest generation of super-computers use huge numbers of ordinary processors connected in high performance networks as the computer with the same programming model – with proven scaling to thousands of nodes. Cloud computing centers have moved to the same architecture with greatly simplified programming models such as Google’s MapReduce that abstract the messaging from the programmer – applying thousands of processors on large data sets to problems such as sorting, machine learning and statistical machine translation. What might we apply to parallel programming for mainstream desktop and laptop processors from what has been learned in these areas?

We believe SCC is an ideal test-bed to explore parallel programming approaches for the mainstream as well as how the Cloud computing performance could be improved with an on-die architecture that reflects the larger Cloud. Microsoft Research is demonstrating how Visual Studio can be modified for development of a message passing application on SCC. Intel Labs has demonstrating the implementation of a microcosm of a Cloud datacenter on chip, with Linux on each core, running an application using Hadoop. We’re also showing demonstrations of message-passing as well as of software-based coherent shared memory SCC.

We expect to make SCC available to academic and industry partners (SCC software research program) who will explore all of these directions and more, leading to understanding of how to build better processors for the Cloud as well as how scalable programming models and architectures that will apply to all processors for all market segments. Today we enter an exciting new phase of our research to bring exciting new applications to uses with the performance of many-core processors.

Releated blog: Watch the livecast from the event

18 Responses to Exploring programming models with the Single-chip Cloud Computer research prototype

  1. AUGEARD Nicolas says:

    Hi, Please bring us something great, a revolution… (and not only for pro).
    Why is there only vertical and horizontal connections between core, diagonal connections though angle isn’t possible (it could reduce latency and increase grid density) ?
    Are the connection between core trough the Z axis possible to give multiple core layer capability? (nano heat-pipe to evacuate heat ?)
    Will my 3D software, capable of tiles rendering, benefit from the terascale cpu ?
    Nicolas (France)

  2. Kevin says:

    Bioinformatics has alot of problems that can be segmented easily without the need for parallel programming techniques. Building and maintaining HPC cluster is often out of reach of small labs. I think if the price is right. Alot of labs would want one of these for their bioinformatician. It will also simplify data retrieval across shared storage if a single processing entity is the data cruncher.

  3. Jim Held says:

    Nicolas,
    One of our goals is to gain insights that we can apply across all our products, we certainly aspire to doing great things. :-) A simple 2D (horizontal and vertical mesh) has advantages in that it is a good fit to the layout of a chip as well as supporting a simple tiled design. We weren’t doing any experimentation with thermal solutions, and this processor doesn’t require any extraordinary cooling. If as I’d expect, your 3D graphics software has heavy use of floating point and vector-level parallelism, it is more likely a fit to the Larrabee architecture that we’ve described, for example at SIGGRAPH. -Jim

  4. Stan Muse says:

    The SCC sounds like it would be a good platform for IBM’s DB2 and/or Informix databases with their ‘shared nothing’ architecture.

  5. I’m writing a paper on options for correct operation for this chip (or this type of chip). Could you post a short list of models that you are aware of that I could search in the academic literature? I am aware of software coherence (what can I search to find a description of your general approach?), MPI (in what arrangement? With seperate OSes?), and separate OSes running a network architecture. Are there other options?
    Thank you so much!
    -Jonathan, University of Cambridge

  6. Amit Pande says:

    Excellent article ! Such innovations indeed tell us the paradigm shift happening in the computing model. SCC processor,in my opinion,in a way confirm the opportunities seen in the cloud computing space. I wonder if it would be possible to see all the workloads presently put in private/public cloud which are scaled out will be made to run on a single machine having the SCC chip. Meaning, I can run multiple workloads on a given server (with SCC) by possibly dividing the cores across them.Same can be thought of memory as well as storage.

  7. Jim Held says:

    Each router has 5 ports (N, E, S, W and Tile). The memory controllers are each on ports at the periphery of the mesh.

  8. Jim Held says:

    Not especially fast since we have classic Pentium cores and only run them at 800MHz-1GHz. With the dual Pentium pipeline it can issue at most 2 instructions at a time, and 48 cores for a peak of 96GigaOps/sec. Of course real programs would have a mix of multi-cycle instructions, and only one pipeline can do floating point. There are many advancements that came after this generation.
    But this is an experimental processor, not a product or product prototype. It is designed to serve as a concept vehicle for some technologies like the mesh and software programming model research. Scaling (getting proportional benefit with use of more cores) is the kind of issue we’re exploring, not raw performance.