Whenever I speak about Intel Labs’ Tera-scale Computing Research program I face skepticism about how far the growth to multiple cores can extend. “How will anyone ever use 100′s of cores on one task?” sums up the typical comments. Coordinating the complex interactions among cores will limit the benefit to a few cores, even if it can be done without errors. How? We have plenty of examples in the world of “scale-out” computing in the “Cloud” where routinely thousands of nodes are used for everything from search to machine learning. “Scale-out” refers to using more independent nodes to tackle an application, each with their own memory, as opposed to “scale-up” where more resources (cores, memory) are added to a node with shared memory. Messaging among nodes replaces coordinated sharing of the same memory.
Today Intel Labs disclosed our next-generation many-core research prototype processor, the Single chip Cloud Computer (SCC). It is designed to allow exploration of scale-out programming or “Cloud computing” on-die. The SCC has 48 Intel-Architecture (IA) core, the largest number ever put on a CPU, connected by an on-die network with latency and bandwidth (256GB/sec) only dreamed of by designers of traditional clusters. SCC is both a microcosm of a Cloud datacenter and an example of a possible architecture for a future many-core datacenter processor that exploits the efficiencies of high-integration silicon design. Power management is also a focus, providing fine-grained software control off voltage and frequency. SCC can run all 48 cores at a time over a range of 25 to 125W, or selectively vary to voltage and frequency of the mesh and sets of cores. The super-computing community has used massive parallelism for years, with message-passing programming models based on APIs such as MPI. The newest generation of super-computers use huge numbers of ordinary processors connected in high performance networks as the computer with the same programming model – with proven scaling to thousands of nodes. Cloud computing centers have moved to the same architecture with greatly simplified programming models such as Google’s MapReduce that abstract the messaging from the programmer – applying thousands of processors on large data sets to problems such as sorting, machine learning and statistical machine translation. What might we apply to parallel programming for mainstream desktop and laptop processors from what has been learned in these areas? We believe SCC is an ideal test-bed to explore parallel programming approaches for the mainstream as well as how the Cloud computing performance could be improved with an on-die architecture that reflects the larger Cloud. Microsoft Research is demonstrating how Visual Studio can be modified for development of a message passing application on SCC. Intel Labs has demonstrating the implementation of a microcosm of a Cloud datacenter on chip, with Linux on each core, running an application using Hadoop. We’re also showing demonstrations of message-passing as well as of software-based coherent shared memory SCC. We expect to make SCC available to academic and industry partners (SCC software research program) who will explore all of these directions and more, leading to understanding of how to build better processors for the Cloud as well as how scalable programming models and architectures that will apply to all processors for all market segments. Today we enter an exciting new phase of our research to bring exciting new applications to uses with the performance of many-core processors. Releated blog: Watch the livecast from the eventIntel Labs
Connect With Us
Related Links
Recent Comments
- Qingfeng Zhu on The Third Eye View
- Anil on The Third Eye View
- Olajfestmény on Intel and Stanford Researchers Reveal Peptide Chip Details to Categorize Diseases and Analyze Protein Interactions
- Tony Rivers on Intel and Stanford Researchers Reveal Peptide Chip Details to Categorize Diseases and Analyze Protein Interactions
- Neel on Our ISTC-VC will rock at SIGGRAPH 2012
Categories
Tags
#IntelR&Dday 80-core @idf08 Big Data Cloud Computing Ct CTO energy efficient Future Lab Future Lab Radio IDF IDF2008 IDF 2010 Immersive Connected Experiences innovation Intel Intel Labs Intel Labs Europe Intel Research ISSCC Justin Rattner many core microprocessor mobility multi-core parallel computing parallel programming radio Rattner ray tracing research Research@Intel Research At Intel Day Robotics security silicon silicon photonics software development Stanford technology terascale virtual worlds Wi-Fi WiMAX wireless






18 Responses to Exploring programming models with the Single-chip Cloud Computer research prototype
success!:)
Hi, Please bring us something great, a revolution… (and not only for pro).
Why is there only vertical and horizontal connections between core, diagonal connections though angle isn’t possible (it could reduce latency and increase grid density) ?
Are the connection between core trough the Z axis possible to give multiple core layer capability? (nano heat-pipe to evacuate heat ?)
Will my 3D software, capable of tiles rendering, benefit from the terascale cpu ?
Nicolas (France)
Bioinformatics has alot of problems that can be segmented easily without the need for parallel programming techniques. Building and maintaining HPC cluster is often out of reach of small labs. I think if the price is right. Alot of labs would want one of these for their bioinformatician. It will also simplify data retrieval across shared storage if a single processing entity is the data cruncher.
Nicolas,
A simple 2D (horizontal and vertical mesh) has advantages in that it is a good fit to the layout of a chip as well as supporting a simple tiled design. We weren’t doing any experimentation with thermal solutions, and this processor doesn’t require any extraordinary cooling. If as I’d expect, your 3D graphics software has heavy use of floating point and vector-level parallelism, it is more likely a fit to the Larrabee architecture that we’ve described, for example at SIGGRAPH. -Jim
One of our goals is to gain insights that we can apply across all our products, we certainly aspire to doing great things.
The SCC sounds like it would be a good platform for IBM’s DB2 and/or Informix databases with their ‘shared nothing’ architecture.
I’m writing a paper on options for correct operation for this chip (or this type of chip). Could you post a short list of models that you are aware of that I could search in the academic literature? I am aware of software coherence (what can I search to find a description of your general approach?), MPI (in what arrangement? With seperate OSes?), and separate OSes running a network architecture. Are there other options?
Thank you so much!
-Jonathan, University of Cambridge
Excellent article ! Such innovations indeed tell us the paradigm shift happening in the computing model. SCC processor,in my opinion,in a way confirm the opportunities seen in the cloud computing space. I wonder if it would be possible to see all the workloads presently put in private/public cloud which are scaled out will be made to run on a single machine having the SCC chip. Meaning, I can run multiple workloads on a given server (with SCC) by possibly dividing the cores across them.Same can be thought of memory as well as storage.
Thanks for this very interesting article.
Thanks for post. Very interesting.
How are the 4 memory controllers attached to the network mesh?
Thanks,
Kent
Each router has 5 ports (N, E, S, W and Tile). The memory controllers are each on ports at the periphery of the mesh.
Congrats, Dr. Held, on your team’s progress on Tera-scale Computing being reported in the latest Intel Technology Journal, http://www.click.intel.com. It has some great articles, thanks!
Thanks for posing this! Great article on programming models!
awesome! You always have great posts.
Thanks for these articles, I enjoyed them!
Very Interesting!
How fast is this solution?
Not especially fast since we have classic Pentium cores and only run them at 800MHz-1GHz. With the dual Pentium pipeline it can issue at most 2 instructions at a time, and 48 cores for a peak of 96GigaOps/sec. Of course real programs would have a mix of multi-cycle instructions, and only one pipeline can do floating point. There are many advancements that came after this generation.
But this is an experimental processor, not a product or product prototype. It is designed to serve as a concept vehicle for some technologies like the mesh and software programming model research. Scaling (getting proportional benefit with use of more cores) is the kind of issue we’re exploring, not raw performance.
Thanks for post. Very interesting.