Moving tera-scale data (Intel Labs @ ISSCC part 1)

For Intel’s hardcore lab jocks, one the most exciting events each year is the International Solid States Circuits Conference. It’s the Olympics of circuit design, where top researchers from industry and academia present their latest advancements in computing and communications hardware. I’ve been working with our researchers to highlight papers from Intel Labs for the past four years, and I can say that these papers are not for the timid. They are chock full of circuit and architectural descriptions, measurement results, diagrams, tables and graphs and leave no room for fluff whatsoever. ISSCC has accepted 8 papers from Intel Labs this year. These papers fall roughly into two general categories: Tera-scale Computing and Intelligent Circuits. In this blog, I’d like to provide an preview of those that fall into the first category and help translate the highly technical abstracts published in the ISSCC Advanced Program.

The first paper should be familiar to some. “A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS” describes what we have dubbed our Single-Chip Cloud Computer or SCC. This is a flagship project for Tera-scale and a successor to the Tera-flops Research Processor disclosed at ISSCC in 2007. The SCC combines 48 fully programmable IA cores connected by a high speed mesh network with 2 Terabit/s (256TB/s) bisection bandwidth. The design supports message passing and an OS running on each core, making it appear to the programmer like a “cloud” of compute resources on a chip. To learn more about the SCC, read Jim Held’s blog and visit the project homepage. We are currently working on a program to share these chips externally with research partners to help advance parallel application development.

The second paper “A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS” explores a new kind of on-chip network to connect cores or accelerators in future many-core chips (or even SoCs). The goal is to provide a way for these integrated networks to consume even less power by using circuit-switching rather than packet switching. This means that rather than having a router switch data packet-by-packet as they traverse the network, a circuit or “channel” is established that that allows a stream of data to travel directly from one location to another, non-stop. The 64-node prototype described in this paper shows that it is possible to move terabits of data per second with a very high energy efficiency with this technique, as much as 1.5 Terabits/s per Watt.

btn.jpg

The third paper, “A 47×10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS,” deals with moving data on and off chip rather than within the chip. It describes a prototype I/O link that can move as much as 470 Gigabits of data per second chip-to-chip while consuming only about 0.7W of power. It does this in part by avoiding the computer motherboard. Instead, chips are connected to each other by attaching a high-quality, high density interconnect to the top of the processor packages as shown in this photo. This more direct chip-to-chip I/O approach and the use of many (47) parallel data channels allowed researchers to design circuits with roughly 10X better power efficiency than what is done in products today. Furthermore, the I/O links have the capability to go to sleep on command and wake in a few nanoseconds, which is about 1000x faster than today, consuming 93% less power when asleep. This means the sleep capability could be used liberally, further reducing average the power consumption. This paper is a significant milestone for us because it aims to eliminate a major barrier to getting the full value of future many-core chips — the need to feed the beast, so to speak.

The final tera-scale paper brings us right back to the beginning, because it presents further results from the original Teraflops Research Chip from 2007. “Within-Die Variation-Aware Dynamic-Voltage-Frequency Scaling Core Mapping and Thread Hopping for an 80-Core Processor” studies how best to divide tasks among the chip’s 80-cores to optimize processing efficiency. The key finding is that one can take advantage of natural core-to-core variations in maximum frequency and leakage current. Normally all cores are considered to be the same for simplicity, but that forces one to set parameters like clock frequency based on the lowest-performing core. Our researchers found by measuring these variations in advance, they could define energy models & a variation-aware global optimizer to boost efficiency of the chip by assigning tasks to specific cores and adjusting core clock speeds and voltages based on the needs of the application. This proposed approach was predicted to yield as much as 6-35% energy savings. Furthermore, by implementing a “thread hopping” technique that would allow priority tasks to continuously hop to the best available cores, one could gain a 20-60% energy savings for certain tasks. These ideas are already being applied to the Single-chip Cloud Computer, which implements some of the fine-grain voltage and frequency control needed to apply these concepts.

These variation-aware techniques are examples of intelligent circuits which can adapt and optimize to meet the needs of the user. I’ll describe more examples of this in my next blog previewing the remaining four papers from Intel Labs at ISSCC.

Comments are closed.