Inside an 80-core chip: the on-chip communication and memory bandwidth solutions

By John Du, reposted from our Chinese language blog.

Here I would like to discuss about some hot technical topics. About tera-scale, some readers of the Chinese blog made comments about the communication and the memory bandwidth solutions. I would talk more about it.

About the interconnection for many cores on the chip

We know that the many core system indeed is an integrated system in one chip with many processor cores. The communication among these cores relied on the on-chip interconnection network. For the interconnect solution, we have 3 considerations of cost in the design, which are power, die area and the complexity.

Fist of all, the power – Interconnect fabric is very demanding on power, it can consume up to 36% of chip power supply. Naturally, if we increase the fabric bandwidth, we also increase the power consumption. We need to balance the demand on bandwidth and the request from a power management system, so that we can get a dynamic on-demand power supply to save energy.

Second factor is the die area. The feature of VLSI chips is to provide vast number of transistors for designers to build various functional circuits. The interconnect network was just built on these transistors, which possibly occupy more than 1/5 of the die area on one core. The more for interconnect fabric, the less for the processing functions on computing. There must be a trade-off at some point. We prefer not sacrifice too much area from computing function, which means the die area for interconnect fabric is limited.

The third is the design complexity. All the circuit design should be optimized. Obviously, a simple circuit is easy to optimize, while the optimization of complex circuits is difficult. Consider 4 common types of the network topology, BUS, bi-directional Ring, 2-D mesh, and the Crossbar. The BUS is the simplest, but each time it can only submit or receive one message. The bi-directional Ring can simultaneously submit or receive messages, and the links can operate at high rate. But when core number increased, it is less economic, because there is only one route. If we add one dimension of the topology, we can get the 2D mesh network, which can process more concurrent messages, with many possible routes. If we continuously increase the dimensions, for example, in a crossbar network, all the cores can communicate with each others simultaneously. The higher dimension networks have better architectural properties. However, they are more difficult to design optimally.

Then which is a good interconnect solution? Our decision was based on the targeting applications requirements on bandwidth. In the tera-scale systems, the full chip bandwidth is in order of Terabytes/s, and the link bandwidth is in hundreds of GB/s. After the considerations on performance, die area and power, our researchers chose the 2D mesh network as the on-chip interconnect network for the tera-scale many core chip. This solution is the best option to balance performance and power efficiency for our many-core system.

About the memory and its bandwidth in the many-core system

Our solution is the 3D Stacked Memory based on TSV (Through silicon Vias) technology. The basic scheme is that we put the thin memory die on top of the CPU die, let the power and IO signals go through memory to CPU. Each core connected to the 3D stacked memory. In our 80-core research chip, each core has 256Kb SRAM (for data and instruction). There are 8490 thru-silicon vias on the chip. This solution fills the needs for both of capacity and low latency. The technology is implemented in small volume production. Our researchers are working on how to support HVM (high volume manufacturing) capability. It is only a question on timing for this technology to implement in the products.

2 Responses to Inside an 80-core chip: the on-chip communication and memory bandwidth solutions

  1. Lord Volton says:

    How much of this will be applicable to 16 and 32 core solutions coming down the pipeline in the next 4-6 years?
    On a similar note, when we have 16 cores is any of this applicable to 4×4 multiple chip systems?
    Interesting stuff!