Resiliency – A Key Strategy to Keep Reaping the Benefits of Moore’s Law (guest post)

This post comes from Antonio Gonzalez, director of the Intel Barcelona Research Center in Spain. His lab conducts a variety of research aimed at improving the performance and energy efficiency of future multi-core and tera-scale microprocessors. His post relates to a paper presented this month at the International Symposium on Microarchitecture on the topic of resilient microarchitectures.

Moore’s Law will continue to provide architects with smaller, faster and less energy consuming transistors to design future microprocessors. This will allow architects to keep increasing the performance of future microprocessors to enable new applications that otherwise would not be possible.

These future transistors will also have other characteristics that designers have to take into account to better exploit their enhanced capabilities. In particular, future transistors are likely to have a higher degree of variability. This variability has a spatial and a temporal component.

Spatial variability occurs when different transistors within the same chip (or on different chips) which have been designed to be equal behave differently in practice (i.e., they have a different speed, leakage, etc.). Temporal variability refers to the fact that the same transistor may change its behavior over time.

The basic approach to dealing with variability is guardbands. Microprocessors are synchronous systems, in which their multiple components work following the pace given by a clock. The clock period is set such that it is larger than the nominal delay of each component plus a worst case estimation of its variability.

This approach is simple, but it may sacrifice a significant potential if the variability is low in average but can reach high values in some uncommon cases, since it ends up paying the worst case margin for all the cases.

Under this scenario, “resilient architectures” that can adapt to the particular temporal and spatial variability of each individual block are becoming more appealing due to its potential to provide much better performance.

There are two basic, orthogonal approaches to building a resilient architecture. One is to mitigate the variability. This normally implies to understand the sources of variability and devise particular techniques that mitigate these effects. The other is to use time speculation, which consists of assuming a timing that is normally higher than the actual delay but sometimes may end up being optimistic. Timing speculation requires a mechanism to detect and correct the eventual faults caused by the sometimes optimistic assumptions. These mechanisms can also be used to detect and correct other types of faults, such as those caused by external agents as it is the case of neutron strikes (aka soft errors).

Resilient microarchitectures have recently become a hot topic in microarchitecture research due to the increasing magnitude of the variations. This trend can be observed, for instance, looking at the program of the top conference in microarchitecture, the IEEE/ACM International Symposium on Microarchitecture, which was recently held in Chicago (Dec. 1-5, 2007). The program consisted on 35 papers, and 9 of these papers (26%) focused on the topic of resiliency.

Resilient microarchitectures is one of the research projects in my lab, the Intel Barcelona Research Center. One of these 9 papers on resiliency was authored by our team. The paper presents a technique we call “Penelope” to mitigate a particular source of time variability, which is known as NBTI (Negative Bias Temperature Instability) wear-out.

We choose the name Penelope since according to Greek mythology, Penelope spent 20 years waiting for her husband Odysseus to return from the Trojan War. In order to refuse marriage proposals, she wove a shroud and claimed that she would choose a suitor once the shroud was finished. Every night for three years she undid part of the work she did during the day. Our technique does something similar: during certain intervals of time it tries to heal some of the wear-out caused by NBTI.

NBTI affects PMOS transistors when negative voltage is applied at the gate (logic input “0”), causing an increase in the threshold voltage, and hence, a lower speed of the transistor. Conversely, when the gate of a PMOS transistor is set to “1”, it only does not degrade but enjoys a self-healing effect. Penelope technique tries to maximize the time that PMOS transistors have a logic “1” at their gate in many of the key components of a microprocessor, in order to minimize NBTI degradation.

The result is that NBTI degradation is slowed down significantly and, as a consequence, guardbands can significantly be reduced and performance is increased.

This is one of our recent projects of our lab, whose charter is to develop innovative ideas for future microprocessors, with special emphasis on the synergy between microarchitecture and compilers to improve performance, increase reliability and reduce power dissipation and energy consumption.

Comments are closed.