Randy Mooney on ISSCC: Scaling performance/watt through circuit innovation

As we look forward to enabling exciting new opportunities in platforms ranging from mobile computing to the data center, along with associated new applications, one common denominator of all these products will be the underlying process technology and the circuits built on that process technology. Providing the building blocks that will enable those platforms and applications is a challenge that we look forward to here in the Circuit Research Lab of Intel’s Corporate Technology Group. Today, I want to discuss four of these building blocks that will be discussed at the International Solid State Circuits Conference.

DRAM-Setup_med.jpg

These include advanced storage (memory) circuits, special purpose circuits for processing of visual data, circuits to increase the robustness of our platforms while adding performance, and circuits for speeding the flow of data between chips in a platform.

Dense Memory

As we move toward Terascale performance levels, the performance and cost of the memory within our platforms becomes increasingly important. This memory ranges from on-chip caches and register files to off-chip main memory and disk. On-chip memory is typically built using a six transistor static memory cell (SRAM). A much more dense, one transistor cell (DRAM), with an additional element, a capacitor, can be incorporated, but only with the additional cost associated with more the more complex process technology needed to form the capacitor. This technology is also much slower than SRAM, and requires a periodic “refresh” of the data in order to retain it’s value.

An alternative is discussed in a paper titled “2GHz 2Mb 2T Gain-Cell Memory Macro with 128GB/s Bandwidth in a 65nm Logic Process”. The memory described in this paper uses a cell based on two transistors and no explicit capacitor. It produces a design that is twice as compact as SRAM while being much faster that a DRAM based design. This enables either larger memories to be constructed on a chip of fixed size, or a reduction in chip area for a given memory size, while providing high speed (Bandwidth) access to the memory. The prototype (pictured here) described can provide up to 128 billion “bytes” (eight individual bits) of data per second. These performance characteristics will make this form of memory an attractive option for on-chip memory in future Terascale devices.

Faster Video Encoding

In order to provide the best performance/watt for both high performance and ultra-mobile applications, some of the transistors in future chips may be solely dedicated to accelerating common tasks, such as video processing. Though such accelerators are task-specific, they can provide 5-10x better performance/watt. Encoding of video data to compress the associated storage space required is an application requiring significant computing power and the associated energy consumption. Within this process, motion estimation for the objects embedded in the video data is the most performance and power critical operation to be performed. This estimation is used to determine the changes from one frame of video to the next in order to minimize the amount of data to be stored in the compressed video stream. In addition, for these applications, there is a wide range of performance and power constraints required to handle a variety of video resolution, frame-rate, and application specifications. The circuits blocks used to implement these applications need to be scalable to maximize power/performance across all these various constraints.

Video-Accel-Chip_med.jpg

The paper “A 320mV 56?W 411GOPS/W Ultra-Low Voltage Motion Estimation Accelerator in 65nm CMOS” describes circuits (pictured here) built in 65nm CMOS technology to accelerate key signal processing algorithms targets for video encoding applications. The work described in the paper demonstrates up to 10x better throughput than best reported accelerator, the ability tune voltage & performance to optimize energy efficiency to the task at hand, operation below the normal minimum power supply voltage (i.e. sub-threshold) down to 0.22 V, and a maximum energy efficiency of 411 billion operation per second for each watt of power consumed. Since the tasks targeted by this accelerator consume 60-80% of the processing for video compression, these circuits could make hardware based compression possible on mobile devices and significantly increase the speed of these computations on larger systems, providing new opportunities across the spectrum of our future platforms.

Robust Circuitry

Microprocessors are designed to achieve error free operation across all specified operating conditions. Doing this requires that the nominal design include a clock frequency “guardband”, or safety margin to account for the occasional excursion to guarantee correct operation during worst case conditions. These conditions may include both temperature and power supply changes that are temporarily at a worst case condition. The paper titled “Energy-Efficient and Metastability-Immune Timing-Error Detection and Instruction-Replay-Based Recovery Circuits for Dynamic-Variation Tolerance” describes circuits that enable design for nominal conditions, while detecting the infrequent timing errors that may occur at worst case conditions. Error recovery is performed by replaying the failed instruction. Since worst case conditions are infrequent, the performance cost of replaying failed instructions are extremely low (<<1%). This enables design for the nominal operating point, without the guardbanding required for worst case conditions.

Error-Correction-Chip_med.jpg

Results from this prototype show that this is the lowest energy and fastest error-detection sequential circuit published to date. Data from the chip further demonstrated up to a 32% performance gain by increasing the frequency of operation of the chip at a given power supply voltage, or up to a 33% reduction in energy consumption by reducing the voltage at a given performance. This increase in the robustness of our circuits while designing for nominal conditions eliminates excess design margin, and helps us to approach the required performance levels to enable future Terascale components.

Scalable I/O

Future platforms will have 10-100s of cores sharing connections to memory, other sockets, and peripherals. To enable data-intensive emerging applications, I/O bandwidth must scale to >100s of billions of bits per second of data moving between platform components. Increasing the rate at which data moves between chips requires improvements in the physical components used to connect chips in the system, the amplifiers used to send and receive data, the communications schemes employed, and the clocks used to time the transmission and reception of data. These improvement will take the form of higher data rates between components, more connections within a fixed area, and better power efficiency to minimize power as we increase performance.

FastRx-Chip_med.jpg

The paper titled “A 27Gb/s Forwarded-Clock I/O Receiver using an injection-Locked LC-DCO in 45nm CMOS” describes work to enable accurate clocks using low complexity circuits with good noise rejection, low circuit area, and excellent power efficiency. Receiving data requires that we know the arrival times of the bits in order to capture them accurately. This can be done by looking at the data stream to gain the knowledge of the correct timing, or designing an explicit clock path in parallel with the data paths from chip to chip. This is typically done using complex circuits such a phase locked loops and delay locked loops. These circuits are difficult to design and require large area for filters required for proper operation and rejection of noise. The paper describes a method of utilizing a forwarded clock (Clock sent on a path parallel to data) injected directly into an oscillator at the receiver to eliminate many of the traditional circuits required for proper timing, along with their associated power, while achieving 27 billion bits per second (27Gb/s) of data bandwidth utilizing 45nm CMOS technology. The circuit (shown here) achieves this high performance while enabling good noise immunity, elimination of area intensive filter components, and excellent power efficiency of 1.6milli-Watts per billion bits per second received (1.6mW/Gb/s). These circuits can be used with a range of physical data channels, and will be a key enabler to Terascale performance at reasonable power levels.

These four papers represent a fraction of the work underway within the Circuit Research Lab. On-going advances at the circuit level will be required to enable our vision of the future. Our researchers are excited to meet this challenge, and we look forward to presenting more of this ground breaking work in the future.

Randy Mooney is an Intel Fellow and as Director of I/O Research is responsible for circuit and interconnect research for multi-gigabit, chip-to-chip connections on microprocessor platforms.

Comments are closed.