Intel has just announced its first AI-optimized FPGA – the Intel® Stratix® 10 NX FPGA – to address the rapid increase in AI model complexity

FPGAs have been used in the area of hardware customization for decades. The hardware customization capability taps into the value proposition of FPGAs such as pipelining for applications that require low batch size and low latency, and flexible fabric and I/O functions to scale up and deliver entire systems. Intel’s silicon and software portfolio, which includes FPGAs, empowers our customers’ intelligent services from the cloud to the edge. Many Intel® FPGA customers have already started implementing AI accelerators using the hardware customization available through Intel FPGA technologies.

It starts with Intel’s vision for device agnostic AI development. This vision allows developers to focus on building their solutions rather than focusing on specific devices. Intel has been fitting FPGAs into that vision for a while. Intel is focusing on acceleration, specifically on the trend of increasing AI model size and complexity. AI model complexity continues to double every 3.5 months. That’s a factor of 10X per year. These AI models are used in applications such as Natural Language Processing (NLP), Fraud Detection, and Surveillance.

Intel has just announced its first AI-optimized FPGA – the Intel® Stratix® 10 NX FPGA – to address the rapid increase in AI model complexity. The Intel Stratix 10 NX FPGA embeds a new type of AI-optimized block called the AI Tensor Block, which delivers up to 15X more INT8 compute performance than today’s Stratix 10 MX. The INT8 data type is often used by AI inferencing algorithms. The AI Tensor Block is tuned for the common matrix-matrix or vector-matrix multiplications used by these AI algorithms, with capabilities designed to work efficiently for both small and large matrix sizes.

 

David Moore, Corporate Vice President and General Manager for the Intel Programmable Solutions Group, holds up an Intel Stratix 10 NX FPGA, the company’s first AI-optimized FPGA

 

The Intel Stratix 10 NX FPGA has several additional in-package features that support high-performance AI inferencing. These features include high-speed HBM2 memory and high-speed transceivers for fast networking. Note that Intel was able to develop the Intel Stratix 10 NX FPGA quickly due to its chiplet-based FPGA architecture strategy.

Intel partnered with Microsoft to develop the AI Tensor block to help accelerate AI workloads in the data center.

“As Microsoft designs our real-time multi-node AI solutions, we need flexible processing devices that deliver ASIC-level tensor performance, high memory and connectivity bandwidth, and extremely low latency. Intel® Stratix® 10 NX FPGAs meet Microsoft’s high bar for these requirements, and we are partnering with Intel to develop next-generation solutions to meet our hyperscale AI needs.” – Doug Burger, Technical Fellow, Microsoft Azure Hardware

Intel Stratix 10 NX FPGAs serve as multi-function AI accelerators for Intel® Xeon® processors. They specifically address applications that require hardware customization, low latency, and real-time capabilities. The AI Tensor Blocks in the Intel Stratix 10 NX FPGA deliver more compute throughput by implementing more multipliers and accumulators compared to the DSP block found in other Intel Stratix 10 devices. The AI Tensor Block contains 30 multipliers and 30 accumulators instead of the two multipliers and two accumulators in the DSP block. The multipliers in the AI Tensor Block are tuned for lower precision numerical formats such as INT4, INT8, Block Floating Point 12, and Block Floating Point 16. These specific precisions are frequently used for AI inferencing workloads.

The Intel Stratix 10 NX FPGAs addresses today’s AI challenges. For example, NLP typically uses large AI models, and these models are growing larger. The need to detect, recognize, and understand the context of various languages, followed by translation to the target language is a growing use for language translation applications, which are one NLP workload. These expanded workload requirements drive model complexity, which results in the need for more compute cycles, more memory, and more networking bandwidth.

The Intel Stratix 10 NX FPGA’s in-package HBM2 memory allows large AI models to be stored on chip. Estimates suggest that a Stratix 10 NX FPGA running a large AI model like BERT at batch size 1 delivers 2.3X better compute performance than an NVIDIA V100.

Fraud Detection is another application where Intel FPGAs enable real-time data processing applications where every microseconds matters. Intel FPGAs’ ability to create custom hardware solutions with direct ingestion of data through its transceivers and deterministic, low latency compute elements make microsecond-class real-time performance possible. Typically, Fraud Detection employs LSTM (Long Short Term Memory) AI models at batch size 1. Estimates suggest that the Intel Stratix 10 NX FPGA will deliver 34X better compute performance than an NVIDIA T4 GPU for LSTM models at batch size 1.

Finally, consider a video surveillance application. Intel FPGAs excel in video surveillance applications because of their hardware customization ability, which allows implementation of custom processing and custom I/O protocols for direct data ingestion. For example, estimates suggest that the Intel Stratix 10 NX FPGA will provide 3.8X better compute performance than an NVIDIA T4 GPU for video surveillance using the ResNet50 model at batch size 1.

The Intel Stratix 10 NX extends the benefits of FPGA based, high performance, hardware customization for AI inferencing through the introduction of the AI Tensor block. The Intel Stratix 10 NX FPGA delivers as much as 15X more compute performance for AI inferencing. This FPGA is Intel’s first AI-optimized FPGA and it will be available later this year.

For more information about the Intel Stratix 10 NX FPGA, click here.

 

 

Intel’s silicon and software portfolio empowers our customers’ intelligent services from the cloud to the edge.

 

Notices and Disclaimers

15X more INT8 compute performance than today’s Stratix 10 MX for AI workloads:  When implementing INT8 computations using the standard Stratix 10 DSP Block, there are 2 multipliers and 2 accumulators used. On the other hand, when using the AI Tensor Block, you have 30 multipliers and 30 accumulators. Therefore 60/4 provides up to 15X more INT8 compute performance when comparing the AI Tensor Block with the standard Stratix 10 DSP block.

BERT 2.3X faster, LSTM 10X faster, ResNet50 3.8X faster: BERT batch 1 performance 2.3X faster than Nvidia V100 (DGX-1 server w/ 1x NVIDIA V100-SXM2-16GB | TensorRT 7.0 | Batch Size = 1 | 20.03-py3 | Precision: Mixed | Dataset: Sample Text);  LSTM batch 1 performance 9.5X faster than Nvidia V100 (Internal server w/Intel® Xeon® CPU E5-2683 v3 and 1x NVIDIA V100-PCIE-16GB | TensorRT 7.0 | Batch Size = 1 | 20.01-py3 | Precision: FP16 | Dataset: Synthetic);  ResNet50 batch 1 performance 3.8X faster than Nvidia V100 (DGX-1 server w/ 1x NVIDIA V100-SXM2-16GB | TensorRT 7.0 | Batch Size = 1 | 20.03-py3 | Precision: INT8 | Dataset: Synthetic).  Estimated on Stratix 10 NX FPGA using -1 speed grade, tested in May 2020.

Each end-to-end AI model includes all layers and computation as described in Nvidia’s published claims as of May 2020. Result is then compared against Nvidia’s published claims.  Link for Nvidia: https://developer.nvidia.com/deep-learning-performance-training-inference. Results have been estimated or simulated using internal Intel analysis, architecture simulation, and modeling, and

provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration.

No product or component can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.   For more complete information visit http://www.intel.com/benchmarks .

Intel Advanced Vector Extensions (Intel AVX) provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings.  Circumstances will vary.  Intel does not guarantee any costs or cost reduction.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.

Published on Categories AI/ML, StratixTags ,
Steven Leibson

About Steven Leibson

Be sure to add the Intel Logic and Power Group to your LinkedIn groups. Steve Leibson is a Senior Content Manager at Intel. He started his career as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He’s served as Editor in Chief of EDN Magazine and Microprocessor Report and was the founding editor of Wind River’s Embedded Developers Journal. He has extensive design and marketing experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.