Cygnus Supercomputer Simulates the Early Universe with an FPGA Acceleration Boost


The Center for Computational Sciences (CCS) at the University of Tsukuba in Japan has developed multiple generations of leading-edge, highly parallel supercomputer systems over the last three decades. CCS’s mission is to promote scientific discovery by supporting university research at several Japanese universities with fast computing resources. The latest machine, dubbed Cygnus, is CCS’s tenth-generation supercomputer and it’s just now coming online. Cygnus will be used for many types of research including simulations of the early universe.

Cygnus consists of 80 computing nodes and these computing nodes come in two flavors: Deneb and Albireo. Astronomically speaking, Deneb, designated Alpha Cygni, is the brightest single star in the Cygnus (the Swan) constellation. Albireo, designated Beta Cygni, is a double star in the same constellation.

These two names refer to the two types of nodes in the Cygnus supercomputer. A Deneb computing node consists of two Intel® Xeon® Gold CPUs and four GPU cards connected through two PCIe network switches, as shown in Figure 1. Each node connects to a switched Infiniband network through four 100-Gbps Infiniband Host Channel Adapters (HCAs).


Figure 1: Cygnus Supercomputer Deneb Computing Node


An Albireo computing node adds two BittWare 520N Network Accelerator Cards, with one FPGA card attached to each PCIe switch as shown in Figure 2.


Figure 2: Cygnus Supercomputer Albireo Node


Each BittWare 520N Network Accelerator Card, shown in Figure 3, is based on an Intel® Stratix® 10 FPGA that’s connected to four 8-Gbyte banks of DDR4-2400 SDRAM for local storage. Four on-board QSFP28 optical module cages capable of operating as fast as 100Gbps provide inter-FPGA communications among the BittWare 520N accelerator cards.


Figure 3: BittWare 520N Network Accelerator Card (based on an Intel Stratix 10 FPGA)


Inside of the Cygnus supercomputer, the QSFP28 optical cages in the 64 BittWare 520N accelerator cards in the 32 Albireo nodes are connected as a 2D, 8×8-card torus, as shown in Figure 4, allowing the FPGAs in the BittWare cards to communicate at high speed.


Figure 4: Cygnus 2D Torus Configuration for FPGA Interconnect


The GPU cards co-located with the BittWare FPGA cards in the Albireo nodes can use DMA to move data directly to and from the FPGA cards through the PCIe switches without CPU intervention, which cuts data latency by a factor of ten. In addition to high-speed computation, the BittWare cards act as both communications elements within the 32 Albireo nodes. CCS calls this configuration the “Accelerator in Switch” (AiS) concept.

CCS has established that the FPGAs in the Albireo nodes outperform the GPUs in certain applications. For example, simulations of the early universe are based on a program called Accelerated Radiated transfer on Grids Oct-Tree (ARGOT). One of the key ARGOT algorithms, called Authentic Radiation Transfer (ART), is a ray-tracing algorithm that performs its computations by dividing space into a mesh. More than 90 percent of the ARGOT program’s computational time is spent running the ART algorithm, so it greatly benefits from hardware acceleration.

Earlier experiments with BittWare A10PL4 PCIe FPGA Boards, which are based on Intel® Arria® 10 FPGAs, demonstrated that the FPGA-based implementation was much faster than a GPU implementation for small meshes (sizes 163 and 323) and about as fast as GPUs for larger meshes (sizes 643 and 1283). The research team expects that the BittWare 520N cards based on Intel Stratix 10 FPGAs will run the ART algorithm much more quickly because the Intel Stratix 10 FPGA runs faster than the Intel Arria 10 FPGA. In addition, the Intel Stratix 10 FPGA has significantly more computational resources including DSPs and M20K memory blocks to harness for computation.

The ART ray-tracing algorithm has been written in OpenCL to make full use of the high-level synthesis (HLS) capabilities provided in the Intel® FPGA SDK for OpenCL development environment and Intel® Quartus Prime design software for the Intel Stratix 10 and Intel Arria 10 FPGAs.


Published on Categories HPC, StratixTags , ,
Steven Leibson

About Steven Leibson

Be sure to add the Intel Logic and Power Group to your LinkedIn groups. Steve Leibson is a Senior Content Manager at Intel. He started his career as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He’s served as Editor in Chief of EDN Magazine and Microprocessor Report and was the founding editor of Wind River’s Embedded Developers Journal. He has extensive design and marketing experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.