WaveNet Neural Network runs on Intel® Stratix® 10 NX FPGA, synthesizes 256 16 kHz audio streams in real time

sleibson · ‎11-09-2020

State-of-the-art text-to-speech (TTS) synthesis systems generally employ two neural network models that run sequentially to generate audio. The first model generates acoustic features such as spectrograms from input text. The second model, a vocoder, takes intermediate features from the first model and produces speech. Tacotron 2 is often used as the first model. A new White Paper from Myrtle.ai titled “Implementing WaveNet Using Intel® Stratix® 10 NX FPGA for Real-Time Speech Synthesis” focuses on the second model, a state-of-the-art vocoder based on a neural network model called WaveNet, which produces natural-sounding speech with near-human fidelity.

The key to the WaveNet model’s high speech quality is an autoregressive loop, but this property also makes the network exceptionally challenging to implement for real-time applications. Efforts to accelerate WaveNet models generally have not achieved real-time audio synthesis. The Myrtle.ai White Paper describes efforts to implement a WaveNet model using an Intel® Stratix® 10 NX FPGA. By using Block Floating Point (BFP16) quantization, which the Intel Stratix 10 NX FPGA supports, Myrtle.ai has been able to deploy a real-time WaveNet model that synthesizes 256 16 kHz audio streams in real time.

For more details and to download the White Paper, click here.

To see a video demo on this system in action, click here.

Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.