High-Level Synthesis of Neural Networks on FPGAs

Many modern applications that perform classification, prediction or clustering employ Neural Networks (NN) for these tasks. They are often used in datacenters or mobile devices, where high performance and energy-efficiency is crucial. Due to their parallelity, GPUs bring a big performance improvement (in comparison to CPUs) for these applications. Nevertheless, they still have a limitation in their fixed architecture. Especially, because the field is changing quickly and fast innovation must be made possible.

Field-Programmable Gate Arrays (FPGAs) offer more flexibility and can specialise the architecture design to the specific neural network application in order to archive the best performance. Furthermore they are highly energy-efficient and are therefore well suited for NNs in embedded systems or large scale server clusters.

However, deploying neural networks on FPGAs (like programming parallel accelerators in general), with focus on high performance, is a complex step. Due to their flexible architecture, FPGAs allow for so many options for tweaking the performance but in return require a lot of hardware specific expertise and take costly development time to be configured properly. Rather than manual development, automation should take place here to accelerate the development process and also drive down costs. Furthermore, developers do not want to manually adapt their implementations for various accelerators (e.g. CPU, GPU, FPGA). Instead, performance portability is desirable.

Lift addresses these challenges by offering a high-level functional, data-parallel language, which allows the user to efficiently develop an application independently of the target hardware platform. Then, rewrite rules in the Lift compiler open a vast design space of possible implementations for this abstract system specification. This design space is explored to find a suitable solution, which satisfies the performance and energy requirements. In order to exploit the parallel structure of NNs, the compiler employs pipelining mechanisms and allocates distributed on-chip memory on the FPGA. Timing behaviour and scheduling is introduced until finally a Hardware description language (HDL) code is emitted, that can be used to generate the bitstream for the FPGA.

Supported by:

[Microsoft Research](https://www.microsoft.com/en-us/research/)
Microsoft Research
[CIFAR AI Chair](https://cifar.ca/ai/canada-cifar-ai-chairs/)
CIFAR AI Chair
[NSERC](https://www.nserc-crsng.gc.ca/)
NSERC

Christophe Dubach
Christophe Dubach
Associate Professor
Canada CIFAR AI Chair, Mila

My research interests include data-prallel language design and implementation, high-level code generation and optimisation for parallel hardware (e.g. GPU, FPGAs), architecture design space exploration, and the use of machine-learning techniques applied to all these topics.

Christof Schlaak
Christof Schlaak
PhD student (Edinburgh University)
Tzung-Han Juang
Tzung-Han Juang
PhD student (McGill University)
Hamza Javed
Hamza Javed
PhD student (McGill University)
Ayan Chakraborty
Ayan Chakraborty
Summer 2021 McGill intern (visiting from IIT Kharagpur)
Jiaxuan Cai
Jiaxuan Cai
Summer 2021 McGill intern (visiting from Chongqing University)
Andrej Ivanis
Andrej Ivanis
MSc student 2018-2019 (Edinburgh University)
Martin Kristien
Martin Kristien
BSc student 2016-2017 (Edinburgh University)