June 17, 2022
Editor’s Note: This content is republished from the MicroZed Chronicles, with permission from the author.
One of the great things about FPGAs is the parallel structures and how we can accelerate the algorithms by exploiting the parallel nature of programmable logic. A few weeks ago, we looked at the AXI Stream FIFO and how it could be used to communicate with AXI streaming devices.
In this blog, I am going to show how we can use an AXI Stream FIFO, DMA, and PYNQ to demonstrate the acceleration that is possible when implementing a FFT in the programmable logic compared to doing it in software on an A9 processor.
This is quite complex so I have created a step-by-step guide which is available here. We are going to do the following elements to create this application:
This is what the finished block diagram looks like.
We are able to build the bitstream with the hardware implemented. While the FPGA image builds, we can burn a PYNQ image to the SD card and start creating the Jupyter Notebook.
The PYNQ Notebook is going to download the overlay to the PYNQ-Z1. It will then create real and imaginary sample data before of different sample lengths. This data will then be used to calculate the FFT in software using NumPy.
Since the AXI Stream FIFO is used to control the FFT, the notebook also creates several functions that can be used to send data to and from the AXI Stream FIFO correctly.
There are two FFT drivers provided with the overlay. The first driver does a copy to and from the DMA and handles buffer allocation etc. The second method requires pre-sizing and buffering and reduces the time required for copying and buffer allocation / freeing associated with the first.
The notebook then uses the first method to run through seven different FFT sizes and runs 100 times to calculate the difference in performance between HW implementation and SW implementation. This information is then plotted and is seen below. As can be seen with the inefficient transfer of data, the A9 processor cores can perform much better than the FFT in the IP core.
However, the next stage of the notebook is to use the more efficient transfer which resizes the buffers correctly first. When the notebook is run in this instance, it is clear the FFT IP in the PL significantly outperforms the SW FFT as would be expected.
This simple experiment shows several things:
If you want to have a try at the Arty Z7-20 or PYNQ-Z1, follow along through the slides to rebuild and download the bit file from here. Happy experimenting!