MicroZed Chronicles: Signal Processing, FFT, and PYNQ

June 17, 2022


Editor’s Note: This content is republished from the MicroZed Chronicles, with permission from the author.

 

One of the great things about FPGAs is the parallel structures and how we can accelerate the algorithms by exploiting the parallel nature of programmable logic. A few weeks ago, we looked at the AXI Stream FIFO and how it could be used to communicate with AXI streaming devices. 

In this blog, I am going to show how we can use an AXI Stream FIFO, DMA, and PYNQ to demonstrate the acceleration that is possible when implementing a FFT in the programmable logic compared to doing it in software on an A9 processor. 

This is quite complex so I have created a step-by-step guide which is available here. We are going to do the following elements to create this application: 

  1. Add a Zynq PS block and configure it for the PYNQ Z1/ Z2
  2. Instantiate a FFT in the programmable logic
  3. Instantiate a DMA in the programmable logic
  4. Connect the stream Master and Slave interfaces of the FFT to the DMA. This enables up to insert samples and receive processed data. 
  5. Instantiate an AXI Stream FIFO and connect it to the FFT Stream Config Input 
  6. Instantiate an AXI Timer and connect to the AXI GP Bus
  7. Instantiate an AXI Interrupt Controller and connect to the fabric-to-processor interrupt on the Zynq PS block

 

This is what the finished block diagram looks like.

MZ_448_AXI_Block_Diagram

We are able to build the bitstream with the hardware implemented. While the FPGA image builds, we can burn a PYNQ image to the SD card and start creating the Jupyter Notebook. 

The PYNQ Notebook is going to download the overlay to the PYNQ-Z1. It will then create real and imaginary sample data before of different sample lengths. This data will then be used to calculate the FFT in software using NumPy.   

MZ_448_FFT

Since the AXI Stream FIFO is used to control the FFT, the notebook also creates several functions that can be used to send data to and from the AXI Stream FIFO correctly. 

There are two FFT drivers provided with the overlay. The first driver does a copy to and from the DMA and handles buffer allocation etc. The second method requires pre-sizing and buffering and reduces the time required for copying and buffer allocation / freeing associated with the first. 

The notebook then uses the first method to run through seven different FFT sizes and runs 100 times to calculate the difference in performance between HW implementation and SW implementation. This information is then plotted and is seen below. As can be seen with the inefficient transfer of data, the A9 processor cores can perform much better than the FFT in the IP core. 

MZ_448_Python_with_Xilinx_PYNQ_FPGA_Performance

However, the next stage of the notebook is to use the more efficient transfer which resizes the buffers correctly first. When the notebook is run in this instance, it is clear the FFT IP in the PL significantly outperforms the SW FFT as would be expected. 

MZ_448_Python_with_Xilinx_PYNQ_FPGA_FFT_Performance

This simple experiment  shows several things: 

  1. The major performance improvement is efficient data transfer between PS and PL. The correct efficient drivers for this are critical. 
  2. PYNQ enables rapid prototyping to ensure your algorithms and drivers provide the performance required /expected. 
  3. PYNQ enables visualization of the results and  also enables real-world data to be inserted very easily into the processing chain. 
  4. We can use it as a pointer for SW application development in PetaLinux as required. 

If you want to have a try at the Arty Z7-20 or PYNQ-Z1, follow along through the slides to rebuild and download the bit file from here. Happy experimenting!