Migrating to a New Target Platform
This migration guide is intended for users who need to migrate their accelerated SDAccel™ environment application from one target platform to another. For example, moving an application from a Virtex® UltraScale+™ VCU1525 Acceleration Development Board to a U200 Acceleration Development Board.
The following topics are addressed as part of this:
- An overview of the Design Migration Process including the physical aspects of FPGA devices.
- Any changes to the host code and design constraints if a new release is used.
- Controlling kernel placements and DDR interface connections.
- Timing issues in the new shell which might require additional options to achieve performance.
Design Migration
When migrating an application implemented in one target platform to another, it is important to understand the differences between the target platforms, and the impact those differences have on the design.
Key considerations:
- Is there a change in the release?
- Does the new target platform contain a different shell?
- Do the kernels need to be redistributed across the Super Logic Regions (SLRs)?
- Does the design meet the required frequency (timing) performance in the new platform?
The following diagram summarizes the migration flow described in this guide, and the topics to consider during the migration process.
Understanding an FPGA Architecture
Before migrating any design to a new target platform, you should have a fundamental understanding of the FPGA architecture. The following diagram shows the floorplan of a Xilinx® FPGA device. The concepts to understand are:
- SSI Devices
- SLRs
- SLR routing resources
- Memory interfaces
Stacked Silicon Interconnect Devices
A SSI device is one in which multiple silicon dies are connected together via silicon interconnect, and packaged into a single device. An SSI device enables high-bandwidth connectivity between multiple die by providing a much greater number of connections. It also imposes much lower latency and consumes dramatically lower power than either a multiple FPGA or a multi-chip module approach, while enabling the integration of massive quantities of interconnect logic, transceivers, and on-chip resources within a single package. The advantages of SSI devices are detailed in Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency.
Super Logic Region
An SLR is a single FPGA die slice contained in an SSI device. Multiple SLR components are assembled to make up an SSI device. Each SLR contains the active circuitry common to most Xilinx FPGA devices. This circuitry includes large numbers of:
- LUTs
- Registers
- I/O Components
- Gigabit Transceivers
- Block Memory
- DSP Blocks
One or more kernels may be implemented within an SLR. A single kernel may not be implemented across multiple SLRs.
SLR Routing Resources
The custom hardware implemented on the FPGA is connected via on-chip routing resources. There are two types of routing resources in an SSI device:
- Intra-SLR Resources
- Intra-SLR routing resource are the fast resources used to connect the hardware logic. The SDAccel environment automatically uses the most optimal resources to connect the hardware elements when implementing kernels.
- Super Long Line (SLL) Resources
- SLLs are routing resources running between SLRs, used to connect logic from one region to the next. These routing resources are slower than intra-SLR routes. However, when a kernel is placed in one SLR, and the DDR it connects to is in another, the SDAccel environment automatically implements dedicated hardware to use SLL routing resources without any impact to performance. More details on managing placement are provided in Modifying Kernel Placement.
Memory Interfaces
Each SLR contains one or more memory interfaces. These memory interfaces are used to connect to the DDR memory where the data in the host buffers is copied before kernel execution. Each kernel will read data from the DDR memory and write the results back to the same DDR memory. The memory interface connects to the pins on the FPGA and includes the memory controller logic.Understanding Shells
In the SDAccel development environment, a shell is the hardware design that is implemented onto the FPGA before any custom logic, or accelerators are added. The shell defines the attributes of the FPGA used in the target platform and is composed of two regions:
- Static region which contains kernel and device management logic.
- Dynamic region where the custom logic of the accelerated kernels is placed.
The shell, which is a static region that cannot be modified by the user, contains the logic required to operate the FPGA, and transfer data to and from the dynamic region. The static region, shown above in gray, might exist within a single SLR, or as in the above example, might span multiple SLRs. The static region contains:
- DDR memory interface controllers
- PCIe® interface logic
- XDMA logic
- Firewall logic, etc.
The dynamic region is the area shown in white above. This region contains all the reconfigurable components of the shell and is the region where all the accelerator kernels are placed.
Because the static region consumes some of the hardware resources available on the device, the custom logic to be implemented in the dynamic region can only use the remaining resources. In the example shown above, the shell defines that all four DDR memory interfaces on the FPGA can be used. This will require resources for the memory controller used in the DDR interface.
Details on how much logic may be implemented in the dynamic region of each shell is provided in the SDx Environments Release Notes, Installation, and Licensing Guide. This topic is also addressed in Modifying Kernel Placement, later in this guide.
Migrating Releases
Before migrating to a new target platform, you should also determine if you will need to target the new platform to a different release of the SDAccel environment. If you do intend to target a new release, it is highly recommended to first target the existing platform using the new software release to confirm there are no changes required, and then migrate to a new target platform.
There are two steps to follow when targeting a new release with an existing platform:
- Host Code Migration
- Release Migration
Host Code Migration
In the 2018.3 release of the SDAccel environment there are some fundamental changes to how the Xilinx runtime (XRT) environment and shell(s) are installed. In previous releases, both the XRT environment and shell(s) were automatically installed with the SDAccel environment. This has implications on the setup required to compile the host code.
Refer to the SDx Environments Release Notes, Installation, and Licensing Guide for details on the 2018.3 installation.
The XILINX_XRT
environment variable is used to
specify the location of the XRT environment and must be set before you compile the host
code. When the XRT environment has been installed, the XILINX_XRT
environment variable can be set by sourcing the /opt/xilinx/xrt/setup.csh, or /opt/xilinx/xrt/setup.sh file as appropriate. Secondly, ensure that your
LD_LIBRARY_PATH
variable also points to the XRT
installation area.
To compile, and run the host code, make sure you source the <SDX_INSTALL_DIR>/settings64.csh, or <SDX_INSTALL_DIR>/settings64.sh file from the SDAccel installation.
If you are using the GUI, it will automatically incorporate the new XRT location
and generate the makefile
when you build your project.
However, if you are using your own custom makefile
, you need to make the following changes:
- In your
makefile
, do not use theXILINX_SDX
environment variable which was used in prior releases. - The
XILINX_SDX
variables and paths must be updated to theXILINX_XRT
environment variable:- Include directories are now specified as:
-I${XILINX_XRT}/include
and-I${XILINX_XRT}/include/CL
- Library path is now:
-L${XILINX_XRT}/lib
- OpenCL™ library will be:
libxilinxopencl.so
. So, use-lxilinxopencl
in yourmakefile
- Include directories are now specified as:
Release Migration
After migrating the host code, build the code on the existing target platform using the new release of the SDAccel development environment. Verify that you can run the project in the SDAccel environment using the new release, and make sure it completes successfully, and meets the timing requirements.
Issues which can occur when using a new release are:
- Changes to C libraries or library files.
- Changes to kernel path names.
- Changes to the HLS pragmas or pragma options embedded in the kernel code.
- Changes to C/C++/OpenCL compiler support.
- Changes to the performance of kernels: this may require adjustments to the pragmas in the existing kernel code.
Address these issues using the same techniques you would use during the development of any kernel. At this stage, ensure the throughput performance of the target platform using the new release meets your requirements. If there are changes to the final timing (the maximum clock frequency), you can address these when you have moved to the new target platform. This is covered in Address Timing.
Modifying Kernel Placement
The primary issue when targeting a new platform is ensuring that an existing kernel placement will work in the new target platform. Each target platform has an FPGA defined by a shell. As shown in the figure below, the shell(s) can be different.
- The shell of the original platform on the left has four SLRs, and the static region is spread across all four SLRs.
- The shell of the target platform on the right has only three SLRs, and the static region is fully-contained in SLR1.
This section explains how to modify the placement of the kernels.
Implications of a New Hardware Platform
The figure below highlights the issue of kernel placement when migrating to a new target platform, or shell. In the example below:
- Existing kernel, kernel_B, is too large to fit into SLR2 of the new target platform because most of the SLR is consumed by the static region.
- The existing kernel, kernel_D, must be relocated to a new SLR because the new target platform does not have four SLRs like the existing platform.
When migrating to a new platform, you need to take the following actions:
- Understand the resources available in each SLR of the new target platform, as documented in the SDx Environments Release Notes, Installation, and Licensing Guide.
- Understand the resources required by each kernel in the design.
- Use the
xocc
linker options (--slr
and--sp
) to specify which SLR each kernel is placed in, and which DDR bank each kernel connects to.
These items are addressed in the remainder of this section.
Determining Where to Place the Kernels
To determine where to place kernels, two pieces of information are required:
- Resources available in each SLR of the shell of the hardware platform (.dsa).
- Resources required for each kernel.
With these two pieces of information you will then determine which kernel or kernels can be placed in each SLR of the shell.
Keep in mind when performing these calculation that 10% of the available resources can be used by system infrastructure:
- Infrastructure logic can be used to connect a kernel to a DDR interface if it has to cross an SLR boundary.
- In an FPGA, resources are also used for signal routing. It is never possible to use 100% of all available resources in an FPGA because signal routing also requires resources.
Available SLR Resources
The resources available in each SLR provided by Xilinx can be found in the SDx Environments Release Notes, Installation, and Licensing Guide. The figure below shows an example shell. In this example you can see:
- The SLR description indicates which SLR contains static and/or dynamic regions.
- The resources available in each SLR (LUTs, Registers, RAM, etc.) are listed.
This allows you to determine what resources are available in each SLR.
Area | SLR 0 | SLR 1 | SLR 2 |
---|---|---|---|
SLR description | Bottom of device; dedicated to dynamic region. | Middle of device; shared by dynamic and static region resources. | Top of device; dedicated to dynamic region. |
Dynamic region pblock name | pfa_top_i_dynamic_region_pblock _dynamic_SLR0 | pfa_top_i_dynamic_region_pblock _dynamic_SLR1 | pfa_top_i_dynamic_region_pblock _dynamic_SLR2 |
Compute unit placement syntax | set_property CONFIG.SLR_ASSIGNMENTS SLR0[get_bd_cells<cu_name>] | set_property CONFIG.SLR_ASSIGNMENTS SLR1[get_bd_cells<cu_name>] | set_property CONFIG.SLR_ASSIGNMENTS SLR2[get_bd_cells<cu_name>] |
Global memory resources available in dynamic region | |||
Memory channels; system port name | bank0 (16 GB DDR4) | bank1 (16 GB DDR4, in static region) bank2 (16 GB DDR4, in dynamic region) |
bank3 (16 GB DDR4) |
Approximate available fabric resources in dynamic region | |||
CLB LUT | 388K | 199K | 388K |
CLB Register | 776K | 399K | 776K |
Block RAM Tile | 720 | 420 | 720 |
UltraRAM | 320 | 160 | 320 |
DSP | 2280 | 1320 | 2280 |
Kernel Resources
The resources for each kernel can be obtained from the System Estimate report.
The System Estimate report is available in the Assistant view after either the Hardware Emulation or System run are complete. An example of this report is shown below.
- FF refers to the CLB Registers noted in the platform resources for each SLR.
- LUT refers to the CLB LUTs noted in the platform resources for each SLR.
- DSP refers to the DSPs noted in the platform resources for each SLR.
- BRAM refers to the block RAM Tile noted in the platform resources for each SLR.
This information can help you determine the proper SLR assignments for each kernel.
Assigning Kernels to SLRs
Each kernel in a design can be assigned to a SLR region using the xocc --slr
command line option to specify a placement file.
When placing kernels, it is recommended to also assign the specific DDR memory bank that
the kernel will connect to using the xocc --sp
command
line option. An example can be used to demonstrate these two command line options.
The figure below shows an example where the existing target platform shell has four SLRs, and the new target platform has a shell with three SLRs, and the static region is also structured differently between the target platforms. In this migration example:
- Kernel_A is mapped to SLR0.
- Kernel_B, which no longer fits in SLR1, is remapped to SLR0, where there are available resources.
- Kernel_C is mapped to SLR2.
- Kernel_D, is remapped to SLR2, where there are available resources.
The kernel mappings are illustrated in the figure below.
Specifying Kernel Placement
xocc
command
option.xocc --slr kernel_A:SLR0 \
--slr kernel_B:SLR0 \
--slr kernel_C:SLR2 \
--slr kernel_D:SLR2
With
these command line options, each of the kernels is placed as shown in the figure
above. Specifying Kernel DDR Interfaces
You should also specify the kernel DDR memory interface when specifying kernel placements. Specifying the DDR interface ensures the automatic pipelining of kernel connections to a DDR interface in a different SLR. This ensures there is no degradation in timing which can reduce the maximum clock frequency.
In this example, using the kernel placements in the above figure:
- Kernel_A is connected to Memory Bank 0.
- Kernel_B is connected to Memory Bank 1.
- Kernel_C is connected to Memory Bank 2.
- Kernel_D is connected to Memory Bank 1.
xocc
command line performs
these
connections:xocc --sp kernel_A.arg1:bank0 \
--sp kernel_B.arg1:bank1 \
--sp kernel_C.arg1:bank2 \
--sp kernel_D.arg1:bank1
--sp
option to assign kernel ports to
memory banks, you must specify the --sp
option for all
interfaces/ports of the kernel. Refer to "Customization of DDR Bank to Kernel
Connection" in the SDAccel Environment Programmers Guide for more information. Address Timing
Perform a system run and if it completes with no violations, then the migration is successful.
If timing has not been met you may need to specify some custom constraints to help meet timing. Refer to UltraFast Design Methodology Guide for the Vivado Design Suite (UG949) for more information on meeting timing.
Custom Constraints
Custom constraints are passed to the Vivado®
tools using the xocc -xp
option for custom placement and
timing constraints. Custom Tcl constraints for floorplanning of the kernels will need to
be reviewed in the context of the new target platform (.dsa). For
example, if a kernel was moved to a different SLR in the new shell, the corresponding
placement constraints for that kernel will also need to be modified.
In general, timing is expected to be comparable between different target platforms that are based on the 9P Virtex UltraScale device. Any custom Tcl constraints for timing closure will need to be evaluated and might need to be modified for the new platform.
Additionally, any non-default options that are passed to xocc
or to the Vivado tools using the
xocc --xp
switch will need to be updated for the new
shell.
Timing Closure Considerations
Design performance and timing closure can vary when moving across SDx™ releases or shell(s), especially when one of the following conditions is true:
- Floorplan constraints were needed to close timing.
- Device or SLR resource utilization was higher than the typical guideline:
- LUT utilization was higher than 70%
- DSP, RAMB, and UltraRAM utilization was higher than 80%
- FD utilization was higher than 50%
- High effort compilation strategies were needed to close timing.
xocc
command line while verifying
that any floorplan constraint ensures the following: - The utilization of each SLR is below the recommended guidelines.
- The utilization is balanced across SLRs if one type of hardware resource needs to be higher than the guideline.
For designs with overall high utilization, increasing the amount of pipelining in the kernels, at the cost of higher latency, can greatly help timing closure and achieving higher performance.
xocc –R 1
report_failfast
is run at the end of each kernel synthesis stepreport_fafailst
is run afteropt_design
on the entire designopt_design
DCP is saved
xocc –R 2
- Same reports as with
-R 1
, plus: report_failfast
is post-placement for each SLR- Additional reports and intermediate DCPs are generated
- Same reports as with
All reports and DCPs can be found in the implementation directory, including kernel synthesis reports:
<runDir>/_x/link/vivado/prj/prj.runs/impl_1
For more information about timing closure and the fail-fast report, see the UltraFast Design Methodology Timing Closure Quick Reference Guide (UG1292).