Skip to content

High Level Synthesis

Final

This assignment description is now complete. There could still be minor updates, which will be highlighted.

Follow the Spirit

Some screenshots may be taken on other versions of Vivado/Vitis or for other configurations. The instructions could also vary slightly depending on the exact design/configuration you follow - such as whether you have separate or combined designs with multiple coprocessors / interfacing method (DMA/FIFO). The spirit of the instructions remain the same. Understand the significance of each step rather than following it mechanically.

Introduction to High-level synthesis

High-level synthesis transforms C functions into hardware IPs.

HLS works fairly well for inner blocks with fairly data-oriented (resource-dominated) functionality without complicated control flow structures. Examples would be digital signal processing, arithmetic on matrices, etc where loops have data-independent exit conditions.

It is not very good for those outer blocks which typically involve complicated control structures (control dominated). HLS-based generation of a control-dominated circuit such as a microcontroller is a holy grail.

The HLS tool is temperamental - sometimes you get very good results, and sometimes you end up wondering what just happened. Sometimes even slight code changes that should have little/no functional relevance can produce substantially different hardware. Well, using an inherently sequential high-level language to produce inherently parallel hardware is challenging.

Not all C code can be synthesized. Anything that depends on a runtime environment would not work. An example would be dynamic memory allocation (no malloc). Keep in mind that the goal of HLS is not to create something which executes on a processor, it is to create a sort of a processor itself.

Creating an HLS-based design based on default options might not yield a good IP/hardware. It will be kind of like going on an organized tour. You will get some feel of the place, you will have the bragging rights, but the experience is generally not so 'authentic'. You need to exert control of the hardware generated using various optimization directives as appropriate depending on the context, design requirements/ tradeoffs.

The accelerator IP that we create needs to be interfaced with the rest of the system (that you created in Assignment 1) for the processor to make it act as a coprocessor. The interface of the IP generated by the HLS can be

Register-based: The processor can read and write the registers within the coprocessor to write inputs/read outputs - each register has an address within the address space of the processor. The parameters and return values are mapped to these registers/addresses. For example, A is in the offset range 0xx to 0xyy - the actual address range is the base address of the coprocessor peripheral (assigned in Vivado under the address tab) + the offset. The coprocessor in this case has an AXI or AXI Lite interface which can be connected to the AXI bus of the system as a slave (similar to how the timer was connected in Assignment 1).

Stream-based: There are separate input and output streams through which the data is streamed in/out. There is no concept of addresses, and the meaning of the data is derived from the order of the data (and possibly some 'tags'). For example, the first 512 words correspond to A, the next 64 correspond to B, and so on. Please read the Introduction to AXI Stream page.

Memory-based: The co-processor reads inputs from / writes output to memory directly. For now, we will start with Stream-based which is perhaps the easiest to get started.

Assignment 2

The template/data files are here.

The assignment involves

1) Creating a stream-based coprocessor aka accelerator do matrix multiplication (RES=A*B/256) using HLS and integrating it into the system. The matrix multiplication problem is exactly the same as in Assignment 1 (the part of your program to send the matrices A and B from PC can be commented out and the matrices can be hard-coded in your program for convenience), just that the multiplication should now be offloaded to the coprocessor in addition to doing it purely in software.

You should do coprocessor Interfacing using both AXI Stream FIFO and AXI DMA. Try with FIFO first before venturing into DMA. It is not necessary that AXI Stream FIFO and AXI DMA-based interfacing has to be done as a single system, i.e., it is fine to have separate projects for each case. Of course, combining those into a single project is fine as well - this page might give you some ideas.

2) Further, for each case above, you need to compare the hardware and software performance via profiling (TCF Profiler, mentioned in the page on Performance analysis) as well as using AXI Timer.

The time taken for the hardware version should be inclusive of the time taken for sending data to and receiving data from the coprocessor (i.e., writing to / reading from AXI Stream FIFO), as this is an unavoidable overhead associated with offloading computations to hardware. This overhead can possibly be ignored when using DMA in a non-blocking fashion, i.e., the CPU is performing some other useful task while the DMA data transfer and co-processor computations are in progress.

It is suggested that you create separate functions called from the main program - something like matrix_multiply_soft(), matrix_multiply_FIFO(), matrix_multiply_DMA) for the software and the hardware versions (using FIFO and DMA) respectively. This facilitates profiling using the TCF profiler which can do profiling only at a function level.

The time taken for sending and receiving data via UART should not be considered, as it has nothing to do with computation/hardware acceleration. It is suggested that the data is hard-coded as a C-array rather than receiving it via UART for this assignment.

3) You are also required to try at least one possible optimization in HLS and compare the performance on hardware (which wouldn't require any modifications to your software C code) with the vanilla (non-optimized) version. The C code for HLS needs to have appropriate pragmas inserted manually or graphically. This is a self-learning / self-exploration exercise.

You should read and get an overview of the following 4 optimizations from the document https://docs.amd.com/v/u/en-US/ug1270-vivado-hls-opt-methodology-guide. The page numbers below are the pages where the topic starts, not necessarily the only 4 pages you need to read. These were covered in the lecture on a conceptual level.

  • pragma HLS array_partition....83
  • pragma HLS dataflow...............91
  • pragma HLS pipeline................116
  • pragma HLS unroll....................125

Note that while these optimizations are applied independently, some optimizations work well only when some other optimizations are also used. For example, doing pipelining or loop unrolling without partitioning the array wouldn't help much, as the bottleneck will be accessing the memory, 1 or 2 elements at a time. The effect of these optimizations is not always that deterministic though, given the nature and non-maturity of HLS tools.

Newer versions of Vitis perform some of these optimizations (e.g., pipelining, array partitioning) automatically provided certain conditions are met. You can disable this from the settings or via pragmas to see the performance of the non-pipelined versions.

#pragma HLS pipeline off

It can also be done via tcl.

set_directive_pipeline -off [get_loops "loop_label"]

and also via hls_component > settings > hls_config.cfg > C Synthesis > Compile > compile.pipeline_loops to 0 (hls component and config file names to be changed as appropriate).

The dataflow optimisation by itself will likely not yield any improvement in performance without modifying the software (C program) significantly to take advantage of the hardware optimisations. This is not easy, and will ideally need an operating system (e.g., FreeRTOS that is supported out of the box), and hence is a purely optional exercise.

When evaluating the effect of HLS optimisations, you can choose either AXI Stream FIFO and AXI DMA-based interfacing, but the latter is strongly recommended. As is the case with FIFO and DMA based designs, it is fine to have separate projects or a combined project for the designs with and without optimisations.

To summarise, we have 3 scenarios. We can have either a single project combining all the 3 above, or 3 separate projects.

  • Non-optimised coprocessor interfaced using FIFO.
  • Non-optimised coprocessor interfaced using DMA.
  • Optimised coprocessor interfaced using DMA/FIFO (DMA recommended).

Submission Info

Assignment 2 (10 marks)

Upload .zip file containing the

  • the .cpp files used for HLS implementation and test/co-simulation. The directives.tcl file should also be included if the 'Directive Destination' is 'Directive File' instead of 'Source File'.
  • .C/H file(s) running on ARM Cortex A53 used to send data to the co-processor, including timer (only those you have modified).
  • A screenshot of your IP integrator canvas, i.e., the block diagram (please do not upload the entire Vivado project folder) for each case (or combined).
  • .xsa file(s).
  • A text file containing the information printed on the serial console in each case (or combined).
  • Screenshots of the TCF profiler tab showing the comparisons in each case (or combined).

to the Canvas by 11:59 PM, 4 Oct 2025. The exact same files should be used for evaluation.

It should be as a .zip archive, with the filename <Team number>_<Team member 1 Name>_<Team member 2 Name>_Asst2.zip.

Please DO NOT upload the whole project!

You will also need to do an online demonstration to a teaching assistant (based on what you submitted at the point of the assignment deadline, not the version you may have improved after the deadline) - arrangements will be made known in due course.

References

Here are some references that can help you get started with Vivado High-Level Synthesis tool