Performance Analysis of AI Engine Graph Application

A system-level view of program execution can be helpful in identifying problems during program execution including correctness and performance issues. Problems such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are examples that are difficult to debug by using explicit print statements or by using traditional interactive debuggers. A systematic way of collecting system level traces for the program execution is needed. The AI Engine architecture has direct support for generation, collection, and streaming of events as trace data during simulation, hardware emulation, or hardware execution.

Note: The event trace feature for hardware execution is an early access feature.

AI Engine Simulation-Based Performance Analysis

In simulation, to view time stamped events, different event types, and data associated with each event, value change dump (VCD) files can be used. VCD files provide a detailed dump of the simulated hardware signals. Additionally, a profile summary provides annotated details for the overall application performance.

AI Engine Simulation-Based Value Change Dump

In the simulation framework, the AI Engine simulator can generate a detailed dump of the hardware signals in the form of value change dump (VCD) files. A defined set of abstract events describes the execution of a multi-kernel AI Engine program in terms of these events. The output of a VCD file is enabled using the aiesimulator --dump-vcd command.

After simulation, or emulation, the VCD file can be processed into events and viewed on a timeline in the Vitis™ analyzer. The events contain information such as time stamps, different event types, and data associated with each event. This information can be correlated to the compiler generated debug information. This includes program counter values mapped to function names and instruction offsets, and source level symbolic data offsets for memory accesses.

The abstract AI Engine events are independent of the VCD format and will be directly extracted from the hardware. The events traces can be produced as plain text, comma-separated values (CSV), common trace format (CTF), or in waveform database (WDB), and the generated event trace data can be viewed in the Vitis analyzer.

VCD File Generation

To generate a VCD file from the Vitis IDE, right-click on your AI Engine graph project from the Explorer view and select Run As > Run Configurations as described in Creating the AI Engine Graph Project and Top-Level System Project. This opens up the Run Configurations dialog box for the current project.

Figure 1: Vitis IDE to Enable VCD File Generation

Select the AI Engine Emulator option and double click to open a new configuration. Select the Generate Trace check box to enable trace capture, and select the VCD Trace button. By default, this produces a VCD dump in a file called foo.vcd in the current directory. You can rename the file if you like.

The VCD file can also be generated by invoking the AI Engine simulator with the –-dump-vcd <filename> option on the command line. The VCD file is generated in the same directory as the simulation is run. Assuming that the program is compiled using the AI Engine compiler, the simulator can be invoked in a shell with the VCD option.

$ aiesimulator –-pkg-dir=./Work --dump-vcd=foo

This command produces the VCD file (foo.vcd) which is written to the current directory.

AI Engine Trace from VCD

The vcdanalyze utility is provided to generate an AI Engine event trace from the VCD file. This process is integrated into the Vitis tool flow automatically. From the Vitis IDE, after a simulation run has finished capturing AI Engine events, you can right-click on the project from the Project Explorer and select Analyze AIE Events. The trace data is produced under the current project at Traces/AIE_AXI_Trace and various views are automatically loaded into the current project.

The raw event trace under the directory Traces/AIE_AXI_Trace/ctf/events.txt should look like the following:

time=1741000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=65536,data1=0,tlast=0
time=1742000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=196610,data1=0,tlast=0
time=1743000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=327684,data1=0,tlast=0
time=1743000,event=CORE_RESET,col=1,row=0
time=1744000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=458758,data1=0,tlast=0
time=1745000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=589832,data1=0,tlast=0
time=1746000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=720906,data1=0,tlast=0
time=1747000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=851980,data1=0,tlast=0
time=1748000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=983054,data1=0,tlast=0
time=1749000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=1,data1=0,tlast=0
time=1750000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=131075,data1=0,tlast=0
time=1751000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=262149,data1=0,tlast=0
time=2186000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2190000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2194000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2198000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2202000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b2
time=2206000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b3

The following command produces the AI Engine trace data for foo.vcd in text form in the ./trdata/events.txt file.

vcdanalyze -vcd foo.vcd
TIP: Use vcdanalyze -h to get help for the command.

The following command produces a CSV file from the AI Engine trace data from the foo.vcd file.

vcdanalyze -vcd=foo.vcd -csv

The following command produces the waveform data files from the AI Engine trace data from the foo.vcd file.

vcdanalyze -vcd foo.vcd -wdb

Viewing the Run Summary in the Vitis Analyzer

After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.

During simulation of the AI Engine graph, the AI Engine simulator or hardware emulation, captures performance and activity metrics and writes the report to the output directory ./aiesimulator_output and ./sim/behav_waveform/xsim. The generated summary is called default.aierun_summary.

The run_summary can be viewed in the Vitis analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:

vitis_analyzer ./aiesimulator_output/default.aierun_summary

The Vitis analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis analyzer, see Using the Vitis Analyzer in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).

Note: The default.aierun_summary also contains the some of the same reports as default.aiecompile_summary. These reports are Graph and Array. To see those reports go to the Viewing Compilation Results in the Vitis Analyzer.

The listed reports include:

Summary
This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.
Profile
When the aiesimulator --profile option is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the AI Engine graphs, kernels-mapped to processors, with tables and graphic presentation of metric data.

The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the raw Profile Summary report is shown.

Figure 2: Profile Summary


Note: The row value of tile number from profile report is one above actual tile number. For example tile_25_1 is tile(25, 0) from the previous screen shot.

Specific tables can be used to see profile information specific to the kernels. This is shown as a chart with a table showing what is running on the tiles. The following is an example chart.

Figure 3: Example Chart


In this view, you can see a chart that shows a Total Function Time which is the total cycles the function used in running the graph. The y-axis shows the id of the function that can be referenced in the following table. This information can be useful in determining where time is being spent in a function and helps with potential optimization or debug.

Trace
Issues such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are difficult to debug using traditional interactive debug techniques. Event trace provides a systematic way of collecting system level traces for the program events, providing direct support for generation, collection, and streaming of hardware events as a trace. The following image shows the Trace report open in the Vitis analyzer.
Figure 4: Trace Report


Note: If the VCD file is too large and it takes too much time for the Vitis analyzer to analyze the VCD and open the Trace view, you can do an online analysis of the VCD when running the AI Engine simulator. The Vitis analyzer then opens the existing WDB and CTF files instead of analyzing the VCD file. The command for AI Engine simulator is as follows.
aiesimulator --pkg-dir=./Work --online -wdb -ctf

Features of the trace report include the following.

  • Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
  • There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
  • By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
  • The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
  • If a lock is not released, a red bar extends through the end of simulation time.
  • Clicking the left or right arrows takes you to the start and end of a state, respectively.
  • The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.

Trace View Data Visualization

Trace view data visualization allows you a side-by-side view of events at the I/O ports, allowing you to look into the design and examine the relative event timing. This provides insight into how the design works in a multi-core environment.

Window Data Analysis

Window data analysis in the trace view allows you to cross probe from a specific input/output window net connections in the graph view to the appropriate position in the event trace view. You can trace window input data as it is being buffered prior to the kernel starting execution. To perform data analysis on a window interface, use the following steps.
  1. The following image shows a net connection that has been selected in the Graph view. The selected net (net80) is highlighted both in net view and in the nets table at the bottom of the window. In addition the Graph view also associates the simulation input and output files with the appropriate nets. The simulation input file associated with net80 is displayed on the right.

  2. In the upper net view, select the file source that connects to net80, as shown, circled in the previous image. The input data is displayed at the right hand side of the window.
  3. Switch to the Trace view to view the events and detailed event timing.

  4. In the events table at the bottom of the screen, select Net from the column selector.
  5. Type net80 in the adjacent (filter) box.
  6. Net events that match net80 are highlighted in the events table. Browse through the events using the Previous/Next toolbar buttons to show the matched events.
  7. The highlighted data matches the data from the input file, shown in the right-hand pane.
    Note: The AI Engine kernel does not start until all the data is available at the window interface.
Using these steps, you can inspect I/O data for the required window connections to ensure the correctness of the I/O data to and from a kernel.

Supported Window Data Types

Table 1. Supported Window Data Types
Data Types Display Format Example
int81 ±123
int16 ±12345
int32 ±1234567890
int64 ±1234567890123456789
uint8 123
uint16 12345
uint32 1234567890
uint64 12345678901234567890
cint16 ±12345+±12345i
cint32 ±1234567890+±1234567890i
float ±1.234567
cfloat ±1.234567+±1.234567i
v4cint16 ±1+2i:3+4i:5+6i:7+8i
v4int32 ±1:2:3:4
v4cint32 ±1+2i:3+4i
v4int64 ±1:2
v4float ±1.000000:2.000000:3.000000:4.000000
v4cfloat ±1.000000+2.000000i:3.000000+4.000000i
v8int16 ±1:2:3:4:5:6:7:8
v8cint16 ±1+2i:3+4i:5+6i:7+8i
v8int32 ±1:2:3:4:5:6:7:8
v8float ±1.000000:2.000000:3.000000:4.000000:5.000000:6.000000:7.000000:8.000000
v16int8 ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16
v16uint8 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16
v16int162 N/A
v16uint162 N/A
v16int32 ±1:2:3:4:5:6:7:8
v16float2 N/A
v16cfloat2 N/A
v32int8 ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32
v32uint8 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32
v32int16 ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16
v32cint16 ±1+2i:3+4i:5+6i:7+8i:9+10i:11+12i:13+14i:15+16i
v32int32 ±1:2:3:4:5:6:7:8
v32float ±1.000000:2.000000:3.000000:4.000000:5.000000:6.000000:7.000000:8.000000
v64int83 ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32

±33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64

v64uint83 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32

33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64

v64int164 ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16

±17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32

  1. Preceding 0 and ‘+’ not displayed.
  2. Currently under development.
  3. 32 samples at timestamp x, next 32 samples at timestamp x+1.
  4. 16 samples at timestamp x, next 16 samples at timestamp x+1.

Stream Data Analysis

Stream data analysis in the trace view allows you to cross probe from a specific input/output stream net connection in the graph view to the appropriate position in the event trace view. You can trace stream input data as it is being received during the kernel execution. To perform data analysis on a stream interface, use the following steps.
  1. The following image shows a kernel port that has been selected in the Graph view. The selected port with ID, pi0, is highlighted both in the Graph view and in the net table at bottom of the window.

  2. Switch to the Trace to view the events and detailed event timing.

  3. Move the time marker to the beginning of the stream data by dragging the vertical line (time marker) in the upper part of the window.
  4. In the upper part of the window you can examine the data associated with the selected input port.
  5. In the events table at the bottom of the screen, select Data from the drop-down menu.
  6. Type 0.000000+0.000000i into the adjacent (filter) box and click Next. The expected time and data are highlighted in the events table.
    Note: The input value depends on data type. 0.000000+0.000000i in this example is for a complex float type.
Using these steps, you can inspect I/O data for the required stream connections to ensure the correctness of the I/O data to and from a kernel.

Supported Stream Data Types

Table 2. Supported Stream Data Types
Data Types Display Format
int81 ±123
int16 ±12345
int32 ±1234567890
int64 ±1234567890123456789
uint8 123
uint16 12345
uint32 1234567890
uint64 12345678901234567890
cint16 ±12345+±12345i
cint322 N/A
float ±1.234567
cfloat2 N/A
acc48 ±1:2:3:4:5:6:7:8
cacc48 ±1+1i_2+2i_3+3i_4+4i
acc802 N/A
cacc802 N/A
accfloat ±1.000000:2.000000:3.000000:4.000000:5.000000:6.000000:7.000000:8.000000
caccfloat 1.000000+2.000000i:3.000000+4.000000i:5.000000+6.000000i:7.000000+8.000000i
v2cint32 ±1+2i:3+4i
v4cint16 ±1+2i:3+4i:5+6i:7+8i
v4int32 ±1:2:3:4
v4float ±1.000000:2.000000:3.000000:4.000000
v8int16 ±1:2:3:4:5:6:7:8
v16int8 ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16
v16uint8 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16
v8acc48 ±1:2:3:4:5:6:7:8
v4cacc48 ±1+2i_3+4i_5+6i_7+8i
  1. Preceding 0 and ‘+’ not displayed.
  2. Currently under development.

Cascade Data Analysis

Tracing cascaded data is similar to trace stream data. Cascade data analysis in the trace view allows you to cross probe from a specific input/output cascade connection in the Graph view to the appropriate position in the event trace view. You can trace cascade input data as it is being received during the kernel execution. To perform data analysis on a cascade interface, use the following steps.
  1. The following image shows a kernel port that has been selected in the Graph view. The selected port with ID, pi2, is highlighted both in the Graph view and in the net table at bottom of the window.

  2. Switch to the Trace to view the events and detailed event timing.

  3. View the data associated with the selected input port.
  4. In the events table at the bottom of the screen, select Data from the column selector drop-down menu.
  5. Type 316733708 into the adjacent (filter) box and click Next. The expected time and data are shown in the events table.
    Note: The input value depends on data type. 316733708+187875955i in this example is for a complex integer data type.
Using these steps, you can inspect I/O data for the required cascade connections to ensure the correctness of the I/O data to and from a kernel.

Data Display Limitations

The following lists some limitations of the trace view data visualization feature.

  1. Trace data visualization is only available for non-templated kernels.
  2. 64-bit non-vector window data types are displayed as two 32-bit high and low values.

    For example, an unsigned, 64-bit integer 0x0000000100000002 is displayed as 2L and 1H separately. L and H represent low and high 32 bits.

    Another example is a complex 32-bit value where the real 32 bits are displayed first, followed by the imaginary 32 bits.

  3. input_pktstream/output_pktstream data type display is not supported.

Cross-Probing

Debugging a design is a complicated process and often needs to switch between views/reports. The Vitis analyzer supports this functionality and allows detailed inspection of events and their associated timing. Cross-probing allows you to probe data in different views allowing you to view I/O data from different perspectives, including data on a particular tile or port, and at what time an event occurred, all within the same window. The Trace and Graph views can be shown in the same window and moving the time marker in the Trace view or selecting object(s) in the Graph view applies to both views simultaneously.
  1. Use the Window Layout toolbar (circled in the following image) to manage views/reports. In the following image, the Trace and Graph view are selected to appear in the same window.

  2. The filter button function, shown circled in the following image, selects tiles, functions, input/output ports, DMA, locks to be included or excluded from trace view to focus on areas of interest.

  3. Drag the time marker to move the time backwards and forwards to evaluate events at a given time. Events occurring after the selected time are highlighted in the events table of the Trace view at the bottom of the screen.

  4. Select objects in the Graph view to map graph, tiles, I/O ports, and net connections to see the object ID, type, direction, data type, buffers, and connected ports.

  5. Values of I/O ports are available in the Trace view, shown circled in the following image. This example design uses complex 16 bits value type (cint16).

These steps show the Trace and Graph view for an output port object in the Graph view (on the right side) at 4,341,000 ns into the design execution and in the Trace view (on the left side) where the output data is available. Moving the time marker to a later time in the Trace view, the events in the events table (lower portion of the Trace view) are highlighted to indicate which events happened at that time. This information is useful to relate events in a multi-processor environment. Other examples can leverage this cross-probing feature in the Vitis analyzer tool.

Trace Compare

Trace Compare in the Vitis analyzer allows the comparison of two different design executions. The comparison between two design runs helps to see the impact of one or more variables introduced in a design run. This allows you to inspect performance difference regardless of what variables are adjusted.
  1. Open the two summary files to be compared.
  2. Select the Trace view for either design, right-click and select Compare to > <filename>_summary for the other the design, or click on the Compare link. The compare can start from any trace.



    Note: In Trace Compare mode, features in the Vitis analyzer apply to both Trace views.
This example uses the same design but one has added printf() statements in the kernel. By looking at this Trace Compare example, you can see that the upper design expended time on the printf() calls that slowed down the overall execution of the design.

Using Trace Compass to Visualize AI Engine Traces

The CTF-based AI Engine trace can be visualized using the free eclipse tool called Trace Compass. This tool is already integrated into the Vitis IDE as a plug-in, allowing you to visualize your traces from the Vitis IDE main panel.

Note: You can download standalone versions and documentation of Trace Compass from their website at http://tracecompass.org.
  1. In the Vitis IDE, after capturing trace data during simulation you can right-click on your project then select Analyze AIE Events. This imports the event data from the simulation and create various views to analyze them. Your screen could look as follows:

  2. To see various views, toggle between the Statistics, Data View, System View, and Function View tabs.

Trace Views

The trace reports support several views:

  • The top window shows a textual list of events in chronological order with various event types and other relevant information. The top row in each column allows you to filter events based on textual patterns. In the bottom window, there are multiple tabs providing different views relating to the execution.
  • The Statistics tab shows the aggregate event statistics based on the selected set of events or a time slice.
  • The System View tab represents the state of system resources such as AI Engines, locks, and DMAs.
  • The Function View tab represents the state of various kernels executing on an AI Engine (core).
  • The Data View tab represents the state of data flowing through the stream switch network.

The following are screen shots of the function view, system view, and data view. The top bar of a view has several options: A legend explaining the colors, zoom in and zoom out, going to beginning and end of state, and correlating it to a textual event that causes the state change. Each view consists of a series of aligned timelines depicting the state of a certain resource or program object. Various events are represented in each timeline. You can hover over the timeline to see the information collected. Clicking on the timeline in one view creates a time bar that allows you to see the corresponding events at that time in other views.

Figure 5: System View with AI Engines, Locks, and DMAs


As shown in the system view, there are four sections: ActiveCores, ActiveDMA, and Locks. If there are PL blocks used in the application, the system view will also show the ActivePLPorts. By using lock IDs in the ActiveCores, ActiveDMA, and Locks sections you can identify how the AI Engines and DMAs interact with one another by acquiring and releasing locks. The currently executing function name is shown when hovering over the Core(0,0).pc bar. The color coding is shown in the legend that opens with a click on the legend icon (, left of the home icon, which resets the timescale to default). Clicking the left or right arrows takes you to the beginning and end of a state, respectively. A text window shows you the event that caused the state change. In this example, all locks are properly acquired and released. If a lock is not released, you will see a red bar that extends through the end of simulation time.

Figure 6: Legend
Figure 7: Function View Showing Running and Stalled Kernels and Main on Each AI Engine


The function view is most useful when analyzing the application from the program standpoint. There is a separate timeline for each kernel mapped to an AI Engine (core), and the view shows when the kernel is executing (blue) or stalled. A detailed pop-up window with details such as the types of stall and duration comes up when you hover over the stalls in the function view.

Figure 8: Data View Showing Data Flowing in the Stream Switch Network


The data view shows the data flowing through the stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, when one packet might get delayed behind another packet when sharing the same stream channel.

Run-Time Event API for Performance Profiling

You can collect profile statistics of your design by calling event APIs in your PS host code. These event APIs are available both during simulation and when you run the design in hardware.

The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. You can use the run-time event API together with the graph control API to profile certain performance metrics during a controlled period of graph execution. The event API supports only platform I/O ports (PLIO) to measure performance metrics such as platform I/O port bandwidth, graph throughput, and graph latency.

Profiling Platform I/O Port Bandwidth

The bandwidth of a platform I/O port can be defined as the average number of bytes transferred per second, which can be derived as the total number of bytes transferred divided by the time when the port is transferring or is stalled (for example, due to back pressure). The following example shows how to profile I/O port bandwidth using the event API. In the example, gr is the application graph object, plio_out is the PLIO object connecting to the graph output port, and the graph is designed to produce 256 int32 data samples in eight iterations.

gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_total_stream_running_to_idle_cycles);
if(handle==event::invalid_handle){
  printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
  return 1;
}
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double bandwidth = (double)256 * sizeof(int32) / (cycle_count * 1e-9); //byte per second

In the example, after the graph is initialized, the event::start_profiling is called to configure the AI Engine to count the accumulated clock cycles between the stream running event and the stream idle event. In other words, it counts the number of cycles when the stream port is in running or in stall state. The first argument in event::start_profiling can be a PLIO or a GMIO object, in this case, it is plio_out. The second argument is event::io_profiling_option enumeration, and in this case, the enumeration is set to event::io_total_stream_running_to_idle_cycles. event::start_profiling returns a handle, which will be used later to read the counter value and to stop the profile. After the graph finishes eight iterations, you can call event::read_profiling to read the counter value by supplying the handle. After profiling is done, it is recommended to stop the performance counter by calling event::stop_profiling with the handle so the hardware resources configured to do the profile can be released for other uses. Finally, the bandwidth is derived by dividing the total number of bytes transferred (256 × sizeof(int32)) by the time spent when the stream port is active (cycle_count × 1e-9, assuming the AI Engine is running at 1 GHz).

Profiling Graph Throughput

Graph throughput can be defined as the average number of bytes produced (or consumed) per second. The following example shows how to profile graph throughput using the event API. In the example, gr is the application graph object, plio_out is the PLIO object connecting to the graph output port, and the graph is designed to produce 256 int32 data in eight iterations.

gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_stream_start_to_bytes_transferred_cycles, 256*sizeof(int32));
if(handle==event::invalid_handle){
  printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
  return 1;
}
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double throughput = (double)256 * sizeof(int32) / (cycle_count * 1e-9); // byte per second

In the example, after the graph is initialized, event::start_profiling is called to configure the AI Engine to count the clock cycles from the stream start event to the event that indicates 256 × sizeof(int32) bytes have been transferred, assuming that the stream stops right after the specified number of bytes are transferred. If the stream continues after the number of bytes transferred, the counter continues and never ends. The first argument in event::start_profiling is plio_out, the second argument is set to event::io_stream_start_to_bytes_transferred_cycles, and the third argument specifies the number of bytes to be transferred before stopping the counter. The graph throughput is derived by dividing the total number of bytes produced in eight iterations (256 × sizeof(int32)) by the time spent from the first output data to the last output data (cycle_count × 1e-9, assuming the AI Engine is running at 1 GHz).

Profiling Port Throughput

Port throughput can be measured by a count of the number of samples are sent in a specific time. Xilinx provides event::io_stream_running_event_count enumeration to count the running event, which corresponds to the number of samples sent.

After the graph runs, and data transfer from or to the port is stable, the following code can be inserted in the host code to measure the port throughput.

int wait_time_us=2000000;
event::handle handle = event::start_profiling(*plio_port, vent::io_stream_running_event_count);
if(handle==event::invalid_handle){
    printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
    return 1;
} 
long long count0 = event::read_profiling(handle); 
usleep(wait_time_us); 
long long count1 = event::read_profiling(handle); 
event::stop_profiling(handle); 
long long samples = count1 - count0; 
std::cout << "num runnning samples: " << samples << std::endl; 
std::cout << " Throughput: " << samples / wait_time_us << " MSPS " << std::endl;

This method can be used for an infinite running graph, or just to count how many samples are sent or received before the graph is stalled (for whatever reason).

To minimize the variance of accuracy, it is advised to run for many seconds in hardware. Accuracy of this method can vary in hardware emulation.

For the AI Engine simulator, this profiling method applies too. You need to replace usleep with the wait function in SystemC, and the wait time needs to be much smaller, because it is much slower in simulation. For example, the sleep function in the preceding code can be replaced with following function call for the AI Engine simulator.

wait(20,SC_US);

Profiling Graph Latency

Graph latency can be defined as the time spent from receiving the first input data to producing the first output data. The following example shows how to profile graph throughput using the event API. In the example, gr is the application graph object, plio_out is the PLIO object connecting to the graph output port, and gmio_in is the GMIO object connecting to the graph input port.

gr.init();
event::handle handle = event::start_profiling(gmio_in, plio_out, event::io_stream_start_difference_cycles);
if(handle==event::invalid_handle){
  printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
  return 1;
}
gr.run(8);
gr.wait();
long long latency_in_cycles = event::read_profiling(handle);
event::stop_profiling(handle);

In the example, after graph is initialized, event::start_profiling is called to configure the AI Engine to count the clock cycles from the stream start event of the input I/O port to the stream start event of the output I/O port. The first and the second argument in event::start_profiling can be GMIO or PLIO ports, representing the input and the output I/O port respectively. In this example, gmio_in is the input I/O port and plio_out is the output I/O port. The third argument is set to event::io_stream_start_difference_cycles enumeration. The counter value simply indicates the graph latency in cycles.

Run-Time Event API Performance Counters

Run-time event APIs use the performance counters in the AI Engine-PL interface tiles and the AI Engine-NoC interface tiles. There are two performance counters in each column of the interface tiles. This section lists the number of performance counter - used by each run-time event API. If the total number of performance counters used exceeds the availability of performance counters in a column of the interface tile, the API that cannot acquire the performance counter fails with the following error message in AI Engine simulator.
[AIE WARNING]: Unable to request resources. RscType: 0
ERROR: event::start_profiling: Failed to request performance counter resources.
For hardware emulation or hardware flows, the following error message is used.
[XRT] ERROR: ERROR: event::start_profiling: Failed to request performance counter resources.: Resource temporarily unavailable
Table 3. Run-Time Event API Performance Counters
Run-time Event Enumeration Number of Performance Counters
event::io_total_stream_running_to_idle_cycles 1
event::io_stream_start_to_bytes_transferred_cycles 2
event::io_stream_start_difference_cycles 1 for input port, 1 for output port
event::io_stream_running_event_count 1
Note: Performance counters are released after event::stop_profiling. The run-time event API can acquire the same performance counters again after they are released.
Note: When multiple graph ports are mapped into the same interface tile, if run-time event APIs are used on these ports, they will compete for the performance counters in the same interface tile.