Performance Analysis of AI Engine Graph Application
A system-level view of program execution can be helpful in identifying problems during program execution including correctness and performance issues. Problems such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are examples that are difficult to debug by using explicit print statements or by using traditional interactive debuggers. A systematic way of collecting system level traces for the program execution is needed. The AI Engine architecture has direct support for generation, collection, and streaming of events as trace data during simulation, hardware emulation, or hardware execution.
AI Engine Simulation-Based Performance Analysis
In simulation, to view time stamped events, different event types, and data associated with each event, value change dump (VCD) files can be used. VCD files provide a detailed dump of the simulated hardware signals. Additionally, a profile summary provides annotated details for the overall application performance.
AI Engine Simulation-Based Value Change Dump
In the simulation framework, the AI Engine simulator can generate a detailed dump of the hardware
signals in the form of value change dump (VCD) files. A defined set of abstract
events describes the execution of a multi-kernel AI Engine program in terms of these events. The output of a VCD file
is enabled using the aiesimulator --dump-vcd
command.
After simulation, or emulation, the VCD file can be processed into events and viewed on a timeline in the Vitis™ analyzer. The events contain information such as time stamps, different event types, and data associated with each event. This information can be correlated to the compiler generated debug information. This includes program counter values mapped to function names and instruction offsets, and source level symbolic data offsets for memory accesses.
The abstract AI Engine events are independent of the VCD format and will be directly extracted from the hardware. The events traces can be produced as plain text, comma-separated values (CSV), common trace format (CTF), or in waveform database (WDB), and the generated event trace data can be viewed in the Vitis analyzer.
VCD File Generation
To generate a VCD file from the Vitis IDE, right-click on your AI Engine graph project from the Explorer view and select as described in Creating the AI Engine Graph Project and Top-Level System Project. This opens up the Run Configurations dialog box for the current project.
Select the AI Engine Emulator option and double click to open a new configuration. Select the Generate Trace check box to enable trace capture, and select the VCD Trace button. By default, this produces a VCD dump in a file called foo.vcd in the current directory. You can rename the file if you like.
The VCD file can also be generated by invoking the AI Engine simulator with the –-dump-vcd
<filename>
option on the command line. The VCD file is generated
in the same directory as the simulation is run. Assuming that the program is
compiled using the AI Engine compiler, the
simulator can be invoked in a shell with the VCD option.
$ aiesimulator –-pkg-dir=./Work --dump-vcd=foo
This command produces the VCD file (foo.vcd) which is written to the current directory.
AI Engine Trace from VCD
The vcdanalyze
utility is
provided to generate an AI Engine event trace
from the VCD file. This process is integrated into the Vitis tool flow automatically. From the Vitis IDE, after a simulation run has finished capturing AI Engine events, you can right-click on the project
from the Project Explorer and select
Analyze AIE Events. The trace data is
produced under the current project at Traces/AIE_AXI_Trace and various views are automatically loaded into
the current project.
The raw event trace under the directory Traces/AIE_AXI_Trace/ctf/events.txt should look like the following:
time=1741000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=65536,data1=0,tlast=0
time=1742000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=196610,data1=0,tlast=0
time=1743000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=327684,data1=0,tlast=0
time=1743000,event=CORE_RESET,col=1,row=0
time=1744000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=458758,data1=0,tlast=0
time=1745000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=589832,data1=0,tlast=0
time=1746000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=720906,data1=0,tlast=0
time=1747000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=851980,data1=0,tlast=0
time=1748000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=983054,data1=0,tlast=0
time=1749000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=1,data1=0,tlast=0
time=1750000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=131075,data1=0,tlast=0
time=1751000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=262149,data1=0,tlast=0
time=2186000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2190000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2194000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2198000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2202000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b2
time=2206000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b3
The following command produces the AI Engine trace data for foo.vcd in text form in the ./trdata/events.txt file.
vcdanalyze -vcd foo.vcd
vcdanalyze -h
to get help for the command. The following command produces a CSV file from the AI Engine trace data from the foo.vcd file.
vcdanalyze -vcd=foo.vcd -csv
The following command produces the waveform data files from the AI Engine trace data from the foo.vcd file.
vcdanalyze -vcd foo.vcd -wdb
Viewing the Run Summary in the Vitis Analyzer
After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.
During simulation of the AI Engine graph, the AI Engine simulator or hardware
emulation, captures performance and activity metrics and writes the report to the
output directory ./aiesimulator_output and
./sim/behav_waveform/xsim
. The generated
summary is called default.aierun_summary.
The run_summary can be viewed in the Vitis analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:
vitis_analyzer ./aiesimulator_output/default.aierun_summary
The Vitis analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis analyzer, see Using the Vitis Analyzer in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
default.aierun_summary
also contains the some of the
same reports as default.aiecompile_summary
. These
reports are Graph and
Array. To see those
reports go to the Viewing Compilation Results in the Vitis Analyzer.The listed reports include:
- Summary
- This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.
- Profile
- When the
aiesimulator --profile
option is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the AI Engine graphs, kernels-mapped to processors, with tables and graphic presentation of metric data.The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the raw Profile Summary report is shown.
Note: The row value of tile number from profile report is one above actual tile number. For example tile_25_1 is tile(25, 0) from the previous screen shot.Specific tables can be used to see profile information specific to the kernels. This is shown as a chart with a table showing what is running on the tiles. The following is an example chart.
In this view, you can see a chart that shows a Total Function Time which is the total cycles the function used in running the graph. The y-axis shows the id of the function that can be referenced in the following table. This information can be useful in determining where time is being spent in a function and helps with potential optimization or debug.
- Trace
- Issues such as missing or mismatching locks, buffer overruns, and
incorrect programming of DMA buffers are difficult to debug using
traditional interactive debug techniques. Event trace provides a systematic
way of collecting system level traces for the program events, providing
direct support for generation, collection, and streaming of hardware events
as a trace. The following image shows the Trace report open in the Vitis analyzer.Note: If the VCD file is too large and it takes too much time for the Vitis analyzer to analyze the VCD and open the Trace view, you can do an online analysis of the VCD when running the AI Engine simulator. The Vitis analyzer then opens the existing WDB and CTF files instead of analyzing the VCD file. The command for AI Engine simulator is as follows.
aiesimulator --pkg-dir=./Work --online -wdb -ctf
Features of the trace report include the following.
- Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
- There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
- By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
- The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
- If a lock is not released, a red bar extends through the end of simulation time.
- Clicking the left or right arrows takes you to the start and end of a state, respectively.
- The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.
Trace View Data Visualization
Trace view data visualization allows you a side-by-side view of events at the I/O ports, allowing you to look into the design and examine the relative event timing. This provides insight into how the design works in a multi-core environment.
Window Data Analysis
- The following image shows a net connection that has been selected in the
Graph view. The
selected net (
net80
) is highlighted both in net view and in the nets table at the bottom of the window. In addition the Graph view also associates the simulation input and output files with the appropriate nets. The simulation input file associated withnet80
is displayed on the right. - In the upper net view, select the file source that connects to
net80
, as shown, circled in the previous image. The input data is displayed at the right hand side of the window. - Switch to the Trace view to view the events and detailed event
timing.
- In the events table at the bottom of the screen, select Net from the column selector.
- Type
net80
in the adjacent (filter) box. - Net events that match
net80
are highlighted in the events table. Browse through the events using the Previous/Next toolbar buttons to show the matched events. - The highlighted data matches the data from the input file, shown
in the right-hand pane.Note: The AI Engine kernel does not start until all the data is available at the window interface.
Supported Window Data Types
Data Types | Display Format Example |
---|---|
int81 | ±123 |
int16 | ±12345 |
int32 | ±1234567890 |
int64 | ±1234567890123456789 |
uint8 | 123 |
uint16 | 12345 |
uint32 | 1234567890 |
uint64 | 12345678901234567890 |
cint16 | ±12345+±12345i |
cint32 | ±1234567890+±1234567890i |
float | ±1.234567 |
cfloat | ±1.234567+±1.234567i |
v4cint16 | ±1+2i:3+4i:5+6i:7+8i |
v4int32 | ±1:2:3:4 |
v4cint32 | ±1+2i:3+4i |
v4int64 | ±1:2 |
v4float | ±1.000000:2.000000:3.000000:4.000000 |
v4cfloat | ±1.000000+2.000000i:3.000000+4.000000i |
v8int16 | ±1:2:3:4:5:6:7:8 |
v8cint16 | ±1+2i:3+4i:5+6i:7+8i |
v8int32 | ±1:2:3:4:5:6:7:8 |
v8float | ±1.000000:2.000000:3.000000:4.000000:5.000000:6.000000:7.000000:8.000000 |
v16int8 | ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16 |
v16uint8 | 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16 |
v16int162 | N/A |
v16uint162 | N/A |
v16int32 | ±1:2:3:4:5:6:7:8 |
v16float2 | N/A |
v16cfloat2 | N/A |
v32int8 | ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32 |
v32uint8 | 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32 |
v32int16 | ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16 |
v32cint16 | ±1+2i:3+4i:5+6i:7+8i:9+10i:11+12i:13+14i:15+16i |
v32int32 | ±1:2:3:4:5:6:7:8 |
v32float | ±1.000000:2.000000:3.000000:4.000000:5.000000:6.000000:7.000000:8.000000 |
v64int83 | ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32 ±33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64 |
v64uint83 | 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32
33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:51:52:53:54:55:56:57:58:59:60:61:62:63:64 |
v64int164 | ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16 ±17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32 |
|
Stream Data Analysis
- The following image shows a kernel port that has been selected in the
Graph view. The
selected port with ID,
pi0
, is highlighted both in the Graph view and in the net table at bottom of the window. - Switch to the Trace to view the events and detailed event
timing.
- Move the time marker to the beginning of the stream data by dragging the vertical line (time marker) in the upper part of the window.
- In the upper part of the window you can examine the data associated with the selected input port.
- In the events table at the bottom of the screen, select Data from the drop-down menu.
- Type
0.000000+0.000000i
into the adjacent (filter) box and click Next. The expected time and data are highlighted in the events table.Note: The input value depends on data type.0.000000+0.000000i
in this example is for a complex float type.
Supported Stream Data Types
Data Types | Display Format |
---|---|
int81 | ±123 |
int16 | ±12345 |
int32 | ±1234567890 |
int64 | ±1234567890123456789 |
uint8 | 123 |
uint16 | 12345 |
uint32 | 1234567890 |
uint64 | 12345678901234567890 |
cint16 | ±12345+±12345i |
cint322 | N/A |
float | ±1.234567 |
cfloat2 | N/A |
acc48 | ±1:2:3:4:5:6:7:8 |
cacc48 | ±1+1i_2+2i_3+3i_4+4i |
acc802 | N/A |
cacc802 | N/A |
accfloat | ±1.000000:2.000000:3.000000:4.000000:5.000000:6.000000:7.000000:8.000000 |
caccfloat | 1.000000+2.000000i:3.000000+4.000000i:5.000000+6.000000i:7.000000+8.000000i |
v2cint32 | ±1+2i:3+4i |
v4cint16 | ±1+2i:3+4i:5+6i:7+8i |
v4int32 | ±1:2:3:4 |
v4float | ±1.000000:2.000000:3.000000:4.000000 |
v8int16 | ±1:2:3:4:5:6:7:8 |
v16int8 | ±1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16 |
v16uint8 | 1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16 |
v8acc48 | ±1:2:3:4:5:6:7:8 |
v4cacc48 | ±1+2i_3+4i_5+6i_7+8i |
|
Cascade Data Analysis
- The following image shows a kernel port that has been selected in the
Graph view. The
selected port with ID,
pi2
, is highlighted both in the Graph view and in the net table at bottom of the window. - Switch to the Trace
to view the events and detailed event timing.
- View the data associated with the selected input port.
- In the events table at the bottom of the screen, select Data from the column selector drop-down menu.
- Type
316733708
into the adjacent (filter) box and click Next. The expected time and data are shown in the events table.Note: The input value depends on data type.316733708+187875955i
in this example is for a complex integer data type.
Data Display Limitations
The following lists some limitations of the trace view data visualization feature.
- Trace data visualization is only available for non-templated kernels.
- 64-bit non-vector window data types are displayed as two 32-bit high and low
values.
For example, an unsigned, 64-bit integer 0x0000000100000002 is displayed as 2L and 1H separately. L and H represent low and high 32 bits.
Another example is a complex 32-bit value where the real 32 bits are displayed first, followed by the imaginary 32 bits.
input_pktstream
/output_pktstream
data type display is not supported.
Cross-Probing
- Use the Window
Layout toolbar (circled in the following image) to
manage views/reports. In the following image, the Trace and Graph view are selected to appear in
the same window.
- The filter button function, shown circled in the following image, selects
tiles, functions, input/output ports, DMA, locks to be included or excluded from
trace view to focus on areas of interest.
- Drag the time marker to move the time backwards and forwards to evaluate
events at a given time. Events occurring after the selected time are highlighted
in the events table of the Trace view at the bottom of the screen.
- Select objects in the
Graph view to map graph, tiles, I/O
ports, and net connections to see the object ID, type, direction, data type,
buffers, and connected ports.
- Values of I/O ports are available in the
Trace view, shown circled in the
following image. This example design uses complex 16 bits value type (
cint16
).
Trace Compare
- Open the two summary files to be compared.
- Select the Trace
view for either design, right-click and select for the other the design, or click on the Compare link. The compare can start from any
trace.Note: In Trace Compare mode, features in the Vitis analyzer apply to both Trace views.
printf()
statements in the kernel. By looking at
this Trace Compare example, you can see that the upper design expended time on the
printf()
calls that slowed down the overall
execution of the design. Using Trace Compass to Visualize AI Engine Traces
The CTF-based AI Engine trace can be visualized using the free eclipse tool called Trace Compass. This tool is already integrated into the Vitis IDE as a plug-in, allowing you to visualize your traces from the Vitis IDE main panel.
- In the Vitis IDE, after capturing trace
data during simulation you can right-click on your project then select
Analyze AIE Events. This imports the
event data from the simulation and create various views to analyze them. Your
screen could look as follows:
- To see various views, toggle between the Statistics, Data View, System View, and Function View tabs.
Trace Views
The trace reports support several views:
- The top window shows a textual list of events in chronological order with various event types and other relevant information. The top row in each column allows you to filter events based on textual patterns. In the bottom window, there are multiple tabs providing different views relating to the execution.
- The Statistics tab shows the aggregate event statistics based on the selected set of events or a time slice.
- The System View tab represents the state of system resources such as AI Engines, locks, and DMAs.
- The Function View tab represents the state of various kernels executing on an AI Engine (core).
- The Data View tab represents the state of data flowing through the stream switch network.
The following are screen shots of the function view, system view, and data view. The top bar of a view has several options: A legend explaining the colors, zoom in and zoom out, going to beginning and end of state, and correlating it to a textual event that causes the state change. Each view consists of a series of aligned timelines depicting the state of a certain resource or program object. Various events are represented in each timeline. You can hover over the timeline to see the information collected. Clicking on the timeline in one view creates a time bar that allows you to see the corresponding events at that time in other views.
As shown in the system view, there are four sections: ActiveCores, ActiveDMA, and Locks. If there are PL blocks used in the
application, the system view will also show the ActivePLPorts. By using lock IDs in the
ActiveCores,
ActiveDMA, and
Locks sections you can
identify how the AI Engines and DMAs interact
with one another by acquiring and releasing locks. The currently executing function
name is shown when hovering over the Core(0,0).pc
bar. The color coding is shown in the legend that opens with a click on the legend
icon (, left of the home
icon, which resets the timescale to default). Clicking the left or right arrows
takes you to the beginning and end of a state, respectively. A text window shows you
the event that caused the state change. In this example, all locks are properly
acquired and released. If a lock is not released, you will see a red bar that
extends through the end of simulation time.
The function view is most useful when analyzing the application from the program standpoint. There is a separate timeline for each kernel mapped to an AI Engine (core), and the view shows when the kernel is executing (blue) or stalled. A detailed pop-up window with details such as the types of stall and duration comes up when you hover over the stalls in the function view.
The data view shows the data flowing through the stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, when one packet might get delayed behind another packet when sharing the same stream channel.
Run-Time Event API for Performance Profiling
You can collect profile statistics of your design by calling event APIs in your PS host code. These event APIs are available both during simulation and when you run the design in hardware.
The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. You can use the run-time event API together with the graph control API to profile certain performance metrics during a controlled period of graph execution. The event API supports only platform I/O ports (PLIO) to measure performance metrics such as platform I/O port bandwidth, graph throughput, and graph latency.
Profiling Platform I/O Port Bandwidth
The bandwidth of a platform I/O port can be defined as the average number of
bytes transferred per second, which can be derived as the total number of bytes
transferred divided by the time when the port is transferring or is stalled (for
example, due to back pressure). The following example shows how to profile I/O port
bandwidth using the event API. In the example, gr
is
the application graph object, plio_out
is the PLIO
object connecting to the graph output port, and the graph is designed to produce 256
int32 data samples in eight iterations.
gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_total_stream_running_to_idle_cycles);
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double bandwidth = (double)256 * sizeof(int32) / (cycle_count * 1e-9); //byte per second
In the example, after the graph is initialized, the event::start_profiling
is called to configure the AI Engine to count the accumulated clock
cycles between the stream running event and the stream idle event. In other words,
it counts the number of cycles when the stream port is in running or in stall state.
The first argument in event::start_profiling
can be
a PLIO or a GMIO object, in this case, it is plio_out
. The second argument is event::io_profiling_option
enumeration, and in this case, the
enumeration is set to event::io_total_stream_running_to_idle_cycles
. event::start_profiling
returns a handle, which will be used later to
read the counter value and to stop the profile. After the graph finishes eight
iterations, you can call event::read_profiling
to
read the counter value by supplying the handle. After profiling is done, it is
recommended to stop the performance counter by calling event::stop_profiling
with the handle so the hardware resources
configured to do the profile can be released for other uses. Finally, the bandwidth
is derived by dividing the total number of bytes transferred (256 × sizeof(int32))
by the time spent when the stream port is active (cycle_count × 1e-9, assuming the AI Engine is running at 1 GHz).
Profiling Graph Throughput
Graph throughput can be defined as the average number of bytes
produced (or consumed) per second. The following example shows how to profile graph
throughput using the event API. In the example, gr
is the application graph object, plio_out
is the
PLIO object connecting to the graph output port, and the graph is designed to
produce 256 int32 data in eight iterations.
gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_stream_start_to_bytes_transferred_cycles, 256*sizeof(int32));
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double throughput = (double)256 * sizeof(int32) / (cycle_count * 1e-9); // byte per second
In the example, after the graph is initialized, event::start_profiling
is called to configure the AI Engine to count the clock cycles from the stream
start event to the event that indicates 256 × sizeof(int32)
bytes have been transferred, assuming that the stream
stops right after the specified number of bytes are transferred. If the stream
continues after the number of bytes transferred, the counter continues and never
ends. The first argument in event::start_profiling
is plio_out
, the second argument is set to event::io_stream_start_to_bytes_transferred_cycles
, and
the third argument specifies the number of bytes to be transferred before stopping
the counter. The graph throughput is derived by dividing the total number of bytes
produced in eight iterations (256 × sizeof(int32)
)
by the time spent from the first output data to the last output data
(cycle_count
× 1e-9, assuming the
AI Engine is running at 1 GHz).
Profiling Port Throughput
Port throughput can be measured by a count of the number of samples are sent in
a specific time. Xilinx provides event::io_stream_running_event_count
enumeration to count
the running event, which corresponds to the number of samples sent.
After the graph runs, and data transfer from or to the port is stable, the following code can be inserted in the host code to measure the port throughput.
int wait_time_us=2000000;
event::handle handle = event::start_profiling(*plio_port, vent::io_stream_running_event_count);
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
long long count0 = event::read_profiling(handle);
usleep(wait_time_us);
long long count1 = event::read_profiling(handle);
event::stop_profiling(handle);
long long samples = count1 - count0;
std::cout << "num runnning samples: " << samples << std::endl;
std::cout << " Throughput: " << samples / wait_time_us << " MSPS " << std::endl;
This method can be used for an infinite running graph, or just to count how many samples are sent or received before the graph is stalled (for whatever reason).
To minimize the variance of accuracy, it is advised to run for many seconds in hardware. Accuracy of this method can vary in hardware emulation.
For the AI Engine simulator, this profiling
method applies too. You need to replace usleep
with the
wait
function in SystemC, and the wait time needs
to be much smaller, because it is much slower in simulation. For example, the sleep
function in the preceding code can be replaced with
following function call for the AI Engine
simulator.
wait(20,SC_US);
Profiling Graph Latency
Graph latency can be defined as the time spent from receiving the
first input data to producing the first output data. The following example shows how
to profile graph throughput using the event API. In the example, gr
is the application graph object, plio_out
is the PLIO object
connecting to the graph output port, and gmio_in
is the GMIO object
connecting to the graph input port.
gr.init();
event::handle handle = event::start_profiling(gmio_in, plio_out, event::io_stream_start_difference_cycles);
if(handle==event::invalid_handle){
printf("ERROR:Invalid handle. Only two performance counter in a AIE-PL interface tile\n");
return 1;
}
gr.run(8);
gr.wait();
long long latency_in_cycles = event::read_profiling(handle);
event::stop_profiling(handle);
In the example, after graph is initialized, event::start_profiling
is called to configure the AI Engine to count the clock cycles from the
stream start event of the input I/O port to the stream start event of the output I/O
port. The first and the second argument in event::start_profiling
can be GMIO or PLIO ports, representing the
input and the output I/O port respectively. In this example, gmio_in
is the input I/O port and plio_out
is the output I/O port. The third argument is set to event::io_stream_start_difference_cycles
enumeration.
The counter value simply indicates the graph latency in cycles.
Run-Time Event API Performance Counters
[AIE WARNING]: Unable to request resources. RscType: 0
ERROR: event::start_profiling: Failed to request performance counter resources.
[XRT] ERROR: ERROR: event::start_profiling: Failed to request performance counter resources.: Resource temporarily unavailable
Run-time Event Enumeration | Number of Performance Counters |
---|---|
event::io_total_stream_running_to_idle_cycles |
1 |
event::io_stream_start_to_bytes_transferred_cycles |
2 |
event::io_stream_start_difference_cycles |
1 for input port, 1 for output port |
event::io_stream_running_event_count |
1 |
event::stop_profiling
. The run-time event API can acquire the same
performance counters again after they are released.