Specialized Graph Constructs
This chapter describes several graph constructs that help when modeling specific scenarios.
Look-up Tables
Static File-scoped Tables
Kernel functions can use private, read-only data structures that are accessed as file-scoped variables. The compiler allocates a limited amount of static heap space for such data. As an example, consider the following header file (user_parameter.h):
#ifndef USER_PARAMETER_H
#define USER_PARAMETER_H
#include <adf.h>
static int32 lutarray[8] = {1,2,3,4,5,6,0,0} ;
#endif
This header file can be included in the kernel
source file and the look-up table can be accessed inside a kernel function directly.
The static
modifier ensures that the array
definition is local to this file. The AI Engine
compiler then allocates this array in static heap space for the processor where this
kernel is used.
#include "user_parameter.h"
void simple_lut(input_window_cint16 * in, output_window_cint16 * out){
v4cint32 tmp;
v4cacc48 acc;
v32cint16 coeffs;
upd_w(coeffs, 0, lutarray);
window_readincr(in, tmp);
acc = mul4(tmp 0, 0x3210, 1, coeffs, 0, 0x0000, 1);
acc = mac4(acc, tmp, 2, 0x3210, 1, coeffs, 2, 0x0000, 1) ;
acc = mac4(acc, tmp, 4, 0x3210, 1, coeffs, 4, 0x0000, 1) ;
window_writeincr(out, srs(acc) ) ;
}
Global Graph-scoped Tables
While the previous example only includes an eight entry look-up table accessed as a global variable, many other algorithms require much larger look-up tables. Because AI Engine local memory is at a premium, it is much more efficient for the AI Engine compiler to manage the look-up table explicitly for specific kernels than to leave a large amount of stack or heap space on every processor. Such tables should not be declared static in the kernel header file.
#ifndef USER_PARAMETER_H
#define USER_PARAMETER_H
#include <adf.h>
int32 lutarray[8] = {1,2,3,4,5,6,0,0} ;
#endif
The kernel source continues to include the header
file and use the table as before. But, now you must declare this table as extern
in the graph class header and use the parameter::array(…)
function to create a parameter
object explicitly in the graph. You also need to attach this parameter object to the
kernel as shown in the following code:
#include <adf.h>
extern int32 lutarray[8];
class simple_lut_graph : public graph {
public:
kernel k;
parameter p;
simple_lut_graph() {
k = kernel::create(simple);
p = parameter::array(lutarray);
connect<>(p,k);
...
}
}
Including this explicit specification of the look-up table in the graph description ensures that the compiler is aware of the requirement to reserve a suitably sized piece of memory for the look-up table when it allocates memory for kernel input and output buffers.
Shared Graph-scoped Tables
Sometimes, the same table definition is used in
multiple kernels. Because the AI Engine
architecture is a distributed address-space architecture, each processor binary
image that executes such a kernel needs to have that table defined in its own local
memory. To get the correct graph linkage spread across multiple processors, you must
declare the tables as extern
within the kernel
source file as well as the graph class definition file. Then, the actual table
definition needs to be specified in a separate header file that is attached as a
property to the kernel as shown below.
#include <adf.h>
extern int32 lutarray[8];
class simple_lut_graph : public adf::graph {
public:
kernel k;
parameter p;
simple_lut_graph() {
k = kernel::create(simple);
p = parameter::array(lutarray);
connect<>(p,k);
std::vector<std::string> myheaders;
myheaders.push_back("./user_parameter.h")
headers(k) = myheaders;
...
}
}
This ensures that the header file that defines the table is included in the final binary link wherever this kernel is used without causing re-definition errors.
FIFO Depth
The AI Engine architecture uses stream data extensively for DMA-based I/O, for communicating between two AI Engines, and for communicating between the AI Engine and the programmable logic (PL). This raises the potential for a resource deadlock when the data flow graph has reconvergent data paths. If the pipeline depth of one path is longer than the other, the producer kernel can stall and might not be able to push data into the shorter path because of back pressure. At the same time, the consumer kernel is waiting to receive data on the longer path due to the lack of data. If the order of data production and consumption between two data paths is different, a deadlock can happen even between two kernels that are directly connected with two data paths. The following figure illustrates the paths.
If the producer kernel is trying to push data on stream
S1 and runs into back pressure while the consumer kernel is still trying to read data
from stream S2, a deadlock occurs. A general way to fix this situation is to create more
buffering in the paths that have back pressure in the source code by using a fifo_depth
constraint on a connection.
p = kernel::create(producer);
c = kernel::create(consumer);
connect<stream> s1(p.out[0], c.in[0]);
connect<stream> s2(p.out[1], c.in[1]);
fifo_depth(s1) = 20;
fifo_depth(s2) = 10;
fifo_depth()
constraint is only valid on stream and window type
kernel connections. It is not available on cascade stream type connections, because
there is a two deep, 384-bit wide FIFO on both the input and output cascade streams that
allows storing up to four values between AI Engines.Stream Switch FIFO
The AI Engine has two 32-bit input
AXI4-Stream interfaces and two 32-bit output
AXI4-Stream interfaces. Each stream is connected
to a FIFO both on the input and output side, allowing the AI Engine to have a four word (128-bit) access per four cycles, or a one
word (32-bit) access per cycle on a stream. A fifo_depth()
constraint specification below 40 allocates FIFOs from the
stream switch. The following is an example of a FIFO allocation on the stream switch
requesting a fifo_depth(8)
.
DMA FIFO
A fifo_depth()
constraint specification
above 40 allocates FIFOs from memory, known as DMA FIFOs. The following is an example of
a FIFO allocation for a request of fifo_depth(3000)
bytes which is allocated in memory.
You can also specify the type of FIFO allocated, whether stream switch or DMA, as well as their locations. More information can be found in FIFO Location Constraints.
AI Engine Tile DMA Performance
In high throughput use cases where the AI Engine and PL throughput is close to maximum, when using a DMA FIFO, and the PL communicates with the DMA FIFO in an asynchronous PL to AI Engine clock relationship, the read side must occasionally wait for data due to nature of a single DMA FIFO. This can lead to slightly lower than 100% throughput on the AI Engine. Some of the recommended ways to avoid the small loss in throughput are as follows.
- Choose a
fifo_depth
constraint of less than or up to 40 at the AI Engine-PL boundaries on streaming connections with a slack of 40 or less. - Add a small asynchronous FIFO in the PL to shift the alignment into the AI Engine clock domain.
- Use a synchronous PL clock to the AI Engine. Use a 128-bit AXI4-Stream interface from the PL and use a PL clock at integer multiples of the AI Engine frequency.
Kernel Bypass
A bypass encapsulator construct discussed in Run-Time Graph Reconfiguration Using Control Parameters is used to execute a kernel conditionally. The control
of the bypass is done through a run-time parameter: 0
for no bypass and 1
for bypass. In addition to the
control parameter, the external connections of a bypassed kernel or a graph are directed
to the external ports of the bypass construct itself. Internally, the bypass construct
is connected to the bypassed kernel or the graph automatically by the compiler. The
following example shows the required coding.
inout_port control;
bypass b;
kernel f, p, c;
f = kernel::create(filter);
...
b = bypass::create(f);
connect<parameter> (control, b.bp);
connect<window<128>> n1(p.out[0], b.in[0]);
connect<window<128>> n2(b.out[0], c.out[0]);
Explicit Packet Switching
Just as multiple AI Engine kernels can share a single processor and execute in a interleaved manner, multiple stream connections can be shared on a single physical channel. This mechanism is known as Packet Switching. The AI Engine architecture and compiler work together to provide a programming model where up to four stream connections can share the same physical channel.
The Explicit Packet Switching feature allows fine-grain control over how packets are generated, distributed, and consumed in a graph computation. Explicit Packet Switching is typically recommended in cases where many low bandwidth streams from common PL source can be distributed to different AI Engine destinations. Similarly many low bandwidth streams from different AI Engine sources to a common PL destination can also take advantage of this feature. Because a single physical channel is shared between multiple streams, you minimize the number of AI Engine - PL interface streams used. This section describes graph constructs to create packet-switched streams explicitly in the graph.
Packet Switching Graph Constructs
input_pktstream
and
output_pktstream
are introduced to represent the
multiplexed data streams as input to or output from a kernel, respectively. More details
on the packet headers and data types can be found in Packet Stream Operations.To explicitly control the multiplexing and de-multiplexing of packets, two new
templated node classes are added to the ADF graph library: pktsplit<n>
and pktmerge<n>
.
A node instance of class pktmerge<n>
is a n:1
multiplexer of n packet streams producing a single packet stream. A node instance of
class pktsplit<n>
is a 1:n de-multiplexer of a
packet stream producing n different packet streams. The maximum number of allowable
packet streams is 32 on a single physical channel (n ≤ 32). See Adaptive Data Flow Graph Specification Reference for more details.
input_pktstream
and output_pktstream
. To connect a packet stream to a window of data meant for
an AI Engine kernel use the following graph
construct:
connect<pktstream, window<32>>
connect<window<32>, pktstream>
connect<pktstream, pktstream>
connect<pktstream, pktstream>
To connect a stream of data from/to a PLIO connection use the following graph construct:
connect<input_port, pktstream>
connect<pktstream, output_port>
When a kernel receives packets of data as a window of data, the header and TLAST are dropped prior to the kernel receiving the window of data. If the kernel writes an output window of data, the packet header and TLAST are automatically inserted.
input_pktstream
of data, the kernel needs to process the packer header and
TLAST, in addition to the packet data. Similarly if the kernel sends an output_pktstream
of data, the kernel needs to insert the
packer header and TLAST, in addition to the packet data into the output stream.These concepts are illustrated in the following example:
class ExplicitPacketSwitching: public adf::graph {
private:
adf:: kernel core[4];
adf:: pktsplit<4> sp;
adf:: pktmerge<4> mg;
public:
adf::port in;
adf::port out;
mygraph() {
core[0] = adf::kernel::create(aie_core1);
core[1] = adf::kernel::create(aie_core2);
core[2] = adf::kernel::create(aie_core3);
core[3] = adf::kernel::create(aie_core4);
adf::source(core[0]) = "aie_core1.cpp";
adf::source(core[1]) = "aie_core2.cpp";
adf::source(core[2]) = "aie_core3.cpp";
adf::source(core[3]) = "aie_core4.cpp";
sp = adf::pktsplit<4>::create();
mg = adf::pktmerge<4>::create();
for(int i=0;i<4;i++){
adf::runtime<ratio>(core[i]) = 0.9;
adf::connect<adf::pktstream, adf::window<32> > (sp.out[i], core[i].in[0]);
adf::connect<adf::window<32>, adf::pktstream > (core[i].out[0], mg.in[i]);
}
adf::connect (in, sp.in[0]);
adf::connect (mg.out[0], out);
}
};
The graph has one input PLIO port and one output PLIO port. The input packet stream from the PL is split four ways and input to four different AI Engine kernels. The output streams from the four AI Engine kernels are merged into one packet stream which is output to the PL. The Vitis analyzer Graph view of the code is shown as follows.
Packet Switching and the AI Engine Simulator
Explicit packet switching is supported by the AI Engine simulator. Consider the example of the previous graph that expects packet switched data from the PL; the data is split inside the AI Engine and sent to four AI Engine kernels. On the output side the four kernel outputs are merged into one output stream to the PL.
The input data file from the PL contains all the packet switched data from the PL, for the four AI Engine kernels in the previous example. It contains the data for different kernels, packet by packet. Each packet of data is for one window input for an AI Engine kernel. The data format is as follows.
2415853568
0
1
2
3
4
5
6
TLAST
7
2415853568
is 0x8fff0000
in hex format. The five least significant bits are the packet
ID, 0 in this case. The last data in the packet has the keyword TLAST, which denotes the
last data for the window input for the kernel. You can construct the header for each packet manually, or write helper functions to generate the header. The AI Engine compiler generates a packet switching report file Work/reports/packet_switching_report.json that lists the packet IDs used in the graph. In addition it also generates Work/temp/packet_ids_c.h and Work/temp/packet_ids_v.h header files that can be included in your C or Verilog kernel code.
Location Constraints
Kernel Location Constraints
When building large graphs with multiple subgraphs, it is sometimes useful to control the exact mapping of kernels to AI Engines, either relative to other kernels or in an absolute sense. The AI Engine compiler provides a mechanism to specify location constraints for kernels, which when used with the C++ template class specification, provides a powerful mechanism to create a robust, scalable, and predictable mapping of your graph onto the AI Engine array. It also reduces the choices for the mapper to try, which can considerably speed up the mapper. Consider the following graph specification:
#include <adf.h>
#include "kernels.h
#define NUMCORES (COLS*ROWS)
using namespace adf;
template <int COLS, int ROWS, int STARTCOL, int STARTROW>
class indep_nodes_graph1 : public graph {
public:
kernel kr[NUMCORES];
port<input> datain[NUMCORES] ;
port<output> dataout[NUMCORES] ;
indep_nodes_graph1() {
for (int i = 0; i < COLS; i++) {
for (int j = 0; j < ROWS; j++) {
int k = i*ROWS + j;
kr[k] = kernel::create(mykernel);
source(kr[k]) = "kernels/kernel.cc";
runtime<ratio>(kr[k]) = 0.9;
location<kernel>(kr[k]) = tile(STARTCOL+i, STARTROW+j);
}
}
for (int i = 0; i < NUMCORES; i++) {
connect<stream, window<64> >(datain[i], kr[i].in[0]);
connect<window<64>, stream >(kr[i].out[0], dataout[i]);
}
};
};
The template parameters identify a COLS x ROWS logical array of kernels (COLS x ROWS = NUMCORES) that are placed within a larger logical device of some dimensionality starting at (STARTCOL, STARTROW) as the origin. Each kernel in that graph is constrained to be placed on a specific AI Engine. This is accomplished using an absolute location constraint for each kernel placing it on a specific processor tile. For example, the following declaration would create a 1 x 2 kernel array starting at offset (3,2). When embedded within a 4 x 4 logical device topology, the kernel array is constrained to the top right corner.
indep_nodes_graph1<1,2,3,2> mygraph;
location<absolute>(k)
, function to specify kernel
constraints and proc(x,y)
function to specify a processor
tile location. These functions are now deprecated. Instead, use location<kernel>(k)
to specify the kernel constraints and tile(x,y)
to identify a specific tile location. See Adaptive Data Flow Graph Specification Reference for more information.Buffer Location Constraints
The AI Engine compiler tries to automatically allocate buffers for windows, lookup tables, and run-time parameters in the most efficient manner possible. However, you might want to explicitly control their placement in memory. Similar to the kernels shown previously in this section, buffers inferred on a kernel port can also be constrained to be mapped to specific tiles, banks, or even address offsets using location constraints, as shown in the following example.
#include <adf.h>
#include "kernels.h"
#define NUMCORES (COLS*ROWS)
using namespace adf;
template <int COLS, int ROWS, int STARTCOL, int STARTROW>
class indep_nodes_graph2 : public graph {
public:
kernel kr[NUMCORES];
port<input> datain[NUMCORES] ;
port<output> dataout[NUMCORES] ;
indep_nodes_graph() {
for (int i = 0; i < COLS; i++) {
for (int j = 0; j < ROWS; j++) {
int k = i*ROWS + j;
kr[k] = kernel::create(mykernel);
source(kr[k]) = "kernels/kernel.cc";
runtime<ratio>(kr[k]) = 0.9;
location<kernel>(kr[k]) = tile(STARTCOL+i, STARTROW+j); // kernel location
location<buffer>(kr[k].in[0]) =
{ address(STARTCOL+i, STARTROW+j, 0x0),
address(STARTCOL+i, STARTROW+j, 0x2000) }; // double buffer location
location<stack>(kr[k]) = bank(STARTCOL+i, STARTROW+j, 2); // stack location
location<buffer>(kr[k].out[0]) = location<kernel>(kr[k]); // relative buffer location
}
}
for (int i = 0; i < NUMCORES; i++) {
connect< stream, window<64> >(datain[i], kr[i].in[0]);
connect< window<64>, stream >(kr[i].out[0], dataout[i]);
}
};
};
In the previous code, the location of double buffers at port kr[k].in[0]
is constrained to the specific memory tile address
offsets that are created using the address(col,row,offset)
constructor. Furthermore, the location of the system memory (including the sync buffer,
stack and static heap) for the processor that executes kernel instance kr[k]
is constrained to a particular bank using the bank(col,row,bankid)
constructor. Finally, the tile location of
the buffers connected to the port kr[k].out[0]
is
constrained to be the same tile as that of the kernel instance kr[k]
. Buffer location constraints are only allowed on window kernel ports.
FIFO Location Constraints
The AI Engine compiler tries to automatically allocate FIFOs in the most efficient manner possible. However, you might want to explicitly control their placement in memory, as shown in the following example. This constraint is useful to preserve the placement of FIFO resources between runs of the AI Engine compiler.
Note the following considerations for FIFO constraints.
- If FIFO constraints are used, the entire depth of the FIFO must be constrained. It is not possible to constrain a portion of the FIFO and leave the rest for the compiler to add.
- If FIFO constraints are added to branching nets, the FIFO constraint should be added to each point-to-point net. If you want to share stream switch FIFOs or DMA FIFOs before the branch, this can be achieved by duplicating the FIFO type and location on each point-to-point net.
- The constraint can be used without a location to specify the desired type of FIFO without specifying a location or depth.
The following example shows how a FIFO constraint can be used in a graph file.
class AieToAieHierarchicalGraph : public graph
{
public:
input_port in;
output_port out;
aieSrcGraph<64,64,64,0,0> aieSrc; // Depth: 128, Constrained: 80
aieDstGraph<64,64,32,1,0> aieDst; // Depth: 128, Constrained: 48
AieToAieHierarchicalGraph()
{
connect<> net0(in, aieSrc.in);
connect<> net1(aieSrc.out, aieDst.in);
connect<> net2(aieDst.out, out);
fifo_depth(net1) = 64;
//The DMA FIFO depth on net1 is set to 64 32-bit words. This DMA FIFO is
//located on tile 2,0 at address 0x2100, and constrained to 192 32-bit words.
location<fifo>(net1) = { dma_fifo(aie_tile, 2, 0, 0x2100, 192) };
}
};
The second example shows how a FIFO constraint can be added to a constraints file.
"PortConstraints": {
"fifo_locations_records": {
"dma_fifos": {
"DMAFIFO_AIE_MEMGRP_X0Y0_256_511": {
"tile_type": "core",
"row": 0,
"column": 0,
"size": 64,
"offset": 256,
"bankId": 0
}
},
"stream_fifos": {
"SSFIFO_11_1_1": {
"tile_type": "core",
"row": 0,
"column": 11,
"channel": 1
}
}
}
}
Hierarchical Constraints
When creating complex graphs with multiple subgraph classes, or multiple instances of the same subgraph class, the location constraints described above can also be applied to each kernel instance or kernel port instance individually at the point of subgraph instantiation instead of the definition. In this case, you need to specify the graph qualified name of that kernel instance or kernel port instance in the constraint as shown below. Also, make sure that the kernels or their ports being constrained as above are defined to be public members of the subgraph.
class ToplevelGraph : public graph {
public:
indep_nodes_graph1<1,2,3,2> mygraph;
port<input> datain[2] ;
port<output> dataout[2] ;
ToplevelGraph() {
for (int i = 0; i < 2; i++) {
connect<stream, window<64> >(datain[i], mygraph.datain[i]);
connect<window<64>, stream >(mygraph.dataout[i], dataout[i]);
// hierarchical constraints
location<stack>(mygraph.kr[i]) = bank(3, 2+i, 2);
location<buffer>(mygraph.kr[i].out[0]) = location<kernel>(mygraph.kr[i]);
}
};
};
aiecompiler --constraints Work/temp/graph_aie_mapped.aiecst src/graph.cpp
Buffer Allocation Control
The AI Engine compiler automatically allocates the desired number of buffers for each memory connection. There are several different cases.
- Lookup tables are always allocated as single buffers because they are expected to be read-only and private to a kernel. No locks are needed to synchronize lookup table accesses because they are expected to be accessed in an exclusive manner.
- Window connections are usually assigned double buffers if the producer and
consumer kernels are mapped to different processors or if the producer or the
consumer is a DMA. This enables the two agents to operate in a pipelined manner
using ping-pong synchronization with two locks. The AI Engine compiler automatically generates this synchronization in
the respective processor
main
functions. - If the producer and consumer kernels are mapped to the same processor, then the window connection is given only one buffer and no lock synchronization is needed because the kernels are executed sequentially.
- Run-time parameter connections can be assigned double buffers (default) along with a selector word to choose the next buffer to be accessed.
Run-time parameter connections can also be assigned single buffers. Sometimes,
with window connections, it is desirable to use only single buffer synchronization
instead of double buffers. This is useful when the local data memory is at a premium
and the performance penalty of using a single buffer for data transfer is not
critical. This can be achieved using the single_buffer(port<T>&)
constraint.
single_buffer(first.in[0]); //For window input or RTP input
single_buffer(first.inout[0]); //For RTP output
C++ Kernel Class Support
The AI Engine compiler supports C++
kernel classes. The following example shows how to set filter coefficients and the
number of samples of a FIR filter class through a constructor. The C++ kernel class
allows internal states for each kernel instance to be encapsulated within the
corresponding class object. In the following code, you can see an example of this
where the filter coefficients (coeffs
) are
specified through the constructor. This resolves the problem of using file scope
variable, global variable, or static function scope variable to store the internal
states of a C function kernel. When multiple instances of such a kernel are mapped
to the same core, the internal state variables are shared across multiple instances
and cause conflicts.
//fir.h
#pragma once
#include "adf.h"
#define NUM_COEFFS 12
class FIR
{
private:
int32 coeffs[NUM_COEFFS];
int32 tapDelayLine[NUM_COEFFS];
uint32 numSamples;
public:
FIR(const int32(&coefficients)[NUM_COEFFS], uint32 samples);
void filter(input_window_int32* in, output_window_int32* out);
static void registerKernelClass()
{
REGISTER_FUNCTION(FIR::filter);
}
};
You are required to write the static void
registerKernelClass()
method in the header file. Inside the registerKernelClass()
method, you need to call the
REGISTER_FUNCTION macro. This macro is used to register the class run
method to be executed on the AI Engine core to perform the kernel functionality. In
the preceding example FIR::filter
is registered
using this macro. The kernel class constructor and run method should be implemented
inside a separate source file. The implementation of a run
method of a kernel class is the same as writing a kernel function
described in previous chapters.
//fir.cpp
//implementation in this example is not optimized and is for illustration purpose
#include "fir.h"
FIR::FIR(const int32(&coefficients)[NUM_COEFFS], uint32 samples)
{
for (int i = 0; i < NUM_COEFFS; i++)
coeffs[i] = coefficients[i];
for (int i = 0; i < NUM_COEFFS; i++)
tapDelayLine[i] = 0;
numSamples = samples;
}
void FIR::filter(input_window_int32* in, output_window_int32* out)
{
for (int i = 0; i < numSamples; i++)
{
for (int j = NUM_COEFFS-1; j > 0; j--)
tapDelayLine[j] = tapDelayLine[j - 1];
tapDelayLine[0] = window_readincr(in);
int32 y = 0;
for (int j = 0; j < NUM_COEFFS; j++)
{
y += coeffs[j] * tapDelayLine[j];
}
window_writeincr(out, y);
}
}
//graph.h
#pragma once
#include "adf.h"
#include "fir.h"
using namespace adf;
class mygraph : public graph
{
public:
input_port in1, in2;
output_port out1, out2;
kernel k1, k2;
mygraph()
{
//see lab8.3 for narrow filter coefficients
k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
runtime<ratio>(k1) = 0.1;
source(k1) = "src/fir.cpp";
//see lab8.3 for wide filter coefficients
k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
runtime<ratio>(k2) = 0.1;
source(k2) = "src/fir.cpp";
connect<window<32>>(in1, k1.in[0]);
connect<window<32>>(in2, k2.in[0]);
connect<window<32>>(k1.out[0], out1);
connect<window<32>>(k2.out[0], out2);
}
};
For a kernel class with a non-default constructor, you can specify the
constructor parameter values in the arguments of kernel::create_object
, when creating a representation of a kernel
instance. In the previous example, two FIR filter kernels (k1
and k2
) are created using kernel::create_object<FIR>
. k1
has filter coefficients { 180, 89, -80, -391, -720,
-834, -478, 505, 2063, 3896, 5535, 6504 } and k2
has filter coefficients { -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754,
-1066, 18539 }. Both of them consume eight samples for each invocation.
The following code shows the AI Engine compiler generated program. The two FIR kernel objects are instantiated with the proper constructor parameters.
//Work/aie/x_y/src/x_y.cc
...
FIR i4({180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504}, 8);
FIR i5({-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539}, 8);
int main(void) {
...
// Kernel call : i4:filter
i4.filter(get_input_window_int32(window_buf0_buf0d),get_output_window_int32(window_buf2_buf2d));
...
// Kernel call : i5:filter
i5.filter(get_input_window_int32(window_buf1_buf1d),get_output_window_int32(window_buf3_buf3d));
...
}
A kernel class can have a member variable occupying a significant
amount of memory space that might not fit into program memory. The location of the
kernel class member variable can be controlled. The AI Engine compiler supports array
reference
member variables that allow the compiler to allocate or
constrain the memory space while passing the reference to the object.
//fir.h
#pragma once
#include "adf.h"
#define NUM_COEFFS 12
class FIR
{
private:
int32 (&coeffs)[NUM_COEFFS];
int32 tapDelayLine[NUM_COEFFS];
uint32 numSamples;
public:
FIR(int32(&coefficients)[NUM_COEFFS], uint32 samples);
void filter(input_window_int32* in, output_window_int32* out);
static void registerKernelClass()
{
REGISTER_FUNCTION(FIR::filter);
REGISTER_PARAMETER(coeffs);
}
};
//fir.cpp
#include "fir.h"
FIR::FIR(int32(&coefficients)[NUM_COEFFS], uint32 samples)
: coeffs(coefficients)
{
for (int i = 0; i < NUM_COEFFS; i++)
tapDelayLine[i] = 0;
numSamples = samples;
}
void FIR::filter(input_window_int32* in, output_window_int32* out)
{
...
}
The previous example shows a slightly modified version of the FIR
kernel class. Here, member variable coeffs
is a
int32 (&)[NUM_COEFFS]
data type. The
constructor initializer coeffs(coefficients)
initializes coeffs
to the reference to an array
allocated externally to the class object. To let the AI Engine compiler know that the coeffs
member variable is intended to be allocated by the compiler,
you must use REGISTER_PARAMETER
to register an
array reference member variable inside the registerKernelClass
The use of kernel::create_object
to
create a representation of a FIR kernel instance and to specify the initial value of
the constructor parameters is the same as in the previous example. See the following
code.
//graph.h
...
class mygraph : public graph
{
...
mygraph()
{
k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
...
k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
...
}
};
The following code shows the corresponding AI Engine compiler generated program. The memory spaces for int32 i4_coeffs[12]
and int32
i5_coeffs[15]
are outside the kernel object instances and are passed
into the FIR objects by reference.
//Work/aie/x_y/src/x_y.cc
int32 i4_coeffs[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
FIR i4(i4_coeffs, 8);
int32 i5_coeffs[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
FIR i5(i5_coeffs, 8);
int main(void) {
...
// Kernel call : i4:filter
i4.filter(get_input_window_int32(window_buf0_buf0d),get_output_window_int32(window_buf2_buf2d));
...
// Kernel call : i5:filter
i5.filter(get_input_window_int32(window_buf1_buf1d),get_output_window_int32(window_buf3_buf3d));
...
}
Because the memory space for an array reference member variable is
allocated by the AI Engine compiler, the
location constraint can be applied to constrain the memory location of these arrays,
as shown in the following example code. The REGISTER_PARAMETER
macro allows kernel::create_object
to create a parameter handle for an array
reference member variable, like k1.param[0]
and
k2.param[0]
, and the location<parameter>
constraint can be applied.
//graph.h
...
class mygraph : public graph
{
...
mygraph()
{
k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
...
k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
...
location<parameter>(k1.param[0]) = address(…);
location<parameter>(k2.param[0]) = bank(…);
}
};
The C++ kernel class header files and the C++ kernel function template (see C++ Template Support) should not contain single-core specific intrinsic APIs and pragmas. This is the same programming guideline as writing regular C function kernels. This is because these header files are included in the graph header file and can be cross-compiled as part of the PS program. The Arm® cross-compiler cannot understand single-core intrinsic APIs or pragmas. Single-core specific programming content must be kept inside the source files.
C++ Template Support
A template is a powerful tool in C++. By passing the data type as a parameter, you eliminate the need to rewrite code to support different data types. Templates are expanded at compile time, like macros. The difference is that the compiler performs type checking before template expansion. The source code contains template functions and class definitions, but the compiled code can contain multiple copies of same function or class. Type parameters, non-type parameters, default arguments, scalar parameters, and template parameters can be passed to a template, where the compiler instantiates the function or class accordingly.
- Support for general C++ template features.
- Supported data types (T) and connection types between kernels:
- Data type (T):
int8
,uint8
,int16
,uint16
,cint16
,int32
,uint32
,cint32
,int64
,uint64
,float
,cfloat
IMPORTANT:acc48
andcacc48
data types are not supported in template stream connections. - Function parameter type:
input_window<T>
,output_window<T>
,input_stream<T>
,output_stream<T>
- Data type (T):
- The compiler does not support pre-compiled headers for template kernels.
Function Templates
Function template source code defines a generic function that can be used for different data types. Example function template:
// add.h
template<typename ELEMENT_TYPE, int FACTOR, size_t NUM_SAMPLES> void add(input_window<ELEMENT_TYPE>* in,
output_window<ELEMENT_TYPE>* out);
// add.cpp
template<typename ELEMENT_TYPE, int FACTOR, size_t NUM_SAMPLES> void add(input_window<ELEMENT_TYPE>* in,
output_window<ELEMENT_TYPE>* out)
{
for (int i=0; i<NUM_SAMPLES; i++)
{
ELEMENT_TYPE value = window_readincr(in);
value += FACTOR;
window_writeincr(out, value);
}
}
// graph.h
mygraph()
{
k[0] = kernel::create(add<int32, 6, 8>);
k[1] = kernel::create(add<int16, 3, 8>);
for (int i=0; i<NUM_KERNELS; i++)
{
runtime<ratio>(k[i]) = 0.3;
source(k[i]) = "src/add.cpp";
}
connect<window<32>>(in[0], k[0].in[0]);
connect<window<32>>(k[0].out[0], out[0]);
connect<window<16>>(in[1], k[1].in[0]);
connect<window<16>>(k[1].out[0], out[1]);
}
where:
- add.h defines a template
add()
function. - add.cpp defines the code for the
template
add()
function. - graph.h uses the template
add()
function withinmygraph
class.
Class Templates
Like function templates, class templates are useful when a class defines an object that is independent of a specific data type. Example class template:
// fir.h
...
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> class FIR
{
private:
ELEMENT_TYPE (&coeffs)[NUM_COEFFS];
ELEMENT_TYPE tapDelayLine[NUM_COEFFS];
uint32 numSamples;
public:
FIR(ELEMENT_TYPE(&coefficients)[NUM_COEFFS], uint32 samples);
void filter(input_window<ELEMENT_TYPE>* in, output_window<ELEMENT_TYPE>* out);
//user needs to write this function to register necessary info
static void registerKernelClass()
{
REGISTER_FUNCTION(FIR::filter);
REGISTER_PARAMETER(coeffs);
}
};
// fir.cpp
...
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> FIR<NUM_COEFFS, ELEMENT_TYPE>::FIR(ELEMENT_TYPE(&coefficients)[NUM_COEFFS], uint32 samples):coeffs(coefficients)
{
...
}
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> void FIR<NUM_COEFFS, ELEMENT_TYPE>::filter(input_window<ELEMENT_TYPE>* in, output_window<ELEMENT_TYPE>* out)
{
...
}
// graph.h
...
mygraph()
{
k1 = kernel::create_object<FIR<12, int32>>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
runtime<ratio>(k1) = 0.1;
source(k1) = "src/fir.cpp";
headers(k1) = { "src/fir.h" };
k2 = kernel::create_object<FIR<15, int32>>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539, 0, 0, 0 }), 8);
runtime<ratio>(k2) = 0.1;
source(k2) = "src/fir.cpp";
headers(k2) = { "src/fir.h" };
...
}
where:
- fir.h defines a class
template where class
FIR
is declared. - fir.cpp contains class
FIR
implementation and the classFIR
member functionfilter
implementation. - graph.h demonstrates the template
class
FIR
instantiation within themygraph
class.
Multicast Support
Various multicast scenarios are supported in the graph, such as from a window to multiple windows, from stream to multiple streams, from PLIO to multiple windows etc. This section lists the supported types of multicast from a single source to multiple destinations. For more details on PLIO, FileIO and GMIO, see Using a Virtual Platform.
# | Source | Destination 1 | Destination 2 | Supported |
---|---|---|---|---|
1 | AI Engine Window | AI Engine Window | AI Engine Window | Not Supported |
2 | AI Engine Window | AI Engine Window | AI Engine Stream | Not Supported |
3 | AI Engine Window | AI Engine Window | PLIO/FileIO/GMIO | Not Supported |
4 | AI Engine Window | AI Engine Stream | AI Engine Stream | Supported |
5 | AI Engine Window | AI Engine Stream | PLIO/FileIO | Supported |
6 | AI Engine window | PLIO/FileIO | PLIO/FileIO | Supported |
7 | AI Engine Stream | AI Engine Window | AI Engine window | Supported |
8 | AI Engine Stream | AI Engine Window | AI Engine Stream | Supported |
9 | AI Engine Stream | AI Engine Window | PLIO/FileIO/GMIO | Supported |
10 | AI Engine Stream | AI Engine Stream | AI Engine Stream | Supported |
11 | AI Engine Stream | AI Engine Stream | PLIO/FileIO/GMIO | Supported |
12 | AI Engine Stream | PLIO/FileIO/GMIO | PLIO/FileIO/GMIO | Supported |
13 | PLIO/FileIO/GMIO | AI Engine Window | AI Engine window | Supported |
14 | PLIO/FileIO/GMIO | AI Engine Window | AI Engine Stream | Not Supported |
15 | PLIO/FileIO/GMIO | AI Engine Window | PLIO/FileIO/GMIO | Not Supported |
16 | PLIO/FileIO/GMIO | AI Engine Stream | AI Engine Stream | Supported |
17 | PLIO/FileIO/GMIO | AI Engine Stream | PLIO/FileIO/GMIO | Not Supported |
18 | PLIO/FileIO/GMIO | PLIO/FileIO/GMIO | PLIO/FileIO/GMIO | Not Supported |
19 | AI Engine Window | AI Engine Stream | GMIO | Not Supported |
20 | AI Engine Window | GMIO | GMIO | Not Supported |
21 | AI Engine Window | PLIO/FileIO | GMIO | Not Supported |
Note the following.
- All source and destination windows in the multicast connections are required to have the same size.
- RTP and packet switching are not covered in this section.
- If the multicast type is supported, the destination number is not limited if it can fit into the hardware.
When multiple streams are connected to the same source, the data is sent to all the destination ports at the same time and is only sent when all destinations are ready to receive data. This might cause stream stall or design hang if the FIFO depth of the stream connections are not deep enough. Refer to the examples in AI Engine Kernel Coding Best Practices Guide (UG1079) for more information about the stream stalls and potential solutions.