Specialized Graph Constructs

Look-up Tables

Static File-scoped Tables

Kernel functions can use private, read-only data structures that are accessed as file-scoped variables. The compiler allocates a limited amount of static heap space for such data. As an example, consider the following header file (user_parameter.h):

#ifndef USER_PARAMETER_H
#define USER_PARAMETER_H

#include <adf.h>

static int32 lutarray[8] = {1,2,3,4,5,6,0,0} ; 

#endif

This header file can be included in the kernel source file and the look-up table can be accessed inside a kernel function directly. The static modifier ensures that the array definition is local to this file. The AI Engine compiler then allocates this array in static heap space for the processor where this kernel is used.

#include "user_parameter.h"
void simple_lut(input_window_cint16 * in, output_window_cint16 * out){
  v4cint32 tmp; 
  v4cacc48 acc;
  v32cint16 coeffs; 

  upd_w(coeffs, 0, lutarray);
  window_readincr(in, tmp);
  acc = mul4(tmp 0, 0x3210, 1, coeffs, 0, 0x0000, 1);
  acc = mac4(acc, tmp, 2, 0x3210, 1, coeffs, 2, 0x0000, 1) ;
  acc = mac4(acc, tmp, 4, 0x3210, 1, coeffs, 4, 0x0000, 1) ;
  window_writeincr(out, srs(acc) ) ;
  }

Global Graph-scoped Tables

While the previous example only includes an eight entry look-up table accessed as a global variable, many other algorithms require much larger look-up tables. Because AI Engine local memory is at a premium, it is much more efficient for the AI Engine compiler to manage the look-up table explicitly for specific kernels than to leave a large amount of stack or heap space on every processor. Such tables should not be declared static in the kernel header file.

#ifndef USER_PARAMETER_H
#define USER_PARAMETER_H

#include <adf.h>

int32 lutarray[8] = {1,2,3,4,5,6,0,0} ; 

#endif

The kernel source continues to include the header file and use the table as before. But, now you must declare this table as extern in the graph class header and use the parameter::array(…) function to create a parameter object explicitly in the graph. You also need to attach this parameter object to the kernel as shown in the following code:

#include <adf.h>
extern int32 lutarray[8];
class simple_lut_graph : public graph {
public:
  kernel k;
  parameter p;

  simple_lut_graph() {
    k = kernel::create(simple);
    p = parameter::array(lutarray);
    connect<>(p,k);
    ...
  }
}

Including this explicit specification of the look-up table in the graph description ensures that the compiler is aware of the requirement to reserve a suitably sized piece of memory for the look-up table when it allocates memory for kernel input and output buffers.

Shared Graph-scoped Tables

Sometimes, the same table definition is used in multiple kernels. Because the AI Engine architecture is a distributed address-space architecture, each processor binary image that executes such a kernel needs to have that table defined in its own local memory. To get the correct graph linkage spread across multiple processors, you must declare the tables as extern within the kernel source file as well as the graph class definition file. Then, the actual table definition needs to be specified in a separate header file that is attached as a property to the kernel as shown below.

#include <adf.h>
extern int32 lutarray[8];
class simple_lut_graph : public adf::graph {
public:
  kernel k;
  parameter p;

  simple_lut_graph() {
    k = kernel::create(simple);
    p = parameter::array(lutarray);
    connect<>(p,k);

    std::vector<std::string> myheaders;
    myheaders.push_back("./user_parameter.h")
    headers(k) = myheaders;
    ...
  }
}

This ensures that the header file that defines the table is included in the final binary link wherever this kernel is used without causing re-definition errors.

Note: Large look-up tables (>32 KB) are not supported.

Note: Shared data either has to be managed explicitly as run-time parameters or declared at the file scope, which is shared across all kernels mapped to the same AI Engine.

FIFO Depth

The AI Engine architecture uses stream data extensively for DMA-based I/O, for communicating between two AI Engines, and for communicating between the AI Engine and the programmable logic (PL). This raises the potential for a resource deadlock when the data flow graph has reconvergent data paths. If the pipeline depth of one path is longer than the other, the producer kernel can stall and might not be able to push data into the shorter path because of back pressure. At the same time, the consumer kernel is waiting to receive data on the longer path due to the lack of data. If the order of data production and consumption between two data paths is different, a deadlock can happen even between two kernels that are directly connected with two data paths. The following figure illustrates the paths.

Figure 1: Producer and Consumer Kernels with Reconvergent Streams

If the producer kernel is trying to push data on stream S1 and runs into back pressure while the consumer kernel is still trying to read data from stream S2, a deadlock occurs. A general way to fix this situation is to create more buffering in the paths that have back pressure in the source code by using a fifo_depth constraint on a connection.

p = kernel::create(producer);
c = kernel::create(consumer);
connect<stream> s1(p.out[0], c.in[0]);
connect<stream> s2(p.out[1], c.in[1]);
fifo_depth(s1) = 20;
fifo_depth(s2) = 10;

Note: The fifo_depth() constraint is only valid on stream and window type kernel connections. It is not available on cascade stream type connections, because there is a two deep, 384-bit wide FIFO on both the input and output cascade streams that allows storing up to four values between AI Engines.

Note: The maximum allowable FIFO depth value is 8188 32-bit words.

Stream Switch FIFO

The AI Engine has two 32-bit input AXI4-Stream interfaces and two 32-bit output AXI4-Stream interfaces. Each stream is connected to a FIFO both on the input and output side, allowing the AI Engine to have a four word (128-bit) access per four cycles, or a one word (32-bit) access per cycle on a stream. A fifo_depth() constraint specification below 40 allocates FIFOs from the stream switch. The following is an example of a FIFO allocation on the stream switch requesting a fifo_depth(8).

Figure 2: FIFO Allocation on Stream Switch

DMA FIFO

A fifo_depth() constraint specification above 40 allocates FIFOs from memory, known as DMA FIFOs. The following is an example of a FIFO allocation for a request of fifo_depth(3000) bytes which is allocated in memory.

Note: The maximum allowable FIFO depth value is 8188 32-bit words.

You can also specify the type of FIFO allocated, whether stream switch or DMA, as well as their locations. More information can be found in FIFO Location Constraints.

AI Engine Tile DMA Performance

In high throughput use cases where the AI Engine and PL throughput is close to maximum, when using a DMA FIFO, and the PL communicates with the DMA FIFO in an asynchronous PL to AI Engine clock relationship, the read side must occasionally wait for data due to nature of a single DMA FIFO. This can lead to slightly lower than 100% throughput on the AI Engine. Some of the recommended ways to avoid the small loss in throughput are as follows.

Choose a fifo_depth constraint of less than or up to 40 at the AI Engine-PL boundaries on streaming connections with a slack of 40 or less.
Add a small asynchronous FIFO in the PL to shift the alignment into the AI Engine clock domain.
Use a synchronous PL clock to the AI Engine. Use a 128-bit AXI4-Stream interface from the PL and use a PL clock at integer multiples of the AI Engine frequency.

Kernel Bypass

A bypass encapsulator construct discussed in Run-Time Graph Reconfiguration Using Control Parameters is used to execute a kernel conditionally. The control of the bypass is done through a run-time parameter: 0 for no bypass and 1 for bypass. In addition to the control parameter, the external connections of a bypassed kernel or a graph are directed to the external ports of the bypass construct itself. Internally, the bypass construct is connected to the bypassed kernel or the graph automatically by the compiler. The following example shows the required coding.

inout_port control;
bypass b;
kernel f, p, c;
f = kernel::create(filter);
...
b = bypass::create(f);
connect<parameter> (control, b.bp);
connect<window<128>> n1(p.out[0], b.in[0]);
connect<window<128>> n2(b.out[0], c.out[0]);

Note: For the bypass to work correctly, a one-to-one correspondence between the input and output data ports is required, both in type and size.

Explicit Packet Switching

Just as multiple AI Engine kernels can share a single processor and execute in a interleaved manner, multiple stream connections can be shared on a single physical channel. This mechanism is known as Packet Switching. The AI Engine architecture and compiler work together to provide a programming model where up to four stream connections can share the same physical channel.

The Explicit Packet Switching feature allows fine-grain control over how packets are generated, distributed, and consumed in a graph computation. Explicit Packet Switching is typically recommended in cases where many low bandwidth streams from common PL source can be distributed to different AI Engine destinations. Similarly many low bandwidth streams from different AI Engine sources to a common PL destination can also take advantage of this feature. Because a single physical channel is shared between multiple streams, you minimize the number of AI Engine - PL interface streams used. This section describes graph constructs to create packet-switched streams explicitly in the graph.

Packet Switching Graph Constructs

Packet-switched streams are essentially multiplexed data streams that carry different types of data at different times. Packet-switched streams do not provide deterministic latency due to the potential for resource contention with other packet-switched streams. The multiplexed data flows in units of packets with a 32-bit packet header and a variable number of payload words. A header word needs to be sent before the actual payload data and the TLAST signal is required on the last word of the packet. Two new data types called input_pktstream and output_pktstream are introduced to represent the multiplexed data streams as input to or output from a kernel, respectively. More details on the packet headers and data types can be found in Packet Stream Operations.

Note: By convention, packets originating in the programmable logic are initialized with row, column to be -1,-1.

To explicitly control the multiplexing and de-multiplexing of packets, two new templated node classes are added to the ADF graph library: pktsplit<n> and pktmerge<n>. A node instance of class pktmerge<n> is a n:1 multiplexer of n packet streams producing a single packet stream. A node instance of class pktsplit<n> is a 1:n de-multiplexer of a packet stream producing n different packet streams. The maximum number of allowable packet streams is 32 on a single physical channel (n ≤ 32). See Adaptive Data Flow Graph Specification Reference for more details.

A kernel can receive packets of data either as windows of data or as input_pktstream and output_pktstream. To connect a packet stream to a window of data meant for an AI Engine kernel use the following graph construct:

connect<pktstream, window<32>>

To connect a window of data from an AI Engine kernel to a packet stream use the following graph construct:

connect<window<32>, pktstream>

To connect a packet split to an AI Engine kernel use the following graph construct:

connect<pktstream, pktstream>

To connect a packet stream from an AI Engine kernel to a packet merge use the following graph construct:

connect<pktstream, pktstream>

To connect a stream of data from/to a PLIO connection use the following graph construct:

connect<input_port, pktstream>
connect<pktstream, output_port>

When a kernel receives packets of data as a window of data, the header and TLAST are dropped prior to the kernel receiving the window of data. If the kernel writes an output window of data, the packet header and TLAST are automatically inserted.

However if the kernel receives input_pktstream of data, the kernel needs to process the packer header and TLAST, in addition to the packet data. Similarly if the kernel sends an output_pktstream of data, the kernel needs to insert the packer header and TLAST, in addition to the packet data into the output stream.

Note: If a kernel is receiving packet data as an input window of data you need to ensure the length of data per packet matches the window size.

These concepts are illustrated in the following example:

class ExplicitPacketSwitching: public adf::graph {
  private: 
    adf:: kernel core[4]; 
    adf:: pktsplit<4> sp; 
    adf:: pktmerge<4> mg;
  public: 
    adf::port in; 
    adf::port out; 
  mygraph() { 
    core[0] = adf::kernel::create(aie_core1); 
    core[1] = adf::kernel::create(aie_core2); 
    core[2] = adf::kernel::create(aie_core3); 
    core[3] = adf::kernel::create(aie_core4); 
  
    adf::source(core[0]) = "aie_core1.cpp"; 
    adf::source(core[1]) = "aie_core2.cpp"; 
    adf::source(core[2]) = "aie_core3.cpp"; 
    adf::source(core[3]) = "aie_core4.cpp"; 
    sp = adf::pktsplit<4>::create(); 
    mg = adf::pktmerge<4>::create(); 
    for(int i=0;i<4;i++){
       adf::runtime<ratio>(core[i]) = 0.9;
       adf::connect<adf::pktstream, adf::window<32> > (sp.out[i], core[i].in[0]);
       adf::connect<adf::window<32>, adf::pktstream > (core[i].out[0], mg.in[i]);
    }
    adf::connect (in, sp.in[0]); 
    adf::connect (mg.out[0], out); 
  }
};

The graph has one input PLIO port and one output PLIO port. The input packet stream from the PL is split four ways and input to four different AI Engine kernels. The output streams from the four AI Engine kernels are merged into one packet stream which is output to the PL. The Vitis analyzer Graph view of the code is shown as follows.

Packet Switching and the AI Engine Simulator

Explicit packet switching is supported by the AI Engine simulator. Consider the example of the previous graph that expects packet switched data from the PL; the data is split inside the AI Engine and sent to four AI Engine kernels. On the output side the four kernel outputs are merged into one output stream to the PL.

The input data file from the PL contains all the packet switched data from the PL, for the four AI Engine kernels in the previous example. It contains the data for different kernels, packet by packet. Each packet of data is for one window input for an AI Engine kernel. The data format is as follows.

Here, 2415853568 is 0x8fff0000 in hex format. The five least significant bits are the packet ID, 0 in this case. The last data in the packet has the keyword TLAST, which denotes the last data for the window input for the kernel.

Note: Integer values only are accepted as a packet header ID for the PL packet inputs to the AI Engine simulator.

You can construct the header for each packet manually, or write helper functions to generate the header. The AI Engine compiler generates a packet switching report file Work/reports/packet_switching_report.json that lists the packet IDs used in the graph. In addition it also generates Work/temp/packet_ids_c.h and Work/temp/packet_ids_v.h header files that can be included in your C or Verilog kernel code.

Location Constraints

Kernel Location Constraints

When building large graphs with multiple subgraphs, it is sometimes useful to control the exact mapping of kernels to AI Engines, either relative to other kernels or in an absolute sense. The AI Engine compiler provides a mechanism to specify location constraints for kernels, which when used with the C++ template class specification, provides a powerful mechanism to create a robust, scalable, and predictable mapping of your graph onto the AI Engine array. It also reduces the choices for the mapper to try, which can considerably speed up the mapper. Consider the following graph specification:

#include <adf.h>
#include "kernels.h
#define NUMCORES (COLS*ROWS)
using namespace adf;

template <int COLS, int ROWS, int STARTCOL, int STARTROW>
class indep_nodes_graph1 : public graph {
 public:
   kernel kr[NUMCORES];
   port<input> datain[NUMCORES] ;
   port<output> dataout[NUMCORES] ;
  
 indep_nodes_graph1() {
  for (int i = 0; i < COLS; i++) {
    for (int j = 0; j < ROWS; j++) {
      int k = i*ROWS + j;
      kr[k] = kernel::create(mykernel);
      source(kr[k])  = "kernels/kernel.cc";
      runtime<ratio>(kr[k]) = 0.9;
      location<kernel>(kr[k]) = tile(STARTCOL+i, STARTROW+j);
    }
  }
  for (int i = 0; i < NUMCORES; i++) {
    connect<stream, window<64> >(datain[i], kr[i].in[0]);
    connect<window<64>, stream >(kr[i].out[0], dataout[i]);
  }
 };
};

The template parameters identify a COLS x ROWS logical array of kernels (COLS x ROWS = NUMCORES) that are placed within a larger logical device of some dimensionality starting at (STARTCOL, STARTROW) as the origin. Each kernel in that graph is constrained to be placed on a specific AI Engine. This is accomplished using an absolute location constraint for each kernel placing it on a specific processor tile. For example, the following declaration would create a 1 x 2 kernel array starting at offset (3,2). When embedded within a 4 x 4 logical device topology, the kernel array is constrained to the top right corner.

indep_nodes_graph1<1,2,3,2> mygraph;

IMPORTANT: Earlier releases used location<absolute>(k), function to specify kernel constraints and proc(x,y) function to specify a processor tile location. These functions are now deprecated. Instead, use location<kernel>(k) to specify the kernel constraints and tile(x,y) to identify a specific tile location. See Adaptive Data Flow Graph Specification Reference for more information.

Buffer Location Constraints

The AI Engine compiler tries to automatically allocate buffers for windows, lookup tables, and run-time parameters in the most efficient manner possible. However, you might want to explicitly control their placement in memory. Similar to the kernels shown previously in this section, buffers inferred on a kernel port can also be constrained to be mapped to specific tiles, banks, or even address offsets using location constraints, as shown in the following example.

#include <adf.h>
#include "kernels.h"
#define NUMCORES (COLS*ROWS) 
using namespace adf;

template <int COLS, int ROWS, int STARTCOL, int STARTROW>
class indep_nodes_graph2 : public graph {
 public:
   kernel kr[NUMCORES];
   port<input> datain[NUMCORES] ;
   port<output> dataout[NUMCORES] ;
  
 indep_nodes_graph() {
  for (int i = 0; i < COLS; i++) {
    for (int j = 0; j < ROWS; j++) {
      int k = i*ROWS + j;
      kr[k] = kernel::create(mykernel);
      source(kr[k])  = "kernels/kernel.cc";
      runtime<ratio>(kr[k]) = 0.9;
      location<kernel>(kr[k]) = tile(STARTCOL+i, STARTROW+j); // kernel location
      location<buffer>(kr[k].in[0]) = 
        { address(STARTCOL+i, STARTROW+j, 0x0), 
          address(STARTCOL+i, STARTROW+j, 0x2000) };          // double buffer location
      location<stack>(kr[k]) = bank(STARTCOL+i, STARTROW+j, 2); // stack location
      location<buffer>(kr[k].out[0]) = location<kernel>(kr[k]); // relative buffer location
    }
  }

  for (int i = 0; i < NUMCORES; i++) {
    connect< stream, window<64> >(datain[i], kr[i].in[0]);
    connect< window<64>, stream >(kr[i].out[0], dataout[i]);
  }
 };
};

In the previous code, the location of double buffers at port kr[k].in[0] is constrained to the specific memory tile address offsets that are created using the address(col,row,offset) constructor. Furthermore, the location of the system memory (including the sync buffer, stack and static heap) for the processor that executes kernel instance kr[k] is constrained to a particular bank using the bank(col,row,bankid) constructor. Finally, the tile location of the buffers connected to the port kr[k].out[0] is constrained to be the same tile as that of the kernel instance kr[k]. Buffer location constraints are only allowed on window kernel ports.

IMPORTANT: Using location constraint constructors and equality relations between them, you can make fine-grain mapping decisions that the compiler must honor. However, you must be careful because it is easy to create constraints that are impossible for the compiler to satisfy. For example, the compiler cannot allow two buffers to be mapped to the same address offset. See the complete reference in Adaptive Data Flow Graph Specification Reference.

FIFO Location Constraints

The AI Engine compiler tries to automatically allocate FIFOs in the most efficient manner possible. However, you might want to explicitly control their placement in memory, as shown in the following example. This constraint is useful to preserve the placement of FIFO resources between runs of the AI Engine compiler.

Note the following considerations for FIFO constraints.

If FIFO constraints are used, the entire depth of the FIFO must be constrained. It is not possible to constrain a portion of the FIFO and leave the rest for the compiler to add.
If FIFO constraints are added to branching nets, the FIFO constraint should be added to each point-to-point net. If you want to share stream switch FIFOs or DMA FIFOs before the branch, this can be achieved by duplicating the FIFO type and location on each point-to-point net.
The constraint can be used without a location to specify the desired type of FIFO without specifying a location or depth.

The following example shows how a FIFO constraint can be used in a graph file.

class AieToAieHierarchicalGraph : public graph 
{
    public: 
        input_port in; 
        output_port out; 
        aieSrcGraph<64,64,64,0,0> aieSrc; // Depth: 128, Constrained: 80 
        aieDstGraph<64,64,32,1,0> aieDst; // Depth: 128, Constrained: 48 

        AieToAieHierarchicalGraph() 
        { 
            connect<> net0(in, aieSrc.in); 
            connect<> net1(aieSrc.out, aieDst.in); 
            connect<> net2(aieDst.out, out); 
            fifo_depth(net1) = 64; 
            //The DMA FIFO depth on net1 is set to 64 32-bit words. This DMA FIFO is 
            //located on tile 2,0 at address 0x2100, and constrained to 192 32-bit words.
            location<fifo>(net1) = { dma_fifo(aie_tile, 2, 0, 0x2100, 192) };
        }
};

The second example shows how a FIFO constraint can be added to a constraints file.

"PortConstraints": {
    "fifo_locations_records": {
      "dma_fifos": {
        "DMAFIFO_AIE_MEMGRP_X0Y0_256_511": {
          "tile_type": "core",
          "row": 0,
          "column": 0,
          "size": 64,
          "offset": 256,
          "bankId": 0
        }
      },  
      "stream_fifos": {
        "SSFIFO_11_1_1": {
          "tile_type": "core",
          "row": 0,
          "column": 11,
          "channel": 1
        }
      }
    }
  }

Hierarchical Constraints

When creating complex graphs with multiple subgraph classes, or multiple instances of the same subgraph class, the location constraints described above can also be applied to each kernel instance or kernel port instance individually at the point of subgraph instantiation instead of the definition. In this case, you need to specify the graph qualified name of that kernel instance or kernel port instance in the constraint as shown below. Also, make sure that the kernels or their ports being constrained as above are defined to be public members of the subgraph.

class ToplevelGraph : public graph {
 public:
  indep_nodes_graph1<1,2,3,2> mygraph;
  port<input> datain[2] ;
  port<output> dataout[2] ;

  ToplevelGraph() {
    for (int i = 0; i < 2; i++) {
      connect<stream, window<64> >(datain[i], mygraph.datain[i]);
      connect<window<64>, stream >(mygraph.dataout[i], dataout[i]);

      // hierarchical constraints
      location<stack>(mygraph.kr[i]) = bank(3, 2+i, 2);
      location<buffer>(mygraph.kr[i].out[0]) = location<kernel>(mygraph.kr[i]);
    }
  };
};

Note: You can recirculate the previous design placement in your next compilation. This significantly reduces the mapper run time. When the compiler runs, it generates a placement constraints file in the Work directory. This constraint file can be specified on the command line for the next iteration.

aiecompiler --constraints Work/temp/graph_aie_mapped.aiecst src/graph.cpp

Buffer Allocation Control

The AI Engine compiler automatically allocates the desired number of buffers for each memory connection. There are several different cases.

Lookup tables are always allocated as single buffers because they are expected to be read-only and private to a kernel. No locks are needed to synchronize lookup table accesses because they are expected to be accessed in an exclusive manner.
Window connections are usually assigned double buffers if the producer and consumer kernels are mapped to different processors or if the producer or the consumer is a DMA. This enables the two agents to operate in a pipelined manner using ping-pong synchronization with two locks. The AI Engine compiler automatically generates this synchronization in the respective processor main functions.
If the producer and consumer kernels are mapped to the same processor, then the window connection is given only one buffer and no lock synchronization is needed because the kernels are executed sequentially.
Run-time parameter connections can be assigned double buffers (default) along with a selector word to choose the next buffer to be accessed.

Run-time parameter connections can also be assigned single buffers. Sometimes, with window connections, it is desirable to use only single buffer synchronization instead of double buffers. This is useful when the local data memory is at a premium and the performance penalty of using a single buffer for data transfer is not critical. This can be achieved using the single_buffer(port<T>&) constraint.

single_buffer(first.in[0]); //For window input or RTP input
single_buffer(first.inout[0]); //For RTP output

C++ Kernel Class Support

The AI Engine compiler supports C++ kernel classes. The following example shows how to set filter coefficients and the number of samples of a FIR filter class through a constructor. The C++ kernel class allows internal states for each kernel instance to be encapsulated within the corresponding class object. In the following code, you can see an example of this where the filter coefficients (coeffs) are specified through the constructor. This resolves the problem of using file scope variable, global variable, or static function scope variable to store the internal states of a C function kernel. When multiple instances of such a kernel are mapped to the same core, the internal state variables are shared across multiple instances and cause conflicts.

//fir.h
#pragma once
#include "adf.h"
#define NUM_COEFFS 12

class FIR
{
private:
    int32 coeffs[NUM_COEFFS];
    int32 tapDelayLine[NUM_COEFFS];
    uint32 numSamples;

public:
    FIR(const int32(&coefficients)[NUM_COEFFS], uint32 samples);
    void filter(input_window_int32* in, output_window_int32* out);
    static void registerKernelClass()
    {
        REGISTER_FUNCTION(FIR::filter);
    }
};

You are required to write the static void registerKernelClass() method in the header file. Inside the registerKernelClass() method, you need to call the REGISTER_FUNCTION macro. This macro is used to register the class run method to be executed on the AI Engine core to perform the kernel functionality. In the preceding example FIR::filter is registered using this macro. The kernel class constructor and run method should be implemented inside a separate source file. The implementation of a run method of a kernel class is the same as writing a kernel function described in previous chapters.

//fir.cpp
//implementation in this example is not optimized and is for illustration purpose
#include "fir.h"

FIR::FIR(const int32(&coefficients)[NUM_COEFFS], uint32 samples)
{
    for (int i = 0; i < NUM_COEFFS; i++)
        coeffs[i] = coefficients[i];

    for (int i = 0; i < NUM_COEFFS; i++)
        tapDelayLine[i] = 0;

    numSamples = samples;
}

void FIR::filter(input_window_int32* in, output_window_int32* out)
{
    for (int i = 0; i < numSamples; i++)
    {
        for (int j = NUM_COEFFS-1; j > 0; j--)
            tapDelayLine[j] = tapDelayLine[j - 1];

        tapDelayLine[0] = window_readincr(in);

        int32 y = 0;
        for (int j = 0; j < NUM_COEFFS; j++)
        {
            y += coeffs[j] * tapDelayLine[j];
        }

        window_writeincr(out, y);
    }
}

//graph.h
#pragma once
#include "adf.h"
#include "fir.h"
using namespace adf;

class mygraph : public graph
{
public:
    input_port in1, in2;
    output_port out1, out2;
    kernel k1, k2;

    mygraph()
    {
        //see lab8.3 for narrow filter coefficients
        k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
        runtime<ratio>(k1) = 0.1;
        source(k1) = "src/fir.cpp";

        //see lab8.3 for wide filter coefficients
        k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
        runtime<ratio>(k2) = 0.1;
        source(k2) = "src/fir.cpp";

        connect<window<32>>(in1, k1.in[0]);
        connect<window<32>>(in2, k2.in[0]);

        connect<window<32>>(k1.out[0], out1);
        connect<window<32>>(k2.out[0], out2);
    }
};

For a kernel class with a non-default constructor, you can specify the constructor parameter values in the arguments of kernel::create_object, when creating a representation of a kernel instance. In the previous example, two FIR filter kernels (k1 and k2) are created using kernel::create_object<FIR>. k1 has filter coefficients { 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 } and k2 has filter coefficients { -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }. Both of them consume eight samples for each invocation.

The following code shows the AI Engine compiler generated program. The two FIR kernel objects are instantiated with the proper constructor parameters.

//Work/aie/x_y/src/x_y.cc
...
FIR i4({180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504}, 8);
FIR i5({-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539}, 8);

int main(void) {
    ...
    // Kernel call : i4:filter
    i4.filter(get_input_window_int32(window_buf0_buf0d),get_output_window_int32(window_buf2_buf2d));
    ...
    // Kernel call : i5:filter
    i5.filter(get_input_window_int32(window_buf1_buf1d),get_output_window_int32(window_buf3_buf3d));
    ...
}

A kernel class can have a member variable occupying a significant amount of memory space that might not fit into program memory. The location of the kernel class member variable can be controlled. The AI Engine compiler supports array reference member variables that allow the compiler to allocate or constrain the memory space while passing the reference to the object.

//fir.h
#pragma once
#include "adf.h"
#define NUM_COEFFS 12

class FIR
{
private:
    int32 (&coeffs)[NUM_COEFFS];
    int32 tapDelayLine[NUM_COEFFS];
    uint32 numSamples;

public:
    FIR(int32(&coefficients)[NUM_COEFFS], uint32 samples);
    void filter(input_window_int32* in, output_window_int32* out);
    static void registerKernelClass()
    {
        REGISTER_FUNCTION(FIR::filter);
        REGISTER_PARAMETER(coeffs);
    }
};

//fir.cpp
#include "fir.h"
FIR::FIR(int32(&coefficients)[NUM_COEFFS], uint32 samples)
    : coeffs(coefficients)
{
    for (int i = 0; i < NUM_COEFFS; i++)
        tapDelayLine[i] = 0;

    numSamples = samples;
}

void FIR::filter(input_window_int32* in, output_window_int32* out)
{
...
}

The previous example shows a slightly modified version of the FIR kernel class. Here, member variable coeffs is a int32 (&)[NUM_COEFFS] data type. The constructor initializer coeffs(coefficients) initializes coeffs to the reference to an array allocated externally to the class object. To let the AI Engine compiler know that the coeffs member variable is intended to be allocated by the compiler, you must use REGISTER_PARAMETER to register an array reference member variable inside the registerKernelClass

The use of kernel::create_object to create a representation of a FIR kernel instance and to specify the initial value of the constructor parameters is the same as in the previous example. See the following code.

//graph.h
...
class mygraph : public graph
{
...
    mygraph()
    {
        k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
        ...
        k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
        ...
    }
};

The following code shows the corresponding AI Engine compiler generated program. The memory spaces for int32 i4_coeffs[12] and int32 i5_coeffs[15] are outside the kernel object instances and are passed into the FIR objects by reference.

//Work/aie/x_y/src/x_y.cc
int32 i4_coeffs[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
FIR i4(i4_coeffs, 8);
int32 i5_coeffs[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
FIR i5(i5_coeffs, 8);

int main(void) {
    ...
    // Kernel call : i4:filter
    i4.filter(get_input_window_int32(window_buf0_buf0d),get_output_window_int32(window_buf2_buf2d));
    ...
    // Kernel call : i5:filter
    i5.filter(get_input_window_int32(window_buf1_buf1d),get_output_window_int32(window_buf3_buf3d));
    ...
}

Because the memory space for an array reference member variable is allocated by the AI Engine compiler, the location constraint can be applied to constrain the memory location of these arrays, as shown in the following example code. The REGISTER_PARAMETER macro allows kernel::create_object to create a parameter handle for an array reference member variable, like k1.param[0] and k2.param[0], and the location<parameter> constraint can be applied.

//graph.h
...
class mygraph : public graph
{
...
    mygraph()
    {
        k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
        ...
        k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
        ...

        location<parameter>(k1.param[0]) = address(…);
        location<parameter>(k2.param[0]) = bank(…);
    }
};

The C++ kernel class header files and the C++ kernel function template (see C++ Template Support) should not contain single-core specific intrinsic APIs and pragmas. This is the same programming guideline as writing regular C function kernels. This is because these header files are included in the graph header file and can be cross-compiled as part of the PS program. The Arm® cross-compiler cannot understand single-core intrinsic APIs or pragmas. Single-core specific programming content must be kept inside the source files.

C++ Template Support

A template is a powerful tool in C++. By passing the data type as a parameter, you eliminate the need to rewrite code to support different data types. Templates are expanded at compile time, like macros. The difference is that the compiler performs type checking before template expansion. The source code contains template functions and class definitions, but the compiled code can contain multiple copies of same function or class. Type parameters, non-type parameters, default arguments, scalar parameters, and template parameters can be passed to a template, where the compiler instantiates the function or class accordingly.

Support for general C++ template features.
Supported data types (T) and connection types between kernels:
- Data type (T): int8, uint8, int16, uint16, cint16, int32, uint32, cint32, int64, uint64, float, cfloat
  IMPORTANT: acc48 and cacc48 data types are not supported in template stream connections.
- Function parameter type: input_window<T>, output_window<T>, input_stream<T>, output_stream<T>
The compiler does not support pre-compiled headers for template kernels.

Function Templates

Function template source code defines a generic function that can be used for different data types. Example function template:

// add.h
template<typename ELEMENT_TYPE, int FACTOR, size_t NUM_SAMPLES> void add(input_window<ELEMENT_TYPE>* in, 
 output_window<ELEMENT_TYPE>* out);

// add.cpp
template<typename ELEMENT_TYPE, int FACTOR, size_t NUM_SAMPLES> void add(input_window<ELEMENT_TYPE>* in, 
 output_window<ELEMENT_TYPE>* out)
{
    for (int i=0; i<NUM_SAMPLES; i++)
    {
        ELEMENT_TYPE value = window_readincr(in);
        value += FACTOR;
        window_writeincr(out, value);
    }
}

// graph.h
mygraph()
{
    k[0] = kernel::create(add<int32, 6, 8>);
    k[1] = kernel::create(add<int16, 3, 8>);
    for (int i=0; i<NUM_KERNELS; i++)
    {
        runtime<ratio>(k[i]) = 0.3;
        source(k[i]) = "src/add.cpp";
    }

    connect<window<32>>(in[0], k[0].in[0]);
    connect<window<32>>(k[0].out[0], out[0]);

    connect<window<16>>(in[1], k[1].in[0]);
    connect<window<16>>(k[1].out[0], out[1]);
}

where:

add.h defines a template add() function.
add.cpp defines the code for the template add() function.
graph.h uses the template add() function within mygraph class.

Class Templates

Like function templates, class templates are useful when a class defines an object that is independent of a specific data type. Example class template:

// fir.h
...
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> class FIR
{
private:
    ELEMENT_TYPE (&coeffs)[NUM_COEFFS];
    ELEMENT_TYPE tapDelayLine[NUM_COEFFS];
    uint32 numSamples;

public:
    FIR(ELEMENT_TYPE(&coefficients)[NUM_COEFFS], uint32 samples);

    void filter(input_window<ELEMENT_TYPE>* in, output_window<ELEMENT_TYPE>* out);

    //user needs to write this function to register necessary info
    static void registerKernelClass()
    {
        REGISTER_FUNCTION(FIR::filter);
        REGISTER_PARAMETER(coeffs);
    }
};

// fir.cpp
...
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> FIR<NUM_COEFFS, ELEMENT_TYPE>::FIR(ELEMENT_TYPE(&coefficients)[NUM_COEFFS], uint32 samples):coeffs(coefficients)
{
    ...
}

template<size_t NUM_COEFFS, typename ELEMENT_TYPE> void FIR<NUM_COEFFS, ELEMENT_TYPE>::filter(input_window<ELEMENT_TYPE>* in, output_window<ELEMENT_TYPE>* out)
{
    ...
}

// graph.h
...
mygraph()
{
    k1 = kernel::create_object<FIR<12, int32>>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
    runtime<ratio>(k1) = 0.1;
    source(k1) = "src/fir.cpp";
    headers(k1) = { "src/fir.h" };

    k2 = kernel::create_object<FIR<15, int32>>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539, 0, 0, 0 }), 8);
    runtime<ratio>(k2) = 0.1;
    source(k2) = "src/fir.cpp";
    headers(k2) = { "src/fir.h" };
    ...
}

where:

fir.h defines a class template where class FIR is declared.
fir.cpp contains class FIR implementation and the class FIR member function filter implementation.
graph.h demonstrates the template class FIR instantiation within the mygraph class.

Multicast Support

Various multicast scenarios are supported in the graph, such as from a window to multiple windows, from stream to multiple streams, from PLIO to multiple windows etc. This section lists the supported types of multicast from a single source to multiple destinations. For more details on PLIO, FileIO and GMIO, see Using a Virtual Platform.

Table 1. Multicast Support Scenarios
#	Source	Destination 1	Destination 2	Supported
1	AI Engine Window	AI Engine Window	AI Engine Window	Not Supported
2	AI Engine Window	AI Engine Window	AI Engine Stream	Not Supported
3	AI Engine Window	AI Engine Window	PLIO/FileIO/GMIO	Not Supported
4	AI Engine Window	AI Engine Stream	AI Engine Stream	Supported
5	AI Engine Window	AI Engine Stream	PLIO/FileIO	Supported
6	AI Engine window	PLIO/FileIO	PLIO/FileIO	Supported
7	AI Engine Stream	AI Engine Window	AI Engine window	Supported
8	AI Engine Stream	AI Engine Window	AI Engine Stream	Supported
9	AI Engine Stream	AI Engine Window	PLIO/FileIO/GMIO	Supported
10	AI Engine Stream	AI Engine Stream	AI Engine Stream	Supported
11	AI Engine Stream	AI Engine Stream	PLIO/FileIO/GMIO	Supported
12	AI Engine Stream	PLIO/FileIO/GMIO	PLIO/FileIO/GMIO	Supported
13	PLIO/FileIO/GMIO	AI Engine Window	AI Engine window	Supported
14	PLIO/FileIO/GMIO	AI Engine Window	AI Engine Stream	Not Supported
15	PLIO/FileIO/GMIO	AI Engine Window	PLIO/FileIO/GMIO	Not Supported
16	PLIO/FileIO/GMIO	AI Engine Stream	AI Engine Stream	Supported
17	PLIO/FileIO/GMIO	AI Engine Stream	PLIO/FileIO/GMIO	Not Supported
18	PLIO/FileIO/GMIO	PLIO/FileIO/GMIO	PLIO/FileIO/GMIO	Not Supported
19	AI Engine Window	AI Engine Stream	GMIO	Not Supported
20	AI Engine Window	GMIO	GMIO	Not Supported
21	AI Engine Window	PLIO/FileIO	GMIO	Not Supported

Note the following.

All source and destination windows in the multicast connections are required to have the same size.
RTP and packet switching are not covered in this section.
If the multicast type is supported, the destination number is not limited if it can fit into the hardware.

When multiple streams are connected to the same source, the data is sent to all the destination ports at the same time and is only sent when all destinations are ready to receive data. This might cause stream stall or design hang if the FIFO depth of the stream connections are not deep enough. Refer to the examples in AI Engine Kernel Coding Best Practices Guide (UG1079) for more information about the stream stalls and potential solutions.