Widening / Vectorization

SDAccel™ transforms the interface of a kernel to prevent it being the bottleneck of the computation. The most significant transformation is widening of the datapath which is achieved by automatic vectorization of the memory accesses.

One case where it is most visible is the following:

__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
    size_t idx = get_global_id(0);
    c[idx] = a[idx] + b[idx];
}

Here, the interface is a 32 bit wide datapath, which drastically limits the memory throughput if implemented directly. SDAccel automatically applies a vector transformation to widen the datapath, eliminating the bottleneck.

There are necessary conditions for that transformation to happen:

Provide the __attribute__((reqd_work_group_size(X, Y, Z))) where X, Y, and Z are positive constants. X*Y*Z is the maximum vectorization factor SDAccel will infer, it is therefore important to have a sufficient number of work-items to saturate the memory bandwidth.
Provide the __attribute__((vec_type_hint(int))) where int is the main type used for computation / memory transfer. The closer the type is to the maximum width of the memory interface, the less vectorization is necessary. This means that small types will vectorize a lot (char64), and big types will vectorize less (float16). In both case, we target the maximum bitwidth of the interface (usually 512 bits).

There are optional parameters that are highly recommended:

Provide __attribute__((xcl_zero_global_work_offset)), which specify that you do not use the global offset parameter available at run-time. This gives the compiler valuable information with regard to alignment of the work-groups, which in turn usually propagate to the alignment of the memory accesses.

Limitations

Since vectorization of the memory access is only truly beneficial when the underlying loop (work-item loop) is partially unrolled by the vectorization factor, we automatically request unrolling of that loop and, if possible, reshaping of the local arrays in which we store that data. This usually behaves nicely, but can interact poorly in rare situations.

For example:

For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult problem for the scheduler (might severely increase memory usage and compilation time), Xilinx recommends that you use partitioning factors that are powers of two (as the vectorization factor will always be a power of two).
If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) since the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable way).