In this lab, you will write vector code to gain a better understanding of how data-parallel code maps to vector-style processors, and to practice optimizing vector code for a given implementation.
The programming model is best explained by using a simple application. We use a simple dot multiplication of two arrays. The equivalent RISC-V code is shown as well.
void DotSerial(int x[], int y[], int output[], int N) {
for (int i = 0; i < N; i++) {
output[i] = x[i] * y[i];
}
}
Scalar Code
# a0: &x[0], a1: &y[0], a2: &result[0], a4, N
# t1 = 0: loop index i
loop:
# load x[i] and y[i]
lw a4,0(a0)
lw a3,0(a1)
# multiplication
mul a4,a4,a3
# store word
sw a4,0(a2)
# Bump pointers
addi a0,a0,4
addi a1,a1,4
addi a2,a2,4
addi t1, t1, 1
bne t1,a4,loop
Limitations
You can see that there is only useful instruction performing the actual multiplication and 8 other instructions loading data and manipulating the loop. For the entire loop we need 8*n instructions. How do we reduce the number of instructions for performing this computation? Answer: How do you increase operational intensity? i.e., get instructions to process more data than a single element
It appears that you are calculating the output one at a time; however if you unrolled and wrote down the loop manually. You can see that each of the loop iterations are independent of each other. output[0] requires only x[0] and y[0], output[1] requires x[1] and y[1]. How can we exploit the observation to work on multiple elements in parallel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for(int i = 0; i < N; i=i+VLEN) {
output[i+0] = x[i+0] * y[i+0];
output[i+1] = x[i+1] * y[i+1];
output[i+2] = x[i+2] * y[i+2];
output[i+3] = x[i+3] * y[i+3];
...
output[i+VLEN-1] = x[i+VLEN-1] * y[i+VLEN-1];
}
// Rewriting as vector
for(int i = 0; i < N; i=i+VLEN) {
start = i;
end = i+VLEN-1
output[start:end] = x[start:end] * y[start:end];
}
Definition :: In a vector processor we implement an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only.
Key idea : Vector processing is based on the key idea that we can improve efficiency by processing multiple elements of a one-dimensional array in parallel. The core instructions work on multiple data elements
The core idea in vector processing is to use a single instruction multiple elements. Code below shows a hypothetical
RISC-V style vector ISA. In this example we cut the number of loop iterations to NVLEN. For instance, when N is 16 we need only two iterations in the vector code while baseline scalar code require 16 iterations, a 8× speedup. There are a number of vector instructions (they are prefixed with v
). They work on vector registers which include multiple scalar elements. Values are loaded into the vector registers from consecutive memory locations starting from the starting address. vmul
works on all elements of the vector in parallel. Figure illustrates the difference between scalar dot and vector dot.
# v0-31 vector registers. For improving the density of mul.
# a0-4, t1 scalar registers for holding array pointers and loop control
# VLEN = 8.
# VectorDot. t0 = VLEN
loop:
vle32.v v8, (a2) # load x[I,i+VLEN]
vle32.v v16, (a3) # load x[I,i+VLEN]
vmul.vv v24,v8,v16 # vector-vector multiplication
vse32.v v24,(a1) # store res[i:i+VLEN]
# Bump pointers
slli t1,t0,2
add a2, a2,t1
add a3,a3,t1
add a1,a1,t1
# Bump loop by VLEN
sub a0,a0,t0
bnez a0, loop
Instead of programming at the assembly level we are going to be interacting with vector programming at the C level using intrinsics
Intrinsic (definition) These are special functions and types that we have implemented that you can use within a C program to activate vector operations and vector registers without requiring to deal with assembly-level syntax and finite 32 number of register.
We will now illustrate the usage of some of these intrinsics using the Dot example. To refer to the overall list of operations and their semantics see here
make Dot.bin
./Dot.bin
Dot.h | Implementation of serial and vector dot |
main.cpp | Invokes the dot functions |
dataset.h | Inputs to the dot program |
// implementation of relu using instrinsics
void DotVector(int x[], int y[], int output[], int N) {
// Create vector registers for processing
__cs295_vec_int x_v, y_v;
__cs295_vec_int result_v;
// Check the loop bound; it increments in VLEN segments
// It considers the array a VLEN elements at a time.
for (int i = 0; i < N; i += VLEN) {
// Activate VLEN lanes
__cs295_mask maskAll = _cs295_init_ones(VLEN);
// Both X and Y pointers move VLEN at a time.
// Load VLEN elements from X; values from contiguous memory
// addresses
_cs295_vload_int(x_v, x + i, maskAll); // x = values[i];
// Load VLEN elements from Y; values from contiguous memory
// addresses.
_cs295_vload_int(y_v, y + i, maskAll); // x = values[i];
// Activate maskAll lanes.
_cs295_vmult_int(result_v, x_v, y_v, maskAll);
// Write results back to memory
_cs295_vstore_int(output + i, result_v, maskAll);
printf("Iteration %d processing %d to %d \n", i, i, i+VLEN);
cs295Logger.printLog();
cs295Logger.clearLog();
}
printf("Done Vector");
}
The vector instrinsics are prefixed with _cs295_
and vector types are prefixed with __cs295
.
__cs295_vec_int
: The implementation of the vector registers of either. Each vector register holds a VLEN number of elements. These registers can be passed as parameters to the vector functions. We declare 3 registers x_v and y_v for holding input array elements.
WARNING: Like RISC-V instructions vector computation operations can only operate on vector registers.
Not array elements. You have to first load the array elements into vector register
__cs295_vec_int x_v, y_v;
__cs295_vec_int result_v;
__cs295_mask
: Masks are bool vectors of width VLEN which specific whether a particular lane or index is active in an operation. We illustrate with vmult belowIf the mask is a 1 the lane performs the operation, if its a 0 the lane is in active and result is not modified.
for element = 0 to VLEN
if mask[element] == 1 then
result[element] = x_v[element] * y_v[element]
else
// DO NOTHING
Since we are processing a VLEN worth of elements in each iteration the loop progresses a VLEN at at time. You may want to think about what if N is not a multiple of VLEN
. Here in this example you can assume N is a multiple of VLEN hence there are no boundary conditions to deal with. We create a mask that activates all the vector lanes since we want to process all 8 elements in parallel. _cs295_init_ones
creates a mask with the first (0..VLEN) lanes set to 1. Note that maskAll is just a variable name. You can use any name for the mask as long as its type of __cs295_mask.
for (int i = 0; i < N; i += VLEN) {
__cs295_mask maskAll = _cs295_init_ones(VLEN);
We load the array elements into the vector register.
x_v
: vector register to load data into
x+i
: ptr from which to start VLEN integers
maskAll
: activate all lanes
// Load VLEN elements from X; values from contiguous memory
// addresses
_cs295_vload_int(x_v, x + i, maskAll); // x = values[i];
When i = 0 x+i = &a[0]. x_v will hold elements x[0:8] in the first iteration i = 8 x+i = &a[8]. In the second iteration when i = 8 we load 8 elements starting at &a[8] x_v will hold elements x[8:15] in the first iteration
vload_int(v,ptr,mask)
for element = 0 to VLEN
if mask[element] == 1 then
v[element] = *(ptr + element)
else
// DO NOTHING
Finally we process all the VLEN (8) elements in parallel.
// Activate maskAll lanes.
_cs295_vmult_int(result_v, x_v, y_v, maskAll);
Finally we have added support displaying what’s happening in the vector lanes. This helps with debuggin as well to see if a particular element is being processed. The log is cumulative so we clear it after each iteration to ensure we are only displaying on a single iterations’s worth of log.
cs295Logger.printLog();
cs295Logger.clearLog();
For instance the processing proceeds as follows. Each instruction now processes 8 elements. The * indicates an element that is processed ********
indicates 8 elements are being processed and active in each iteration.
Iteration 0 processing 0 to 8
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vload | ********
vload | ********
vmul | ********
vstore | ********
Iteration 8 processing 8 to 16
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vload | ********
vload | ********
vmul | ********
vstore | ********
Each of the vector types __cs295_vec_int and __cs295_mask are structs of type where T is either int or bool (for masks).
struct __cs295_vec {
T value[VLEN];
};
# When gdbing you can display as such to display the contents of individual lanes.
p/x x_v.value[0]
Now let’s consider when N is not a multiple of VLEN. This example is shown in Dot-Error/dataset.h
. The size of the vector is 12. Now lets run the same code as previously.
make Dot-Error.bin
./Dot-Error.bin
Iteration 0 processing 0 to 8
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vload | ********
vload | ********
vmult | ********
vstore | ********
Iteration 8 processing 8 to 16
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vload | ********
vload | ********
vmult | ********
vstore | ********
Done VectorYou have written to out of bound value!
Wrong calculation at value[12]!
values x,y = [X=-3 Y=-8][X=3 Y=-3][X=6 Y=9][X=3 Y=-7][X=7 Y=1][X=4 Y=0][X=3 Y=-3][X=5 Y=-10][X=7 Y=-2][X=2 Y=-7][X=5 Y=10][X=-2 Y=2]
output = 24 -9 54 -21 7 0 -9 -50 -14 -14 50 -4
gold = 24 -9 54 -21 7 0 -9 -50 -14 -14 50 -4
We have written out of bounds to element 12. The results for 0-11 still seem to match the golden values (expected results). Gold refers to the expected results (its defined in dataset.h).
Why did this happen?
The answer lies in the logs. The size of the array is 12. In the first iteration we processed elements 0-8 and in the second iteration we processed 8-16. All the lanes are active in the second iteration (see the ****). This is incorrect since we only want to process 8-12. The vstore wrote an additional 4 values result[12-16] overflowing which it should not have. Typucally this would have resulted in a segfault. But we add additional padding in the datasets to detect and warn the user about the error.
In scalar code typically the loop will terminate at 12; however in this case here since we are processing in segments of VLEN (8 elements at a time). We have to detect this and deactivate if VLEN exceeds array size.
cd $REPO
make Dot-Fix.bin
./Dot-Fix.bin
The fix is pretty simple in each iteration we first check how many elements are left to be processed. In the final iteration N-i < VLEN
will hold and we calculate the number of lanes that have to be active. We then activate only width amount of lanes.
for (int i = 0; i < N; i += VLEN) {
int width;
// Check if final iteration.
if (N - i < VLEN) {
width = N - i; // Find number of elements left to process
} else {
width = VLEN; // Otherwise activate all lanes
}
// Activate VLEN lanes
__cs295_mask maskAll = _cs295_init_ones(width);
The log now reveals the execution. We deactivate the non-required lanes 12-16 in the second iteration (See the __).
Dot Vector
Iteration 0 processing 0 to 8
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vload | ********
vload | ********
vmult | ********
vstore | ********
Iteration 8 processing 8 to 16
***************** Printing Vector Unit Execution Log *****************
Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
vload | ****____
vload | ****____
vmult | ****____
vstore | ****____
Done VectorResults matched with answer!
We extend the notion of masking lanes further. We examine the the input data sets to find that many of the inputs could be 0. When either x is zero or y is zero it is not necessary to perform the multiplication.
Dot-MZ/dataset.h
int input_X[SIZE] = {0, 3, 0, 3, 7, 4, 3, 5, 7, 2, 5, -2, -5, -8, 0, -9};
int input_Y[SIZE] = {-8, 0, 0, -7, 1, 0, -3, -10, -2, -7, 10, 2, 6, -2, -2, 6};
void DotSerial(int x[], int y[], int output[], int N) {
for (int i = 0; i < N; i++) {
if (x[i] != 0 && y[i] != 0) {
output[i] = x[i] * y[i];
}
}
}
How do we take advantage of this operation.
__cs295_mask maskAll = _cs295_init_ones(VLEN);
_cs295_vload_int(x_v, x + i, maskAll); // x = values[i];
__cs295_mask mask_x_zeros;
_cs295_veq_int(mask_x_zeros, x_v, zeros, maskAll);
// Load VLEN elements from Y; values from contiguous memory
// Is y[i] == 0 ?
_cs295_vload_int(y_v, y + i, maskAll); // x = values[i];
__cs295_mask mask_y_zeros;
_cs295_veq_int(mask_y_zeros, y_v, zeros,
maskAll); // x = values[i];
// Zero lanes
__cs295_mask mask_x_or_y_zeros = _cs295_mask_or(mask_x_zeros, mask_y_zeros);
// Activate non-zero lanes
__cs295_mask mask_nz = _cs295_mask_not(mask_x_or_y_zeros);
// Multiply non-zero values
_cs295_vmult_int(result_v, x_v, y_v, mask_nz);
// Write results back to memory
_cs295_vstore_int(output + i, result_v, mask_nz);
__cs295_mask maskAll = _cs295_init_ones(VLEN);
// Load VLEN elements from X; values from contiguous memory
_cs295_vload_int(x_v, x + i, maskAll); // x = values[i];
__cs295_mask mask_x_zeros;
// Is x[i] == 0 ?
_cs295_veq_int(mask_x_zeros, x_v, zeros, maskAll);
maskAll
activates all lanes so that all elements of the array can be read. Note that we need to read every array
element before we can figure out which ones are zeros._cs295_veq_int
to compare the read elements x_v
againt a vector with all zeros
(this is another vector register we have initialized to all zeros). Note that the result of this operation itself is another mask. It also uses a mask to determine on which elements the operation should be applied. In this case we need to check all elements of x to see if its a zero. So we activate all lanes maskAll._cs295_veq_int
```python
veq_int(result_mask,v1,v2,mask)
for element = 0 to VLEN
if mask[element] == 1 then
if (v1[element] == v2[element])
result_mask[element] = 1;
else
result_mask[element] = 0;
```bash
make Dot-MZ.bin
./Dot-MZ.bin