Grace Hopper developed the first compiler for a computer programming language.

Lab 9

Lab 9
Check yourself

In this lab, you will write vector code to gain a better understanding of how data-parallel code maps to vector-style processors, and to practice optimizing vector code for a given implementation.

Check yourself

Have you gone through Week 9 videos?
What is vector code trying to optimize?
What is a vector ? (a one-dimensional array data structure)

The programming model is best explained by using a simple application. We use a simple dot multiplication of two arrays. The equivalent RISC-V code is shown as well.

void DotSerial(int x[], int y[], int output[], int N) {
  for (int i = 0; i < N; i++) {
    output[i] = x[i] * y[i];
  }
}

Scalar Code

# a0: &x[0], a1: &y[0], a2: &result[0], a4, N
# t1 = 0: loop index i
loop:
    # load x[i] and y[i]
    lw      a4,0(a0)
    lw      a3,0(a1)
    # multiplication
    mul     a4,a4,a3
    # store word
    sw      a4,0(a2)
    # Bump pointers
    addi    a0,a0,4
    addi    a1,a1,4
    addi    a2,a2,4
    addi    t1, t1, 1
    bne     t1,a4,loop

Limitations

You can see that there is only useful instruction performing the actual multiplication and 8 other instructions loading data and manipulating the loop. For the entire loop we need 8*n instructions. How do we reduce the number of instructions for performing this computation? Answer: How do you increase operational intensity? i.e., get instructions to process more data than a single element
It appears that you are calculating the output one at a time; however if you unrolled and wrote down the loop manually. You can see that each of the loop iterations are independent of each other. output[0] requires only x[0] and y[0], output[1] requires x[1] and y[1]. How can we exploit the observation to work on multiple elements in parallel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for(int i = 0; i < N; i=i+VLEN) {
    output[i+0] = x[i+0] * y[i+0];
    output[i+1] = x[i+1] * y[i+1];
    output[i+2] = x[i+2] * y[i+2];
    output[i+3] = x[i+3] * y[i+3];
    ...
    output[i+VLEN-1] = x[i+VLEN-1] * y[i+VLEN-1];
}
// Rewriting as vector
for(int i = 0; i < N; i=i+VLEN) {
  start = i;
  end =  i+VLEN-1
    output[start:end] = x[start:end] * y[start:end];
}

Definition :: In a vector processor we implement an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only.

Key idea : Vector processing is based on the key idea that we can improve efficiency by processing multiple elements of a one-dimensional array in parallel. The core instructions work on multiple data elements

The core idea in vector processing is to use a single instruction multiple elements. Code below shows a hypothetical RISC-V style vector ISA. In this example we cut the number of loop iterations to $\frac{N}{VLEN}$ . For instance, when N is 16 we need only two iterations in the vector code while baseline scalar code require 16 iterations, a 8 $\times$ speedup. There are a number of vector instructions (they are prefixed with v). They work on vector registers which include multiple scalar elements. Values are loaded into the vector registers from consecutive memory locations starting from the starting address. vmul works on all elements of the vector in parallel. Figure illustrates the difference between scalar dot and vector dot.

# v0-31 vector registers. For improving the density of mul.
# a0-4, t1 scalar registers for holding array pointers and loop control
# VLEN = 8.
# VectorDot. t0 = VLEN
loop: 
vle32.v v8, (a2)   # load x[I,i+VLEN]
vle32.v v16, (a3)  # load x[I,i+VLEN]
vmul.vv v24,v8,v16 # vector-vector multiplication
vse32.v v24,(a1)   # store res[i:i+VLEN]
# Bump pointers
slli t1,t0,2 
add  a2, a2,t1 
add  a3,a3,t1
add  a1,a1,t1 
# Bump loop by VLEN
sub  a0,a0,t0
bnez a0, loop

Instead of programming at the assembly level we are going to be interacting with vector programming at the C level using intrinsics

Intrinsic (definition) These are special functions and types that we have implemented that you can use within a C program to activate vector operations and vector registers without requiring to deal with assembly-level syntax and finite 32 number of register.

With intrinsics you can create an arbtirary number of vector registers in your program and invoke as many operations as defined in our vector library.

We will now illustrate the usage of some of these intrinsics using the Dot example. To refer to the overall list of operations and their semantics see here

1. Baseline Dot

make Dot.bin
./Dot.bin


Dot.h	Implementation of serial and vector dot
main.cpp	Invokes the dot functions
dataset.h	Inputs to the dot program

// implementation of relu using instrinsics
void DotVector(int x[], int y[], int output[], int N) {

  // Create vector registers for processing
  __cs295_vec_int x_v, y_v;
  __cs295_vec_int result_v;

  // Check the loop bound; it increments in VLEN segments
  // It considers the array a VLEN elements at a time.
  for (int i = 0; i < N; i += VLEN) {

    // Activate VLEN lanes
    __cs295_mask maskAll = _cs295_init_ones(VLEN);

    // Both X and Y pointers move VLEN at a time.

    // Load VLEN elements from X; values from contiguous memory
    // addresses
    _cs295_vload_int(x_v, x + i, maskAll);  // x = values[i];

    // Load VLEN elements from Y; values from contiguous memory
    // addresses.
    _cs295_vload_int(y_v, y + i, maskAll);  // x = values[i];

    // Activate maskAll lanes.
    _cs295_vmult_int(result_v, x_v, y_v, maskAll);

    // Write results back to memory
    _cs295_vstore_int(output + i, result_v, maskAll);

    printf("Iteration %d processing %d to %d \n", i, i, i+VLEN);
    cs295Logger.printLog();
    cs295Logger.clearLog();
  }
  printf("Done Vector");
}

The vector instrinsics are prefixed with _cs295_ and vector types are prefixed with __cs295.

Vector types

__cs295_vec_int : The implementation of the vector registers of either. Each vector register holds a VLEN number of elements. These registers can be passed as parameters to the vector functions. We declare 3 registers x_v and y_v for holding input array elements.
- WARNING: Like RISC-V instructions vector computation operations can only operate on vector registers.
- Not array elements. You have to first load the array elements into vector register

__cs295_vec_int x_v, y_v;
__cs295_vec_int result_v;

__cs295_mask: Masks are bool vectors of width VLEN which specific whether a particular lane or index is active in an operation. We illustrate with vmult below

If the mask is a 1 the lane performs the operation, if its a 0 the lane is in active and result is not modified.

for element = 0 to VLEN
   if mask[element] == 1 then
     result[element] = x_v[element] * y_v[element]
  else 
    // DO NOTHING

Loop control

Since we are processing a VLEN worth of elements in each iteration the loop progresses a VLEN at at time. You may want to think about what if N is not a multiple of VLEN. Here in this example you can assume N is a multiple of VLEN hence there are no boundary conditions to deal with. We create a mask that activates all the vector lanes since we want to process all 8 elements in parallel. _cs295_init_ones creates a mask with the first (0..VLEN) lanes set to 1. Note that maskAll is just a variable name. You can use any name for the mask as long as its type of __cs295_mask.

  for (int i = 0; i < N; i += VLEN) {
          __cs295_mask maskAll = _cs295_init_ones(VLEN);

Vector memory ops

We load the array elements into the vector register. x_v : vector register to load data into x+i : ptr from which to start VLEN integers maskAll: activate all lanes

  // Load VLEN elements from X; values from contiguous memory
    // addresses
    _cs295_vload_int(x_v, x + i, maskAll);  // x = values[i];

When i = 0 x+i = &a[0]. x_v will hold elements x[0:8] in the first iteration i = 8 x+i = &a[8]. In the second iteration when i = 8 we load 8 elements starting at &a[8] x_v will hold elements x[8:15] in the first iteration

vload_int(v,ptr,mask)
for element = 0 to VLEN
   if mask[element] == 1 then
     v[element] = *(ptr + element)
  else 
    // DO NOTHING

Vector computation and store

Finally we process all the VLEN (8) elements in parallel.

    // Activate maskAll lanes.
    _cs295_vmult_int(result_v, x_v, y_v, maskAll);

Log and Debugging

Finally we have added support displaying what’s happening in the vector lanes. This helps with debuggin as well to see if a particular element is being processed. The log is cumulative so we clear it after each iteration to ensure we are only displaying on a single iterations’s worth of log.

    cs295Logger.printLog();
    cs295Logger.clearLog();

For instance the processing proceeds as follows. Each instruction now processes 8 elements. The * indicates an element that is processed ******** indicates 8 elements are being processed and active in each iteration.

Iteration 0 processing 0 to 8
***************** Printing Vector Unit Execution Log *****************
 Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
       vload | ********
       vload | ********
       vmul  | ********
      vstore | ********
Iteration 8 processing 8 to 16
***************** Printing Vector Unit Execution Log *****************
 Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
       vload | ********
       vload | ********
       vmul  | ********
      vstore | ********

Each of the vector types __cs295_vec_int and __cs295_mask are structs of type where T is either int or bool (for masks).

struct __cs295_vec {
  T value[VLEN];
};
# When gdbing you can display as such to display the contents of individual lanes. 
p/x x_v.value[0]

2. Dot Error

Now let’s consider when N is not a multiple of VLEN. This example is shown in Dot-Error/dataset.h. The size of the vector is 12. Now lets run the same code as previously.

make Dot-Error.bin
./Dot-Error.bin

Iteration 0 processing 0 to 8
***************** Printing Vector Unit Execution Log *****************
 Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
       vload | ********
       vload | ********
       vmult | ********
      vstore | ********
Iteration 8 processing 8 to 16
***************** Printing Vector Unit Execution Log *****************
 Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
       vload | ********
       vload | ********
       vmult | ********
      vstore | ********
Done VectorYou have written to out of bound value!
Wrong calculation at value[12]!
values x,y  = [X=-3 Y=-8][X=3 Y=-3][X=6 Y=9][X=3 Y=-7][X=7 Y=1][X=4 Y=0][X=3 Y=-3][X=5 Y=-10][X=7 Y=-2][X=2 Y=-7][X=5 Y=10][X=-2 Y=2]
output =  24 -9  54 -21  7  0 -9 -50 -14 -14  50 -4
gold   =  24 -9  54 -21  7  0 -9 -50 -14 -14  50 -4

We have written out of bounds to element 12. The results for 0-11 still seem to match the golden values (expected results). Gold refers to the expected results (its defined in dataset.h).

Why did this happen?

The answer lies in the logs. The size of the array is 12. In the first iteration we processed elements 0-8 and in the second iteration we processed 8-16. All the lanes are active in the second iteration (see the ****). This is incorrect since we only want to process 8-12. The vstore wrote an additional 4 values result[12-16] overflowing which it should not have. Typucally this would have resulted in a segfault. But we add additional padding in the datasets to detect and warn the user about the error.

3. Dot Fix . Support for array lengths not multiple of VLEN

In scalar code typically the loop will terminate at 12; however in this case here since we are processing in segments of VLEN (8 elements at a time). We have to detect this and deactivate if VLEN exceeds array size.

cd $REPO
make Dot-Fix.bin
./Dot-Fix.bin

The fix is pretty simple in each iteration we first check how many elements are left to be processed. In the final iteration N-i < VLEN will hold and we calculate the number of lanes that have to be active. We then activate only width amount of lanes.

  for (int i = 0; i < N; i += VLEN) {
    
    int width;    
    // Check if final iteration. 
    if (N - i < VLEN) {
      width = N - i; // Find number of elements left to process
    } else {
      width = VLEN; // Otherwise activate all lanes
    }

    // Activate VLEN lanes
    __cs295_mask maskAll = _cs295_init_ones(width);

The log now reveals the execution. We deactivate the non-required lanes 12-16 in the second iteration (See the __).

Dot Vector
Iteration 0 processing 0 to 8
***************** Printing Vector Unit Execution Log *****************
 Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
       vload | ********
       vload | ********
       vmult | ********
      vstore | ********
Iteration 8 processing 8 to 16
***************** Printing Vector Unit Execution Log *****************
 Instruction | Vector Lane Occupancy ('*' for active, '_' for inactive)
------------- --------------------------------------------------------
       vload | ****____
       vload | ****____
       vmult | ****____
      vstore | ****____
Done VectorResults matched with answer!

4. Dot MZ . Optimizing zeros away

We extend the notion of masking lanes further. We examine the the input data sets to find that many of the inputs could be 0. When either x is zero or y is zero it is not necessary to perform the multiplication.

Dot-MZ/dataset.h
int input_X[SIZE] = {0, 3, 0, 3, 7, 4, 3, 5, 7, 2, 5, -2, -5, -8, 0, -9};
int input_Y[SIZE] = {-8, 0, 0, -7, 1, 0, -3, -10, -2, -7, 10, 2, 6, -2, -2, 6};

void DotSerial(int x[], int y[], int output[], int N) {
  for (int i = 0; i < N; i++) {
    if (x[i] != 0 && y[i] != 0) {
    output[i] = x[i] * y[i];
    }
  }
}

How do we take advantage of this operation.

We check with elements are 0 in X or Y and combine it into a mask.
We then use this new mask to control which elements we multiply.
Note that this only reduces the number of multiplications performed.
We still need to load all elements to check if they are 0 in the first place

__cs295_mask maskAll = _cs295_init_ones(VLEN);

_cs295_vload_int(x_v, x + i, maskAll);  // x = values[i];
__cs295_mask mask_x_zeros;
_cs295_veq_int(mask_x_zeros, x_v, zeros, maskAll);

// Load VLEN elements from Y; values from contiguous memory
// Is y[i] == 0 ?

_cs295_vload_int(y_v, y + i, maskAll);  // x = values[i];
__cs295_mask mask_y_zeros;
_cs295_veq_int(mask_y_zeros, y_v, zeros,
               maskAll);  // x = values[i];

// Zero lanes
__cs295_mask mask_x_or_y_zeros = _cs295_mask_or(mask_x_zeros, mask_y_zeros);

// Activate non-zero lanes
__cs295_mask mask_nz = _cs295_mask_not(mask_x_or_y_zeros);

// Multiply non-zero values
_cs295_vmult_int(result_v, x_v, y_v, mask_nz);

// Write results back to memory
_cs295_vstore_int(output + i, result_v, mask_nz);

Read array elements

__cs295_mask maskAll = _cs295_init_ones(VLEN);
// Load VLEN elements from X; values from contiguous memory
_cs295_vload_int(x_v, x + i, maskAll);  // x = values[i];
__cs295_mask mask_x_zeros;
// Is x[i] == 0 ?
_cs295_veq_int(mask_x_zeros, x_v, zeros, maskAll);

maskAll activates all lanes so that all elements of the array can be read. Note that we need to read every array element before we can figure out which ones are zeros.
We then use a logical operation _cs295_veq_int to compare the read elements x_v againt a vector with all zeros (this is another vector register we have initialized to all zeros). Note that the result of this operation itself is another mask. It also uses a mask to determine on which elements the operation should be applied. In this case we need to check all elements of x to see if its a zero. So we activate all lanes maskAll.
The result mask, mask_x_zeros is only the lanes in which x_v == 0.
Similarly we calculate mask_y_zeros is a mask in which only lanes in which y_v == 0 are active.

_cs295_veq_int
```python
veq_int(result_mask,v1,v2,mask)
for element = 0 to VLEN
   if mask[element] == 1 then
     if (v1[element] == v2[element])
      result_mask[element] = 1;
     else 
      result_mask[element] = 0;

```bash
make Dot-MZ.bin
./Dot-MZ.bin