Project #1: OpenMP Filter

Due date: None. We will make a solution available for self-assessment and a study guide on Friday February 22.

The project parallelizes a filtering function that we implemented and parallelized in our previous research. These are experiments that were run in the Burns lab in 2012. You will use OpenMP and evaluate the speedup realized on multicore processors. In this code, you have a data array D that contains labels. These may be superpixels in an image or a segmentation of a scene. You also have a filter list F that enumerates the labels of interest. The goal of the computation is to output a new data array D’ that contains only the elements of D that are listed in F, zero otherwise.

I am giving you two serial implementations that you may use as a reference: filter.c. Here is the inner loop of one of them:

/* start with an empty *all zeros* output array */
output_array = (unsigned int*) malloc ( DATA_LEN * sizeof(unsigned int));
memset ( output_array, 0, DATA_LEN );

/* for all elements in the data */
for (int x=0; x<DATA_LEN; x++) {
  /* for all elements in the filter */
  for (int y=0; y<FILTER_LEN; y++) {
    /* it the data element matches the filter */
    if (input_array[x] == filter_list[y]) {
      /* include it in the output */
      output_array[x] = input_array[x];
    }
  }
}

Experimental Evaluation

All performance evaluations should be run on the Amazon High-CPU 8-core instances (c5.4xlarge) which currently cost about $0.68/hour. I strongly recommend that you do development and testing on some combination of you personal computer and the free-tier micro instances and then launch a large instance for performance evaluation. All experimental data points should represent the median of 20 runs. This is good experimental practice (see http://en.wikipedia.org/wiki/Confidence_interval.) If you are experiencing serious outliers in performance you may use your judgement to discard them as not representing system behavior. This happens on AWS. Discarding outliers is not good experimental practice, but may be necessary.

We have provided a simple BASH script that allows you to run your code multiple times with a single command. Download the script from the parallel website onto your EC2 instance and make sure it is executible:

wget http://parallel.cs.jhu.edu/assets/code/runloop.sh
chmod +x runloop.sh

You can then run any program you want multiple times by placing ./runloop in front of your normal shell command. E.g.:

./runloop ./filter 

You can change the number of iterations on line 4 of the runloop.sh script.

Screen

You may also find it useful to run your experiments in a screen. screen is a useful Linux utility that allows you to run interactive shells in their own process. By using screen, you can disconnect and reconnect to the EC2 instance without losing your work.

To start a screen session, use the -R flag:

screen -R parallel

To exit out of a screen and return to your main terminal (“detach”), type ctrl + a, release, then hit the d key. To see a list of running screens:

screen -list 

To reattach, use -r:

screen -r parallel 

Note that some keyboard shortcuts are different in a screen than they are in a normal shell.

A Note on Compilation

Modern C/C++ compilers include a large number of code optimization flags. Enabling these optimizations can and will change the structure of your code as written. Many parts of this assignment depend on your code operating more or less as written. Therefore, you need to be careful to compile explicitly without optimizations:

gcc -o filter -O0 -fopenmp filter.c 

Additionally, note that gcc will not be installed by default on your Amazon instance. To install gcc, g++, and a few other handy tools for compilation, run the following command:

sudo apt-get install build-essential 

What to do?

Loop Efficiency

  1. Evaluate the performance of the functions serialFilterFirst and serialDataFirst that loop over the arrays in different orders as a function of the filter length. You should select a constant value for the data length that allows each experiment to complete in a reasonable, but non-trivial amount of time. You should vary the filter length as 1,2,4,8,16,32,64,…(at least to 256).

    a. Plot the runtime as a function of filter size (don’t forget to label axes and units).

    b. Normalize the runtime to the filter length and plot the normalized runtime, i.e. number of operations per second.

    c. Is it more efficient to have the data or the filter in the outer loop? Why?

    d. How does the relative performance scale with the filter size? Explain the trend.

Loop Parallelism

  1. (10 pts) Implement loop parallel versions of the code. Run these on filters of length 256.

    a. Create a function parallelFilterFirst that is the same as b. serialFilterFirst except that it has the #pragma omp parallel for around the outer loop.

    b. Create a function parallelDataFirst that is the same as serialDataFirst except that it has the #pragma omp parallel for around the outer loop.

  2. (20 pts) Speedup

    a. Generate speedup plots for 1,2,4,8, and 16 threads. You will have to specify the number of threads to run using omp_set_num_threads().

    b. Describe the results. What type of speedup was realized? When did speedup tail off? Why? What hardware did this run on and how did that influence your result?

  3. (20) Parallel performance.

    a. Which version of the parallel code was faster? Why?

    b. For the faster, estimate the value of p in Amdahl’s law based on your results for 1 to 8 threads.

    c. To what do you attribute any non-ideal speedup? To startup costs, interference, skew, etc? Explain your answer.

Loop Unrolling

  1. (30 pts) In this part, you will experiment with unrolling loops by a factor of eight. There are four variants based on the loop iteration order and which loop you unroll.
    • parallelFilterFirstUnrolledData should unroll the inner loop of parallelFilterFirst
    • parallelFilterFirstUnrolledFilter should unroll the outer loop of parallelFilterFirst
    • parallelDataFirstUnrolledFilter should unroll the inner loop of parallelDataFirst
    • parallelDataFirstUnrolledData should unroll the outer loop of parallelDataFirst Make sure to run checkData to verify that your loop unrolling is correct. You will notice that you need a minimum filter length of 8 to match the degree of loop unrolling.

Compare the relative performance of these functions on a filter of length 256.

  • What is the benefit of loop unrolling? What overhead does it eliminate?
  • Can you correlate the performance improvment with the fact that you unrolled the loop 8 times? Perhaps by instruction counting.
  • Can you conclude whether it is better to unroll the inner loop or outer loop?
  • Can you conclude whether it is better to unroll the data loop or the filter loop?

TERMINATE YOUR EC2 INSTANCES