Assignment #2: OpenMP Filter

Due date: Tuesday 14 March 2017, 11:59:00 pm

The project parallelizes a filtering function that we have been working on in our research using OpenMP and evaluates the speedup realized on multicore processors. The core concept is that you have a data array D that contains labels. These may be superpixels in an image or a segmentation of a scene. You also have a filter list F that enumerates the labels of interest. The goal of the computation is to output a new data array D’ that contains only the elements of D that are listed in F, zero otherwise.

I am giving you two serial implementations that you may use as a reference: filter.c. Here is the inner loop of one of them:

/* start with an empty *all zeros* output array */
output_array = (unsigned int*) malloc ( DATA_LEN * sizeof(unsigned int));
memset ( output_array, 0, DATA_LEN );

/* for all elements in the data */
for (int x=0; x<DATA_LEN; x++) {
  /* for all elements in the filter */
  for (int y=0; y<FILTER_LEN; y++) {
    /* it the data element matches the filter */
    if (input_array[x] == filter_list[y]) {
      /* include it in the output */
      output_array[x] = input_array[x];
    }
  }
}

Experimental Evaluation

All performance evaluations should be run on the Amazon High-CPU 8-core instances (c4.2xlarge) which currently cost about $0.40/hour. I recommend that you do development and testing on some combination of you personal computer and the free-tier micro instances and then launch a large instance for performance evaluation. All experimental data points should represent the median of 20 runs. This is good experimental practice (see http://en.wikipedia.org/wiki/Confidence_interval for the motivation.) If you are experiencing serious outliers in performance you may use your judgement to discard them as not representing system behavior. This happens on AWS. Discarding outliers is not good experimental practice, but may be necessary.

We have provided a simple BASH script that allows you to run your code multiple times with a single command. Download the script from the parallel website onto your EC2 instance and make sure it is executible:

wget http://parallel.cs.jhu.edu/assets/code/runloop.sh
chmod +x runloop.sh

You can then run any program you want multiple times by placing ./runloop in front of your normal shell command. E.g.:

./runloop ./filter 

You can change the number of iterations on line 4 of the runloop.sh script.

Screen

You may also find it useful to run your experiments in a screen. screen is a useful Linux utility that allows you to run interactive shells in their own process. By using screen, you can disconnect and reconnect to the EC2 instance without losing your work.

To start a screen session, use the -R flag:

screen -R parallel

To exit out of a screen and return to your main terminal (“detach”), type ctrl + a, release, then hit the d key. To see a list of running screens:

screen -list 

To reattach, use -r:

screen -r parallel 

Note that some keyboard shortcuts are different in a screen than they are in a normal shell.

A Note on Compilation

Modern C/C++ compilers include a large number of code optimization flags. Enabling these optimizations can and will change the structure of your code as written. Many parts of this assignment depend on your code operating more or less as written. Therefore, you need to be careful to compile explicitly without optimizations:

gcc -o filter -O0 -fopenmp filter.c 

Additionally, note that gcc will not be installed by default on your Amazon instance. To install gcc, g++, and a few other handy tools for compilation, run the following command:

sudo apt-get install build-essential 

What to do?

Loop Efficiency

  1. (20 pts) Evaluate the performance of the functions serialFilterFirst and serialDataFirst that loop over the arrays in different orders as a function of the filter length. You should select a constant value for the data length that allows each experiment to complete in a reasonable, but non-trivial amount of time. You should vary the filter length as 1,2,4,8,16,32,64,…(at least to 1024).

    a. Plot the runtime as a function of filter size (don’t forget to label axes and units).

    b. Normalize the runtime to the filter length and plot the normalized runtime, i.e. number of operations per second.

    c. Is it more efficient to have the data or the filter in the outer loop? Why?

    d. How does the relative performance scale with the filter size? Explain the trend.

Loop Parallelism

  1. (10 pts) Implement loop parallel versions of the code. Run these on filters of length 512. a. Create a function parallelFilterFirst that is the same as b. serialFilterFirst except that it has the #pragma omp parallel for around the outer loop.

    b. Create a function parallelDataFirst that is the same as serialDataFirst except that it has the #pragma omp parallel for around the outer loop.

  2. (20 pts) Speedup

    a. Generate speedup plots for 1,2,4,8, and 16 threads. You will have to specify the number of threads to run using omp_set_num_threads().

    b. Describe the results. What type of speedup was realized? When did speedup tail off? Why? What hardware did this run on and how did that influence your result?

  3. (20) Parallel performance.

    a. Which version of the parallel code was faster? Why?

    b. For the faster, estimate the value of p in Amdahl’s law based on your results for 1 to 8 threads.

    c. To what do you attribute any non-ideal speedup? To startup costs, interference, skew, etc? Explain your answer.

An Optimized Version

  1. (30 pts) Consider the following code optimizations. For each, implement the optimized code and describe your result. For optimizations that improve run time, explain why. Similarly, provide an explanation if an optimization is ineffective (or makes performance worse). The following optimization resource may help with your implementation and explanations: http://www.akira.ruc.dk/~keld/teaching/IPDC_f10/Slides/pdf4x/4_Performance.4x.pdf. This is the most important part of the assignment. Please dig deeply.

    a. Loop Unrolling

    b. Custom Scheduling (e.g. Dynamic, Static with user specified block size)

What to turn in?

You will submit your code to Gradescope in two separate assignments. Submit your writeup under the Project 2 Writeup assignment, and your code under the Project 2 Code assignment.

Write-up: A typed (do not submit photos/scanned handwritten work) PDF file should be turned in that includes answers to all of the questions identified by number and sub-part. This should also include graphs for the requested experiments. Answer all parts of all questions. Graphs should include legends, axis labels, and units.

Code: Your code should be a single file named filter.c that includes both serial functions and both parallel functions. Note that the autograder for this assignment simply makes sure your code compiles.

TERMINATE YOUR EC2 INSTANCES