Project #1: OpenMP Filter

Due date: September 24, 2021. 05:00:00 pm.

The project parallelizes a filtering function that we implemented and parallelized in our previous research. These are experiments that were run in the Burns lab in 2012. You will use OpenMP and evaluate the speedup realized on multicore processors. In this code, you have a data array D that contains labels. These may be superpixels in an image or a segmentation of a scene. You also have a filter list F that enumerates the labels of interest. The goal of the computation is to output a new data array D’ that contains only the elements of D that are listed in F, zero otherwise.

I am giving you two serial implementations that you may use as a reference: filter.c. Here is the inner loop of one of them:

/* for all elements in the data */
for (int x=0; x<DATA_LEN; x++) {
  /* for all elements in the filter */
  for (int y=0; y<FILTER_LEN; y++) {
    /* if the data element matches the filter */
    if (input_array[x] == filter_list[y]) {
      /* include it in the output */
      output_array[x] = input_array[x];
    }
  }
}

Experimental Evaluation

You should run all performance experiments on a computer with at least 4 real cores, i.e. if you are on an Intel archictecture with hyperthreading, you need 8 threads. You should also run experiments on a machine that is not being shared with other workloads. This ensures that resource measurements are accurate. For example, it is not appropriate to run tests on the ugrad machines.

To check how many physical cores are on your machine you can use the following commands

  • linux or wsl cat /proc/cpuinfo | grep "core id"
  • OSX sysctl -a | grep machdep.cpu | grep core_count

If you do not have access to a machine with at least 4 cores, you may use one of the undergraduate machines or a cloud computing resource. All of AWS, GCE, Azure, Digital Ocean, etc. have inexpensive machines available. GCE and AWS have student promotions to gain access to free credits. In prior years, I required students to use AWS. I had said:

“The preferred implementation would run experiments on Amazon High-CPU 16 vCPU. I am recommending the c5a.4xlarge instances, which promises at least 8 physical cores to implement 16vCPUs. This costs about $1.20/hour.”

However, I want to make this easier and more convenient. Use the tools with which you are most comfortable. If you want to get started with AWS, we have a simple assignment that serves as a quick-start guide project0.1.html. If you are using a cloud instance, I strongly recommend that you do development and testing on some combination of you personal computer or the free-tier micro instances and then run the big machine only for benchmarking.

When benchmarking runs, I would typically ask you to take a suitable number of runs to get narrow confience intervals and plot charts with 95% confidence intervals (see http://en.wikipedia.org/wiki/Confidence_interval). This is good experimental methodology in computer science. If you are running on the cloud or one a ugrad machine, you may get performance outliers from resource sharing. For ugrad machines, you can run htop to make sure that no other processes are using a substantial amount of processing. In the cloud, you may also get non-repeatability of results when you have a different machine environment at different times. In most cases, you can inspect results to make sure that performance is stable over multiple runs and then report means. If you cannot get stable results over mutliple runs, you can run the experiments many times and report the median or mode. This essentially throws out outliers. Not good statistical practice, but occassionally necessary and occasionally used in cloud computing to deal with noise.

A Note on Compilation

Modern C/C++ compilers include a large number of code optimization flags. Enabling these optimizations can and will change the structure of your code as written. Many parts of this assignment depend on your code operating more or less as written. Therefore, you need to be careful to compile explicitly without optimizations:

gcc -o filter -O0 -fopenmp filter.c 

What to do?

All parts (a, b, c…) of numbered items (1, 2, 3…), require a written response that should be turned in. Answers should be typset. Graphs should be produced using the tool of your choice and embedded in the document. Please turn in a single PDF.

Answers do not need to be long. The expectation is that one to three sentences should suffice for each part of each question. Answers that include extraneous or irrelevant details will have points deducted even if they have they correct answer as a subset.

A Note on Ethics

For teams/partners working together, it is OK to use the same experimental data from the same code. You must interpret and present it independently. Specifically, you cannot generate one set of graphs and use the in submissions from multiple people.

Loop Efficiency

  1. Evaluate the performance of the functions serialFilterFirst and serialDataFirst that loop over the arrays in different orders as a function of the filter length. You should select a constant value for the data length that allows each experiment to complete in a reasonable, but non-trivial amount of time, e.g. >100ms on the shortest experiment and less than 60s on the longest experiment. You should vary the filter length as 1,2,4,8,16,32,64,…,256.

    a. Plot the runtime as a function of filter length (don’t forget to label axes and units).

    b. Does the runtime improve or degrade as the filter grows in size? Why? Hint: A filter of size 8 does twice as much work as a filter of size 4. So, divide the performance by the filter length to get a sense of efficiency.

    c. In what cases, if any, is serialFilterFirst faster? Explain why. Hint: This will be a function of the loop sizes and caching effects in the memory system.

    d. In what cases, if any, is serialDataFirst faster? Explain why.

Loop Parallelism

Implement loop parallel versions of the code. Run these on filters of length 256.

  • Populate the function parallelFilterFirst that is the same as serialFilterFirst except that it has the #pragma omp parallel for around the outer loop.
  • Populate the function parallelDataFirst that is the same as serialDataFirst except that it has the #pragma omp parallel for around the outer loop.
  1. Speedup: do this for both parallelFilterFirst and parallelDataFirst.

    a. Generate speedup plots for 1,2,4,8, and 16 threads. You will have to specify the number of threads to run using omp_set_num_threads() or the OMP_NUM_THREADS environment variable.

    b. At what level of parallelism did speedup stop increasing? Explain.

    c. Estimate the Amhdahl number p for both functions. Show your work.

    d. Which function is faster? Why?

    e. Which function is more scalable? Why?

    d. To what factors against parallelism do you attribute the loss of parallel efficiency, i.e. why is this code not embarassingly parallel?

Loop Unrolling

In this part, you will experiment with unrolling loops by a factor of eight. There are four variants based on the loop iteration order and which loop you unroll. Implement the following:

  • parallelFilterFirstUnrolledData should unroll the inner loop of parallelFilterFirst
  • parallelFilterFirstUnrolledFilter should unroll the outer loop of parallelFilterFirst
  • parallelDataFirstUnrolledFilter should unroll the inner loop of parallelDataFirst
  • parallelDataFirstUnrolledData should unroll the outer loop of parallelDataFirst Make sure to run checkData to verify that your loop unrolling is correct. You will notice that you need a minimum filter length of 8 to match the degree of loop unrolling.
  1. Compare the relative performance of these functions on a filter of length 256 with the largest data size from part 2 and 8 threads.

    a. Is it more efficient to unroll the outer loop or the inner loop? Explain why.

    b. What operations/insturctions are eliminated by loop unrolling? How many instructions per loop?

    c. Which combination of loop unrolling and iteration order is the most efficient? Explain why. If this is a different loop order than was found in parts 1 or 2, explain the difference.

Submission Instructions

This assignment should be submitted on gradescope. The signup code for gradescope is ERDNJD. Answers should be typed and submitted as a pdf. Handwritten assignments will NOT be accepted. There are a total of 48 late hours for the duration of the semester for all of the assignments combined. Anything exceeding that will incur a penalty.