Assignment #3: Measuring Parallel Performance

Due date: Thursday 6 April 2017, 11:59:00 pm

In this assignment, you will write parallel programs using threads in Java. Follow the guidelines for creating concrete Runnable objects in Appendix C of Patterns for Parallel Programming. There will be few choices in the design of an appropriate parallel algorithm. The point of this exercise is to explore scaling and speedup on multi-core processors.

The majority of your grade (~60%) will be derived from your answers to the questions. These should be concise, thoughtful, and specific. You will turn in both a PDF document that answers the questions below and your source code.

Part 0: What You Need to Know

Amazon EC2

You will run your programs on Amazon EC2 using a c4.4xlarge, High-CPU Extra Large Instance. These machines have 62 EC2 Compute Units on 16 virtual cores and will allow you to explore scaling and speedup to that scale in multicore systems.

You should develop your code on your personal computer (or a t2.micro free tier instance) and only use EC2 for scalability testing.

vCPUs: See this excellent blog post for a description of vCPUs. The following quote is particularly revealing…

It turns out that a vCPU does not guarantee to you anything more than a single CPU thread.

What this means is a vCPU could be an entire CPU core… or you could be getting a single thread on a hyper-threaded core that you’re sharing with someone else (or, probably much more likely, with yourself, e.g. one Xeon core with HT enabled counts as two vCPUs). Depending on the workload you’re running, you might not see the performance you would expect from doubling the number of vCPUs.


Submit code to the coding assignment on Gradescope and the writeup to the writeup assignment there.

PDF Document: A PDF document that answers the analysis questions listed below. Include your name, the date, and your section number. This document must be a PDF. Google Docs is typically quite good at converting other file formats to PDF.

Source Code: Upload two Java files to Gradescope. Your coin flipper should be named, and your brute force DES key cracking program should be named

Compilation: The easiest way to compile your code is to use javac:

java CoinFlip

Part 1: Parallel Coin Flipping

Write a java program that flips coins in parallel. The program can be invoked from the shell (command line) and should take two arguments: the number of threads to start and the number of coin flips. The program should output the number of heads (or tails), the number of total coin flips, and the elapsed (wall clock) time to run the program. For example, my implementation does the following:

<randal@rio:/Users/randal/class/420/P1.MergeSort> java CoinFlip
Usage: CoinFlip #threads #iterations
<randal@rio:/Users/randal/class/420/P1.MergeSort> java CoinFlip 1 10000000
5000728 heads in 10000000 coin tosses.
Elapsed time: 525ms
<randal@rio:/Users/randal/class/420/P1.MergeSort> java CoinFlip 2 10000000
5001027 heads in 10000000 coin tosses.
Elapsed time: 267ms

For this experiment you will need to use some java functions. Specifically, System.currentTimeMillis() and Random.generate() and Random.nextInt(). The program I give you in part 2 has examples. You will also need join() all your child threads from the parent thread. This is the most basic form of synchronization.

Random(): It’s very important to not share your Random() number generator between threads. Instead, create Random() number generators in the constructor of your object. You will have to work out why in first question below…

Also, if you create a Random number generator (Random rand = new Random();) in your run method, you will get strange behavior. Make sure you have a single Random object for each thread, and make sure that object is created somewhere besides the run method (like the constructor).

Part 1: Coin Flipping Analysis

For reasonable parameters, measure the speed-up of this program from 1 thread to twice as many threads as cores (vCPUs in AWS). Your experiment should take seconds (so that it’s a significant number of trials) but not hours. For example, for 16 cores, you might run the program for 1,2,4,…,16,32 threads at 1000000000 coin tosses for speedup. (Don’t worry about +/- rounding errors for number of flips per thread.)

  1. Factors Against Parallelism

    a. Below we have provided the source for a common implementation of the next method of the Random object. The next method is always used to generate random objects, regardless of type (float, int, bool, etc). The nextInt method, for example, calls next with a bit value of 32.

     protected synchronized int next(int bits)
         seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
         return (int) (seed >>> (48 - bits));

    Given the source code above, explain why sharing a single Random object across all threads might be problematic. Be sure to cite any specific factors against parallelism that apply in your explanation.

    b. Suppose you created a single Random object shared across each of your threads. What type of speedup (e.g. linear) would you expect to see? Why?

  2. Speedup

    a. Produce a plot that shows the speedup of your program. Label all axes and include units.

    b. Algorithm (true) speedup measures the scaling performance of the algorithm as a function of processing elements. In this case, from 1..8 processing elements, characterize the algorithmic speedup. If it is sub-linear, describe the potential sources of loss.

    c. Why does the speedup not continue to increase past the number of cores? Does it degrade? Why?

  3. Design and run an experiment that measures the startup costs of this code.

    a. Describe your experiment. Why does it measure startup?

    b. Estimate startup cost. Justify your answer.

    c. Assuming that the startup costs are the serial portion of the code and the remaining time is the parallel portion of the code, what speedup would you expect to realize on 100 threads? 500 threads? 1000 threads? (Use Amdahl’s law.)

Part 2: Brute Force a DES Key

In this step, you will implement a parallel/threaded Java program that attempts to brute force a DES key. Because we do not have reasonable computing power to attack 56 bit keys, we are going to attack smaller keys. I also wanted to keep it simple. So, rather than giving you an already encrypted message, we are going to have your program generate the ciphertext and then brute force the decryption.

The attack is a combination brute force + known plaintext attack. This is quite common. If you the attacker know part of the message, you can attempt to decrypt with all possible keys and check to see if the decrypted output matches the know plaintext.

The assignment starts with my serial implementation of a brute force attack:

Parallelize this program so that it will search for a specified number of keys using a specified number of threads. My implementation does the following:

<randal@rio:/Users/randal/class/420/P1.MergeSort> java BruteForceDES
Usage: BruteForceDES #threads key_size_in_bits
<randal@rio:/Users/randal/class/420/P1.MergeSort> java BruteForceDES 2 20
Generated secret key 798711
Thread 0 Searched key number 0 at 0 milliseconds.
Thread 1 Searched key number 600000 at 1803 milliseconds.
Thread 0 Searched key number 100000 at 2204 milliseconds.
Thread 1 Searched key number 700000 at 3384 milliseconds.
Thread 0 Searched key number 200000 at 3783 milliseconds.
Thread 1 found decrypt key 798711 producing message: Johns Hopkins afraid of the big bad wolf?
Thread 1 Searched key number 800000 at 4891 milliseconds.
Thread 0 Searched key number 300000 at 5292 milliseconds.
Thread 1 Searched key number 900000 at 6426 milliseconds.
Thread 0 Searched key number 400000 at 6801 milliseconds.
Thread 1 Searched key number 1000000 at 7925 milliseconds.
Thread 0 Searched key number 500000 at 8309 milliseconds.
Final elapsed time: 9045

You should make sure that your threads have no shared state, variables, or objects. For example, make sure to create a separate SealedDES object for decryption in each thread. Do not use a single encryption/decryption object shared among all threads.

You should only encrypt the plaintext once and distribute the encrypted object to all your threads. This is similar to the building of the filter / data arrays in project 1 (which were never included in the timing for your experiments).

Part 2: Brute Force a DES Key Analysis

  1. For reasonable parameters and for however many cores you have on the system, measure the weak scaling of this program. (Note: There will be a lecture on weak scaling on March 27th).

    a. Produce a scaling chart and interpret/describe the results. Please use charts with correctly labelled axes and ticks. Is the scaling constant?

    b. Extrapolating from your scaling analysis, how long would it take to brute force a 56 bit DES key on a machine with 64 cores? Explain your answer.