Final Exam (EN 601.[3,4,6]20 Parallel Progamming)

Due Date: Tuesday December 21, 2021 11:59 am EST (earlier submissions are welcome). Submit a PDF of your typeset answers to Gradescope. Handwritten answers will not be accepted.

This is an open book and open Internet untimed exam. The questions are designed to require thoughtful answers that are not obvious. You are required to turn in answers that are solely your own, prepared individually, and representative of your own thinking and understanding about the problem. Some questions may require additional research beyond the materials provided in class. You are welcome to use online or printed resources that are public, static, and citeable. This includes Web pages, journal and conference articles, and books. Any resources from which you draw data or use concepts must be cited in the answer at the location the concept is used. It is not adequate to put a list of citations at the end of your answer or at the end of the exam. You may not consult any human or interactive resources. You may not discuss ideas with friends or show or exchange data or partial answers. You may not post questions to Stack Overflow or use online answering services, such as Chegg.

If you have a question about the exam or need clarification, you may submit a private question to instructors on Piazza. Please do not post anything visible to the class. Please do not send email.

Please be careful to respond to all parts of the questions that are asked. Each question mark indicates that a response is required. Look for directives such as Explain, Define, or Give and follow these instructions. Please keep your answers as brief as possible. Answers that include extraneous facts and irrelevant details will lose credit even if a correct answer is included in the response.

  1. (8 points) Write PySpark code to identify all Web pages that have no incoming hyperlinks, i.e. there are no other Web pages that link to or point to these pages. You are given an RDD that contains key/value pairs resulting that describe the linking structure of a Web site. The key contains a URL and the value contains all hyperlinks in the page.
outlinks = [['a.com/a',['b.edu/b','d.org/d']],
            ['b.edu/b',['a.com/a','d.org/d']],
            ['c.com/c',['a.com/a','b.edu/b','d.org/d']],
            ['d.org/d',['a.com/a','b.edu/b']],
            ['e.edu/e',['a.com/a']]]
outlinksrdd = sc.parallelize(outlinks)

# TODO (your answer code goes here)

noinlinksrdd.collect()

This should return.

 ['e.edu/e', 'c.com/c']

Your implementation should use Spark transformations so that all intermediate results between outlinksrdd and noinlinksrdd are RDDs. Specifically, you should not collect() intermediate results. This is a programming problem. You should write PySpark code that runs correctly. It should also be concise and efficient.

  1. (5 points) Answer the following based on Figure 1 in Calore et al. ThunderX2 Performance and Energy-Efficiency for HPC Workloads, Computation, 2020. Note that this is a log-log roofline plot.

    1. Which kernel “Dirac” or “LBM (Collide)” is closer to the empirical upper bound on performance on ThunderX2? Justify your answer quantitatively, i.e. estimate values from the chart that support your conclusion.

    2. Which architecture (Skylake, Thunder X2, or Haswell) is best suited for the LBM (Collide) kernel? Justify your answer. Note: Multiple answers are reasonable, but they have different justifications. Give only one.

  2. (5 pts) In Project 3b, we used Dask compute() on multiple dataframes so that the dask scheduler shares intermediate results among multiple computations. The following snippet is faster than calling compute() on df1 and df2 independently.

df1 = df.groupby(['Origin','TailNum']).DepDelay.mean()
df2 = df.groupby(['Origin','TailNum']).DepDelay.max()
dd.compute(df1, df2)

This is because Dask computes df.groupby(['Origin','TailNum']).DepDelay only once.

However in Spark dataframes, one must call collect() on each RDD.

df1 = df.groupby("Origin","TailNum").select("DepDelay").mean()
df2 = df.groupby("Origin","TailNum").select("DepDelay").max()
df1.collect()
df2.collect()

Note: This is not a programming problem. Your answer does not need to compile and run.

  1. (3 continued) subquestions (answer these)

    1. How can you encourage Spark to compute df.groupby("Origin","TailNum").select("DepDelay") only once?

    2. Rewrite the PySpark snippet to implement your solution.

  1. (8 pts) On deadlock.

    1. Give a real world example of a deadlock. The example should not be a computer program.

    2. In your example, what are the processes?

    3. In your example, what are the resources?

    4. How does you example meet the “no preemption” criterion for deadlock?

  2. (1 pts) What parallel computing technology, architecture, programming model makes you excited? What would you have liked to learn that we didn’t cover? Use no more three sentences.