Parallel Computing for Data Science (EN.601.420/620)

Lectures

Lectures are designed for synchronous delivery. It is not expected that the recorded version is an adequate substitute for attending.

Recorded lectures are available on Panopto through Blackboard. Log in to blackboard. Lectures are named LecXX.[Topic].[Date].mp4. Although Panopto is inconvenient, it is the only way to control access to lectures to enrolled students.

Jupyter notebooks for lectures are available on github (https://github.com/randalburns/pplectures2021). It is encouraged that you clone this repo, pull before each lecture and run.

Projects

Project 1: OpenMP Filter
Homework 1.5: All Possible Regressions in Python (due October 1, 2021 5:00 pm EDT)
- complete the notebook in pplectures2021/homework/HW1.5.ipynb
- in this way you will have the dataset and the environment
Project 2: Java BlockingQueue (due October 15, 2021 5:00 pm EDT)
Project 3: dask notebooks (due November 2, 2021 5:00 pm EDT )
Project 4: k-means in Spark
Project 5: Ray Deadlock (due December 6, 2021 5:00 pm EDT)
- submissions may be turned in as late as December 8, 2021, 5:00 pm with no late penalty

Midterm

Due October 22, 5 pm EDT. PDF.
- Midterm is an open-Internet, take home exam. It will be distributed on Wednesday October 20, 2021 at 5:30 pm.
- Any updates, errata, corrections will be placed on this Webpage. I will notify of changes on Piazza.

Final

Final Exam has been released as of December 13, 2021 2:00pm.
- DUE Tuesday Decmeber 21, 2021 11:59 am.
- Early submissions are strongly encouraged

Late Hours

A total of 48 late hours are permitted per semester to use as needed for projects. Late hours will be rounded up to the nearest hour, e.g. a project submitted 2.5 hours late will count as using 3 late hours.

Course Schedule

(30 August) Introduction to Parallel Programming

Syllabus review. Academic ethics. Parallelism in modern computer architectures. Performance beyond computational complexity. An introduction to course tools for lectures and homework: conda/python/github/jupyter.

Reading:
- Mattson, Patterns for Parallel Programming, Chapter 1.

(30 August) A First Parallel Program

Parallelization with joblib in Python. The Global Interpreter Lock. Python packages for data science. Performance timing.

Reading:
- Matloff, Parallel Computing for Data Science, Chapter 1.

(6 September) Amdahl’s Law, Strong Scaling, and Parallel Efficiency

Amdahl’s law is the fundamental principle behind strong scaling in parallel computing. Strong scaling is the process of solving a problem of the fixed size faster with parallel resources.

Reading:
- Mattson, Patterns for Parallel Programming, Ch. 2.4-2.6

(8 September) OpenMP

Lecture 4: An introduction to parallelizing serial programs based on compiler directives. Also serial equivalence, and loop parallel constructs.

Reading:
- Mattson, Patterns for Parallel Programming, Appendix A.
Reference Materials:
- LLNL Tutorial (ignore Fortran stuff): https://computing.llnl.gov/tutorials/openMP/
- Specification (it’s actually really useful): http://www.openmp.org/mp-documents/spec30.pdf

(15 September) Cache Hierarchy

Lecture 5: Memory hierarchy and latency. Caching concepts: size, lines, associativity, and inclusion/exclusion. Caching microbenchmarks.

Reading:
- Dongarra et al. Accurate Cache and TLB Characterization Using Hardware Counters. https://link.springer.com/content/pdf/10.1007/978-3-540-24688-6_57.pdf
  - Please read to comprehend Figures 1, 2, and 3. Figures 4, 5, and 6 and the second microbenchmark (papi cacheBlock) are less important and difficult to understand.

(20 September) Loop Optimization

Lecture 6: Loop Optimizations

Reading:
- Performance tutorial (this is good!): http://www.akira.ruc.dk/~keld/teaching/IPDC_f10/Slides/pdf4x/4_Performance.4x.pdf

(22 September) Moore’s Law and Factors Against Parallelism

Lecture 7: Startup, interference, and skew.

Reading:
- Chapter 2, Patterns for Parallel Programming

(27 September) Vector Processsing and Processor Intrinsics

Guest lecturer: Brian Wheatman

(29 September) JIT Compilation, Moore’s Law, Parallel Efficiency

Lecture 9: A potpourri of stuff that I have not gotten to.

(4 October) Processess, Threads, and Java Threads

Lecture 10a: Java Threads

(6 October) Java Concurrency control

Asynchrony, waiting on threads, volatile variables, and synchronized functions.

Reading:
- Appendix C: Patterns for Parallel Programming

(11 October) Mutual Exclusion

Lecture 12: Critical sections and fast mutual exclusion.

Reading:
- Chapter 1 and 2-2.6: Herlihy and Shavat

NO MATERIAL past this point is on the midterm

(13 October) Dask

Lecture 13: Dask Arrays. Data parallel and declarative programming. Execution graphs and lazy evaluation.

(18 October) Dask Dataframes

Lecture 14: Parallel Pandas. Slicing and Aggregation. Indexing.

Reading:
- B-Tree
- B+-Tree

(20 October) Introduction to Map/Reduce

Lecture 15: The Google Parallel computing environment, functional programming concepts applied to large-scale parallelism, text processing.

Reading:
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004

(25 October) Hadoop!

Lecture 16: Hadoop! programming, the WordCount tutorial, and the Hadoop! toolchain.

(11 November) Triangle Counting in Hadoop!

Lecture 17: Friends-of-friends running example. The M/R sorting guarantee and combiners.

(1 November) Introduction to Spark

Lecture 18: Spark and Resilient Distributed Datasets.

Reading:
- Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI, 2012

(3 November) Roofline

Lecture 19: The roofline performance model and off-chip bandwidth

Reading: Understand operational intensity and the memory-limited and processing limited portions of the chart. This will be on the final as described in class!
- Williams et al. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures, CACM, 52(4), 2009.

(8 November) Ray: Task Programming with Remote Functions

Lecture 20: Remote functions, distributed objects, distributed memory management

Reading: P. Moritz et al. Ray: A Distributed Framework for Emerging AI Applications. OSDI, 2018.

(15 November) BSP, Barriers, and Ray Actors

Lecture 21: Bulk synchronous parallel, barrier synchronization, stateful distributed objects, service centers, ray.get() as a synchronization primitive.

(17 November) MPI, Deadlock and Flynn’s Taxonomy.

Lecture 22a/b/c

Reading:
- MPI Tutorial, Lawrence Livermore National Lab
- Mattson, Appendix B, Patterns for Parallel Programming.

(29 November) GPU architecture

Lecture 23: The evolution of GPU computing, from graphics pipeline to GPGPU to CUDA. GPU hardware.

Cool blog post about GPUs in deep learning. https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664

(1 December) The Google TPU

Lecture 24: Reading: Jouppi et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. ISCA, 2017.

(6 December) Top 500 Supercomputers

Lecture 25: slides