Parallel Computing for Data Science Fall 2023 (EN 601.420/620)

Syllabus in standard CS/JHU/ABET format. The material on this page mirrors that information.

Schedule

MW 4:30 pm - 5:30 pm, Hodson Hall 210. Zoom link with password can be found in Canvas.

Note to students trying to enroll

All course enrollment prior to the first day of classes 2023 will be conducted through SIS. For students eligible to enroll through SIS, requests for permission to enroll based on exceptional circumstances or requests to be promoted from the wait list will not be granted. Emails requesting such permissions will not receive replies.

Informal note on registration

In prior years, we have had the experience that all interested students have gained entry into the class eventually. There tends to be some melt of enrolled students in the second week of classes. If you are interested in taking the course, please email the instructor letting them know your intent. Then attend lectures and keep up. It will likely work out. No guarantees.

Course Description

This course studies parallelism in data science, drawing examples from data analytics, statistical programming, and machine learning. It focuses mostly on the Python programming ecosystem, but we will use C/C++ to accelerate Python and Java to explore shared-memory threading. It explores parallelism at all levels, including instruction level parallelism (pipelining and vectorization), shared-memory multicore, and distributed computing. Concepts from computer architecture and operating systems will be developed in support of parallelism, including Moore’s law, the memory hierarchy, caching, processes/threads, and concurrency control. The course will cover modern data-parallel programming frameworks, including Dask, Spark, Hadoop!, and Ray. The course will not cover GPU deep-learning frameworks nor CUDA. The course is suitable for second-year undergraduate CS majors and graduate students from other science and engineering disciplines that have prior programming experience. [Systems]

Prerequisites:

  • Intermediate Programming (EN 601.120 or the equivalent)
  • Data Structures (EN 601.226 or the equivalent)
  • Computer Systems Fundamentals (EN 601.333 or the equivalent)

Course topics include:

  • Amdahl’s law and strong scaling
  • Data science in Python: dataframes, numpy, scipy
  • Machine learning in Python: scikit-learn
  • Instruction-level parallelism
  • Multicore architectures
  • Shared-memory parallelism and programming with threads
  • Memory hierarchy and caching
  • Synchronization and concurrency control
  • Gustavson’s law and weak scaling
  • Data-parallel distributed computing: dask, spark
  • Distributed actor programming: ray

Comments on the 2023 Edition

This year we are deploying an e-book that is a complete description of everything that you are expected to know during the class. The book will link to external sources that you will be expected to read, use and understand. This will be explicit.

The e-book will include all of the exercises (programming homework). The intent this year is to have more (7-9), smaller exercises to reduce barriers to entry.

The changes that were made last year are still relevant.

Comments on the 2022 Edition

This course replaces Parallel Programming as it was taught from 2013–2021. It shares the same course number and you cannot receive credit for both. The new syllabus changes the focus of the course:

  • Examples will be drawn from machine learning and data science as much as possible.
  • The course will not cover GPU programming, GPUs, or machine learning frameworks, such as TensorFlow, Keras, and PyTorch.
  • Supercomputing and scientific/numerical applications will be deemphasized.
  • The course adds material on instruction level parallelism, including pipelining and vectorization.

Most importantly, the term “Programming” has been replaced with “Computing” which reflects the natural evolution of the space to focus more on the architecture and systems aspects of the subject matter and less on programming patterns and parallel design.

Students are responsible for all material and announcements on this course Web page and Piazza.

Academic Conduct

The guidelines of Johns Hopkins’ undergraduate ethics policy and graduate student conduct policy apply to all activities associated with this course. Additionally, students are subject to the Computer Science Academic Integrity Code.

In addition, the following are specific ethics guidelines for this course.

Students are encouraged to consult with each other and even collaborate on all programming activities. This means that students may look at each other’s code, program in teams, and even help each other debug. Any code the was written together must have a citation (in code comments) that indicates who developed the code. Any code excerpted from outside sources must have a citation to the source (in code comments). You are also welcome to consult and use ChatGPT, Github Co-Pilot, or other AI tools. Again, you must cite how the code was generated.

You must be able to understand and explain all code that you turn in. If you take code from another source, such as an AI tool, you must understand what the code does and how the code fulfills the requirements of the assignment. As part of evaluating work, the course staff may choose to have a meeting to review your program and have you demonstrate your understanding of the code.

For programming that is done in pairs or teams, all team members must be listed as collaborators in comments in the source code AND in any submitted documents (PDFs and notebooks). You should also specify the nature and scope of the collaboration, shared programming, discussion, debugging, consulting. Failure to state a collaboration and its scope is an ethics violation and will be treated as such.

Each assignment involves questions that analyze the assignment and connect the program to course concepts. The answers to these questions must be prepared independently by each student and must be work that is solely their own. You cannot use AI tools to generate your answer. You are welcome to use AI tools to format, proofread, and clarify your answers. If you choose to do so, you should indicate what tool was used and include a copy of the input text and output text from the model.

Lecture Recordings and Attendance

All lectures will be available for synchronous delivery on zoom. All lectures will be recorded and will be available on Panopto within 24 hours. The course is designed for in-person delivery. Course activities count toward your grade and cannot be performed asynhcronously. Attendance is not required.

Course Staff

Instructor

Randal Burns, randal@jhu.edu, http://www.cs.jhu.edu/~randal/

  • Office Hours: Monday 6-7 Friday 12-1 (zoom https://wse.zoom.us/j/96988259048)
Teaching and Course Assistants

TA office hours will be held in Malone 207

Brian Wheatman, wheatman(at)cs.jhu.edu

  • Office Hours: Wednesaday 6-7

Ariel Lubonja, ariel(at)cs.jhu.edu

  • Office Hours: Wednesday 10-11:30 (zoom https://JHUBlueJays.zoom.us/j/7936487340)

Meghana Madhyastha, mmadhya1(at)jhu.edu

  • Office Hours: Tuesday 2:30-3:30

Brian Choi, bchoi11(at)cs.jhu.edu

  • Office Hours: Thursday 2:30-3:30

Course Goals

Specific Outcomes for this course are that students:

  • Take a computational task and construct an implementation that maximizes parallelism.
  • Analyze and instrument an implementation of a computer program for its speedup, scaleup, and parallel efficiency.
  • Reason about the loss of parallel efficiency and attribute that loss to factors, including startup costs, interference, and skew.
  • Work with a diverse set of programming tools for different parallel environments, including cloud computing, high-performance computing, multicore, and GPU accelerators.
  • Analyze how locality, latency, and coherency in the memory hierarchy influence parallel efficiency and improve program design based on the properties of memory.

This course will address the following CSAB ABET Criterion 3 Student Outcomes Graduates of the program will have an ability to:

  1. Analyze a complex computing problem and to apply principles of computing and other relevant disciplines to identify solutions.
  2. Design, implement, and evaluate a computing-based solution to meet a given set of computing requirements in the context of the program’s discipline.
  3. Apply computer science theory and software development fundamentals to produce computing-based solutions.

Grading

The course will includes six to nine programming activities that span one to two weeks of course time. Activities will be graded for completion of the assignment. Activities that are incomplete or do not fulfill the stated objectives may be resubmitted with permission from the instructor. The goal of the activities is for the student to gain skills with the algorithms, programming tools, and principles presented in the class. Answers that are incorrect or programs that do not meet the assignment objectives will be either (1) be marked as incorrect to provide feedback to the student or (2) be returned to the student for resubmission. Every student will have the opportunity to receive all credit for all activities. Activities makes up 40% of the course grade.

The course has two exams: a midterm and a final. Each exam counts for 30% of the course grade.

The final letter grades do not depend solely on the achievement of a target score over all assignments and exams. Grades will be determined based on the achievement of learning goals. The course staff will determine a map of total scores to grades at the end of the semester. This policy lets instructors account for variance in exam scores, specifically when the exam scores are lower than intended or expected by the instructors. Grades will start with the following guidelines:

  • 93.% or more -> A
  • 90% - 93.3% -> A-
  • 86.6% - 90% -> B+
  • 83.3% - 86.6% -> B
  • 80% - 83.3% -> B-
  • less than 80% -> TBD based on evidence of learning

The instructors may choose to move the grade boundaries down, i.e. move the A- threshold from 90% to 87% based on how the course realized learning goals. We will not move the thresholds up.

Textbooks used in Course

The course does not follow a textbook. Lectures will refer to specific material from the following books. For the O’Reilly Books, you must first access one through the library proxy. After that, the links will work.

Mattson, T. G., B. A. Sanders, and D. L. Massingill. Patterns for Parallel Programming. Addison-Wesley, 2004. This text is available online to Hopkins students.

N. Matloff, Parallel Computing for Data Science. CRC Press, 2015. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.467.9918&rep=rep1&type=pdf

Herlihy, M. and N. Shavit. The Art of Multiprocessor Programming. Morgan-Kaufmann, 2008. This text is available online to Hopkins students.

To access online books:

  • login to the Pulse VPN (if outside of jhu)
  • navigate to library.jhu.edu and search for the title.
  • you will see an online access link.