Parallel Computing for Data Science Fall 2022 (EN 601.420/620)

Syllabus in standard CS/JHU/ABET format. The material on this page mirrors that information.

Note to students trying to enroll

All course enrollment prior to the first day of classes 2022 will be conducted through SIS. For students eligible to enroll through SIS, requests for permission to enroll based on exceptional circumstances or requests to be promoted from the wait list will not be granted. Emails requesting such permissions will not receive replies.

Course Description

This course studies parallelism in data science, drawing examples from data analytics, statistical programming, and machine learning. It focuses mostly on the Python programming ecosystem but will use C/C++ to accelerate Python and Java to explore shared-memory threading. It explores parallelism at all levels, including instruction level parallelism (pipelining and vectorization), shared-memory multicore, and distributed computing. Concepts from computer architecture and operating systems will be developed in support of parallelism, including Moore’s law, the memory hierarchy, caching, processes/threads, and concurrency control. The course will cover modern data-parallel programming frameworks, including Dask, Spark, Hadoop!, and Ray. The course will not cover GPU deep-learning frameworks nor CUDA. The course is suitable for second-year undergraduate CS majors and graduate students from other science and engineering disciplines that have prior programming experience. [Systems]

Prerequisites:

  • Intermediate Programming (EN 601.120 or the equivalent)
  • Data Structures (EN 601.226 or the equivalent)
  • Computer Systems Fundamentals (EN 601.333 or the equivalent)

Course topics include:

  • Amdahl’s law and strong scaling
  • Data science in Python: dataframes, numpy, scipy
  • Machine learning in Python: scikit-learn
  • Instruction-level parallelism
  • Multicore architectures
  • Shared-memory parallelism and programming with threads
  • Memory hierarchy and caching
  • Synchronization and concurrency control
  • Gustavson’s law and weak scaling
  • Data-parallel distributed computing: dask, spark
  • Distributed actor programming: ray

Comments on the 2022 Edition

This course replaces Parallel Programming as it was taught from 2013–2021. It shares the same course number and you cannot receive credit for both.
The new syllabus changes the focus of the course:

  • Examples will be drawn from machine learning and data science as much as possible.
  • The course will not cover GPU programming, GPUs, or machine learning frameworks, such as TensorFlow, Keras, and PyTorch.
  • Supercomputing and scientific/numerical applications will be deemphasized
  • The course adds a material on instruction level parallelism, including pipelining and vectorization.

Most importantly, the term “Programming” has been replaced with “Computing” which reflects the natural evolution of the space to focus more on the architecture and systems aspects of the subject matter and less on programming patterns and parallel design.

Students are responsible for all material and announcements on this course Web page and Piazza.

Academic Conduct

The guidelines of Johns Hopkins’ undergraduate ethics policy and graduate student conduct policy apply to all activities associated with this course. Additionally, students are subject to the Computer Science Academic Integrity Code.

In addition, the specific ethics guidelines for this course are: Students are encouraged to consult with each other and even collaborate on all programming assignments. This means that students may look at each other’s code, pair program, and even help each other debug. Any code the was written together must have a citation (in code comments) that indicates who developed the code. Any code excerpted from outside sources must have a citation to the source (in code comments when the codee is used). Each assignment involves questions that analyze the assignment and connect the program to course concepts. The answers to these questions must be prepared independently by each student and must be work that is solely their own.

For any homework that is done in pairs or teams, all team members must be listed as collaborators in comments in the source code AND in any submitted documents (PDFs and notebooks). You should also specify the nature and scope of the collaboration, shared programming, discussion, debugging, consulting. Failure to state a collaboration and its scope is an ethics violation and will be treated as such.

Schedule

MW 4:30 pm - 5:30 pm, zoom TBA. Zoom link with password can be found in Blackboard.

Course Staff

Instructor

Randal Burns, randal@jhu.edu, http://www.cs.jhu.edu/~randal/

  • Office Hours:
    • TBA
    • by appointment
Teaching Assistants

TBA

Course Assistants

TBA

Course Goals

Specific Outcomes for this course are that students:

  • Take a computational task and construct an implementation that maximizes parallelism.
  • Analyze and instrument an implementation of a computer program for its speedup, scaleup, and parallel efficiency.
  • Reason about the loss of parallel efficiency and attribute that loss to factors, including startup costs, interference, and skew.
  • Work with a diverse set of programming tools for different parallel environments, including cloud computing, high-performance computing, multicore, and GPU accelerators.
  • Analyze how locality, latency, and coherency in the memory hierarchy influence parallel efficiency and improve program design based on the properties of memory.

This course will address the following CSAB ABET Criterion 3 Student Outcomes Graduates of the program will have an ability to:

  1. Analyze a complex computing problem and to apply principles of computing and other relevant disciplines to identify solutions.
  2. Design, implement, and evaluate a computing-based solution to meet a given set of computing requirements in the context of the program’s discipline.
  3. Apply computer science theory and software development fundamentals to produce computing-based solutions.

Grading

Course grades will be drawn from:

  • 40% midterm and final examination
  • 40% programming projects and
  • 20% short programming assignments

Course staff will factor in class participation and evidence of learning trajectory. These factors will contribute no more than a +/- change to a letter grade, i.e. B+->A- or B+->B. There is no curve. Each student earns a grade that reflects their individual learning and performance in the class.

Requests for Regrades

A student should only request a regrade of an assessment if a technical error was made in grading. In this case, the student must clearly document the technical error associated with a specific problem and submit a written request for a regrade. Once an assessment grade is released, you will have two weeks to submit a regrade request. All regrade requests must be submited through gradescope and only written requests will be considered. Please do not ask for regrade requests in office hours.

Textbooks used in Course

The course does not follow a textbook. Lectures will refer to specific material from the following books. For the O’Reilly Books, you must first access one through the library proxy. After that, the links will work.

Mattson, T. G., B. A. Sanders, and D. L. Massingill. Patterns for Parallel Programming. Addison-Wesley, 2004. This text is available online to Hopkins students https://learning.oreilly.com/library/view/patterns-for-parallel/0321228111/

N. Matloff, Parallel Computing for Data Science. CRC Press, 2015. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.467.9918&rep=rep1&type=pdf

Herlihy, M. and N. Shavit. The Art of Multiprocessor Programming. Morgan-Kaufmann, 2008. This text is available online to Hopkins students https://learning.oreilly.com/library/view/the-art-of/9780123973375/

Midterm Exam: TBA

Final Exam: TBA