Parallel Computing for Data Science Fall 2022 (EN 601.420/620)
Syllabus in standard CS/JHU/ABET format. The material on this page mirrors that information.
Note to students trying to enroll
All course enrollment prior to the first day of classes 2022 will be conducted through SIS. For students eligible to enroll through SIS, requests for permission to enroll based on exceptional circumstances or requests to be promoted from the wait list will not be granted. Emails requesting such permissions will not receive replies.
Course Description
This course studies parallelism in data science, drawing examples from data analytics, statistical programming, and machine learning. It focuses mostly on the Python programming ecosystem but will use C/C++ to accelerate Python and Java to explore shared-memory threading. It explores parallelism at all levels, including instruction level parallelism (pipelining and vectorization), shared-memory multicore, and distributed computing. Concepts from computer architecture and operating systems will be developed in support of parallelism, including Moore’s law, the memory hierarchy, caching, processes/threads, and concurrency control. The course will cover modern data-parallel programming frameworks, including Dask, Spark, Hadoop!, and Ray. The course will not cover GPU deep-learning frameworks nor CUDA. The course is suitable for second-year undergraduate CS majors and graduate students from other science and engineering disciplines that have prior programming experience. [Systems]
Prerequisites:
- Intermediate Programming (EN 601.120 or the equivalent)
- Data Structures (EN 601.226 or the equivalent)
- Computer Systems Fundamentals (EN 601.333 or the equivalent)
Course topics include:
- Amdahl’s law and strong scaling
- Data science in Python: dataframes, numpy, scipy
- Machine learning in Python: scikit-learn
- Instruction-level parallelism
- Multicore architectures
- Shared-memory parallelism and programming with threads
- Memory hierarchy and caching
- Synchronization and concurrency control
- Gustavson’s law and weak scaling
- Data-parallel distributed computing: dask, spark
- Distributed actor programming: ray
Comments on the 2022 Edition
This course replaces Parallel Programming as it was taught from 2013–2021. It shares the same course number
and you cannot receive credit for both.
The new syllabus changes the focus of the course:
- Examples will be drawn from machine learning and data science as much as possible.
- The course will not cover GPU programming, GPUs, or machine learning frameworks, such as TensorFlow, Keras, and PyTorch.
- Supercomputing and scientific/numerical applications will be deemphasized.
- The course adds material on instruction level parallelism, including pipelining and vectorization.
Most importantly, the term “Programming” has been replaced with “Computing” which reflects the natural evolution of the space to focus more on the architecture and systems aspects of the subject matter and less on programming patterns and parallel design.
Links
Students are responsible for all material and announcements on this course Web page and Piazza.
Academic Conduct
The guidelines of Johns Hopkins’ undergraduate ethics policy and graduate student conduct policy apply to all activities associated with this course. Additionally, students are subject to the Computer Science Academic Integrity Code.
In addition, the specific ethics guidelines for this course are: Students are encouraged to consult with each other and even collaborate on all programming assignments. This means that students may look at each other’s code, pair program, and even help each other debug. Any code the was written together must have a citation (in code comments) that indicates who developed the code. Any code excerpted from outside sources must have a citation to the source (in code comments when the codee is used). Each assignment involves questions that analyze the assignment and connect the program to course concepts. The answers to these questions must be prepared independently by each student and must be work that is solely their own.
For any homework that is done in pairs or teams, all team members must be listed as collaborators in comments in the source code AND in any submitted documents (PDFs and notebooks). You should also specify the nature and scope of the collaboration, shared programming, discussion, debugging, consulting. Failure to state a collaboration and its scope is an ethics violation and will be treated as such.
Schedule
MW 4:30 pm - 5:30 pm, zoom meeting ID 921 6887 9293. Zoom link with password can be found in Canvas.
Lecture Recordings and Attendance
All lectures will be available for synchronous delivery on zoom. All lectures will be recorded and will be available on Panopto within 24 hours. The course is designed for in-person delivery. Course activities count toward your grade and cannot be performed asynhcronously. Attendance is not required.
Course Staff
Instructor
Randal Burns, randal@jhu.edu, http://www.cs.jhu.edu/~randal/
- Office Hours:
- 1-2 pm Mondays, Malone 160.
- 6:30-8 pm Thursdays, https://wse.zoom.us/j/93813332726.
Teaching and Course Assistants
Ariel Lubonja, ariel(at)cs.jhu.edu
- Office Hours:
- Wednesday 9-10 am, https://wse.zoom.us/j/9788061713
- Friday 9-10am, Malone 207.
Brian Wheatman, wheatmann(at)cs.jhu.edu
- Office Hours:
- Tuesdays 3-4 pm, Malone 207.
Meghana Madhyastha, mmadhya1(at)jhu.edu
- Office Hours:
- Wednesday 3:15-4:15 pm, Malone 207.
Course Goals
Specific Outcomes for this course are that students:
- Take a computational task and construct an implementation that maximizes parallelism.
- Analyze and instrument an implementation of a computer program for its speedup, scaleup, and parallel efficiency.
- Reason about the loss of parallel efficiency and attribute that loss to factors, including startup costs, interference, and skew.
- Work with a diverse set of programming tools for different parallel environments, including cloud computing, high-performance computing, multicore, and GPU accelerators.
- Analyze how locality, latency, and coherency in the memory hierarchy influence parallel efficiency and improve program design based on the properties of memory.
This course will address the following CSAB ABET Criterion 3 Student Outcomes Graduates of the program will have an ability to:
- Analyze a complex computing problem and to apply principles of computing and other relevant disciplines to identify solutions.
- Design, implement, and evaluate a computing-based solution to meet a given set of computing requirements in the context of the program’s discipline.
- Apply computer science theory and software development fundamentals to produce computing-based solutions.
Grading
The course includes of six programming activities that span one to two weeks of course time. Activities will be graded for completion of the assignment. Activities that are incomplete or do not fulfill the stated objectives may be resubmitted with permission from the instructor. The goal of the activities is for the student to gain skills with the algorithms, programming tools, and principles presented in the class. Credit for the assignment does not depend on providing correct answers to each question. Answers that are incorrect or programs that do not meet the assignment objectives will be either (1) be marked as incorrect to provide feedback to the student or (2) be returned to the student for resubmission. Every student will have the opportunity to receive all credit for all activities. Activities makes up 30% of the course grade.
There has three exams: two midterms and a final. Each exam counts for 20% of the course grade.
The remaining 10% of the course grade is based on course participation and completion of in-course exercises. Attendance is not required, but credit for this portion of the grade can only be accumulated during the course.
The final letter grades do not depend solely on the achievement of a target score over all assignments and exams. Grades will be determined based on the achievement of learning goals. The course staff will determine a map of total scores to grades at the end of the semester. This policy lets instructors account for variance in exam scores, specifically when the exam scores are lower than intended or expected by the instructors. Grades will start with the following guidelines:
- 93.% or more -> A
- 90% - 93.3% -> A-
- 86.6% - 90% -> B+
- 83.3% - 86.6% -> B
- 80% - 83.3% -> B-
- less than 80% -> TBD based on evidence of learning
The instructors may choose to move the grade boundaries down, i.e. move the A- threshold from 90% to 87% based on how the course realized learning goals. We will not move the thresholds up.
Textbooks used in Course
The course does not follow a textbook. Lectures will refer to specific material from the following books. For the O’Reilly Books, you must first access one through the library proxy. After that, the links will work.
Mattson, T. G., B. A. Sanders, and D. L. Massingill. Patterns for Parallel Programming. Addison-Wesley, 2004. This text is available online to Hopkins students.
N. Matloff, Parallel Computing for Data Science. CRC Press, 2015. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.467.9918&rep=rep1&type=pdf
Herlihy, M. and N. Shavit. The Art of Multiprocessor Programming. Morgan-Kaufmann, 2008. This text is available online to Hopkins students.
To access online books:
- login to the Pulse VPN (if outside of jhu)
- navigate to library.jhu.edu and search for the title.
- you will see an online access link.