Parallel Computing for Data Science (EN.601.420/620)

Assignment 4: k-means in Spark

A notebook that contains assignment is available here.

You can make a copy of the notebook in your Google drive and then complete the assignment. This consists of writing code as instructed by either TODO or ... placeholders in the notebook cells.

The original notebook (read-only sharing link) includes the sample output from the cells that can be used as reference. Additionally, I have provided a python-only (no Spark) implementation of the same functionality for your reference here.

Turning it in:

Turn in the notebook as an ipynb file to Gradescope. Also, create a sharing link for your notebook and submit that link to Gradescope.

Due date: Monday November 15, 2021. 5:00 pm EST.