Skip to content

Project Prerequisites

Brigitta Sipőcz edited this page Feb 26, 2020 · 3 revisions

This project will involve using two pieces of software: Jupyter and Apache Spark. Below, we outline a few steps you can take to familiarize yourself with using these two pieces of software and to get a taste of what the development process will look like in this project.

Install and use required software

  1. Download and install Jupyter on your system.
  2. Download and install Apache Spark on your system and ensure you can interface with it from within Jupyter. You can build from the source code, download a pre-built binary to your system, or install pyspark using the Python package managers conda or pip.
  3. Within a Jupyter notebook, import pyspark and compute the value of Pi. This is a canonical example of parallel computation in Spark, and you should be able to find many examples online.

Make a JupyterLab widget that displays the Spark UI in the Lab interface

  1. Follow the JupyterLab documentation (or other examples / documentation) to learn how to build your own JupyterLab Extension.
  2. Alter your extension to show a preview of the Spark UI. (Hint: if you are running locally, the UI can usually be accessed at localhost:4040.)
  3. (Bonus) Multiple Spark clusters can be created simultaneously on one machine. If the default port of 4040 is already in use, the Spark driver will attempt to bind to sequential ports following the default, e.g. 4041, 4042, etc. Make it possible to interact with your widget to accommodate this behavior. For example, you can create an input text field that changes which Spark UI is being shown.
  4. (Bonus) Building on the previous step, create a widget that detects all Spark clusters running on your system and exposes a selector (e.g. a dropdown menu) to choose between these. The selection doesn't need to do anything, you simply need to make a way to show a user what Spark clusters they have available to them. Feel free to come up with any solution you desire, including working server-side with Python. Feel free to also submit only a description of how you might solve this without actually implementing your solution.

Share your work with us

  1. Create a pull request from your fork to merge your work into this repository on a new branch named <your-name>-gsoc-prereqs so that we can review your submission.