Machine Learning project developed in PySpark. The ML model uses the dataset related to the anonymized presences registered via the Younicam mobile application in the University of Camerino's buildings to predict the number of people in a room during a precise time interval.
TPOT is used in the model training phase to get the best combination between the ML model and hyperparameters.
Under your home directory, find a file named .bash_profile, .bashrc or .zshrc. This name might be different according to the operation system or version. After that, open the bash shell startup file and past the script below:
export SPARK_HOME="/opt/spark"
export PATH="$SPARK_HOME/bin:$PATH"
If you want Jupyter Notebook to be opened when launching PySpark, add also the variables below:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
Now you are able to launch PySpark from any directory with the underneath command:
pyspark
To install the project dependencies run the following command:
pip install -r requirements.txt
Notice that the TPOT pipeline needs some additional dependencies listed in the TPOT installation docs.
Launch PySpark, as described above, and browse into the project directory to execute the notebooks.
If the Jupyter Notebook doesn't open automatically with PySpark, open it using the command below:
jupyter notebook /path/to/notebook
The TPOT pipeline notebook was used in order to find the best combination between ML model and hyperparameters. It outputs a .py pipeline to run the selected ML model with its configurations. We used the returned pipeline inside the Model Training notebook in order to perform additional operation around the training (e.g. save intermediate dataset, evaluation).
The repository has the following folder structure:
- data : contains the original dataset plus some other intermediary transformations in json format
- notebooks : contains all the notebooks used during experimentation. There are a notebook for the collection and preparation phases, one for the training and evaluation phases, one for the predictions visualization and another one to execute the TPOT pipeline.
- predictions : contains the final predictions results in csv format
- Yuri Paoloni - yuripaoloni
- Matteo Leonesi - MatteoLeonesi