Authors: Yannik Suhre, Jan Faulstich, Skyler MacGowan, Sebastian Sydow, Jacob Umland
π΄ This repository shows how to predict the demand for bikes available for rent through Washington D.C.'s Capital Bikeshare.
- Bikerus
- Check it out!
- Introduction for reproducibility
- TL;DR
- Data acquisition
- Data visualization
- Data Preprocessing
- Data modelling
- Deployment and live predictions
You can check our live demo here. If you want to have a deeper insight in how we did things, you can read our paper, which is also available within this repo.
π‘ The text that follows outlines how to fully reproduce the findings of this repository.
You must firt install and then run this repository. To do so we recommend that you use docker and Visual Studio Code as this corresponds with our methodology. Alternatively, you can also use Anaconda.
π This will explain how to use docker for an easy install
To use Visual Studio Code and Docker, please follow the steps outlined below.
- Download the repository and open the folder in Visual Studio Code (VS Code). If you are new to VS Code please install the Remote Development Add-On.
- On the bottom left corner of your open window on VS Code, two signs should appear. Click on those. Doing so should cause a list to open.
- On this list, click the
Remote-Containers: Reopen in Container
entry. Note: If this is the first time you are doing this, it can take some time as Docker will create your image with all the necessary requirements. - Following the completion of Step 3, you have your file editor on the left side and can click through the files. If you want to execute a file, just click the Play button on the top right corner; doing so will execute the Python script.
π This will explain how to use Anaconda to use this repo
Should you prefer to use Anaconda
(miniconda
was not tested) start by downloading the repository. Then open the Anaconda prompt
, navigating to the downloaded git-repository Bikerus
(one can navigate within the Anaconda prompt
using normal command line commands; cd <your-path-to-bikerus>
. Should you have any spaces within your path use quotation marks around your path. Also, should you have to change your Harddrive use \<your-harddrive-letter>
. In total that would look like: cd \\<your-harddrive-letter> "<your-path-to-bikerus>"
). Once you have navigated there with your prompt, create a new python environment:
conda create --name bikerus python=3.8
Next activate this environment:
conda activate bikerus
Now, we use pip
to install the necessary packages from the requirements.txt
.
pip install -r requirements.txt
This installs all the necessary packages within your Anaconda
environment. Now you can start and execute every script by itself without worrying about packages and versions.
π³ This paragraph is only applicable if you are using Docker with VSCode
Once your Docker Container is running inside your VSCode, you can just enter the following:
./execute_all_scripts.sh
This will execute all scripts in the correct order and you don't have to run them indivdually. Should you use Anaconda
you have to run them individually, since the Anaconda prompt
cannot execute shell scripts.
πΎ This paragraph will explain how you can obtain the data used
In order to obtain the data used by this project, please clone this repository and then execute the 0_pipeline_data_getting_compression.py
file. This file will:
- Download the files from the web
- Extract them into a folder within the parent directory called
data/raw
- Load these raw datasets and convert them into a compressed file in
data/interim
(for the sake of convenience we left the raw data there, should you want to make modifications thereto).
πΊοΈ This section shows how the data visualizations can be created
In order to reproduce the map with the bike share rental stations, you have to execute the file 0.1_pipeline_bike_station_viz.py
within the python
folder. This will create a folder images
within the parent directory. Once you enter this folder there should be a .html
file, which contains this map.
π₯ This paragraph will show how the NAs are imputed
In order to impute your own missing values please execute the script named 1_pipeline_impute_NAs.py
. This will create a file in the folder data
which is named preprocessed
. In this folder you can find the final version of the Bike Rental data.
πΎ This paragraph describes how the further preprocessing works
Based on the data resulting from imputing NAs, further preprocessing is done by executing the script 2_pipeline_preprocessing
: β unnecessary data features are dropped, β‘οΈ data is transformed to correct data types, π and the continous variables are normalized. This script will create a file for the preprocessed data in the folder data
as well as another file for the storing the actual (non-normalized) minimum and maximum values for the target variable.
ποΈ This paragraph will explain how you can partition the used data into training and testing sets. Additionally it explains how the training set can be partitioned into a training set as well as a cross-validation-set (for additional testing), if
GridSearchCV
is not used.GridSearchCV
uses (Stratified)KFold as a cross validation splitting strategy (see parametercv
). The following explaination describes a variation of KFold which returns first k folds as training set and the (k + 1)th fold as testing set.
Steps for creating training and testing sets:
-
Import the data. Use
df = decompress_pickle(<path>.pbz2)
for importing. -
Call the function
train_test_split_ts
.The function takes two arguments: The first one is the
data (type: DataFrame)
. The second is the size of thetraining set (type: float)
. The size of the training set must be greater than0
and smaller than1
.The function returns the sets for
X_train
,Y_train
,X_test
andY_test
.X_train
andX_test
including all columns except for the target column.Y_train
andY_test
only include the target column (field to predict).X_train
andY_train
are used for training including determining the samples for cross validation.X_test
andY_test
are only used for the (final) testing. The files are exported to:'./data/partitioned/'
.
Steps for creating training and testing sets for cross validation (if GridSearchCV is not used):
-
Import the date:
X_train
,Y_train
. Usedf = decompress_pickle(<path>.pbz2)
for importing. -
Call the function
get_sample_for_cv
.The function takes six arguments. Two arguments are optional (refer to the steps for creating a horizontal bar diagram to visualize the train-test-splits).
n_splits
: This determines the number of splits that will be used for cross-validation. It must be an integer and greater than1
.fold
: This determines the current fold (subsample) of thetrain
andtest
set for cross validation. It must be an integer and greater than0
and not greater than thenumber of splits
.X_train
andY_train
: Data used for training including determining the samples for cross validation.X_train
includes all columns except for the target column.Y_train
only includes the target column (field to predict).
The function returns the sets for
X_train_current
andY_train_current
as the current fold/sub-sample. Additionally it returnsX_test_cv_current
andY_test_cv_current
for cross-validation.
Steps for creating a horizontal bar diagram to visualize the train-test-splits:
The function get_sample_for_cv
can create a horizontal bar diagram for the visualization of the the train-test-splits. The function only creates the bar diagram if X_test
is added as a parameter and if the parameter vis == True
.
X_test
:X_test
is needed to visualize the final round of testing withX_test
andY_test
, which we created at the beginning with the functiontrain_test_split_ts
. To create the horizontal bar diagram,X_test
must be added to the function.vis
:Vis
is used as decision variable for the creation of the diagram. It is initalized asFalse
. Therefore, the horizontal bar diagram will not be created. To create the horizontal bar diagram, addTrue
as the last parameter when calling the function. The figure is saved in the path./data/partitioned/
.
In order to load or train all models (with the exception of fastai - see here why) with the given train and test split, execute the script 4_models.py
. This will:
- Load the given models or train them
- Save them to the local drive
- Create two dataframe:
- One dataframe with all given predictions (normalized and unnormalized)
- Another with the given
$R^2$ values
- Saves the aforementioned dataframes
π This paragraph explains how the catboost regressor is used
- Run the
Grid_Search_Catboost-param.ipynb
to comprehend my Catboost settings. The best parameters of the CatBoostRegressor for this dataset aredepth = 6
,learning_rate = 0.1
anditerations = 1000
. - Open the
catboost_skript_ts.py
script. Check the calculated parameters with the parameters in theCatBoostRegressor
. Afterwards run thecatboost_skript_ts.py
script to create the CatBoost model based on the parameters and the BikeRental dataset. Additionally the script also saves the state of the CatBoost model in a file in the bikerus folder. The file is namedCatboost_model
. - Last but not least open the
load_catboost.py
script. This script loads the previously saved CatBoost model. Additionally, there is also a test dataset of of1. January 2013 0pm
. If you run the script, the model will predict the Bike Rentals for this specific hour based on the testdata set. Since we fed the model with normalized data, it returns a normalized count value.
β οΈ In order to try this one, one has to install fastai within an anaconda environment, since the pip version is really hard to install. Thus it cannot be installed within a container. Please follow the fastai link above for a more detailed explanation of how to setup fastai in your local environment. You have to uncomment the functionfastai_neural_net_regression
within the scriptmodel_creation
as well as the line where you import fastai. If you want to run the script4_models.py
with thefastai_neural_net_regression
make sure also to uncomment that, when you are in a anaconda enviroment with fastai.
π The following will explain how to use FastAI for a regression task
FastAI is a framework developed for fast and accessible artificial intelligence. Since its second version it can deal with structured tabular data, using neural nets as a regressor.
π² This paragraph explains how the RandomForestRegressor is used.
- Open the
random_forstest.py
script and run it. - The following steps are performed within the script:
- The script loads the preprocessed data using
decompress_pickle("./data/preprocessed/BikeRental_preprocessed.pbz2")
. - The column
'datetime'
needs to be dropped, because the RandomForestRegressor cannot handle its type. - The train and test samples are created using the function
train_test_split_ts
. - Here,
GridSearchCV
is not used. Following from the explanation about cross validation iterators in scikit-learn (chapter 3.1.2.), if one knows that the samples have been generated using a time-dependent process, it is safer to use a time-series aware cross-validation scheme. Therefore, cross validation is performed by applying the function get_sample_for_cv to also consider the time series character for cross validation. Here, 5 folds are created. The different hyperparameters are applied to the folds through cascaded for loops. ThePseudo-R^2
is calculated for each fold and the respective hyperparameter combination. At the end, a mean of each hyperparameter combination across the five folds is calculated. Thehyperparameter combination
with the highest mean is returned. Under consideration of the trade-off between a highPseudo-R^2
and the model's robustness, the hyperparametersmax_depth = 11
,n_estimators = 300
,max_features = 10
andmax_leaf_nodes = 80
were chosen. - The RandomForestRegressor is trained with the best hyperparameters and the
R^2
andPseudo-R^2
are calculated. - The Model is saved using
joblib.dump(RForreg, "./RandomForest_Model/" + str(filename))
.
- The script loads the preprocessed data using
πΈοΈ This paragraph explains how the MLPRegressor is used.
- after having finished all preprocessing steps, run
4_models.py
in order to run the models. - after execution you can find the saved multilayer perceptron model, its optimal hyperparamters and r squared values as well as the predicted dataframe in
NN_MLP_files
in themodels
folder.
π This paragraph explains how to start up a flask app, which deploys the models and makes live predictions
Bascially all you have to do is to run app.py
. In VSCode with Docker backend just click the play button in the top right corner. In Anaconda run app.py
from the top level of the bikerus folder:
python flask/app.py
Go to the given webpage (most likely 127.0.0.1:5000) and enjoy predicting!