ML101

This repository is created to help people to get started with Machine Learning and give some tips&tricks for programing with python

table of content:

general workflow

Step 1: install anaconda

Step 2: virtual environments

Step 3: jupyter notebook

Step 4: script/.py-files

Step 5: some last tips

folder structure
TPOT library
some nice literature

general workflow with machine learning

(1) data collection 

--> (2) data exploration 

--> (3) data preprocessing --> (4) train model --> (5) evaluate model 

--> (6) repeat steps 3, 4 and 5 until your model is usable 

==> (7) create prototype --> (8) implement more features/fix bugs

code in jupyter notebooks

(1) data collection - the base of your model is the data, so choose wisely

Often, you will get specific data for a project and train your model on it; however, this data can be not enough or very messy and in this case, you can use data with a similar structure (e.g.: from Kaggle) to train your model on it and later finetune it on your specific data.

Kaggle.com
client data

(2) data exploration - gain some information about your data

The better you know your dataset, the easier it is for you to understand why your model makes its prediction the way it does. Furthermore, if you know the deficits of your data, you can do something about it.

NaN cells
duplicates
balance of target class
look into specific rows (some times it reveals dependencies between columns)

Tip: You can use libraries that make it easier and faster to explore your data. My personal choice is bamboolib.

(3) data preprocessing - bring the data in a good shape for your model

After step 2, you now know the deficits of your data and you can take actions. You also have to convert text to vectors and encode categorical features so that your model can work with them.

delete/fill NaN cells
handle duplicates
upsample or downsample data
convert text to vectors
encode/scale/normalize features
feature selection to speed up your models
split dataset into train and test data

(4) train model - choose a model and train it

There are two main types of models you will probably use: classifier and regressor.

classifier (features --> classes)
- special case: two classes (binary classification) --> there are models especially for this
- e.g.: Is a cat, a dog, or a horse in the photo?
regressor (features --> values)
- one can also use regressors for classification --> in some cases this can be helpful
- e.g.: Based on the profits of the last years, what will be the profit of this year? (sales forecasting)

(5) evaluate model - use meaningful key figures to evaluate the performance of your model

look at different metrics like f1-score, r2-score, recall, precision, ...
classificationreport and confusionmetrics are helpful to evaluate classifiers

(6) repeate steps 3, 4, and 5 - improve the metrics of the model so that it is usable

try different preprocessings --> different encoder, scaler, vectorizer, feature selections, or normalizer
hyperparameter tuning of the model
try different models --> you can do this manually by yourself or use helpful libraries like TPOT which will do the upper steps for you (code for the tpot classifier/regressor in the TPOT library section)

code in scripts/.py-files

(7) create prototype

Now that you have a preprocessing for your data and a model with a good performance, you can bring your code in a production ready form. This means to refactor your code into classes, functions, and different scripts. At the end, you want to have a workflow from running script1 --> script2 --> script3 so that you have

raw data --> preprocessed data --> train and save model --> deploy model

For example, you could create following scripts:

data_prep.py --> takes the raw data and returns the preprocessed data
model.py --> class of model with train and predict function (can contain several models that are called to generate the output)
train.py --> takes the preprocessed data, trains the model on it and saves the model (e.g.: with the pickle library)
deploy.py --> takes the saved models and deploys them in the cloud (e.g.: azure, AWS, ...)
consume.py --> takes data, sends them as a request to the deployment endpoint in the cloud and returns the prediction of the model

(8) implement new features and fix bugs

Your first prototype is now in production, but this is not the end. There are maybe some bugs in the code that have to be fixed or some conflicts with other operating systems. Furthermore, you can start now to find new practical features to implement and create prototype 2/3/... For testing, you can use jupyter notebooks again and afterwards implement the new stuff in your script structure.

implement new features
fix bugs/solve problems and conflicts

Step 1: install anconda

Anaconda is a collection of useful python packages like sklearn or pandas that you will use very often while doing machine learning (So, you can save time with not doing pip install for all these packages that you will need). Furthermore, anaconda can be used for managing your virtual environments and running jupyter notebooks.

anaconda website

if you never heard anything about notebooks and virtual environments, you should start to get to know jupyter notebook first before diving into the more advanced stuff. Therefore, just open the Anaconda Navigator after installing anaconda and launch jupyter notebook. Try a little bit and afterwards when you have a brief overview, come back and learn how to set-up your working environment a bit more beneficial :-)

Step 2: usage of virtual environments

You use the base environment when you do commands in the terminal or running py-scripts (if you do not change it). This means that you install (e.g.: with pip) all libraries in this environment. That works only until a certain point because the different libraries need different versions of their subpackages and that can produce conflicts. Some of these conflicts can be solved and others not. In the worst case, you cannot use pip anymore and there will be a ton of errors while executing your code. The solution are virtual environments. A virtual environment is a separate environment in which you can install packages and the different environments do not interact with each other. So, you can have for every project a different environment (this brings some advantages that will be mentioned later).

create a virtual environment

First of all, you have to create an environment

conda create --name new_env

Second, you have to activate it ((base) should change to (new_env) in your terminal)

conda activate new_env

save a virtual environment to a .yaml-file

Virtual environments can take quite some storage on your computer that is why you should save an environment to a .yaml-file and delete it when you will not use it in the next time. Furthermore, others can run your projects without having to install all the different libraries you used (this can take some time and nerves). So, when you save your environment and put it to the rest of your code, one can just create this environment from the .yaml-file and start working with all the libraries.

For saving the environment, you have to activate it first and then run the following command. The file will be saved in your current working directory.

conda env export > conda.yaml

create a virtual environment from a .yaml-file

The name of the environment is the one inside the .yaml-file.

conda env create -f conda.yaml

remove a virtual environment and its dependencies

conda remove --name new_env --all

show a list of all virtual environments

It is good to have an overview of all the environments to see which one are not needed anymore.

conda info --envs

or same result with

conda env list

clone a virtual environment

I recommend not to work in the base environment and always to activate a different one. In the base environment are all standard libraries installed that one could need (the packages one installed with anaconda) without any conflicts and what I like to do is to clone it. So that you have a new experimental environment for example that you can use for testing non-project related stuff. You normally do not do this with project related stuff because the .yaml-files of a project should be minimal.

conda create --name new_env --clone env_you_want_to_clone

Step 3: usage of jupyter notebook

Now that we know how to use virtual environments, we can start with notebooks.

start jupyter notebook

There are two ways to launch jupyter notebook:

(1) with the anaconda navigator

open the anaconda navigator
select the environment you want to use in the upper-left corner (default: base)
click launch jupyter notebook (It will start a localhost)

(2) with the terminal

open the terminal
activate the environment you want to use
run in the terminal the following command (It will start a localhost)

jupyter notebook

jupyter notebook nbextensions [recommended]

Jupyter notebook is a nice program, but there are extensions that can make your life way easier. (Maybe one should first get used to the normal notebooks and the basic shortcuts and so on, but (after this) one should directly use these)

installation of nbextensions

I would recommend to create a new virtual environment exten for the extensions with cloning the base environment because the library tends to have conflicts with other bigger libraries and you can use the extensions in notebooks with other environments (I am not sure if this is true for all the extensions, but for most). You can just not edit the currently activated extensions in other environments. Therefore, you have to start jupyter notebooks with exten.

To get these nbextensions to know, I recommend to read this article. It also contains other helpful libraries.

To install the extensions in a new environment copy and run the following commands in the terminal:

conda create --name exten --clone base
conda activate exten
pip install jupyter_contrib_nbextensions
pip install jupyter_nbextensions_configurator
jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user

recommendations for the extensions

Nbextensions has a lot of different extensions and all of them are in a way useful, but to get started with them I would recommend the following ones (I do not list the default ones here):

Autopep8 - this extension can automatically change your code to the pep8 coding standards in your notebook
Collapsible Headings - this extension allows you to minimize header blocks which makes it easier to work with big notebooks
ExecuteTime - this extension times the execution of each code cell and you do not have to use %%time
Hinterland - this extension enables auto-completion which makes the programing way faster
Initialization cells - this extension allows you to mark cells as initialization cells that means they are ran when you load the notebook. You can, for example, load libraries or datasets you always need directly (more a quality of life upgrade)
isort formatter - this extension can sort your library imports alphabetically grouped by module import (makes the library imports more readable)
Scratchpad - this extension enables an expandable cell for quick testing like current state of a variable (otherwise you always have for the program unnecessary cells that makes the notebook less readable)
ScrollDown - this extension automatically scrolls down when you have a long output (quality of life upgrade)
Snippets Menu - this extension is the best of all. It allows you to save code snippets in a given format (file will be inserted into the repo soon) and insert them in your code. This can make you speed-up your coding and saves time of searching the same snippet (e.g.: read from .txt-files) for the thousands time.
Split Cells Notebook - this extension allows you to put two cells next to each other. This is useful for comparing graphics or outputs.
Table of Contents (2) - this extension enables you a table of content with the Markdown cell headers as topics. This is useful for big notebooks

Step 4: usage of scripts/.py-files

There are some things that are good to know about scripts.

structure

Normally, you have a general structure in your script like:

library imports

functions/class

if __name__ == "__main__":
  ... (some code that is executed if the script is dircetly ran. It will not be executed if you import functions/class from this script into another)

It is good to start functions and classes with a docstring that describes the parameter and output of a function besides the normal comments because this makes it way easier to read your code. There are some conventions how to do this. Furthermore, there is also a huge list with other conventions for coding with python besides this which are called PEP 8. It is not bad to know some of these, but at the end, you do not have to know all of them and, as long as your code is clean, readable, and understandable, you should be fine.

recommended libraries for improving your code

I mentioned in the part before that it is import to have readable code, but you do not have to this all by yourself. There are some helpful libraries that will support you. A list of libraries and commands can you find in my Code-Testing repo.

Step 5: some last tips

folder structure

How you structure your folders is your choice, but at the end, it has to be understandable for other users which means it should not be too messy. I personally like the following structure:

/project name
  /data
    /1_raw
    /2_processed
  /notebooks
    /<name>.ipynb
  /scripts
    /data_prep
      /data_prep.py
      /utils.py (sometimes if I want to separate the preprocessing function when I need a lot of preprocessing)
    /deployment
      /deploy.py
      /consume.py
      /score.py (script for the scoring endpoint)
    /models
      /model.py (to separate the model class from the train and save script)
      /train_and_save.py
      /utils.py (if I need some functions that would make the train_and_save script overcrowded)
  /artifacts
    /model.pkl (saved model)
    /<name>.pkl (if I also need to save an encoder, scaler, ...)

The advantage of having the same structure in every project is that others can easily run your projects with always the same workflow.

(here: data_prep.py --> train_and_save.py --> deploy.py --> consume.py)

TPOT library

The TPOT library will save you a lot of work. You just have to give the tpot classifier or tpot regressor your data and it will automatically try different combinations of preprocessing, models, and hypertuning. The link to their website where they explain in more detail what they exactly do and to their github repository.

NOTE: currently, you cannot use tpot on MacBooks with a M1 (hopefully, this will be fixed soon)

installation of tpot

Run the following commands in the terminal:

pip install deap update_checker tqdm stopit xgboost
pip install tpot

example code for the tpot classifier

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline.py')

example code for the tpot regressor

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

some nice literature sources

Cheatsheets for different topics to get an overview
list of helpful articles for learning about ML and improving your python coding that I extend now and then

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

Priapos1004/ML101

Folders and files

Latest commit

History

Repository files navigation

ML101

general workflow with machine learning

code in jupyter notebooks

(1) data collection - the base of your model is the data, so choose wisely

(2) data exploration - gain some information about your data

(3) data preprocessing - bring the data in a good shape for your model

(4) train model - choose a model and train it

(5) evaluate model - use meaningful key figures to evaluate the performance of your model

(6) repeate steps 3, 4, and 5 - improve the metrics of the model so that it is usable

code in scripts/.py-files

(7) create prototype

(8) implement new features and fix bugs

Step 1: install anconda

Step 2: usage of virtual environments

create a virtual environment

save a virtual environment to a .yaml-file

create a virtual environment from a .yaml-file

remove a virtual environment and its dependencies

show a list of all virtual environments

clone a virtual environment

Step 3: usage of jupyter notebook

start jupyter notebook

(1) with the anaconda navigator

(2) with the terminal

jupyter notebook nbextensions [recommended]

installation of nbextensions

recommendations for the extensions

Step 4: usage of scripts/.py-files

structure

recommended libraries for improving your code

Step 5: some last tips

folder structure

TPOT library

installation of tpot

example code for the tpot classifier

example code for the tpot regressor

some nice literature sources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages