Skip to content

Latest commit

 

History

History
521 lines (367 loc) · 24.6 KB

cln61pr9s000b09ji200433e0.md

File metadata and controls

521 lines (367 loc) · 24.6 KB
title datePublished cuid slug cover tags
Practical Guide to Building Your First Predictive Machine Learning Model
Sat Sep 30 2023 13:05:18 GMT+0000 (Coordinated Universal Time)
cln61pr9s000b09ji200433e0
practical-guide-to-building-your-first-predictive-machine-learning-model
data, data-science, machine-learning, data-analysis, machine-learning-models

In this blog, I will be delving into demystifying the intricate process behind building a predictive machine learning model. Whether you are a newcomer to the world of data science, or someone looking to reinforce their foundation, this blog post is tailored just for you. In this practical guide, I will break down each step into bite-sized pieces, using simple words to ensure everyone reading can easily follow through. The case study to be used is the Titanic dataset. This classic example will help us navigate through the essentials: importing required libraries, data cleaning, exploratory data analysis, feature engineering, feature selection, hyperparameter tuning, and assessing model performance. Now, grab your virtual boarding pass and let's sail on this exciting journey towards creating your very own (potentially first) machine learning model!


Required Packages

  • A Kaggle account - Go ahead and create if you don't have one.

  • An Integrated Development Environment (IDE), e.g. VS Code or any other one you are comfortable with.

  • Python


Context - From the Dataset

The Titanic dataset, named after the historic shipwreck, provides a foundational dataset for data science and machine learning studies. It contains information about the passengers who were aboard the RMS Titanic during its ill-fated maiden voyage in 1912. This dataset includes details like age, gender, class, fare and survival status. By using the Titanic dataset for our case study, we get to explore essential data science and machine learning concepts in a practical context.

💡
To enjoy this project even further, I'd recommend watching the movie first!

Installing Packages

  • Download and install Python via this link after choosing your Operating System.

  • Download and install Jupyter Notebook via this link - [Optional if you already have an IDE like VS Code].


Downloading the Dataset

The Titanic dataset is available on Kaggle via the links below.

💡
It is recommended to create a folder for this project for easy file reading and management.

Understanding the Dataset

The training dataset has 10 columns which can be tweaked further for feature engineering. Understanding the information these columns store is paramount to building an effective machine-learning model that works for predicting whether a passenger survives or not.

Columns Description
PassengerID Unique Identifier of a passenger
Name Passenger Name (Includes titles like Miss, Mrs, Major, Rev etc)
Survived A dummy variable that specifies whether a passenger survived or not - 1 for Survived and 0 for Died
Pclass Ticket class with 1 for Upper Class, 2 for Middle Class, and 3 for Lower
Age Age of the Passenger (Fractional if the age is less than 1)
SibSp Number of siblings/spouses travelling with the passenger
Parch Number of parents/children travelling with the passenger
Ticket Ticket Number
Fare Passenger Fare
Cabin Cabin Number
Embarked Port of Embarkation (C for Cherbourg, Q for Queenstown, and S for Southampton)

Prepping...

Before we start cleaning, analysing or building any model, it is required to install the popular Python packages needed for our project. These are the likes of Numpy, Pandas, Matplotlib, Seaborn, and Sklearn. You can install all of them using pip in your local terminal.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1695276735781/fa1b1521-638a-4d98-a978-9fb96bed00a9.png align="left")

Once this is done, giddy up for your first full-fletched data-centric prediction project!

Loading the Dataset

  • Open up your IDE - it is recommended to have one where you can run your code in cell blocks; Jupyter Notebook/Lab, VS Code with .ipynb extension, and Google Colab are popular examples.

  • Once this is done, import the packages you installed using pip.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
  • You can then proceed to load the datasets.
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
💡
If you have stored the files in the same folder, the above code will work. However, if your data files have been stored elsewhere on your local computer, you need to provide the path to their location.
  • To quickly inspect what the data looks like, we look at the first 'n' rows.
train.head() # gives the first 5 rows, you can specify the n you want otherwise

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1695357966516/0c87311a-1149-488e-9b33-5f290311d4c7.png align="center")

We have successfully loaded the data sets required for our machine-learning project. The train.csv data file will be used to train the machine learning model(s) we choose and the test.csv will be used to assess the performance of our trained models on unseen/new scenarios. In essence, we want to see how well our model predicts the survival rate of new passengers who were not in our original dataset.


Cleaning the Dataset

Data cleaning stands as a critical cornerstone in every data project. The performance of your machine learning model hinges significantly on the quality and cleanliness of your data. Therefore, it becomes imperative to meticulously undertake the necessary steps to ensure your data is free from imperfections and ready for analysis. For the Titanic dataset, we apply a set of widely recognized techniques ranging from removing duplicates to addressing missing values.

Removing Duplicates

print(train.shape) # gives the no of rows, the no of columns in the data
train.drop_duplicates(inplace=True) # dropping duplicates 
print(train.shape) # checking to see if any rows were actually dropped

Checking for Missing Values

print(train.isnull().sum()) # returns the no of missing values for all columns

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696064110209/0d2bc895-1c1f-4f28-b80f-b07345a73ca0.png align="left")

From the image above, we see that 177, 687 and 2 entries of the Age, Cabin, and Embarked columns have nan (missing) values respectively.

# checking the percentage of missing values in the Age column
print(train.isnull().sum()["Age"]/ len(train)) # about 19.9%

# checking the percentage of missing values in the Cabin column
print(train.isnull().sum()["Cabin"]/len(train)) # about 77.1%
💡
The data shows that 77% of the Cabin entries are nan values. However, this is not the case as only first-class members had cabins labelled A to F, and T and G. Therefore, we can fill the nan values with 'No Cabin' for the other classes.
train_copy['Cabin'].fillna('No Cabin', inplace=True)
train_copy.Cabin.value_counts()

Dropping "nan" Values

train.dropna(subset=['Embarked'], inplace=True)

Filling in the Age Column

We could have dropped all the rows with missing age values in our dataset but as you might have guessed, we will be missing out on a significant chunk of possible insights with the additional 177 rows. Since we will not be dropping these rows, we have to fill them with some values. One popular way is by imputing the mean or median of the ages. So, mean or median? The distribution of the age column helps us answer this question.

plt.figure(figsize=(10, 5))
sns.distplot(train_copy['Age'])
plt.title('Distribution of Age column')
plt.xlabel('Age')   
plt.ylabel('Frequency')
plt.show()

We get a plot like the one below:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696065730103/6a8efa3a-4cba-4cc2-a6da-66e24859887f.png align="center")

Seeing that the distribution is a little bit skewed to the right, we use the median for the imputation as it is more robust to outliers and not heavily influenced by very low or high age values like the mean.

train_copy['Age'].fillna(train_copy['Age'].median(), inplace=True)

Column Description

A quick description of the numeric columns can also help to provide some information that will help us gain some statistical insights as to how the data in our train dataset is distributed.

train.describe()

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696064686629/f87d1344-ad5a-431c-90ad-48ebf5fbb0b1.png align="center")

This returns a quick analysis of the numeric columns with information on count (number of non-missing values), the mean, standard deviation, minimum, maximum, and 25th, 50th, and 75th percentile in the data.

💡
Compare and contrast this with train.info()

Exploratory Data Analysis (EDA)

This step involves visually and statistically summarizing your data, identifying outliers, understanding the distribution of variables, and forming initial hypotheses. EDA is a crucial step before diving into modelling, as it helps you better grasp the characteristics of your data and make informed decisions on how to proceed.

train.Sex.value_counts().plot(kind='bar', alpha=1)
train.Embarked.value_counts().plot(kind='bar')
train.Pclass.value_counts().plot(kind='bar')

Execute each line in the code block above in a separate code cell in your workbook and see the plots you get.

💡
If you remember from the movie, in the scenes where the ship was capsizing and the women and children were saved first. That begs the question, what is the survival rate of men compared to women in the Titanic? A simple EDA can help us see this.
males = train[train['Sex'] == "male"]
females = train[train['Sex'] == "female"]

# computing survival rates for both
male_survival_rate = males["Survived"].mean()
female_survival_rate = females["Survived"].mean()

male_survival_counts = males["Survived"].value_counts()
female_survival_counts = females["Survived"].value_counts()

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

axes[0].bar(["Males", "Females"], [male_survival_rate, female_survival_rate], color=["blue", "red"])
axes[0].set_xlabel("Gender")
axes[0].set_ylabel("Survival Rate")
axes[0].set_title("Males vs Females Survival Rates")

bar_positions = np.arange(2)
bar_width = 0.4
axes[1].bar(bar_positions - bar_width/2, male_survival_counts, bar_width, label='Died')
axes[1].bar(bar_positions + bar_width/2, female_survival_counts, bar_width, label='Survived')
axes[1].set_title('Survival Count Comparison: Males vs Females')
axes[1].set_xticks(bar_positions)
axes[1].set_xticklabels(['Males', 'Females'])
axes[1].set_xlabel("Gender")
axes[1].legend()
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()
💡
You can keep exploring the data with EDA to see what additional insights you get. What else do you remember from the movie and would like to see in the data?

Feature Engineering

Feature engineering has to do with creating new, informative features from existing data to enhance the performance of your machine-learning model. It involves selecting, transforming, or combining variables in a way that captures meaningful patterns and relationships. Effective feature engineering can significantly impact the predictive power of your model, ultimately leading to better results and insights.

Create a Copy of your Data

train_copy = train.copy()

Creating a Title Column

title_list = ['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev', 'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess', 'Don', 'Jonkheer']

import string
def look_for_title(name, title_list):
    # looping through our title list
    for title in title_list:
        if title in name:
            return title
    # if we don't find any title in the name
    return np.nan
💡
The title_list contains all possible titles for all passengers on the ship. You might be questioning the rationale behind creating this feature. It's possible that a person's title could influence their chances of survival, although we don't have definitive proof of that at this stage. Nevertheless, at this point, it remains a relevant feature to construct in our analysis.
# creating the title column
train_copy['Title'] = train_copy['Name'].map(lambda x: look_for_title(x, title_list))

Categorizing the Title List to Any of [A Single Man (Master), A Single Woman (Miss), A Married Man (Mr), A Married Woman (Mrs) ]

def replace_title(data):
    title = data['Title']
    if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
        return 'Mr'
    elif title in ['Countess', 'Mme']:
        return 'Mrs'
    elif title in ['Mlle', 'Ms']:
        return 'Miss'
    elif title == 'Dr':
        if data['Sex'] == 'male':
            return 'Mr'
        else:
            return 'Mrs'
    else:
        return title

train_copy = train.copy()
train_copy['Title'] = train_copy.apply(replace_title, axis=1)
train_copy['Title'].value_counts()
💡
To read more on this, please visit this website.

Creating a Deck Column

On a ship, decks have cabins and are usually represented with the first letter of the cabin number. For instance, cabin number C85 belongs to deck C.

# creating decks from the first letter of the cabin column
train_copy['Deck'] = train_copy['Cabin'].map(lambda x: x[0])
# changing N to No Cabin
train_copy['Deck'] = train_copy['Deck'].replace('N', 'No Cabin')

Creating a Column for Family Size

# creating a column for family size
train_copy['Family_Size'] = train_copy['SibSp'] + train_copy['Parch'] + 1  # with the addition of the person themselves

Creating a Column for Fare per Person

# creating a column for fare per person
train_copy['Fare_Per_Person'] = train_copy['Fare'] / train_copy['Family_Size']

Preparing our Data for Modelling

Normalization (Scaling Features)

Scaling features is a critical data preprocessing step in machine learning. It involves transforming numerical variables to a standard scale, typically between 0 and 1 or with a mean of 0 and a standard deviation of 1. This process ensures that features with different scales do not unduly influence the model's performance, allowing it to give equal weight to each feature during training.

# standardizing the fare, age, sibsp, parch, family_size and fare_per_person columns
scaler = StandardScaler()
train_copy2 = train_copy.copy() # creating another copy of our dataset
train_copy2[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']] = scaler.fit_transform(train_copy2[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']])

Encoding Categorical Variables

Encoding categorical variables is the process of converting non-numeric data, such as categories or labels, into numerical format, allowing machine learning algorithms to process and learn from this information. Common techniques include one-hot encoding and label encoding, each suitable for different types of categorical data.

# one hot encoding for the sex column
encoded_train = pd.get_dummies(train_copy2, columns=['Sex'], prefix=['is'])
# one hot encoding for the embarked column
encoded_train = pd.get_dummies(encoded_train, columns=['Embarked'], prefix=['Embarked'])
# one hot encoding for the title column
encoded_train = pd.get_dummies(encoded_train, columns=['Title'], prefix=['Title'])
# label encoding for the deck column because it is ordinal
encoded_train['Deck'] = encoded_train['Deck'].map({'No Cabin': 0, 'A':8, 'B':7, 'C':6, 'D':5, 'E':4, 'F':3, 'G':2, 'T':1})

Creating a Copy of our Encoded Data

# creating a copy of the encoded train dataset
encoded_train2 = encoded_train.copy()

Dropping Columns Not Needed for Training

# dropping the columns not needed for training
encoded_train2.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Assessing the Correlations Between Columns

# checking for the correlation between the columns
plt.figure(figsize=(15, 10))
sns.heatmap(encoded_train2.corr(), annot=True, cmap='coolwarm')
plt.show()

We get a map like the one below:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696075292606/677b3a93-64ab-4736-a28e-566ce0611504.png align="center")

💡
A heatmap of the correlation matrix for all the columns tells you how positively or negatively the columns are to each other. If two columns are highly correlated, they are most likely giving the same information. Hence, analysis of the correlation matrix can help inform the features you choose for training.

Training the Model

Training the model is the phase where we feed our prepared dataset into the chosen machine learning algorithm. During this process, the algorithm learns to recognize patterns and relationships within the data, adjusting its parameters to make predictions or classifications. The goal is to find the best-fitting model that generalizes well to unseen data, which is achieved through iterative training, evaluation, and fine-tuning of the model's hyperparameters.

Oftentimes, one has to experiment with a series of machine learning models and then choose the one that works best based on the preferred performance metrics. In this case, we want a model that gives us the highest accuracy score.

💡
For this blog, I have used Random Forest Classifier but I have also worked with different models like Logistic Regression, Support Vector Machines, Decision Trees, and Gradient Boosting. You can find the GitHub repository here.

Random Forest Classifier

req_features = ['Pclass', 'Title_Master', 'Title_Mrs', 'Title_Mr', 'is_female', 'is_male', 'Family_Size', 'SibSp', 'Age', 'Parch']

# random forest classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=1)

# training the model
rf.fit(encoded_train2[req_features], encoded_train2['Survived'])

# assessing the accuracy of trained model on the training data
print(rf.score(encoded_train2[req_features], encoded_train2['Survived']))
💡
We select only a list of features from our dataset and not all so that our model does not overfit the training data and then performs badly on unseen data. With this, we are able to build a more generalisable algorithm that works favourably well on both training and test data. There is no hard rule to the number of features required to build your model, one usually has to experiment with different combinations of the features or use some predefined methods like feature importance to select the best set possible.

We have trained our first machine-learning model with the Random Forest algorithm. Now, we move on to check the performance on the testing data.

Preparing the Test Data

test_copy = test.copy()
# applying the same transformations to the test data as we did to the train data
test_copy['Title'] = test_copy['Name'].map(lambda x: look_for_title(x, title_list))
test_copy['Title'] = test_copy.apply(replace_title, axis=1)
test_copy['Cabin'].fillna('No Cabin', inplace=True)
test_copy['Deck'] = test_copy['Cabin'].map(lambda x: x[0])
test_copy['Deck'] = test_copy['Deck'].replace('N', 'No Cabin')
test_copy['Family_Size'] = test_copy['SibSp'] + test_copy['Parch'] + 1  # with the addition of the person themselves
test_copy['Fare_Per_Person'] = test_copy['Fare'] / test_copy['Family_Size']
test_copy['Embarked'].fillna(test_copy['Embarked'].mode()[0], inplace=True)
test_copy['Age'].fillna(test_copy['Age'].median(), inplace=True)
test_copy[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']] = scaler.fit_transform(test_copy[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']])
# dealing with categorical columns on our test data
# one hot encoding for the sex column
test_copy = pd.get_dummies(test_copy, columns=["Sex"], prefix=["is"])
test_copy = pd.get_dummies(test_copy, columns=["Embarked"], prefix=["Embarked"])
test_copy = pd.get_dummies(test_copy, columns=["Title"], prefix=["Title"])
# label encoding for the deck column because it is ordinal
test_copy['Deck'] = test_copy['Deck'].map({'No Cabin': 0, 'A':8, 'B':7, 'C':6, 'D':5, 'E':4, 'F':3, 'G':2, 'T':1})
# replacing the nan value in the fare column with the mean
test_copy['Fare'].fillna(test_copy['Fare'].mean(), inplace=True)

# replacing the nan value in the fare per person column with the mean
test_copy['Fare_Per_Person'].fillna(test_copy['Fare_Per_Person'].mean(), inplace=True)

Testing our Model

rf_predictions = rf.predict(test_copy[req_features])

# creating a dataframe with the predictions
submission = pd.DataFrame({'PassengerId': test_copy['PassengerId'], 'Survived': rf_predictions})
submission.to_csv('submission.csv', index=False)
print("Submission file successfully saved!")

Submitting on Kaggle

Running the code above would have generated a CSV file with two columns: PassengerID and Survived. You can go ahead and submit your predictions on the website. You get your score on how well your algorithm predicted the survival rates of people on the Titanic.

💡
Keep tuning the hyperparameters, trying different features for the model training, extracting features, experimenting with other models and seeing how well your scores improve!

Conclusion

In conclusion, our journey through building a machine learning model using the Titanic dataset has covered essential data science steps, from cleaning to feature engineering and model training. These concepts provide a foundation for beginners in data science and machine learning.

Thank you for engaging!