title | datePublished | cuid | slug | cover | tags |
---|---|---|---|---|---|
Practical Guide to Building Your First Predictive Machine Learning Model |
Sat Sep 30 2023 13:05:18 GMT+0000 (Coordinated Universal Time) |
cln61pr9s000b09ji200433e0 |
practical-guide-to-building-your-first-predictive-machine-learning-model |
data, data-science, machine-learning, data-analysis, machine-learning-models |
In this blog, I will be delving into demystifying the intricate process behind building a predictive machine learning model. Whether you are a newcomer to the world of data science, or someone looking to reinforce their foundation, this blog post is tailored just for you. In this practical guide, I will break down each step into bite-sized pieces, using simple words to ensure everyone reading can easily follow through. The case study to be used is the Titanic dataset. This classic example will help us navigate through the essentials: importing required libraries, data cleaning, exploratory data analysis, feature engineering, feature selection, hyperparameter tuning, and assessing model performance. Now, grab your virtual boarding pass and let's sail on this exciting journey towards creating your very own (potentially first) machine learning model!
-
A Kaggle account - Go ahead and create if you don't have one.
-
An Integrated Development Environment (IDE), e.g. VS Code or any other one you are comfortable with.
-
Python
The Titanic dataset, named after the historic shipwreck, provides a foundational dataset for data science and machine learning studies. It contains information about the passengers who were aboard the RMS Titanic during its ill-fated maiden voyage in 1912. This dataset includes details like age, gender, class, fare and survival status. By using the Titanic dataset for our case study, we get to explore essential data science and machine learning concepts in a practical context.
-
Download and install Python via this link after choosing your Operating System.
-
Download and install Jupyter Notebook via this link - [Optional if you already have an IDE like VS Code].
The Titanic dataset is available on Kaggle via the links below.
The training dataset has 10 columns which can be tweaked further for feature engineering. Understanding the information these columns store is paramount to building an effective machine-learning model that works for predicting whether a passenger survives or not.
Columns | Description |
---|---|
PassengerID | Unique Identifier of a passenger |
Name | Passenger Name (Includes titles like Miss, Mrs, Major, Rev etc) |
Survived | A dummy variable that specifies whether a passenger survived or not - 1 for Survived and 0 for Died |
Pclass | Ticket class with 1 for Upper Class, 2 for Middle Class, and 3 for Lower |
Age | Age of the Passenger (Fractional if the age is less than 1) |
SibSp | Number of siblings/spouses travelling with the passenger |
Parch | Number of parents/children travelling with the passenger |
Ticket | Ticket Number |
Fare | Passenger Fare |
Cabin | Cabin Number |
Embarked | Port of Embarkation (C for Cherbourg, Q for Queenstown, and S for Southampton) |
Before we start cleaning, analysing or building any model, it is required to install the popular Python packages needed for our project. These are the likes of Numpy, Pandas, Matplotlib, Seaborn, and Sklearn. You can install all of them using pip in your local terminal.
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1695276735781/fa1b1521-638a-4d98-a978-9fb96bed00a9.png align="left")
Once this is done, giddy up for your first full-fletched data-centric prediction project!
-
Open up your IDE - it is recommended to have one where you can run your code in cell blocks; Jupyter Notebook/Lab, VS Code with .ipynb extension, and Google Colab are popular examples.
-
Once this is done, import the packages you installed using pip.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
- You can then proceed to load the datasets.
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
- To quickly inspect what the data looks like, we look at the first 'n' rows.
train.head() # gives the first 5 rows, you can specify the n you want otherwise
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1695357966516/0c87311a-1149-488e-9b33-5f290311d4c7.png align="center")
We have successfully loaded the data sets required for our machine-learning project. The train.csv data file will be used to train the machine learning model(s) we choose and the test.csv will be used to assess the performance of our trained models on unseen/new scenarios. In essence, we want to see how well our model predicts the survival rate of new passengers who were not in our original dataset.
Data cleaning stands as a critical cornerstone in every data project. The performance of your machine learning model hinges significantly on the quality and cleanliness of your data. Therefore, it becomes imperative to meticulously undertake the necessary steps to ensure your data is free from imperfections and ready for analysis. For the Titanic dataset, we apply a set of widely recognized techniques ranging from removing duplicates to addressing missing values.
print(train.shape) # gives the no of rows, the no of columns in the data
train.drop_duplicates(inplace=True) # dropping duplicates
print(train.shape) # checking to see if any rows were actually dropped
print(train.isnull().sum()) # returns the no of missing values for all columns
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696064110209/0d2bc895-1c1f-4f28-b80f-b07345a73ca0.png align="left")
From the image above, we see that 177, 687 and 2 entries of the Age, Cabin, and Embarked columns have nan (missing) values respectively.
# checking the percentage of missing values in the Age column
print(train.isnull().sum()["Age"]/ len(train)) # about 19.9%
# checking the percentage of missing values in the Cabin column
print(train.isnull().sum()["Cabin"]/len(train)) # about 77.1%
train_copy['Cabin'].fillna('No Cabin', inplace=True)
train_copy.Cabin.value_counts()
train.dropna(subset=['Embarked'], inplace=True)
We could have dropped all the rows with missing age values in our dataset but as you might have guessed, we will be missing out on a significant chunk of possible insights with the additional 177 rows. Since we will not be dropping these rows, we have to fill them with some values. One popular way is by imputing the mean or median of the ages. So, mean or median? The distribution of the age column helps us answer this question.
plt.figure(figsize=(10, 5))
sns.distplot(train_copy['Age'])
plt.title('Distribution of Age column')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
We get a plot like the one below:
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696065730103/6a8efa3a-4cba-4cc2-a6da-66e24859887f.png align="center")
Seeing that the distribution is a little bit skewed to the right, we use the median for the imputation as it is more robust to outliers and not heavily influenced by very low or high age values like the mean.
train_copy['Age'].fillna(train_copy['Age'].median(), inplace=True)
A quick description of the numeric columns can also help to provide some information that will help us gain some statistical insights as to how the data in our train dataset is distributed.
train.describe()
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696064686629/f87d1344-ad5a-431c-90ad-48ebf5fbb0b1.png align="center")
This returns a quick analysis of the numeric columns with information on count (number of non-missing values), the mean, standard deviation, minimum, maximum, and 25th, 50th, and 75th percentile in the data.
This step involves visually and statistically summarizing your data, identifying outliers, understanding the distribution of variables, and forming initial hypotheses. EDA is a crucial step before diving into modelling, as it helps you better grasp the characteristics of your data and make informed decisions on how to proceed.
train.Sex.value_counts().plot(kind='bar', alpha=1)
train.Embarked.value_counts().plot(kind='bar')
train.Pclass.value_counts().plot(kind='bar')
Execute each line in the code block above in a separate code cell in your workbook and see the plots you get.
males = train[train['Sex'] == "male"]
females = train[train['Sex'] == "female"]
# computing survival rates for both
male_survival_rate = males["Survived"].mean()
female_survival_rate = females["Survived"].mean()
male_survival_counts = males["Survived"].value_counts()
female_survival_counts = females["Survived"].value_counts()
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
axes[0].bar(["Males", "Females"], [male_survival_rate, female_survival_rate], color=["blue", "red"])
axes[0].set_xlabel("Gender")
axes[0].set_ylabel("Survival Rate")
axes[0].set_title("Males vs Females Survival Rates")
bar_positions = np.arange(2)
bar_width = 0.4
axes[1].bar(bar_positions - bar_width/2, male_survival_counts, bar_width, label='Died')
axes[1].bar(bar_positions + bar_width/2, female_survival_counts, bar_width, label='Survived')
axes[1].set_title('Survival Count Comparison: Males vs Females')
axes[1].set_xticks(bar_positions)
axes[1].set_xticklabels(['Males', 'Females'])
axes[1].set_xlabel("Gender")
axes[1].legend()
axes[1].set_ylabel('Count')
plt.tight_layout()
plt.show()
Feature engineering has to do with creating new, informative features from existing data to enhance the performance of your machine-learning model. It involves selecting, transforming, or combining variables in a way that captures meaningful patterns and relationships. Effective feature engineering can significantly impact the predictive power of your model, ultimately leading to better results and insights.
train_copy = train.copy()
title_list = ['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev', 'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess', 'Don', 'Jonkheer']
import string
def look_for_title(name, title_list):
# looping through our title list
for title in title_list:
if title in name:
return title
# if we don't find any title in the name
return np.nan
# creating the title column
train_copy['Title'] = train_copy['Name'].map(lambda x: look_for_title(x, title_list))
Categorizing the Title List to Any of [A Single Man (Master), A Single Woman (Miss), A Married Man (Mr), A Married Woman (Mrs) ]
def replace_title(data):
title = data['Title']
if title in ['Don', 'Major', 'Capt', 'Jonkheer', 'Rev', 'Col']:
return 'Mr'
elif title in ['Countess', 'Mme']:
return 'Mrs'
elif title in ['Mlle', 'Ms']:
return 'Miss'
elif title == 'Dr':
if data['Sex'] == 'male':
return 'Mr'
else:
return 'Mrs'
else:
return title
train_copy = train.copy()
train_copy['Title'] = train_copy.apply(replace_title, axis=1)
train_copy['Title'].value_counts()
On a ship, decks have cabins and are usually represented with the first letter of the cabin number. For instance, cabin number C85 belongs to deck C.
# creating decks from the first letter of the cabin column
train_copy['Deck'] = train_copy['Cabin'].map(lambda x: x[0])
# changing N to No Cabin
train_copy['Deck'] = train_copy['Deck'].replace('N', 'No Cabin')
# creating a column for family size
train_copy['Family_Size'] = train_copy['SibSp'] + train_copy['Parch'] + 1 # with the addition of the person themselves
# creating a column for fare per person
train_copy['Fare_Per_Person'] = train_copy['Fare'] / train_copy['Family_Size']
Scaling features is a critical data preprocessing step in machine learning. It involves transforming numerical variables to a standard scale, typically between 0 and 1 or with a mean of 0 and a standard deviation of 1. This process ensures that features with different scales do not unduly influence the model's performance, allowing it to give equal weight to each feature during training.
# standardizing the fare, age, sibsp, parch, family_size and fare_per_person columns
scaler = StandardScaler()
train_copy2 = train_copy.copy() # creating another copy of our dataset
train_copy2[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']] = scaler.fit_transform(train_copy2[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']])
Encoding categorical variables is the process of converting non-numeric data, such as categories or labels, into numerical format, allowing machine learning algorithms to process and learn from this information. Common techniques include one-hot encoding and label encoding, each suitable for different types of categorical data.
# one hot encoding for the sex column
encoded_train = pd.get_dummies(train_copy2, columns=['Sex'], prefix=['is'])
# one hot encoding for the embarked column
encoded_train = pd.get_dummies(encoded_train, columns=['Embarked'], prefix=['Embarked'])
# one hot encoding for the title column
encoded_train = pd.get_dummies(encoded_train, columns=['Title'], prefix=['Title'])
# label encoding for the deck column because it is ordinal
encoded_train['Deck'] = encoded_train['Deck'].map({'No Cabin': 0, 'A':8, 'B':7, 'C':6, 'D':5, 'E':4, 'F':3, 'G':2, 'T':1})
# creating a copy of the encoded train dataset
encoded_train2 = encoded_train.copy()
# dropping the columns not needed for training
encoded_train2.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
# checking for the correlation between the columns
plt.figure(figsize=(15, 10))
sns.heatmap(encoded_train2.corr(), annot=True, cmap='coolwarm')
plt.show()
We get a map like the one below:
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1696075292606/677b3a93-64ab-4736-a28e-566ce0611504.png align="center")
Training the model is the phase where we feed our prepared dataset into the chosen machine learning algorithm. During this process, the algorithm learns to recognize patterns and relationships within the data, adjusting its parameters to make predictions or classifications. The goal is to find the best-fitting model that generalizes well to unseen data, which is achieved through iterative training, evaluation, and fine-tuning of the model's hyperparameters.
Oftentimes, one has to experiment with a series of machine learning models and then choose the one that works best based on the preferred performance metrics. In this case, we want a model that gives us the highest accuracy score.
req_features = ['Pclass', 'Title_Master', 'Title_Mrs', 'Title_Mr', 'is_female', 'is_male', 'Family_Size', 'SibSp', 'Age', 'Parch']
# random forest classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=1)
# training the model
rf.fit(encoded_train2[req_features], encoded_train2['Survived'])
# assessing the accuracy of trained model on the training data
print(rf.score(encoded_train2[req_features], encoded_train2['Survived']))
We have trained our first machine-learning model with the Random Forest algorithm. Now, we move on to check the performance on the testing data.
test_copy = test.copy()
# applying the same transformations to the test data as we did to the train data
test_copy['Title'] = test_copy['Name'].map(lambda x: look_for_title(x, title_list))
test_copy['Title'] = test_copy.apply(replace_title, axis=1)
test_copy['Cabin'].fillna('No Cabin', inplace=True)
test_copy['Deck'] = test_copy['Cabin'].map(lambda x: x[0])
test_copy['Deck'] = test_copy['Deck'].replace('N', 'No Cabin')
test_copy['Family_Size'] = test_copy['SibSp'] + test_copy['Parch'] + 1 # with the addition of the person themselves
test_copy['Fare_Per_Person'] = test_copy['Fare'] / test_copy['Family_Size']
test_copy['Embarked'].fillna(test_copy['Embarked'].mode()[0], inplace=True)
test_copy['Age'].fillna(test_copy['Age'].median(), inplace=True)
test_copy[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']] = scaler.fit_transform(test_copy[['Fare', 'Age', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']])
# dealing with categorical columns on our test data
# one hot encoding for the sex column
test_copy = pd.get_dummies(test_copy, columns=["Sex"], prefix=["is"])
test_copy = pd.get_dummies(test_copy, columns=["Embarked"], prefix=["Embarked"])
test_copy = pd.get_dummies(test_copy, columns=["Title"], prefix=["Title"])
# label encoding for the deck column because it is ordinal
test_copy['Deck'] = test_copy['Deck'].map({'No Cabin': 0, 'A':8, 'B':7, 'C':6, 'D':5, 'E':4, 'F':3, 'G':2, 'T':1})
# replacing the nan value in the fare column with the mean
test_copy['Fare'].fillna(test_copy['Fare'].mean(), inplace=True)
# replacing the nan value in the fare per person column with the mean
test_copy['Fare_Per_Person'].fillna(test_copy['Fare_Per_Person'].mean(), inplace=True)
rf_predictions = rf.predict(test_copy[req_features])
# creating a dataframe with the predictions
submission = pd.DataFrame({'PassengerId': test_copy['PassengerId'], 'Survived': rf_predictions})
submission.to_csv('submission.csv', index=False)
print("Submission file successfully saved!")
Running the code above would have generated a CSV file with two columns: PassengerID and Survived. You can go ahead and submit your predictions on the website. You get your score on how well your algorithm predicted the survival rates of people on the Titanic.
In conclusion, our journey through building a machine learning model using the Titanic dataset has covered essential data science steps, from cleaning to feature engineering and model training. These concepts provide a foundation for beginners in data science and machine learning.
Thank you for engaging!