Welcome to Day 18 of the 30 Days of Data Science series! Today, we explore the basics of Machine Learning and an essential library for implementing ML models in Python: Scikit-learn. This session will set the foundation for understanding ML concepts and applying them in practice.
- What is Machine Learning?
- Types of Machine Learning: Supervised, Unsupervised, and Reinforcement Learning.
- Introduction to Scikit-learn, a machine learning library in Python.
- Example: Linear Regression using Scikit-learn.
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn and improve from data without being explicitly programmed.
- Data: ML algorithms are trained using historical data.
- Model: A mathematical representation of the problem to make predictions or decisions.
- Training: The process of feeding data into the model to learn patterns.
-
Supervised Learning:
- Input data (features) and output labels (target) are provided.
- Goal: Learn a mapping from input to output.
- Examples: Regression, Classification.
-
Unsupervised Learning:
- Only input data is provided, no output labels.
- Goal: Discover hidden patterns or groupings.
- Examples: Clustering, Dimensionality Reduction.
-
Reinforcement Learning:
- Agents learn by interacting with the environment and receiving feedback (rewards or penalties).
- Examples: Game playing, Robotics.
Scikit-learn is a Python library for implementing machine learning algorithms. It provides simple and efficient tools for predictive data analysis.
- Built-in algorithms for supervised and unsupervised learning.
- Tools for model evaluation, preprocessing, and pipeline creation.
- Compatible with other Python libraries like NumPy and pandas.
Before using Scikit-learn, ensure it is installed in your environment. Use the following command:
pip install scikit-learn
-
Loading a Dataset: Scikit-learn comes with several built-in datasets.
from sklearn.datasets import load_iris iris = load_iris() print(iris.keys()) # Output: Keys like 'data', 'target', etc.
-
Splitting Data: Use
train_test_split
to divide data into training and testing sets.from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42 )
-
Training a Model: Fit a model using the training data.
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf.fit(X_train, y_train)
-
Making Predictions: Use the trained model to make predictions.
predictions = clf.predict(X_test) print(predictions)
-
Evaluating a Model: Measure accuracy or other metrics.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, predictions) print(f"Accuracy: {accuracy}")
Let’s build a Linear Regression model to predict house prices.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the dataset
data = fetch_california_housing()
X, y = data.data, data.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")
Output Example:
Mean Squared Error: 0.5401
- Use the
load_wine
dataset from Scikit-learn and train a Decision Tree Classifier. - Build a K-Means clustering model on synthetic data using Scikit-learn.
- Experiment with different test sizes in the
train_test_split
function and observe the impact on performance.
- Machine Learning enables systems to learn from data and make predictions.
- Scikit-learn simplifies the implementation of ML algorithms with its tools and datasets.
- Linear Regression is a basic but powerful algorithm to understand supervised learning.