A study recording of Coursera's Machine Learning by Andrew Ng, but added some practices for reinforceing learning.
-
Introduction
Machine Learning definition
: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience ESupervised learning
: "Right answer given" e.g. Regression, Classification...Unsupervised learning
: "No right answer given" e.g. Clustering, Gradient descent...
-
Linear Regression with One Variable
Model representation
Cost function
Gradient Descent
-
Linear Algebra Review
-
Python Practice for Simple Linear Regression
PREDICTING HOUSE PRICES
We have the following dataset:
Entry No. Square_Feet Price 1 150 6450 2 200 7450 3 250 8450 4 300 9450 5 350 11450 6 400 15450 7 600 18450 With linear regression, we know that we have to find a linearity within the data so we can get θ0 and θ1 Our hypothesis equation looks like this:
Where:
- hθ(x) is the value price (which we are going to predicate) for particular square_feet (means price is a linear function of square_feet)
- θ0 is a constant
- θ1 is the regression coefficient
Coding:
- See week1 python codes which used Python Packages for Data Mining like NumPy, SciPy, Pandas, Matplotlib, Scikit-Learn to implement it.
- Script Output:
PREDICTING WHICH TV SHOW WILL HAVE MORE VIEWERS NEXT WEEK
The Flash and Arrow are American television series, each one is popular with people. It's interesting that which one will ultimately win the ratings war, so lets write a program which predicts which TV Show will have more viewers.
We have the following dataset:
FLASH_EPISODE FLASH_US_VIEWERS ARROW_EPISODE ARROW_US_VIEWERS 1 4.83 1 2.84 2 4.27 2 2.32 3 3.59 3 2.55 4 3.53 4 2.49 5 3.46 5 2.73 6 3.73 6 2.6 7 3.47 7 2.64 8 4.34 8 3.92 9 4.66 9 3.06 Steps to solving this problem:
- First we have to convert our data to X_parameters and Y_parameters, but here we have two X_parameters and Y_parameters. So, lets’s name them as flash_x_parameter, flash_y_parameter, arrow_x_parameter , arrow_y_parameter.
- Then we have to fit our data to two different linear regression models- first for Flash, and the other for Arrow.
- Then we have to predict the number of viewers for next episode for both of the TV shows.
- Then we can compare the results and we can guess which show will have more viewers.
Coding:
- See week1 python codes which used Python Packages for Data Mining like Pandas, Scikit-Learn to implement it. See
- Script Output: The flash TV show will have more viewer for next week9.
-
Linear Regression with Multiple Variables
Multiple features
Gradient Descent for multiple variables
Feature scaling
: Make sure features are on a similar scale.Learning rate
: Making sure gradient descent is working correctly.- If α is too small: slow convergence.
- If α is too large: J(θ) may not decrease on every every iteration; may not converge
Features and Polynomial Regression
-
Computing Parameters Analytically
Normal equation
: Method to solve θ for analytically.Normal equation Noninvertibility
-
Octave Tutorial
Basic operation
Moving Data Around
Computing on Data
Plotting Data
Control statement: for, while, if statement
Vectorization
-
Octave Practice for Linear Regression
In this practice, we will implement linear regression and get to see it work on data. See related exercises and scripts which used officeal coursera's exercise. All what we did in the exercises are summarised as below sections:
Linear regression with one variable
In this part of this exercise, you will implement linear regression with onevariable to predict profits for a food truck. Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for profits and populations from the cities.
1. Plotting the Data
Before starting on any task, it is often useful to understand the data by visualizing it. For this dataset, you can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population).
2. Gradient Descent
In this part, you will fit the linear regression parameters θ to our dataset using gradient descent.
The objective of linear regression is to minimize the cost function
where the hypothesis hθ(x) is given by the linear model
Recall that the parameters of your model are the θj values. These arethe values you will adjust to minimize cost J(θ). One way to do this is to use the batch gradient descent algorithm. In batch gradient descent, each iteration performs the update
With each step of gradient descent, your parameters θj come closer to the optimal values that will achieve the lowest cost J(θ).
So after computing the cost function J(θ) and θ, you can plot the linear fit like below picture :
To understand the cost function J(θ) better, you will now plot the cost over a 2-dimensional grid of θ0 and θ1 values.
Surface:
Contour:
Linear regression with multiple variables
In this part, you will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.
1. Feature Normalization
By looking at the data set avlue, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of mag-nitude, first performing feature scaling can make gradient descent converge much more quickly.
- Subtract the mean value of each feature from the dataset.
- After subtracting the mean, additionally scale (divide) the feature values by their respective \standard deviations."
2. Gradient Descent
Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged. Also we can try out different learning rates for the dataset and find a learning rate that converges quickly. If you picked a learning rate within a good range, your plot look similar figure like below:
3. Normal Equationst
In the lecture videos, you learned that the closed-form solution to linear regression is
Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no loop until convergence" like in gradient descent.
-
Classification and Representation
Classification:
Email:Spam/NotSpam? , Online Transac&ons:Fraudulent(Yes/No)?Hypothesis Representation
Decision Boundary
-
Logistic Regression Model
Cost Function
Simplified Cost Function and Gradient Descent
Advanced Optimization
-
Multiclass Classification
Multiclass Classification:One-vs-all
-
Regularization
The Problem of Overfitting
Cost Function
Regularized Linear Regression
Regularized Logistic Regression
-
Octave Practice for Logistics Regression
In this exercise, you will implement logistic regression and apply it to two different datasets. See related exercises and scripts which used officeal coursera's exercise. All what we did in the exercises are summarised as below sections:
Logistic Regression
In this part of the exercise, you will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant's scores on two exams and the admissions decision. Your task is to build a classification model that estimates an applicant's probability of admission based the scores from those two exams.
1. Visualizing the data
Before starting to implement any learning algorithm, it is always good to visualize the data if possible. It can help you get more familiar with the data distribution.
2. Sigmoid function
Before you start with the actual cost function, recall that the logistic regression hypothesis is defined as:
where function g is the sigmoid function. The sigmoid function is defined as:
3. Cost function and gradient
Recall that the cost function in logistic regression is
and the gradient of the cost is a vector of the same length as θ where the jth element (for j = 0; 1; : : : ; n) is defined as follows:
4. Learning parameters using fminunc
In the previous assignment, you found the optimal parameters of a linear regression model by implementing gradent descent. You wrote a cost function and calculated its gradient, then took a gradient descent step accordingly.This time, instead of taking gradient descent steps, you will use an Octave/-MATLAB built-in function called fminunc.
Octave/MATLAB's fminunc is an optimization solver that fnds the minimum of an unconstrained2 function. For logistic regression, you want to optimize the cost function J(θ) with parameters θ.
5. Plot the decision boundary
This fnal θ value will then be used to plot the decision boundary on the training data, resulting in a figure similar to below picture:
Regularized logistic regression
In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.
1. Visualizing the data
Before starting to implement any learning algorithm, it is always good to visualize the data if possible. It can help you get more familiar with the data distribution.
It shows that our dataset cannot be separated into positive and negative examples by a straight-line through the plot. Therefore, a straight-forward application of logistic regression will not perform well on this datasetsince logistic regression will only be able to find a linear decision boundary.
2. Feature mapping
One way to fit the data better is to create more features from each datapoint.We will map the features into all polynomial terms of x1 and x2 up to the sixth power.
As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classifier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot.
3. Cost function and gradient
Now you will implement code to compute the cost function and gradient for regularized logistic regression.
Recall that the cost function in logistic regression is:
Note that you should not regularize the parameter θ0. In Octave/MATLAB, recall that indexing starts from 1, hence, you should not be regularizing the theta(1) parameter (which corresponds to θ0) in the code. The gradient of the cost function is a vector where the jth element is defined as follows:
4. Learning parameters using fminunc
Similar to the previous parts, you will use fminunc to learn the optimal parameters θ
5. Plot the decision boundary
This fnal θ value will then be used to plot the decision boundary on the training data, resulting in a figure similar to below picture:
-
Motivations
Non-linear Hypothesis
Neurons and the Brain
-
Neural Networks
Model Representation
-
Application
Examples and Intuitions
MultiClass classification
-
Octave Practice for Multi-class Classification and Neural Networks
In this exercise, you will implement one-vs-all logistic regression and neural networks to recognize hand-written digits.See related exercises and scripts which used officeal coursera's exercise. All what we did in the exercises are summarised as below sections:
Multi-class Classification
For this exercise, you will use logistic regression and neural networks to recognize handwritten digits (from 0 to 9). Automated handwritten digit recognition is widely used today - from recognizing zip codes (postal codes) on mail envelopes to recognizing amounts written on bank checks.
1. Dataset and Visualizing the data
You're given a dataset that contains 5000 training examples of handwritten digits. Each training exaple is a 20 pixel by 20 pixel prayscale image of the digit.Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The 20 by 20 grid of pixels is "unrolled" into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X where every row is a training example for a handwritten digit image.
We will begin by visualizing a subset of the training set, select 100 rows from X and pass those rows to the displayData function. After you run the related codes, you should see an imgae like below:
2. Vectorizing Logistic Regression
You will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, you will need to train 10 separate logistic regression classifiers. To make this training efficient, it is important to ensure that your code is well vectorized.
Recall that for regularized logistic regression, the cost function is defined as
Note that you should not be regularizing θ0 which is used for the bias term. Correspondingly, the partial derivative of regularized logistic regression cost for θj is defined as
3. One-vs-all Classification and Prediction
In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers. After training your one-vs-all classifier, you can now use it to predict the digit contained in a given image. For each input, you should compute the "probability" that it belongs to each class using the trained logistic regression classifiers.
You should see that the training set accuracy is about 94.9% (i.e., it classifies 94.9% of the examples in the training set correctly).
Neural Networks
In the previous part of this exercise, you implemented multi-class logistic regression to recognize handwritten digits. However, logistic regression cannot form more complex hypotheses as it is only a linear classifier. In this part of the exercise, you will implement a neural network to recognize handwritten digits using the same training set as before. The neural network will be able to represent complex models that form non-linear hypotheses. For this week, you will be using parameters from a neural network that we have already trained. Your goal is to implement the feedforward propagation algorithm to use our weights for prediction.
1. Model representation
Our neural network is shown as below picture. It has 3 layers { an input layer, a hidden layer and an output layer. Recall that our inputs are pixel values of digit images. Since the images are of size 20*20, this gives us 400 input layer units (excluding the extra bias unit which always outputs +1).
2. Feedforward Propagation and Prediction
Now you will implement feedforward propagation for the neural network.
You should implement the feedforward computation that computes hθ(x(i))ffor every example i and returns the associated predictions. Similar to the one-vs-all classification strategy, the prediction from the neural network will be the label that has the largest output (hθ(x))k.
Once you are done, You should see that the accuracy is about 97.5%.