Skip to content

Commit

Permalink
Cleaning up uncommitted updates
Browse files Browse the repository at this point in the history
  • Loading branch information
Fady Bishara committed Sep 20, 2024
1 parent 813ca65 commit ab35ecf
Show file tree
Hide file tree
Showing 8 changed files with 191 additions and 139 deletions.
Binary file added docs/figures/y_preds.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
113 changes: 113 additions & 0 deletions docs/linreg/closed_form.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@


# Closed-form solution

First, let us generalize the loss function just a little bit more by adding a ==regularization== term to it -- see this [Wikipedia article](https://en.wikipedia.org/wiki/Regularization_(mathematics)#Regularization_in_machine_learning) for a very brief introduction to the topic.

??? question "Why would we want to regularize the model?"
In this simple case with a one-dimensional function (i.e., with one dependent variable), regularization does not actually help; this is easy to show explicitly :construction:.
However, when the model becomes more complex and has many parameters, the regularization term penalizes complexity and, thus, overfitting.

With this additional term, the loss function becomes,
$$
\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \left(y_i - t_i\right)^2 + \lambda\,\Omega(w,b)\,,
$$
where $\lambda$ is a coefficient that controls the relative size (i.e., effect) of the regularization term, $\Omega$ which is a function of the parameters, $w$ and $b$.
Two well established (and well motivated) choices for the regularization term are:

<div class="center-table" style="font-size:14pt" markdown>

| Type | Form of $\mathbf{\Omega}$ | Used in |
| :--------: | :-----------: | -------------------------------------- |
| $L_2-$norm | $w^2 + b^2$ | [Ridge regression](https://en.wikipedia.org/wiki/Ridge_regression) |
| $L_1-$norm | $\lvert w\rvert + \lvert b\rvert$ | [LASSO](https://en.wikipedia.org/wiki/Lasso_&lpar;statistics&rpar;) |

</div>

In this slightly more complicated but still simple case, and with a bit of algebra, we can find a closed form solution for the parameters $w$ and $b$. The loss function here is strictly convex and thus has a unique minimum. The solution can be found by solving the linear set of equations $\nabla\mathcal{L}=\vec{0}$ with ==$L_2$== regularization and is given by,

\begin{equation*}
\begin{split}
w &= \frac{\mathrm{cov}(X, T) + \lambda\langle XT\rangle}{
\mathrm{var}\,X + \lambda\langle X^2\rangle + \lambda(1+\lambda)
}\,,\\
b &= \frac{1}{1+\lambda}\left[\langle{T}\rangle - w\,\langle{X}\rangle\right]\,.
\end{split}
\end{equation*}

The captial letters mean the full vector of features, $x$, and targets, $t$, in the dataset. The operators $\mathrm{cov}$ and $\mathrm{var}$ are the covariance and variance respectively. They can be computed with the [NumPy](https://numpy.org/) functions `#!numpy numpy.cov` and `#!numpy numpy.var`. Finally, quantities enclosed in angled brackets as in $\langle{X}\rangle$ means the average (`#!numpy numpy.mean`) over the features in the datasets and similarly for the targets and the product of the features with the targets.


## Estimating the error on the fitted parameters

!!! danger "This section is techincal"

One can get a lower bound on the variance of the fitted parameters (or, to be more technically correct, the variance of the _estimators_). This is known as the [Cramér-Rao](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound) bound which says this variance is bounded from below by the inverse of the Fisher information, i.e.,

\begin{equation*}
\mathrm{var}(\hat{\boldsymbol\theta})\geq I(\boldsymbol\theta)^{-1}\,,
\end{equation*}

where $I(\boldsymbol\theta)$ is the [Fisher information matrix](https://en.wikipedia.org/wiki/Fisher_information#Matrix_form) which is proportional to the Hessian of the loss function

\begin{equation*}
\left[I(\boldsymbol\theta)\right]_{ij} = -N\,\mathbb{E}\left[
\frac{\partial^2}{\partial\theta_i\partial\theta_j} \log\,f(\boldsymbol{X};\boldsymbol\theta)\Biggr|\boldsymbol\theta
\right]\,,
\end{equation*}

where $f$ is the likelihood function (1) which is related to our loss function via
{ .annotate }

1. I'm slightly confused here :confounded:, this (taken from Wikipedia) is not a likelihood but looks rather like a probability...

\begin{equation*}
-\log f = \frac{\mathcal{L}}{2S^2}\,.
\end{equation*}

Note that the (unbiased) _sample variance_, $S^2$, appears here if we assume a Gaussian likelihood. So, we can write $I_{ij}$ as

\begin{equation*}
\left[I(\boldsymbol\theta)\right]_{ij} = \frac{1}{2S^2}\,\frac{\partial^2\mathcal{L}}{\partial\theta_i\partial\theta_j} = \frac{1}{2S^2}\,H_{ij}\,,
\end{equation*}

where $H_{ij}$ is a matrix element of the Hessian matrix. Thus, the inverse of the Fisher information matrix is (dropping the explicit dependence on the arguments for simplicity)

\begin{equation*}
I^{-1} = 2S^2\,H^{-1} = \frac{2}{N-1}\sum_i^N(y_i-t_i)^2\,H^{-1}
\end{equation*}

Denoting the estimators for $w$ and $b$ by $\hat{w}$ and $\hat{b}$, their (co)variances are thus,

!!! success "(Co)variances of the estimators"

\begin{equation*}
\begin{split}
\mathrm{var}\,\hat{w} &= \frac{1}{N-1}\sum_i^N (y_i-t_i)^2\times
\frac{1}{\mathrm{var}\,X}\left(\mathrm{var}\,X + \overline{X}^2\right)\,,\\
\mathrm{var}\,\hat{b} &= \frac{1}{N-1}\sum_i^N (y_i-t_i)^2\times
\frac{1}{\mathrm{var}\,X}\,,\\
\mathrm{cov}(\hat{w}, \hat{b}) &= \frac{-1}{N-1}\sum_i^N (y_i-t_i)^2\times
\frac{\overline{X}}{\mathrm{var}\,X}\,.
\end{split}
\end{equation*}

### Propagating the errors to the model

Let's keep using the hatted parameters (estimated from the data) to distinguish them from the true parameters. Our model is then
$$
y = \hat{w}x + \hat{b}.
$$

Propagating the uncertainties to $y$ (see this [Wikipedia article](https://en.wikipedia.org/wiki/Propagation_of_uncertainty#Example_formulae) for example), we have
$$
(\delta y)^2 = (\mathrm{var}\,\hat{w})\,x^2 + \mathrm{var}\,\hat{b} + 2x\,\mathrm{cov}(\hat{w}, \hat{b})\,.
$$


<figure markdown="span">
![Error band](../figures/error_band.png){ width="600" }
<figcaption>
The band shows the 95% confidence interval on the fit by propagating the uncertainty on the estimated parameters $\hat{w}$ and $\hat{b}$ as described in the main text.
</figcaption>
</figure>
90 changes: 1 addition & 89 deletions docs/linear_regression.md → docs/linreg/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Things are getting a bit technical here and there is lot to unpack so let's look
1. Okay, so where did the "more-than-one-variable" thing happen here? Recall that we want to optimize the loss function with respect to the parameters and our model has two of those!

<figure markdown="span">
![Image title](./figures/tangents.png){ width="600" }
![Image title](../figures/tangents.png){ width="600" }
<figcaption>The gray segments are the tangents to the blue curve at each red point. The slope of the tangent lines are the values of the first derivative of the curve at each point.</figcaption>
</figure>

Expand Down Expand Up @@ -182,91 +182,3 @@ And here are a few things you should _definitely_ do:
- Plot the model parameters as function of training steps
- Plot the data (`#!python plt.scatter`) and the fitted model


!!! info "Exact solution"

In this (simple) case, and with a bit of algebra, we can find a closed form solution for the parameters $w$ and $b$. The loss function here is strictly convex and thus has a unique minimum. The solution can be found by solving the linear set of equations $\nabla\mathcal{L}=\vec{0}$ and is given by,

\begin{equation*}
w = \frac{\mathrm{cov}(X, T)}{
\mathrm{var}\,X
}\,,\qquad
b = \overline{T} - w\,\overline{X}\,.
\end{equation*}

The captial letters mean the full vector of features, $x$, and targets, $t$, in the dataset. The operators $\mathrm{cov}$ and $\mathrm{var}$ are the covariance and variance respectively. They can be computed with the `numpy` functions `cov` and `var`. Finally, an overline as in $\overline{X}$ means the average (`numpy.mean`) over the features in the datasets and similarly for the targets.


## Estimating the error on the fitted parameters

!!! danger "This section is techincal"

One can get a lower bound on the variance of the fitted parameters (or, to be more technically correct, the variance of the _estimators_). This is known as the [Cramér-Rao](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound) bound which says this variance is bounded from below by the inverse of the Fischer information, i.e.,

\begin{equation*}
\mathrm{var}(\hat{\boldsymbol\theta})\geq I(\boldsymbol\theta)^{-1}\,,
\end{equation*}

where $I(\boldsymbol\theta)$ is the [Fischer information matrix](https://en.wikipedia.org/wiki/Fisher_information#Matrix_form) which is proportional to the Hessian of the loss function

\begin{equation*}
\left[I(\boldsymbol\theta)\right]_{ij} = -N\,\mathbb{E}\left[
\frac{\partial^2}{\partial\theta_i\partial\theta_j} \log\,f(\boldsymbol{X};\boldsymbol\theta)\Biggr|\boldsymbol\theta
\right]\,,
\end{equation*}

where $f$ is the likelihood function (1) which is related to our loss function via
{ .annotate }

1. I'm slightly confused here :confounded:, this (taken from Wikipedia) is not a likelihood but looks rather like a probability...

\begin{equation*}
-\log f = \frac{\mathcal{L}}{2S^2}\,.
\end{equation*}

Note that the (unbiased) _sample variance_, $S^2$, appears here if we assume a Gaussian likelihood. So, we can write $I_{ij}$ as

\begin{equation*}
\left[I(\boldsymbol\theta)\right]_{ij} = \frac{1}{2S^2}\,\frac{\partial^2\mathcal{L}}{\partial\theta_i\partial\theta_j} = \frac{1}{2S^2}\,H_{ij}\,,
\end{equation*}

where $H_{ij}$ is a matrix element of the Hessian matrix. Thus, the inverse of the Fischer information matrix is (dropping the explicit dependence on the arguments for simplicity)

\begin{equation*}
I^{-1} = 2S^2\,H^{-1} = \frac{2}{N-1}\sum_i^N(y_i-t_i)^2\,H^{-1}
\end{equation*}

Denoting the estimators for $w$ and $b$ by $\hat{w}$ and $\hat{b}$, their (co)variances are thus,

!!! success "(Co)variances of the estimators"

\begin{equation*}
\begin{split}
\mathrm{var}\,\hat{w} &= \frac{1}{N-1}\sum_i^N (y_i-t_i)^2\times
\frac{1}{\mathrm{var}\,X}\left(\mathrm{var}\,X + \overline{X}^2\right)\,,\\
\mathrm{var}\,\hat{b} &= \frac{1}{N-1}\sum_i^N (y_i-t_i)^2\times
\frac{1}{\mathrm{var}\,X}\,,\\
\mathrm{cov}(\hat{w}, \hat{b}) &= \frac{-1}{N-1}\sum_i^N (y_i-t_i)^2\times
\frac{\overline{X}}{\mathrm{var}\,X}\,.
\end{split}
\end{equation*}

### Propagating the errors to the model

Let's keep using the hatted parameters (estimated from the data) to distinguish them from the true parameters. Our model is then
$$
y = \hat{w}x + \hat{b}.
$$

Propagating the uncertainties to $y$, we have
$$
(\delta y)^2 = (\mathrm{var}\,\hat{w})\,x^2 + \mathrm{var}\,\hat{b} + 2x\,\mathrm{cov}(\hat{w}, \hat{b})\,.
$$


<figure markdown="span">
![Error band](./figures/error_band.png){ width="600" }
<figcaption>
The band shows the 95% confidence interval on the fit by propagating the uncertainty on the estimated parameters $\hat{w}$ and $\hat{b}$ as described in the main text.
</figcaption>
</figure>
12 changes: 12 additions & 0 deletions docs/nnets/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

# Neural networks

In this tutorial, we are going to implement a fully-connected feed-forward neural network from scratch. We will then use this neural network to classify the handwritten digits in [MNIST dataset](https://yann.lecun.com/exdb/mnist/) [LeCun, Cortes, and Burges].

The following resources provide a nice basic introduction to neural networks:

- ???
- ???

We'll start by coding the neural network. Once done, we'll turn our attention to the MNIST dataset, in particular, we'll try to get familiar with it, plot a few of the digits, figure out how to transform the images into an input for our neural network, etc.

44 changes: 44 additions & 0 deletions docs/nnets/neural_network_testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
status: new
---

# A simple dataset for testing

The (U.S.) National Institute of Standards and Technology (NIST) has a nice collection of datasets for non-linear regression along with 'certified' fit parameters.

The page that lists these datasets is [itl.nist.gov/div898/strd/nls/nls_main.shtml](https://www.itl.nist.gov/div898/strd/nls/nls_main.shtml).
You can choose whichever model you like. But to be concrete here, I will take the `Chwirut2` [dataset](https://www.itl.nist.gov/div898/strd/nls/data/chwirut2.shtml) which is exponential-distributed and described by a 3-parameter model.

Let's also implement the model using the [NIST certified parameters](https://www.itl.nist.gov/div898/strd/nls/data/LINKS/v-chwirut2.shtml). The model is
$$
f(x; \beta) + \epsilon = \frac{\exp(-\beta_1 x)}{\beta_2 + \beta_3 x} + \epsilon\,,
$$
with $\beta_1$, $\beta_2$, and $\beta_3$ given by
```
Certified Certified
Parameter Estimate Std. Dev. of Est.
beta(1) 1.6657666537E-01 3.8303286810E-02
beta(2) 5.1653291286E-03 6.6621605126E-04
beta(3) 1.2150007096E-02 1.5304234767E-03
```

Here is the Python implementation of this model with the best-fit parameters.
```py
def fcert(x: np.ndarray) -> np.ndarray:

beta1 = 1.6657666537E-01
beta2 = 5.1653291286E-03
beta3 = 1.2150007096E-02

return np.exp(-beta1*x) / (beta2 + beta3*x)
```

!!! danger "Normalizing the dataset"

For many reasons, neural networks **do** care about the normalization of the data. In particular, when using the `sigmoid` activation function wich has a range $\in [0,1]$, this would save a lot of unecessary frustration.

<figure markdown="span">
![Image title](../figures/y_preds.gif){ width="600" }
<figcaption>Network prediction during training.</figcaption>
</figure>

50 changes: 1 addition & 49 deletions docs/neural_networks.md → docs/nnets/neural_networks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,7 @@
status: new
---

# Neural networks

In this tutorial, we are going to implement a fully-connected feed-forward neural network from scratch. We will then use this neural network to classify the handwritten digits in [MNIST dataset](https://yann.lecun.com/exdb/mnist/) [LeCun, Cortes, and Burges].

The following resources provide a nice basic introduction to neural networks:

- ???
- ???

We'll start by coding the neural network. Once done, we'll turn our attention to the MNIST dataset, in particular, we'll try to get familiar with it, plot a few of the digits, figure out how to transform the images into an input for our neural network, etc.


## The `Network` `#!py class`
# The `Network` `#!py class`

To efficiently and cleanly write the necessary code, it is very useful to create a `#!python class`. To read about classes in Python see [realpython.com/python-classes](https://realpython.com/python-classes/) (reading the [Getting Started With Python Classes](https://realpython.com/python-classes/#getting-started-with-python-classes) section will at least get you familiar with the basic concepts).
The class definition and the initialization code (`#!python __init__` method) as well as the definition of the required methods are given just below. Your task is to understand the relevant algorithms and fill in the corresponding methods. Let's go!
Expand Down Expand Up @@ -146,39 +134,3 @@ class Network:
...
return None
```

## A simple dataset for testing

The (U.S.) National Institute of Standards and Technology (NIST) has a nice collection of datasets for non-linear regression along with 'certified' fit parameters.

The page that lists these datasets is [itl.nist.gov/div898/strd/nls/nls_main.shtml](https://www.itl.nist.gov/div898/strd/nls/nls_main.shtml).
You can choose whichever model you like. But to be concrete here, I will take the `Chwirut2` [dataset](https://www.itl.nist.gov/div898/strd/nls/data/chwirut2.shtml) which is exponential-distributed and described by a 3-parameter model.

Let's also implement the model using the [NIST certified parameters](https://www.itl.nist.gov/div898/strd/nls/data/LINKS/v-chwirut2.shtml). The model is
$$
f(x; \beta) + \epsilon = \frac{\exp(-\beta_1 x)}{\beta_2 + \beta_3 x} + \epsilon\,,
$$
with $\beta_1$, $\beta_2$, and $\beta_3$ given by
```
Certified Certified
Parameter Estimate Std. Dev. of Est.
beta(1) 1.6657666537E-01 3.8303286810E-02
beta(2) 5.1653291286E-03 6.6621605126E-04
beta(3) 1.2150007096E-02 1.5304234767E-03
```

Here is the Python implementation of this model with the best-fit parameters.
```py
def fcert(x: np.ndarray) -> float:

beta1 = 1.6657666537E-01
beta2 = 5.1653291286E-03
beta3 = 1.2150007096E-02

return np.exp(-beta1*x) / (beta2 + beta3*x)
```

!!! danger "Normalizing the dataset"

For many reasons, neural networks **do** care about the normalization of the data. In particular, when using the `sigmoid` activation function wich has a range $\in [0,1]$, this would save a lot of unecessary frustration.

8 changes: 8 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,11 @@
.md-typeset details {
font-size: 100%;
}

.center-table {
text-align: center;
}
.md-typeset .center-table :is(td,th):not([align]) {
text-align: initial; /* Reset alignment for table cells */
font-size: 15px;
}
13 changes: 12 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,21 @@ theme:
features:
- content.code.select
- content.code.copy
- navigation.indexes
- navigation.tabs
- navigation.tabs.sticky
- navigation.footer

nav:
- Home: index.md
- Linear Regression:
- linreg/index.md
- Closed-form solution: linreg/closed_form.md
- Warm-up: linear_regression.md
- Neural networks: neural_networks.md
- Neutral Networks:
- nnets/index.md
- The neural network class: nnets/neural_networks.md
- Testing with a simple dataset: nnets/neural_network_testing.md

plugins:
- search
Expand All @@ -43,6 +53,7 @@ markdown_extensions:
- admonition
- attr_list
- footnotes
- tables
- md_in_html
- pymdownx.arithmatex:
generic: true
Expand Down

0 comments on commit ab35ecf

Please sign in to comment.