Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lauriemaynard 2022 #14

Merged
merged 11 commits into from
Mar 16, 2022
104 changes: 71 additions & 33 deletions book-en/02-introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ $$\mathbb{C}{\rm ov} (\epsilon_i, \epsilon_j) = 0,\ i \neq j$$
3\. And, the residuals are normal:

$$\boldsymbol{\varepsilon} | \mathbf{X} \sim \mathcal{N} \left( \mathbf{0}, \sigma^2_\epsilon \mathbf{I} \right)$$

The estimations of general linear models as in $\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X$ assumes that data is generated following these assumptions.

```{r echo = FALSE, fig.width = 8, fig.height = 8}
# Set the coefficients:
Expand Down Expand Up @@ -76,8 +76,63 @@ legend(x = 0, y = 25,
```


# An example with general linear models
## An example with general linear models

Let's simulate 250 observations which satisfies our assumptions: $\epsilon_i \sim \mathcal{N}(0, 2^2), i = 1,...,250$.

.pull-left[

```{r}
nSamples <- 250
ID <- factor(c(seq(1:nSamples)))

PredVar <- runif(nSamples,
min = 0,
max = 50)

simNormData <- data.frame(
ID = ID,
PredVar = PredVar,
RespVar = (2*PredVar +
rnorm(nSamples,
mean = 0,
sd = 2)
)
)

# We have learned how to use lm()

lm.simNormData <- lm(RespVar ~ PredVar,
data = simNormData)
```

]

.pull-right[
```{r}
layout(matrix(c(1,2,3,4),2,2))
plot(lm.simNormData)
```

]

1. These graphs allows one to check the assumption of linearity and homoscedasticity;

2. QQ-plot allows the comparison of the residuals to "ideal" normal observations;

3. Scale-location plot (square rooted standardized residual vs. predicted value) is useful for checking the assumption of homoscedasticity;

4. Cook's Distance, which is a measure of the influence of each observation on the regression coefficients and helps identify outliers.

Residuals are $Y-\widehat{Y}$, or the observed value minus the predicted value.

Outliers are observations $Y$ with large residuals, i.e. the observed value $Y$ for a point $X$ is very different from the one predicted by the regression model $\widehat{Y}$.

A leverage point is defined as an observation $Y$ that has a value of $x$ that is far away from the mean of $x$.

An influential observation is defined as an observation $Y$ that changes the slope of the line $\beta_1$. Thus, influential points have a large influence on the fit of the model. One method to find influential points is to compare the fit of the model with and without each observation.

#Example with real data
Let us use our prior knowledge on general linear models to explore the relationship between variables within the *Oribatid mite dataset*.

Let us begin by loading this dataset into `R`:
Expand All @@ -89,13 +144,12 @@ mites <- read.csv('data/mites.csv',
stringsAsFactors = TRUE)
```

The dataset that you just loaded is a subset from the classic 'Oribatid
mite dataset', which has been used in numerous texts (e.g. Borcard,
The dataset that you just loaded is a subset from the classic [Oribatid mites (Acari,Oribatei)](http://adn.biol.umontreal.ca/~numericalecology/data/oribates.html), which has been used in numerous texts (e.g. Borcard,
Gillet & Legendre, *Numerical Ecology with R*), and which is available
in the `vegan` library.

The Oribatid mite dataset has **70 observations with moss and mite samples** collected Station de Biologie des
Laurentides from the Université de Montréal, within the municipality of Saint-Hippolyte, Québec (Canada). Each sample includes **5** variables of **environmental measurements** and abundance for *Galumna* sp. for each site.
The Oribatid mite dataset has **70 observations with moss and mite samples** collected Station de [Station de Biologie from the Université de Montréal](https://goo.gl/maps/PxN1Q7KUPnUt92Eu5).
], within the municipality of Saint-Hippolyte, Québec (Canada). Each sample includes **5** variables of **environmental measurements** and abundance for *Galumna* sp. for each site.

We can peek into the structure and the first six rows of the dataset using the `head()` and `str()` functions:

Expand Down Expand Up @@ -161,6 +215,8 @@ $\text{Abundance} = f(\text{Substrate}, \epsilon)$
</div>
</div>

Can we see a relationship between *Galumna* and any of the five environmental variables?

Let us attempt to be more specific and ask **wether *Galumna*'s community values (abundance, occurrence and relative frequency) vary as a function of water content**.

We can begin by representing all three response variables against the predictor:
Expand Down Expand Up @@ -189,7 +245,7 @@ plot(prop ~ WatrCont,
Indeed, `Galumna` seems to vary negatively as a function of `WatrCont`, *i*.*e*.
*Galumna* sp. seems to prefer dryer sites.

We can go step further and fit general linear models to test whether `Galumna`, `pa`, and/or `prop` vary as a function of `WatrCont`
We can go step further and fit general linear models to test whether `Galumna`, `pa`, or `prop` vary as a function of `WatrCont`

```{r, eval = -c(2, 5, 8)}
# Fit the models
Expand Down Expand Up @@ -233,20 +289,14 @@ Yes, there is a strong and significant relationship for all 3 response variables
Let's validate these models to make sure that we
respect assumptions of linear models, starting with the abundance model.

We can see a strong and significant relationship for all three models, concerning each one of the three response variables!

Nevertheless, we cannot forget the most important step: verifying if the assumptions for these general linear models have not been violated.

Let us begin by validating these models to make sure that we respect assumptions of linear models, starting with the abundance model.

```{r, echo = TRUE, eval = TRUE, fig.width = 7, fig.height = 7, fig.align = "center"}
# Plot the abundance model
plot(Galumna ~ WatrCont, data = mites)
abline(lm.abund)
```

The model does not fit well. It predicts negative abundance values when
`WatrCont` exceeds `600`, which does not make any sense. Also, the model
`WatrCont` exceeds 600, which does not make any sense. Also, the model
does poorly at predicting high abundance values at low values of `WatrCont`.

We can also check the model diagnostic plots:
Expand Down Expand Up @@ -294,22 +344,10 @@ Let's take a step back here and review the assumptions of linear models, and whe

$$Y_i = \beta_0 + \beta_1X_i + \varepsilon$$

where:

- $Y_i$ is the predicted value of a response variable
- $\beta_0$ is the *unknown coefficient* **intercept**
- \beta_1$ is the *unknown coefficient* **slope**
- $X_i$ is the value of the explanatory variable
- $\varepsilon_i$ is the model residual drawn from a normal distribution with a varying mean but a constant variance.


This last point about $\varepsilon_i$ is important. This is where assumptions of
normality and homoscedasticity originate. **It means that the residuals
(the distance between each observation and the regression line) can be
predicted by drawing random values from a normal distribution.**
The last entry $\varepsilon_i$ is important. This is where assumptions of
normality and homoscedasticity originate. In linear models, **the residuals $\varepsilon_i$ (the distance between each observation and the regression line) can be predicted by drawing random values from a normal distribution.**

Recall
that all normal distributions have two parameters, $\mu$ (the mean of the distribution) and $\sigma^2$ (the variance of the distribution). In a linear model, $\mu$ changes based on values of $X$ (the predictor
Recall that all normal distributions have two parameters, $\mu$ (the mean of the distribution) and $\sigma^2$ (the variance of the distribution). In a linear model, $\mu$ changes based on values of $X$ (the predictor
variable), but $\sigma^2$ has the same value for all values of $Y$. Our simple linear can also be written as this:

$$Y_i \sim N(\mu = \beta_0 + \beta_1 X_i +\varepsilon, \sigma^2)$$
Expand Down Expand Up @@ -380,7 +418,7 @@ This tells us that randomly drawn $Y$ values when water content is $300$ should
At a water content of 400, residuals \varepsilon should follow a normal
distribution with parameters $\mu = 3.44 + (-0.006 x 400) = 1.02$ and $\sigma^2 = 1.51$, etc. Each $Y$ value is modeled using a normal distribution with a mean that depends on $X_i$, but with a variance that is constant $\sigma^2 = 1.51$ across all $X_i$ values. Graphically, this looks like this:

![](images/modelPredic.png)
![](images/modelPredic.png){width="400"}

The four normal distributions on this graph represent the probability of
observing a given Galumna abundance for 4 different water content
Expand All @@ -401,9 +439,9 @@ Very often, data will not "behave" and will violate the assumptions we have seen

We have been told to **transform** our data using logarithmic, square-root, and cosine transformations to get around these problems. Unfortunately, transformations not always work and come with a few drawbacks:

**1.** They change the response variable (!), making interpretation challenging;
**2.** They may not simulateneously improve linearity and homogeneity of variance;
**3.** The boundaries of the sample space change.
**1.** They change the response variable (!), making interpretation challenging;\
**2.** They may not simulateneously improve linearity and homogeneity of variance;\
**3.** The boundaries of the sample space change.\

For instance, our simple linear model:

Expand Down
9 changes: 3 additions & 6 deletions book-en/03-distributions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -153,15 +153,12 @@ equation:

$$Y_i \sim Poisson(\lambda = \beta_0 + \beta_1X_i)$$

Notice that $\lambda$ varies as a function of $X$ (water content), meaning that
the residual variance will also vary with $X$. This means that we just
Notice that $\lambda$ varies as a function of $X$ (water content), meaning that the residual variance will also vary with $X$. This means that we just
relaxed the homogeneity of variance assumption! Also, predicted values
will now be integers instead of fractions because they will all be drawn
from Poisson distributions with different $\lambda$ values. The model will never
predict negative values because Poisson distributions have strictly
from Poisson distributions with different $\lambda$ values. The model will never predict negative values because Poisson distributions have strictly
positive ranges. By simply switching the distribution of error terms
($\varepsilon$) from normal to Poisson, we solved most of the problems of our
abundance linear model. This new model is *almost* a Poisson
($\varepsilon$) from normal to Poisson, we solved most of the problems of our abundance linear model. This new model is *almost* a Poisson
generalized linear model, which basically looks like this:

```{r echo=FALSE, fig.align="center", fig.width=7, fig.height = 5}
Expand Down
48 changes: 35 additions & 13 deletions book-en/04-glm.Rmd
Original file line number Diff line number Diff line change
@@ -1,36 +1,58 @@
# Generalized Linear Models
# Generalized linear models

A **generalized linear model** is made of a **linear predictor**:
To avoid any bias from general linear models, we need to specify 2 things:
1\. **a statistical distribution for the residuals of the model**
2\. **a link function** for the predicted valued of the model.

$$\eta_i = \underbrace{g(\mu)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}$$
and,
For a linear regression with a continuous response variable distributed normally, the predicted values of Y is explained by this equation :

$μ = βx$

where:\
-$μ$ is the predicted value of the response variable\
-$x$ is the model matrix (*i.e.* predictor variables)\
-$β$ is the estimated parameters from the data (*i.e.*
intercept and slope). $βx$ is called the **linear predictor**. In math terms, it's the matrix product of the model's matrix $x$ and the vector of the estimated parameters $β$.

1. a **link** function $g(\mu_i)$ that **transforms the expected value** of $Y$ and describes how the mean $E(Y_i) = \mu_i$ depends on the linear predictor
For a linear regression with a continuous response variable distributed normally, the linear predictor $βx$ is equivalent to the expected values of the response variable. *When the response varaible is not normally distributed, this statement is not true*. I nthis situation, we need to transform the predicted values $μ$ with a **link function** to obtain a linear relationship with the linear predictor:

$$g(\mu_i) = ηi$$

2. a **variance** function that describows how the variance $\text{var}(Y_i)$ depends on the mean
where $g(μ)$ is the link function to the predicted values. Therefore it removes the restriction on the residuals.

The $μ / 1-μ$ ratio represent the odds of an event Y to occur. It transforms our predicted value to a scale of 0 to `+Inf`. For instance the probability to fail this course 0.6, thus the odds of failing is $0.6/(1 − 0.6) = 1.5$. It indicates that the probability of me failing this class is 1.5 higher than the probability that I succeed (which are $1.5 × 0.4 = 0.6$).

To complete our model, we also need a **variance** function which describe how the variance $\text{var}(Y_i)$ depends on the mean:

$$\text{var}(Y_i) = \phi V(\mu)$$

where the **dispersion parameter** $\phi$ is a constant.
where the **dispersion parameter** $phi$ is constant.

So, for our **general linear model** with $\epsilon ∼ N(0, \sigma^2)$ and $Y_i \sim{} N(\mu_i, \sigma^2)$, we have the same linear predictor:
With the linear predictor, our generalized linear model would be:

$$\eta_i = \underbrace{g(\mu)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}$$

$$\underbrace{g(\mu)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}$$
A**general linear model** with $\epsilon ∼ N(0, σ^2)$ and $Y_i \sim{} N(\mu_i, \sigma^2)$ has:

the **identity** link function
1. an **identity** link function:

$$g(\mu_i) = \mu_i$$

and, the variance function
2. and a variance function:

$$V(\mu_i) = 1$$

*This is equivalent to the normal linear model that we have learned before!*
which result in:

$$\underbrace{g(\mu)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}$$

When our response variable is binary, the link function is a *logit* function:

$$\eta_i = \text{logit}(\mu_i) = \log(\frac{\mu_i}{1-\mu_i})$$

The log transformation allows the values to be distributed from `-Inf` to `+Inf`. We can now link the predicted values with the linear predictor $βx$. This is the reason we still qualify this model as **linear** despite the relationship not being a \«straight line\».

In `R`, we can fit generalized linear models using the `glm()` function, which is similar to the `lm()` function, with the `family` argument taking the names of the **link function** (inside `link`) and, the **variance function**.:
In `R`, we can fit generalized linear models using the `glm()` function, which is similar to the `lm()` function, with the `family` argument taking the names of the **link function** (inside `link`) and, the **variance function**:

```{r, eval = FALSE, echo = TRUE}
# This is what the glm() function syntax looks like (don't run this)
Expand Down
Loading