glmmbernoulli.qmd

---
fig-width: 4
fig-height: 3
fig-dpi: 150
fig-format: png
engine: julia
julia:
  exeflags: ["--project"]
---

# Generalized linear mixed models for binary responses {#sec-glmmbinomial}

\newcommand\bbb{{\mathbf{b}}}
\newcommand\bbX{{\mathbf{X}}}
\newcommand\bbx{{\mathbf{x}}}
\newcommand\bby{{\mathbf{y}}}
\newcommand\bbZ{{\mathbf{Z}}}
\newcommand\bbbeta{{\boldsymbol{\beta}}}
\newcommand\bbeta{{\boldsymbol{\eta}}}
\newcommand\bbmu{{\boldsymbol{\mu}}}
\newcommand\bbtheta{{\boldsymbol{\theta}}}
\newcommand\mcN{{\mathcal{N}}}
\newcommand\mcB{{\mathcal{B}}}
\newcommand\mcY{{\mathcal{Y}}}

Attach the packages to be used in this chapter

```{julia}
#| code-fold: true
#| output: false
#| label: packages06
using AlgebraOfGraphics
using CairoMakie
using DataFrames         # only used for `describe`
using EmbraceUncertainty: dataset
using FreqTables
using MixedModels
using MixedModelsMakie
using StatsBase
using StatsModels
```

and defined some constants, if not already defined.

```{julia}
#| code-fold: true
#| output: false
#| label: constants06
@isdefined(contrasts) || const contrasts = Dict{Symbol,Any}()
@isdefined(nAGQ) || const nAGQ = 9
@isdefined(progress) || const progress = false
```

In this chapter we consider mixed-effects models for data sets in which the response is *binary*, representing  yes/no or true/false or correct/incorrect responses.

Because the response must be one of only two possible values we adapt our models to predict the probability of the positive response.
As for linear models and linear mixed-effects models, the mean response, $\bbmu$, is determined by a *linear predictor*,
$$
\bbeta=\bbX\bbbeta+\bbZ\bbb
$$ {#eq-linearpred}
depending on the fixed-effects parameters, $\bbbeta$, the random effects, $\bbb$, and the model matrices, $\bbX$ and $\bbZ$.
For a linear model the mean response, $\bbmu$, is equal to the linear predictor, $\bbeta$.
But for a generalized linear model $\bbeta$ determines $\bbmu$ according to a *link function*, $g$.
For historical reasons it is the function taking an element of $\bbmu$ to the corresponding element of $\bbeta$ that is called the link.
The transformation in the opposite direction, from $\bbeta$ to $\bbmu$, is called the *inverse link*.

As in previous chapters, we will begin with an example to help illustrate these ideas.

## Artificial contraception use in regions of Bangladesh {#sec-contraception}

One of the test data sets from the Center for Multilevel Modelling, University of Bristol is derived from the 1989 Bangladesh Fertility Survey, [@huq:cleland:1990].
The data are a subsample of 1934 women selected from 60 of the 64 political districts or *zila*, available as the `contra` data set in the `MixedModels` package.

```{julia}
contra = Table(dataset(:contra))
```

with summary

```{julia}
#| code-fold: true
let df = DataFrame(contra)
  describe(df, :mean, :min, :median, :max, :nunique, :eltype)
end
```

The response of interest is `use`, whether the woman chooses to use artificial contraception, which is a *binary* response with only two possible values, `N` and `Y`.
The covariates include the district (`dist`) in which the woman resides, the number of live children she currently has (`livch`, coded as `0`, `1`, `2`, and `3+`), her `age`, and `urban`, also coded as `N` and `Y`, indicating rural or urban.

Note that the mean of these `age` values is close to zero but not exactly zero.
This occurs when the values have been centered about the sample mean then rounded.
In this case it appears that the ages were recorded as a whole number of years and the mean was rounded to two decimal places ending in `.56`.
Thus, all the negative values end in `.56` and the positive values end in `.44`.
The mean of these rounded, centered values is not exactly zero because of the rounding after centering.

Centering the ages allows for meaningful interpretation of intercepts and fixed effects for other covariates, because they refer to an age within the range of the observed ages.
Regretably, the information on what the centering age (i.e. the original mean age) was does not seem to be available.

### Plotting the binary response {#sec-plottingbinary}

Producing informative graphical displays of a binary response as it relates to covariates is somewhat more challenging that the corresponding plots for responses on a continuous scale.
If we were to plot the 1934 responses as 0/1 values versus, for example, the woman's centered age, we would end up with a rather uninformative plot, because all the points would fall on one of two horizontal lines, at $y=0$ and $y=1$.

One approach to illustrating the structure of the data more effectively is to add *scatterplot smoother* lines to the plot,
as in @fig-contradata,

```{julia}
#| code-fold: true
#| fig-cap: Smoothed relative frequency of contraception use versus centered age for women in the 1989 Bangladesh Fertility Survey
#| label: fig-contradata
#| warning: false
draw(
  data(contra) *
  mapping(
    :age => "Centered age (yr)",
    :use => ==("Y") => "Frequency of contraceptive use";
    col=:urban => renamer(["N" => "Rural", "Y" => "Urban"]),
    color=:livch,
  ) *
  smooth();
  figure=(; size=(650, 400)),
)
```

to show the trend in the response with respect to the covariate.
Once we have the smoother lines in such a plot we can omit the data points themselves, as we did here, because they add very little information.

The first thing to notice about the plot is that the proportion of women using contraception is not linear in age, which, on reflection, makes sense.
A woman in the middle of this age range (probably corresponding to an age around 25) is more likely to use artificial contraception than is a girl in her early teens or a woman in her mid-forties.
We also see that women in an urban setting are more likely to use contraception than those in a rural setting and that women with no live children are less likely to use contraception than are women who already have children.
There do not seem to be strong differences between women who have 1, 2 or 3 or more children compared to the differences between women with children and those without children.

Interestingly, the quadratic pattern with respect to age does not seem to have been noticed in other analyses of these data.
Comparisons of model fits through different software systems, as provided by the Center for Multilevel Modelling, incorporate only a linear term in age, even though the pattern is clearly nonlinear.
The lesson here is similar to what we have seen in other examples; careful plotting of the data should, whenever possible, precede attempts to fit models to the data.

### Initial GLMM fit to the contraception data {#sec-contraglmm}

Fitting a generalized linear mixed model (GLMM) is very similar to fitting a linear mixed model (LMM).
We call the `fit` function with the first three arguments being the model type, `MixedModel`, then a formula specifying the response, fixed-effects terms and random-effects terms, then a data table.
For GLMMs we also specify a fourth positional argument which is a distribution - in this case `Bernoulli()`.

Establishing the contrasts and fitting a preliminary model with random effects for `dist`, main effects for `livch`, `age`, `age^2`, and `urban`, plus interaction terms for `age & urban` and `age^2 & urban` can be done as

```{julia}
contrasts[:livch] = EffectsCoding(; base="0")
contrasts[:urban] = HelmertCoding()

com01 =
  let d = contra,
    ds = Bernoulli(),
    f = @formula(use ~
      1 + livch + (age + abs2(age)) * urban + (1 | dist))

  fit(MixedModel, f, d, ds; contrasts, nAGQ, progress)
end
```

Notice that in the formula language defined by the [StatsModels](https://github.com/JuliaStats/StatsModels.jl) package, an interaction term is written with the `&` operator.
*Crossing* of terms, which generates main effects and interactions, is written with the `*` operator (as in the formula language in [R](https://r-project.org)).
An interaction of a numeric variable with itself is performed by multiplication so the coefficient labelled `age & age` in the table is the quadratic term in `age`.
(Notice that, in this formula language, `age * age`, which may easily be interpreted as `age^2`, expands to `age + age^2`.)
Thus, what is written in the formula as a three-way interaction `age * age * urban` becomes an `urban` contrast plus linear and quadratic terms for `age` and each of their interactions with the `urban` contrast, which will be coded as `-1` for `N` and `+1` for `Y`.

A fifth positional argument can be used to specify the link function, described in @sec-glmmlink, but in most cases the canonical link function for the distribution is used.
In the case of the Bernoulli distribution the canonical link is the *logit* link.

As for LMMs, the named argument `contrasts` specifies the contrasts to apply to some of the covariates in a key-value dictionary.
Another named argument, `nAGQ`, specifies the number of quadrature points to use in an *adaptive Gauss-Hermite quadrature* rule for evaluating the deviance (see @sec-aGHQ for details).
A small, odd number, such as `nAGQ=9` defined in the first code block of this chapter, is the preferred choice.

The interpretation of the coefficients in this model is somewhat different from the linear mixed models coefficients that we examined previously, but many of the model-building steps are similar.
A rough assessment of the utility of a particular term in the fixed-effects part of the model can be obtained from examining the estimates of the coefficients associated with it and their standard errors, which are the basis for the `z` (z-statistic) and `p` (p-value) columns in the table.
However, these p-values are even more approximate than those provided for LMMs.
To perform a more accurate test of whether a particular term is useful we omit it from the model, refit and compare the reduced model fit to the original according to the change in deviance.

We will examine the terms in the model first and discuss the interpretation of the coefficients in @sec-glmmlink.

### Model building for the contra data {#sec-contramodelbuilding}

We noted that @fig-contradata shows similar patterns for women with children, whether they have one, two, or three or more children.
We have set the contrasts for the `livch` factor to be offsets relative to the reference level, in this case women who do not have any live children.
Although the coefficients labeled `livch: 1`, `livch: 2`, and `livch: 3+` are all large relative to their standard errors, they are reasonably close to each other.
This confirms our earlier impression that the main distinction is between women with children and those without and, for those who do have children, the number of children is not an important distinction.

After incorporating a new variable `ch` --- an indicator of whether the woman
has any children --- in the data,

```{julia}
contrasts[:ch] = HelmertCoding();
contra = Table(contra; ch=contra.livch .!= "0")
```

```{julia}
#| code-fold: true
let df = DataFrame(contra)
  describe(df, :mean, :min, :median, :max, :nunique, :eltype)
end
```

we fit a reduced model.

```{julia}
com02 =
  let d = contra,
    ds = Bernoulli(),
    f = @formula(use ~ 1 + ch + age * age * urban + (1 | dist))

    fit(MixedModel, f, d, ds; contrasts, nAGQ, progress)
  end
```

Comparing this model to the previous model

```{julia}
MixedModels.likelihoodratiotest(com02, com01)
```

indicates that the reduced model is adequate.

Apparently neither the second-order interaction `age & urban` nor the third-order interaction `age & age & urban` is significant and we fit a model without these terms.

```{julia}
com03 =
  let f = @formula(use ~ 1 + urban + ch + age * age + (1 | dist)),
    d = contra,
    ds = Bernoulli()

    fit(MixedModel, f, d, ds; contrasts, nAGQ, progress)
  end
```

A likelihood ratio test

```{julia}
MixedModels.likelihoodratiotest(com03, com02)
```

indicates that these terms can safely be eliminated.

A plot of the smoothed observed proportions versus centered age by `urban` and `ch`, @fig-contradata2,

```{julia}
#| code-fold: true
#| fig-cap: Smoothed relative frequency of contraception use versus centered age for women in the 1989 Bangladesh Fertility Survey.  The livch factor has been collapsed to children/nochildren.
#| label: fig-contradata2
#| warning: false
draw(
  data(contra) *
  mapping(
    :age => "Centered age (yr)",
    :use => ==("Y") => "Frequency of contraceptive use";
    col=:urban => renamer(["N" => "Rural", "Y" => "Urban"]),
    color=:ch => "Children",
  ) *
  smooth();
  figure=(; size=(650, 400)),
)
```

indicates that all four groups have a quadratic trend with respect to age but the location of the peak proportion is shifted for those without children relative to those with children.
Incorporating an interaction of `age` and `ch` allows for such a shift.

```{julia}
com04 =
  let f =
      @formula(use ~ 1 + urban + ch * age + abs2(age) + (1 | dist)),
    d = contra,
    ds = Bernoulli()

    fit(MixedModel, f, d, ds; contrasts, nAGQ, progress)
  end
```

Comparing this fitted model to the previous one

```{julia}
MixedModels.likelihoodratiotest(com03, com04)
```

confirms the usefulness of this term.

A series of such model fits led to a model with random effects for the combinations of `dist` and `urban`, because differences between urban and rural women in the same district were comparable to differences between districts, even after accounting for an effect of `urban` at the fixed-effects (or population) level.

```{julia}
com05 =
  let f = @formula(use ~
      1 + urban + ch * age + abs2(age) + (1 | dist & urban)),
    d = contra,
    ds = Bernoulli()

    fit(MixedModel, f, d, ds; contrasts, nAGQ, progress)
  end
```

In more detail,

```{julia}
#| code-fold: true
print(com05)
```

Notice that, although there are 60 distinct districts, there are only 102 distinct combinations of `dist` and `urban` represented in the data.
In 15 of the 60 districts there are no rural women in the sample and in 3 districts there are no urban women in the sample, as shown in a frequency table

```{julia}
#| code-fold: true
freqtable(contra, :urban, :dist)
```

## Link functions and interpreting coefficients {#sec-glmmlink}

To this point the only difference we have encountered between fitting generalized linear mixed models (GLMMs) and linear mixed models (LMMs) is the need to specify the distribution family in a call to `fit`.
The formula specification is identical and the assessment of the significance of terms using likelihood ratio tests is similar.
This is intentional.
We have emphasized the use of likelihood ratio tests on terms, whether fixed-effects or random-effects terms, exactly so the approach will be general.

However, the interpretation of the coefficient estimates in the different types of models is different.
In a linear mixed model the conditional mean (or "expected value") of the response given the random effects is simply the value of the *linear predictor*, $\bbmu=\bbeta=\bbX\bbbeta+\bbZ\bbb$ .
That is, if we assume that we know the values of the fixed-effects parameters, $\bbbeta$, and the random effects, $\bbb$, then the expected response for a particular combination of covariate values is a linear combination of these coefficients where the particular linear combination is determined by values of the covariates for that observation.
Individual coefficients can be interpreted as slopes of the fitted response with respect to a numeric covariate or as shifts between levels of a categorical covariate.

It is worthwhile emphasizing this relationship, and how it is used to form predictions from the model, by illustrating the process on a grid of covariate values.
To this end, we will take a brief excursion into creating grids of covariate values and evaluating linear predictors.

### Creating grids of covariate values {#sec-covariategrid}

Suppose that we wish to plot the "population" linear predictor values from model `com05`.
That is, we will consider only the fixed-effects terms in the model formula and ignore the random effects.
We want to create a plot like @fig-contradata2 by evaluating the linear predictor for a range of `age` values for each of the combinations of `ch` and `urban`.

First we construct a table of covariate values, `newdata`, containing the *Cartesian product* of values of these covariates, created using a [generator expression](https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions) returning [NamedTuples](https://docs.julialang.org/en/v1/base/base/#Core.NamedTuple), which is the standard row-wise representation of tabular data.

```{julia}
newdata = Table(
  (; age, ch, urban) for age in -10:3:20, ch in [false, true],
  urban in ["N", "Y"]
)
```

Next we isolate the fixed-effects terms from the formula for model `com05` by selecting its `rhs` property (the right-hand side of the formula) then filtering out any random-effects terms in the expression.

```{julia}
feform = filter(
  t -> !isa(t, MixedModels.AbstractReTerm),
  com05.formula.rhs,
);
```

::: {.callout-note}
Add extractors for different parts of the formula to MixedModels.jl
:::

Finally, we evaluate the model columns for this reduced formula.
These are returned as an array of matrices, in this case containing only one matrix.

```{julia}
newX = only(StatsModels.modelcols(feform, newdata))
```

The predicted linear predictor, $\bbeta$, for the `newdata` table and the fixed-effects estimate for model `com05` is the product of this model matrix and the `fixef` coefficients.

```{julia}
newdata = Table(newdata; η=newX * fixef(com05))
```

Plotting the result, @fig-com05pred,

```{julia}
#| code-fold: true
#| fig-cap: Linear predictor versus centered age from model com05
#| label: fig-com05pred
#| warning: false
draw(
  data(newdata) *
  mapping(
    :age => "Centered age (yr)",
    :η => "Linear predictor value from model com05";
    col=:urban => renamer(["N" => "Rural", "Y" => "Urban"]),
    color=:ch => "Children",
  ) *
  smooth();
  figure=(; size=(650, 400)),
)
```

shows that these curves follow the general trends of the data plot.
However, the vertical axis is not on a probability scale.
Indeed most of the values of the linear predictor are negative.

This brings us to the topic of link functions.

### The logit link function for binary responses {#sec-logitlink}

The probability model for a binary response is the Bernoulli distribution, which is a very simple probability distribution in that its "support" --- the set of possible values of the random variable --- is just `0` or `1`.
If the probability of the response `1` is $p$ then the probability of `0` must be $1-p$.
It is easy to establish that the expected value is also $p$.
For consistency across distribution families we write this expected response as $\mu$ instead of $p$.
We should, however, keep in mind that, for this distribution, $\mu$ corresponds to a probability and hence must satisfy
$0\le\mu\le 1$.

In general we don't want to have restrictions on the values of the linear predictor so we equate the linear predictor to a function of $\mu$ that has an unrestricted range.
In the case of the Bernoulli distribution with the canonical link function we equate the linear predictor to the *log odds* or *logit* of the positive response.
That is
$$
\eta = \log\left(\frac{\mu}{1-\mu}\right) .
$$ {#eq-logodds}

To understand why this is called the "log odds" recall that $\mu$ corresponds to a probability in $[0,1]$.
The corresponding odds ratio, $\frac{\mu}{1-\mu}$, is in $[0,\infty)$ and the logarithm of the odds ratio, $\mathrm{logit}(\mu)$, is in $(-\infty, \infty)$.

The inverse of the logit link function,
$$
\mu = \frac{1}{1+\exp(-\eta)} ,
$$ {#eq-logitinv}
is called the *logistic* function and is shown in @fig-logitinv.

```{julia}
#| code-fold: true
#| fig-cap: 'The logistic function, which is the inverse to the logit link function.'
#| label: fig-logitinv
#| warning: false
let
  fig = Figure(; size=(650, 300))
  ax = Axis(fig[1, 1]; xlabel="η", ylabel="μ")
  lines!(ax, -5.5 .. 5.5, η -> inv(1 + exp(-η)))
  fig
end
```

The inverse link takes a value on the unrestricted range, $(-\infty,\infty)$, and maps it to the probability range, $[0,1]$.
It happens this function is also the cumulative distribution function for the standard logistic distribution, available in [Distributions.jl](https://github.com/JuliaStats/Distributions.jl) as `cdf(Logistic(), η)`.
In some presentations the relationship between the logit link and the logistic distribution is emphasized but that often leads to questions of why we should focus on the logistic distribution.
Also, it is not clear how this approach would generalize to other distributions such as the Poisson or the Gamma distributions.

### Canonical link functions {#sec-canonicallink}

A way of deriving the logit link that does generalize to a class of common distributions, in what is called the *exponential family*, is to consider the logarithm of the probability function (for discrete distributions) or the probability density function (for continuous distributions).
The probability function for the Bernoulli distribution is $\mu$ for $y=1$ and $1-\mu$ for $y=0$.
If we write this in a somewhat contrived way as $\mu^y+(1-\mu)^{1-y}$ for $y\in\{0,1\}$ then the logarithm of the probability function becomes
$$
\log\left(\mu^y+(1-\mu)^{1-y}\right) = \log(1-\mu) +
y\,\log\left(\frac{\mu}{1-\mu}\right) .
$$ {#eq-BernoulliProb}
Notice that the logit link function is the multiple of $y$ in the last term.

The characteristic of distributions in the exponential family is that the logarithm of the probability mass function or probability density function, whichever is appropriate, can be expressed as a sum of up to three terms: one that involves $y$ only, one that involves the parameters only, and the product of $y$ and a function of the parameters.
This function is the canonical link for the distribution.

In the case of the Poisson distribution the probability function is $\frac{e^{-\mu}\mu^y}{y!}$ for $y\in\{0,1,2,\dots\}$ so the log probability function is
$$
-\log(y!)-\mu+y\log(\mu) .
$$ {#eq-PoissonProb}
and the canonical link function is $\log(\mu)$.

### Interpreting coefficient estimates {#sec-GLMMcoefficients}

Returning to the interpretation of the estimated coefficients in model `com05` we apply exactly the same interpretation as for a linear mixed model but taking into account that slopes or differences in levels are with respect to the logit or log-odds function.
If we wish to express results in the probability scale then we should apply the logistic function to whatever combination of coefficients is of interest to us.

```{julia}
logistic(η) = inv(one(η) + exp(-η))
newdata = Table(newdata; μ=logistic.(newdata.η))
```

producing the population predictions on the probability scale, as shown in @fig-com05predmu.

```{julia}
#| code-fold: true
#| fig-cap: Predicted probability of contraception use versus centered age from model com05.
#| label: fig-com05predmu
#| warning: false
draw(
  data(newdata) *
  mapping(
    :age => "Centered age (yr)",
    :μ => "Probability of contraceptive use";
    col=:urban => renamer(["N" => "Rural", "Y" => "Urban"]),
    color=:ch => "Children",
  ) *
  smooth();
  figure=(; size=(650, 400)),
)
```

On the probability scale we can compare the predictions to the observed frequencies shown with scatterplot smoother lines in @fig-contradata2.

Consider the predictions on both the linear predictor and probability (or expected value) scales for women with centered ages of 2.0.

```{julia}
filter(r -> r.age == 2, newdata)
```

The predicted probability of woman with centered age of 2, with children, living in an urban environment using artificial contraception is about 2/3, which is reasonably close to the smoothed frequency for that combination of covariates in @fig-contradata2.

Similarly, a woman of centered age of 2 without children living in a rural environment has a predicted probability of using artificial contraception of a little less than 20%, which also corresponds to the smoother line for that combination (blue line in the left panel) in @fig-contradata2.

## Interpretation of random effects {#sec-glmmrandomeff}

We should also be aware that the random effects are defined on the linear predictor scale and not on the probability scale.
A normal probability plot of the conditional modes of the random effects for model `com05`, @fig-com05qqcaterpillar

```{julia}
#| code-fold: true
#| fig-cap: Caterpillar plot of the conditional modes of the random-effects for model `com05`
#| label: fig-com05qqcaterpillar
#| warning: false
qqcaterpillar!(Figure(; size=(650, 450)), com05)
```

shows that the smallest random effects are approximately -1 and the largest are approximately 1.

The numerical values and the identifier of the combination of `dist` and `urban` for these extreme values can be obtained from the first few rows and the last few rows of the sorted random-effects table.

```{julia}
srtdre = let retbl = only(raneftables(com05))
  retbl[sortperm(last.(retbl))]
end
first(srtdre, 6)
```

```{julia}
last(srtdre, 6)
```

The largest random effect is for rural settings in `D34`.
There were 26 women in the sample from rural `D34`

```{julia}
D34N = filter(r -> (r.dist == "D34") & (r.urban == "N"), contra)
```

of whom 20 used contraception, an unusually large proportion for a rural setting.

But this happens when you have relatively small survey sizes - we expect considerable variation.

Also there is considerable variability in the lengths of the prediction intervals in @fig-com05qqcaterpillar because the data are unbalanced with respect to district and rural/urban districts.

Consider the cross-tabulation of counts of interviewees by district and urban/rural status presented at the end of @sec-contramodelbuilding.
The data contains responses from 54 rural women in district `D01` but only 21 rural women from `D11`.
Thus the bottom line in @fig-com05qqcaterpillar, for `("D21", "N")`, and
based on 54 responses, is shorter than the line second from the bottom,
for `("D11", "N")` and based on 21 women only.

### Conversion of random effects to relative odds

The exponential of the random effect is the relative odds of a woman in a particular urban/district combination using artificial birth control compared to her counterpart (same age, same with/without children status, same urban/rural status) in a typical district.
The thus, the relative odds of a rural woman in district `D01` using artificial contraception relative to the general population of women her age is

```{julia}
exp(last(first(srtdre)))
```

or about 40%.

*This page was rendered from git revision {{< git-rev short=true >}}.*