-
Notifications
You must be signed in to change notification settings - Fork 90
/
Copy pathcontr_one_hot.Rd
93 lines (73 loc) · 3.28 KB
/
contr_one_hot.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/contr_one_hot.R
\name{contr_one_hot}
\alias{contr_one_hot}
\title{Contrast function for one-hot encodings}
\usage{
contr_one_hot(n, contrasts = TRUE, sparse = FALSE)
}
\arguments{
\item{n}{A vector of character factor levels (of length >=1) or the number
of unique levels (>= 1).}
\item{contrasts}{This argument is for backwards compatibility and only the
default of \code{TRUE} is supported.}
\item{sparse}{This argument is for backwards compatibility and only the
default of \code{FALSE} is supported.}
}
\value{
A diagonal matrix that is \code{n}-by-\code{n}.
}
\description{
This contrast function produces a model matrix with indicator columns for
each level of each factor.
}
\details{
By default, \code{model.matrix()} generates binary indicator variables for
factor predictors. When the formula does not remove an intercept, an
incomplete set of indicators are created; no indicator is made for the
first level of the factor.
For example, \code{species} and \code{island} both have three levels but
\code{model.matrix()} creates two indicator variables for each:
\if{html}{\out{<div class="sourceCode r">}}\preformatted{library(dplyr)
library(modeldata)
data(penguins)
levels(penguins$species)
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode">}}\preformatted{## [1] "Adelie" "Chinstrap" "Gentoo"
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode r">}}\preformatted{levels(penguins$island)
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode">}}\preformatted{## [1] "Biscoe" "Dream" "Torgersen"
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode r">}}\preformatted{model.matrix(~ species + island, data = penguins) \%>\%
colnames()
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode">}}\preformatted{## [1] "(Intercept)" "speciesChinstrap" "speciesGentoo" "islandDream"
## [5] "islandTorgersen"
}\if{html}{\out{</div>}}
For a formula with no intercept, the first factor is expanded to
indicators for \emph{all} factor levels but all other factors are expanded to
all but one (as above):
\if{html}{\out{<div class="sourceCode r">}}\preformatted{model.matrix(~ 0 + species + island, data = penguins) \%>\%
colnames()
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode">}}\preformatted{## [1] "speciesAdelie" "speciesChinstrap" "speciesGentoo" "islandDream"
## [5] "islandTorgersen"
}\if{html}{\out{</div>}}
For inference, this hybrid encoding can be problematic.
To generate all indicators, use this contrast:
\if{html}{\out{<div class="sourceCode r">}}\preformatted{# Switch out the contrast method
old_contr <- options("contrasts")$contrasts
new_contr <- old_contr
new_contr["unordered"] <- "contr_one_hot"
options(contrasts = new_contr)
model.matrix(~ species + island, data = penguins) \%>\%
colnames()
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode">}}\preformatted{## [1] "(Intercept)" "speciesAdelie" "speciesChinstrap" "speciesGentoo"
## [5] "islandBiscoe" "islandDream" "islandTorgersen"
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode r">}}\preformatted{options(contrasts = old_contr)
}\if{html}{\out{</div>}}
Removing the intercept here does not affect the factor encodings.
}