title | tags | authors | affiliations | date | bibliography | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The vtreat R package: a statistically sound data processor for predictive modeling |
|
|
|
9 February 2018 |
paper.bib |
When applying statistical methods or applying machine learning
techniques to real world data, there are common data issues that can cause modeling to
fail. The vtreat
package
(@vtreat) is an R data frame processor that prepares messy real world data
for predictive modeling in a reproducible and statistically sound manner.
The package's objective is to produce clean data frames that preserve
the original information, and are safe for model training and model
application. Vtreat
does
this by collecting statistics from training data in order to produce a
treatment plan. Vtreat
then uses this treatment plan to process subsequent data frames prior
to both model training and model application. The processed data
frame is guaranteed to be purely numeric, with no missing or NaN
values, and no string or categorical values.
Vtreat
serves as a powerful
alternative to R's native model.matrix
construct. The goals of the
package differ from those of training harness systems such as caret
(@caret) and unsupervised ad-hoc processing systems such as recipes
(@recipes).
In particular vtreat emphasizes safe but y-aware (supervised) pre-processing of data for predictive modeling tasks. It automates:
- Treatment of missing values through safe replacement plus indicator column.
- Explicit coding of categorical variable levels as indicator variables.
- Robust handling of novel categorical levels (values seen during test or application, but not seen during training).
- Supervised re-coding of categorical variables with very large numbers of levels, using an approach similar to that described by @appliedmr.
- Cross validation to mitigate overfit and undesirable supervision bias.
- Optional significance-based and cross-validated variable selection.
Vtreat
is careful to
automate only domain-agnostic data cleaning steps that are to common
to many applications. This intentionally leaves domain-specific
processing to the researcher and their own appropriate tools.
The use of vtreat
avoids the
perils of ad-hoc data treatment, and provides a reproducible,
documented, and citable data treatment procedure.
For more details and further discussion, please see our expository article @vtreatX and the package online documentation.