title

Summary

When applying statistical methods or applying machine learning techniques to real world data, there are common data issues that can cause modeling to fail. The vtreat package (@vtreat) is an R data frame processor that prepares messy real world data for predictive modeling in a reproducible and statistically sound manner.

The package's objective is to produce clean data frames that preserve the original information, and are safe for model training and model application. Vtreat does this by collecting statistics from training data in order to produce a treatment plan. Vtreat then uses this treatment plan to process subsequent data frames prior to both model training and model application. The processed data frame is guaranteed to be purely numeric, with no missing or NaN values, and no string or categorical values. Vtreat serves as a powerful alternative to R's native model.matrix construct. The goals of the package differ from those of training harness systems such as caret (@caret) and unsupervised ad-hoc processing systems such as recipes (@recipes).

In particular vtreat emphasizes safe but y-aware (supervised) pre-processing of data for predictive modeling tasks. It automates:

Treatment of missing values through safe replacement plus indicator column.
Explicit coding of categorical variable levels as indicator variables.
Robust handling of novel categorical levels (values seen during test or application, but not seen during training).
Supervised re-coding of categorical variables with very large numbers of levels, using an approach similar to that described by @appliedmr.
Cross validation to mitigate overfit and undesirable supervision bias.
Optional significance-based and cross-validated variable selection.

Vtreat is careful to automate only domain-agnostic data cleaning steps that are to common to many applications. This intentionally leaves domain-specific processing to the researcher and their own appropriate tools.

The use of vtreat avoids the perils of ad-hoc data treatment, and provides a reproducible, documented, and citable data treatment procedure.

For more details and further discussion, please see our expository article @vtreatX and the package online documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper.md

paper.md

Summary

References

Files

paper.md

Latest commit

History

paper.md

File metadata and controls

Summary

References