Skip to content

Latest commit

 

History

History
79 lines (67 loc) · 3.16 KB

paper.md

File metadata and controls

79 lines (67 loc) · 3.16 KB
title tags authors affiliations date bibliography
The vtreat R package: a statistically sound data processor for predictive modeling
R
data science
predictive modeling
classification
regression
data preparation
significance
dimensionality reduction
reproducible research
cross-validation
name orcid email affiliation
John Mount
0000-0002-3696-2012
1
name orcid email affiliation
Nina Zumel
0000-0001-8831-0190
1
name index
Win-Vector, LLC
1
9 February 2018
paper.bib

Summary

When applying statistical methods or applying machine learning techniques to real world data, there are common data issues that can cause modeling to fail. The vtreat package (@vtreat) is an R data frame processor that prepares messy real world data for predictive modeling in a reproducible and statistically sound manner.

The package's objective is to produce clean data frames that preserve the original information, and are safe for model training and model application. Vtreat does this by collecting statistics from training data in order to produce a treatment plan. Vtreat then uses this treatment plan to process subsequent data frames prior to both model training and model application. The processed data frame is guaranteed to be purely numeric, with no missing or NaN values, and no string or categorical values. Vtreat serves as a powerful alternative to R's native model.matrix construct. The goals of the package differ from those of training harness systems such as caret (@caret) and unsupervised ad-hoc processing systems such as recipes (@recipes).

In particular vtreat emphasizes safe but y-aware (supervised) pre-processing of data for predictive modeling tasks. It automates:

  • Treatment of missing values through safe replacement plus indicator column.
  • Explicit coding of categorical variable levels as indicator variables.
  • Robust handling of novel categorical levels (values seen during test or application, but not seen during training).
  • Supervised re-coding of categorical variables with very large numbers of levels, using an approach similar to that described by @appliedmr.
  • Cross validation to mitigate overfit and undesirable supervision bias.
  • Optional significance-based and cross-validated variable selection.

Vtreat is careful to automate only domain-agnostic data cleaning steps that are to common to many applications. This intentionally leaves domain-specific processing to the researcher and their own appropriate tools.

The use of vtreat avoids the perils of ad-hoc data treatment, and provides a reproducible, documented, and citable data treatment procedure.

For more details and further discussion, please see our expository article @vtreatX and the package online documentation.

References