Skip to content

Commit

Permalink
Merge pull request #71 from mayer79/out-of-sample
Browse files Browse the repository at this point in the history
Initial support of out-of-sample prediction
  • Loading branch information
mayer79 authored Jul 27, 2024
2 parents 5e5afab + 814c7e8 commit d935791
Show file tree
Hide file tree
Showing 12 changed files with 531 additions and 51 deletions.
3 changes: 1 addition & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ Description: Alternative implementation of the beautiful 'MissForest'
predictive mean matching tries to raise the variance in the resulting
conditional distributions to a realistic level. This would allow,
e.g., to do multiple imputation when repeating the call to
missRanger(). A formula interface allows to control which variables
should be imputed by which.
missRanger(). Out-of-sample application is supported as well.
License: GPL (>= 2)
Depends:
R (>= 3.5.0)
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Generated by roxygen2: do not edit by hand

S3method(predict,missRanger)
S3method(print,missRanger)
S3method(summary,missRanger)
export(generateNA)
Expand Down
62 changes: 42 additions & 20 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,36 @@
# missRanger 2.6.0

## Possibly breaking changes
### Major feature

Out-of-sample application is now possible! Thanks to [@jeandigitale](https://github.com/jeandigitale) for pushing the idea in [#58](https://github.com/mayer79/missRanger/issues/58).

This means you can run `imp <- missRanger(..., keep_forests = TRUE)` and then apply its models to new data via `predict(imp, newdata)`. The "missRanger" object can be saved/loaded as binary file, e.g, via `saveRDS()`/`readRDS()` for later use.

Note that out-of-sample imputation works best for rows in `newdata` with only one
missing value (actually counting only missings in variables used as covariates in random forests). We call this the "easy case". In the "hard case",
even multiple iterations (set by `iter`) can lead to unsatisfactory results.

The out-of-sample algorithm works as follows:

1. Impute univariately all relevant columns by randomly drawing values
from the original, unimputed data. This step will only impact "hard case" rows.
2. Replace univariate imputations by predictions of random forests. This is done
sequentially over variablse in descending order of number of missings
(to minimize the impact of univariate imputations). Optionally, this is followed
by predictive mean matching (PMM).
3. Repeat Step 2 for "hard case" rows multiple times.

### Possibly breaking changes

- Columns of special type like date/time can't be imputed anymore. You will need to convert them to numeric before imputation.
- `pmm()` is more picky: `xtrain` and `xtest` must both be either numeric, logical, or factor (with identical levels).

## Minor changes in output object
### Minor changes in output object

- Add original data as `data_raw`.
- Renamed `visit_seq` to `to_impute`.

## Other changes
### Other changes

- Now requires ranger >= 0.16.0.
- More compact vignettes.
Expand All @@ -27,39 +47,41 @@
- num.threads = NULL
- save.memory = FALSE
- For variables that can't be used, more information is printed.
- If `keep_forests = TRUE`, the argument `data_only` is set to `FALSE` by default.
- "missRanger" object now stores `pmm.k`.

# missRanger 2.5.0

## Bug fixes
### Bug fixes

- Since Release 2.3.0, unintentionally, negative formula terms haven't been dropped, see [#62](https://github.com/mayer79/missRanger/issues/62). This is fixed now.

## Enhancements
### Enhancements

- The vignette on multiple imputations has been revised, and a larger number of donors in predictive mean matching is being used in the example.

# missRanger 2.4.0

## Future Output API
### Future Output API

- New argument `data_only = TRUE` to control if only the imputed data should be returned (default), or an object of class "missRanger". This object contains the imputed data and infos like OOB prediction errors, fixing [#28](https://github.com/mayer79/missRanger/issues/28). The value `FALSE` will later becoming the default in {missRanger 3.0.0}. This will be announced via deprecation cycle.

## Enhancements
### Enhancements

- New argument `keep_forests = FALSE`. Should the random forests of the best iteration (the one that generated the final imputed data) be added to the "missRanger" object? Note that this will use a lot of memory. Only relevant if `data_only = FALSE`. This solves [#54](https://github.com/mayer79/missRanger/issues/54).

## Bug fixes
### Bug fixes

- In case the algorithm did not converge, the data of the *last* iteration was returned instead of the current one. This has been fixed.

# missRanger 2.3.0

## Major improvements
### Major improvements

- `missRanger()` now works with syntactically wrong variable names like "1bad:variable". This solves an [old issue](https://github.com/mayer79/missRanger/issues/19), recently popping up in [this new issue](https://github.com/mayer79/missRanger/issues/51).
- `missRanger()` now works with any number of features, as long as the formula is left at its default, i.e., `. ~ .`. This solves this [issue](https://github.com/mayer79/missRanger/issues/50).

## Other changes
### Other changes

- Documentation improvement.
- `ranger()` is now called via the x/y interface, not the formula interface anymore.
Expand All @@ -71,13 +93,13 @@

# missRanger 2.2.0

## Less dependencies
### Less dependencies

- Removed {mice} from "suggested" packages.
- Removed {dplyr} from "suggested" packages.
- Removed {survival} from "suggested" packages.

## Maintenance
### Maintenance

- Adding Github pages.
- Introduction of Github actions.
Expand All @@ -92,36 +114,36 @@ Maintenance release,

# missRanger 2.1.4 (not on CRAN)

## Minor changes
### Minor changes

- Now using progress bar instead of "." to show progress (when verbose = 1).

# missRanger 2.1.2 and 2.1.3

## Maintenance update
### Maintenance update

- Fixing failing unit tests.

# missRanger 2.1.1

## Minor changes
### Minor changes

- Allow the use of "mtry" as suggested by Thomas Lumley. Recommended values are NULL (default), 1 or a function of the number of covariables m, e.g. `mtry = function(m) max(1, m %/% 3)`. Keep in mind that `missRanger()` might use a growing set of covariables in the first iteration of the process, so passing `mtry = 2` might result in an error.

## Documentation
### Documentation

- Improved help pages.
- Splitted long vignette into three shorter ones.

## Other
### Other

- Added unit tests.

# missRanger 2.1.0

This is a summary of all changes since version 1.x.x.

## Major changes
### Major changes
* `missRanger` now also imputes and uses logical variables, character variables and further variables of mode numeric like dates and times.

* Added formula interface to specify which variables to impute (those on the left hand side) and those used to do so (those on the right hand side). Here some (pseudo) examples:
Expand All @@ -142,7 +164,7 @@ This is a summary of all changes since version 1.x.x.

* In PMM mode, `missRanger` relies on OOB predictions. The smaller the value of `num.trees`, the higher the risk of missing OOB predictions, which caused an error in PMM. Now, `pmm` allows for missing values in `xtrain` or `ytrain`. Thus, the algorithm will even work with `num.trees = 1`. This will be useful to impute large data sets with PMM.

## Minor changes
### Minor changes

* The function `imputeUnivariate` has received a `seed` argument.

Expand All @@ -152,6 +174,6 @@ This is a summary of all changes since version 1.x.x.

* If `verbose` is not 0, then `missRanger` will show which variables will be imputed in which order and which variables will be used for imputation.

## Minor bug fix
### Minor bug fix

* The argument `returnOOB` is now effectively controlling if out-of-bag errors are attached as attribute "oob" to the resulting data frame or not. So far, it was always attached.
190 changes: 190 additions & 0 deletions R/methods.R
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,193 @@ summary.missRanger <- function(object, ...) {
invisible(object)
}


#' Predict Method
#'
#' @description
#' Impute missing values on `newdata` based on an object of class "missRanger".
#'
#' For multivariate imputation, use `missRanger(..., keep_forests = TRUE)`.
#' For univariate imputation, no forests are required.
#' This can be enforced by `predict(..., iter = 0)` or via `missRanger(. ~ 1, ...)`.
#'
#' Note that out-of-sample imputation works best for rows in `newdata` with only one
#' missing value (actually counting only missings in variables used as covariates
#' in random forests). We call this the "easy case". In the "hard case",
#' even multiple iterations (set by `iter`) can lead to unsatisfactory results.
#'
#' @details
#' The out-of-sample algorithm works as follows:
#' 1. Impute univariately all relevant columns by randomly drawing values
#' from the original, unimputed data. This step will only impact "hard case" rows.
#' 2. Replace univariate imputations by predictions of random forests. This is done
#' sequentially over variables in descending order of number of missings
#' (to minimize the impact of univariate imputations). Optionally, this is followed
#' by predictive mean matching (PMM).
#' 3. Repeat Step 2 for "hard case" rows multiple times.
#'
#' @param object 'missRanger' object.
#' @param newdata A `data.frame` with missing values to impute.
#' @param pmm.k Number of candidate predictions of the original dataset
#' for predictive mean matching (PMM). By default the same value as during fitting.
#' @param iter Number of iterations for "hard case" rows. 0 for univariate imputation.
#' @param seed Integer seed used for initial univariate imputation and PMM.
#' @param verbose Should info be printed? (1 = yes/default, 0 for no).
#' @param ... Currently not used.
#' @export
#' @examples
#' iris2 <- generateNA(iris, seed = 20, p = c(Sepal.Length = 0.2, Species = 0.1))
#' imp <- missRanger(iris2, pmm.k = 5, num.trees = 100, keep_forests = TRUE, seed = 2)
#' predict(imp, head(iris2), seed = 3)
predict.missRanger <- function(
object, newdata, pmm.k = object$pmm.k, iter = 4L, seed = NULL, verbose = 1L, ...
) {
stopifnot(
"'newdata' should be a data.frame!" = is.data.frame(newdata),
"'newdata' should have at least one row!" = nrow(newdata) >= 1L,
"'iter' should not be negative!" = iter >= 0L,
"'pmm.k' should not be negative!" = pmm.k >= 0L
)
data_raw <- object$data_raw

# WHICH VARIABLES TO IMPUTE?

# (a) Only those in newdata
to_impute <- intersect(object$to_impute, colnames(newdata))

# (b) Only those with missings, and in decreasing order
# to minimize impact of univariate imputations
to_fill <- is.na(newdata[, to_impute, drop = FALSE])
m <- sort(colSums(to_fill), decreasing = TRUE)
to_impute <- names(m[m > 0])
to_fill <- to_fill[, to_impute, drop = FALSE]

if (length(to_impute) == 0L) {
return(newdata)
}

# CHECK VARIABLES USED TO IMPUTE

impute_by <- object$impute_by
if (!all(impute_by %in% colnames(newdata))) {
stop(
"Variables not present in 'newdata': ",
paste(setdiff(impute_by, colnames(newdata)), collapse = ", ")
)
}

# We currently don't do multivariate imputation if variable not to be imputed
# has missing values
only_impute_by <- setdiff(impute_by, to_impute)
if (length(only_impute_by) > 0L && anyNA(newdata[, only_impute_by])) {
stop(
"Missing values in ", paste(only_impute_by, collapse = ", "), " not allowed."
)
}

# CONSISTENCY CHECKS WITH 'data_raw'

for (v in union(to_impute, impute_by)) {
v_new <- newdata[[v]]
v_orig <- data_raw[[v]]

if (all(is.na(v_new))) {
next # NA can be of wrong class!
}
# class() distinguishes numeric, integer, logical, factor, character, Date, ...
# - variables in to_impute are numeric, integer, logical, factor, or character
# - variables in impute_by can also be of *mode* numeric, which includes Dates
if (!identical(class(v_new), class(v_orig))) {
stop("Inconsistency between 'newdata' and original data in variable ", v)
}

# Factor inconsistencies are not okay in 'to_impute'
if (
v %in% to_impute && is.factor(v_new) && !identical(levels(v_new), levels(v_orig))
) {
if (all(levels(v_new) %in% levels(v_orig))) {
newdata[[v]] <- factor(v_new, levels(v_orig), ordered = is.ordered(v_orig))
if (verbose >= 1L) {
message("\nExtending factor levels of '", v, "' to those in original data")
}
} else {
stop("New factor levels seen in variable to impute: ", v)
}
}
}

if (!is.null(seed)) {
set.seed(seed)
}

# UNIVARIATE IMPUTATION
# has no effect for "easy case" rows, but is not very expensive

for (v in to_impute) {
bad <- to_fill[, v]
v_orig <- data_raw[[v]]
donors <- sample(v_orig[!is.na(v_orig)], size = sum(bad), replace = TRUE)
if (all(bad)) {
# Handles e.g. case when original is factor, but newdata has all NA of numeric type
newdata[[v]] <- donors
} else {
newdata[[v]][bad] <- donors
}
}

if (length(impute_by) == 0L || iter < 1L) {
if (verbose >= 1L) {
message("\nOnly univariate imputations done")
}
return(newdata)
}

# MULTIVARIATE IMPUTATION

if (is.null(object$forests)) {
stop("No random forests in 'object'. Use missRanger(, keep_forests = TRUE).")
}

# Do we have a random forest for all variables with missings?
# This can fire only if the first iteration in missRanger() was the best, and only
# for maximal one variable.
forests_missing <- setdiff(to_impute, names(object$forests))
if (verbose >= 1L && length(forests_missing) > 0L) {
message(
"\nNo random forest for ", forests_missing,
". Univariate imputation done for this variable."
)
}
to_impute <- setdiff(to_impute, forests_missing)

# Do we have rows of "hard case"? If no, a single iteration is sufficient.
easy <- rowSums(to_fill[, intersect(to_impute, impute_by), drop = FALSE]) <= 1L
if (all(easy)) {
iter <- 1L
}

for (j in seq_len(iter)) {
for (v in to_impute) {
y <- newdata[[v]]
pred <- stats::predict(object$forests[[v]], newdata[to_fill[, v], ])$predictions
if (pmm.k >= 1) {
xtrain <- object$forests[[v]]$predictions
ytrain <- data_raw[[v]]
if (anyNA(ytrain)) {
ytrain <- ytrain[!is.na(ytrain)] # To align with OOB predictions
}
pred <- pmm(xtrain = xtrain, xtest = pred, ytrain = ytrain, k = pmm.k)
} else if (is.logical(y)) {
pred <- as.logical(pred)
} else if (is.character(y)) {
pred <- as.character(pred)
}
newdata[[v]][to_fill[, v]] <- pred
}
if (j == 1L) {
to_fill <- to_fill & !easy
}
}
return(newdata)
}

Loading

0 comments on commit d935791

Please sign in to comment.