Merge pull request #71 from mayer79/out-of-sample

Initial support of out-of-sample prediction
mayer79 · Jul 27, 2024 · d935791 · d935791
2 parents 5e5afab + 814c7e8
commit d935791
Show file tree

Hide file tree

Showing 12 changed files with 531 additions and 51 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -14,8 +14,7 @@ Description: Alternative implementation of the beautiful 'MissForest'
     predictive mean matching tries to raise the variance in the resulting
     conditional distributions to a realistic level. This would allow,
     e.g., to do multiple imputation when repeating the call to
-    missRanger().  A formula interface allows to control which variables
-    should be imputed by which.
+    missRanger(). Out-of-sample application is supported as well.
 License: GPL (>= 2)
 Depends: 
     R (>= 3.5.0)

diff --git a/NAMESPACE b/NAMESPACE
@@ -1,5 +1,6 @@
 # Generated by roxygen2: do not edit by hand
 
+S3method(predict,missRanger)
 S3method(print,missRanger)
 S3method(summary,missRanger)
 export(generateNA)

diff --git a/NEWS.md b/NEWS.md
@@ -1,16 +1,36 @@
 # missRanger 2.6.0
 
-## Possibly breaking changes
+### Major feature
+
+Out-of-sample application is now possible! Thanks to [@jeandigitale](https://github.com/jeandigitale) for pushing the idea in [#58](https://github.com/mayer79/missRanger/issues/58).
+
+This means you can run `imp <- missRanger(..., keep_forests = TRUE)` and then apply its models to new data via `predict(imp, newdata)`. The "missRanger" object can be saved/loaded as binary file, e.g, via `saveRDS()`/`readRDS()` for later use.
+
+Note that out-of-sample imputation works best for rows in `newdata` with only one
+missing value (actually counting only missings in variables used as covariates in random forests). We call this the "easy case". In the "hard case", 
+even multiple iterations (set by `iter`) can lead to unsatisfactory results.
+
+The out-of-sample algorithm works as follows:
+
+1. Impute univariately all relevant columns by randomly drawing values 
+   from the original, unimputed data. This step will only impact "hard case" rows.
+2. Replace univariate imputations by predictions of random forests. This is done
+   sequentially over variablse in descending order of number of missings
+   (to minimize the impact of univariate imputations). Optionally, this is followed
+   by predictive mean matching (PMM).
+3. Repeat Step 2 for "hard case" rows multiple times.
+
+### Possibly breaking changes
 
 - Columns of special type like date/time can't be imputed anymore. You will need to convert them to numeric before imputation.
 - `pmm()` is more picky: `xtrain` and `xtest` must both be either numeric, logical, or factor (with identical levels).
 
-## Minor changes in output object
+### Minor changes in output object
 
 - Add original data as `data_raw`.
 - Renamed `visit_seq` to `to_impute`.
 
-## Other changes
+### Other changes
 
 - Now requires ranger >= 0.16.0.
 - More compact vignettes.
@@ -27,39 +47,41 @@
   - num.threads = NULL
   - save.memory = FALSE
 - For variables that can't be used, more information is printed.
+- If `keep_forests = TRUE`, the argument `data_only` is set to `FALSE` by default.
+- "missRanger" object now stores `pmm.k`.
 
 # missRanger 2.5.0
 
-## Bug fixes
+### Bug fixes
 
 - Since Release 2.3.0, unintentionally, negative formula terms haven't been dropped, see [#62](https://github.com/mayer79/missRanger/issues/62). This is fixed now.
 
-## Enhancements
+### Enhancements
 
 - The vignette on multiple imputations has been revised, and a larger number of donors in predictive mean matching is being used in the example.
 
 # missRanger 2.4.0
 
-## Future Output API
+### Future Output API
 
 - New argument `data_only = TRUE` to control if only the imputed data should be returned (default), or an object of class "missRanger". This object contains the imputed data and infos like OOB prediction errors, fixing [#28](https://github.com/mayer79/missRanger/issues/28). The value `FALSE` will later becoming the default in {missRanger 3.0.0}. This will be announced via deprecation cycle.
 
-## Enhancements
+### Enhancements
 
 - New argument `keep_forests = FALSE`. Should the random forests of the best iteration (the one that generated the final imputed data) be added to the "missRanger" object? Note that this will use a lot of memory. Only relevant if `data_only = FALSE`. This solves [#54](https://github.com/mayer79/missRanger/issues/54).
 
-## Bug fixes
+### Bug fixes
 
 - In case the algorithm did not converge, the data of the *last* iteration was returned instead of the current one. This has been fixed.
 
 # missRanger 2.3.0
 
-## Major improvements
+### Major improvements
 
 - `missRanger()` now works with syntactically wrong variable names like "1bad:variable". This solves an [old issue](https://github.com/mayer79/missRanger/issues/19), recently popping up in [this new issue](https://github.com/mayer79/missRanger/issues/51).
 - `missRanger()` now works with any number of features, as long as the formula is left at its default, i.e., `. ~ .`. This solves this [issue](https://github.com/mayer79/missRanger/issues/50).
 
-## Other changes
+### Other changes
 
 - Documentation improvement.
 - `ranger()` is now called via the x/y interface, not the formula interface anymore.
@@ -71,13 +93,13 @@
 
 # missRanger 2.2.0
 
-## Less dependencies
+### Less dependencies
 
 - Removed {mice} from "suggested" packages.
 - Removed {dplyr} from "suggested" packages.
 - Removed {survival} from "suggested" packages.
 
-## Maintenance
+### Maintenance
 
 - Adding Github pages.
 - Introduction of Github actions.
@@ -92,36 +114,36 @@ Maintenance release,
 
 # missRanger 2.1.4 (not on CRAN)
 
-## Minor changes
+### Minor changes
 
 - Now using progress bar instead of "." to show progress (when verbose = 1).
 
 # missRanger 2.1.2 and 2.1.3
 
-## Maintenance update
+### Maintenance update
 
 - Fixing failing unit tests.
 
 # missRanger 2.1.1
 
-## Minor changes
+### Minor changes
 
 - Allow the use of "mtry" as suggested by Thomas Lumley. Recommended values are NULL (default), 1 or a function of the number of covariables m, e.g. `mtry = function(m) max(1, m %/% 3)`. Keep in mind that `missRanger()` might use a growing set of covariables in the first iteration of the process, so passing `mtry = 2` might result in an error.
 
-## Documentation
+### Documentation
 
 - Improved help pages.
 - Splitted long vignette into three shorter ones.
 
-## Other
+### Other
 
 - Added unit tests.
 
 # missRanger 2.1.0
 
 This is a summary of all changes since version 1.x.x.
 
-## Major changes
+### Major changes
 * `missRanger` now also imputes and uses logical variables, character variables and further variables of mode numeric like dates and times.
 
 * Added formula interface to specify which variables to impute (those on the left hand side) and those used to do so (those on the right hand side). Here some (pseudo) examples:
@@ -142,7 +164,7 @@ This is a summary of all changes since version 1.x.x.
 
 * In PMM mode, `missRanger` relies on OOB predictions. The smaller the value of `num.trees`, the higher the risk of missing OOB predictions, which caused an error in PMM. Now, `pmm` allows for missing values in `xtrain` or `ytrain`. Thus, the algorithm will even work with `num.trees = 1`. This will be useful to impute large data sets with PMM.
 
-## Minor changes
+### Minor changes
 
 * The function `imputeUnivariate` has received a `seed` argument.
 
@@ -152,6 +174,6 @@ This is a summary of all changes since version 1.x.x.
 
 * If `verbose` is not 0, then `missRanger` will show which variables will be imputed in which order and which variables will be used for imputation.
 
-## Minor bug fix
+### Minor bug fix
 
 * The argument `returnOOB` is now effectively controlling if out-of-bag errors are attached as attribute "oob" to the resulting data frame or not. So far, it was always attached.
diff --git a/R/methods.R b/R/methods.R
@@ -41,3 +41,193 @@ summary.missRanger <- function(object, ...) {
   invisible(object)
 }
 
+
+#' Predict Method
+#' 
+#' @description
+#' Impute missing values on `newdata` based on an object of class "missRanger".
+#' 
+#' For multivariate imputation, use `missRanger(..., keep_forests = TRUE)`. 
+#' For univariate imputation, no forests are required. 
+#' This can be enforced by `predict(..., iter = 0)` or via `missRanger(. ~ 1, ...)`.
+#' 
+#' Note that out-of-sample imputation works best for rows in `newdata` with only one
+#' missing value (actually counting only missings in variables used as covariates 
+#' in random forests). We call this the "easy case". In the "hard case", 
+#' even multiple iterations (set by `iter`) can lead to unsatisfactory results.
+#' 
+#' @details
+#' The out-of-sample algorithm works as follows:
+#' 1. Impute univariately all relevant columns by randomly drawing values 
+#'    from the original, unimputed data. This step will only impact "hard case" rows.
+#' 2. Replace univariate imputations by predictions of random forests. This is done
+#'    sequentially over variables in descending order of number of missings
+#'    (to minimize the impact of univariate imputations). Optionally, this is followed
+#'    by predictive mean matching (PMM).
+#' 3. Repeat Step 2 for "hard case" rows multiple times.
+#' 
+#' @param object 'missRanger' object.
+#' @param newdata A `data.frame` with missing values to impute.
+#' @param pmm.k Number of candidate predictions of the original dataset
+#'   for predictive mean matching (PMM). By default the same value as during fitting.
+#' @param iter Number of iterations for "hard case" rows. 0 for univariate imputation.
+#' @param seed Integer seed used for initial univariate imputation and PMM.
+#' @param verbose Should info be printed? (1 = yes/default, 0 for no).
+#' @param ... Currently not used.
+#' @export
+#' @examples
+#' iris2 <- generateNA(iris, seed = 20, p = c(Sepal.Length = 0.2, Species = 0.1))
+#' imp <- missRanger(iris2, pmm.k = 5, num.trees = 100, keep_forests = TRUE, seed = 2)
+#' predict(imp, head(iris2), seed = 3)
+predict.missRanger <- function(
+    object, newdata, pmm.k = object$pmm.k, iter = 4L, seed = NULL, verbose = 1L, ...
+  ) {
+  stopifnot(
+    "'newdata' should be a data.frame!" = is.data.frame(newdata),
+    "'newdata' should have at least one row!" = nrow(newdata) >= 1L,
+    "'iter' should not be negative!" = iter >= 0L,
+    "'pmm.k' should not be negative!" = pmm.k >= 0L
+  )
+  data_raw <- object$data_raw
+
+  # WHICH VARIABLES TO IMPUTE?
+
+  # (a) Only those in newdata
+  to_impute <- intersect(object$to_impute, colnames(newdata))
+
+  # (b) Only those with missings, and in decreasing order
+  #     to minimize impact of univariate imputations
+  to_fill <- is.na(newdata[, to_impute, drop = FALSE])
+  m <- sort(colSums(to_fill), decreasing = TRUE)
+  to_impute <- names(m[m > 0])
+  to_fill <- to_fill[, to_impute, drop = FALSE]
+
+  if (length(to_impute) == 0L) {
+    return(newdata)
+  }
+
+  # CHECK VARIABLES USED TO IMPUTE
+
+  impute_by <- object$impute_by
+  if (!all(impute_by %in% colnames(newdata))) {
+    stop(
+      "Variables not present in 'newdata': ",
+      paste(setdiff(impute_by, colnames(newdata)), collapse = ", ")
+    )
+  }
+
+  # We currently don't do multivariate imputation if variable not to be imputed 
+  # has missing values
+  only_impute_by <- setdiff(impute_by, to_impute)
+  if (length(only_impute_by) > 0L && anyNA(newdata[, only_impute_by])) {
+    stop(
+      "Missing values in ", paste(only_impute_by, collapse = ", "), " not allowed."
+    )
+  }
+
+  # CONSISTENCY CHECKS WITH 'data_raw'
+
+  for (v in union(to_impute, impute_by)) {
+    v_new <- newdata[[v]]
+    v_orig <- data_raw[[v]]
+
+    if (all(is.na(v_new))) {
+      next  # NA can be of wrong class!
+    }
+    # class() distinguishes numeric, integer, logical, factor, character, Date, ...
+    # - variables in to_impute are numeric, integer, logical, factor, or character
+    # - variables in impute_by can also be of *mode* numeric, which includes Dates
+    if (!identical(class(v_new), class(v_orig))) {
+      stop("Inconsistency between 'newdata' and original data in variable ", v)
+    }
+
+    # Factor inconsistencies are not okay in 'to_impute'
+    if (
+      v %in% to_impute && is.factor(v_new) && !identical(levels(v_new), levels(v_orig))
+    ) {
+      if (all(levels(v_new) %in% levels(v_orig))) {
+        newdata[[v]] <- factor(v_new, levels(v_orig), ordered = is.ordered(v_orig))
+        if (verbose >= 1L) {
+          message("\nExtending factor levels of '", v, "' to those in original data")
+        }
+      } else {
+        stop("New factor levels seen in variable to impute: ", v)
+      }
+    }
+  }
+
+  if (!is.null(seed)) {
+    set.seed(seed)
+  }
+
+  # UNIVARIATE IMPUTATION 
+  # has no effect for "easy case" rows, but is not very expensive
+
+  for (v in to_impute) {
+    bad <- to_fill[, v]
+    v_orig <- data_raw[[v]]
+    donors <- sample(v_orig[!is.na(v_orig)], size = sum(bad), replace = TRUE)
+    if (all(bad)) {
+      # Handles e.g. case when original is factor, but newdata has all NA of numeric type
+      newdata[[v]] <- donors
+    } else {
+      newdata[[v]][bad] <- donors
+    }
+  }
+
+  if (length(impute_by) == 0L || iter < 1L) {
+    if (verbose >= 1L) {
+      message("\nOnly univariate imputations done")
+    }  
+    return(newdata)
+  }
+
+  # MULTIVARIATE IMPUTATION
+
+  if (is.null(object$forests)) {
+    stop("No random forests in 'object'. Use missRanger(, keep_forests = TRUE).")
+  }
+
+  # Do we have a random forest for all variables with missings?
+  # This can fire only if the first iteration in missRanger() was the best, and only
+  # for maximal one variable.
+  forests_missing <- setdiff(to_impute, names(object$forests))
+  if (verbose >= 1L && length(forests_missing) > 0L) {
+    message(
+      "\nNo random forest for ", forests_missing, 
+      ". Univariate imputation done for this variable."
+    )
+  }
+  to_impute <- setdiff(to_impute, forests_missing)
+
+  # Do we have rows of "hard case"? If no, a single iteration is sufficient.
+  easy <- rowSums(to_fill[, intersect(to_impute, impute_by), drop = FALSE]) <= 1L
+  if (all(easy)) {
+    iter <- 1L
+  }
+
+  for (j in seq_len(iter)) {
+    for (v in to_impute) {
+      y <- newdata[[v]]
+      pred <- stats::predict(object$forests[[v]], newdata[to_fill[, v], ])$predictions
+      if (pmm.k >= 1) {
+        xtrain <- object$forests[[v]]$predictions
+        ytrain <- data_raw[[v]]
+        if (anyNA(ytrain)) {
+          ytrain <- ytrain[!is.na(ytrain)]  # To align with OOB predictions
+        }
+        pred <- pmm(xtrain = xtrain, xtest = pred, ytrain = ytrain, k = pmm.k)
+      } else if (is.logical(y)) {
+        pred <- as.logical(pred)
+      } else if (is.character(y)) {
+        pred <- as.character(pred)
+      }
+      newdata[[v]][to_fill[, v]] <- pred
+    }
+    if (j == 1L) {
+      to_fill <- to_fill & !easy
+    }
+  }
+  return(newdata)
+}
+