man/model_spec.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/model_object_docs.R
\name{model_spec}
\alias{model_spec}
\title{Model Specifications}
\description{
The parsnip package splits the process of fitting models into two steps:
\enumerate{
\item Specify how a model will be fit using a \emph{model specification}
\item Fit a model using the model specification
}

This is a different approach to many other model interfaces in R, like \code{lm()},
where both the specification of the model and the fitting happens in one
function call. Splitting the process into two steps allows users to
iteratively define model specifications throughout the model development
process.

This intermediate object that defines how the model will be fit is called
a \emph{model specification} and has class \code{model_spec}. Model type functions,
like \code{\link[=linear_reg]{linear_reg()}} or \code{\link[=boost_tree]{boost_tree()}}, return \code{model_spec} objects.

Fitted model objects, resulting from passing a \code{model_spec} to
\link[=fit.model_spec]{fit()} or \link[=fit_xy.model_spec]{fit_xy}, have
class \code{model_fit}, and contain the original \code{model_spec} objects inside
them. See \link[=model_fit]{?model_fit} for more on that object type, and
\link[=extract_spec_parsnip.model_fit]{?extract_spec_parsnip} to
extract \code{model_spec}s from \code{model_fit}s.
}
\details{
An object with class \code{"model_spec"} is a container for
information about a model that will be fit.

The main elements of the object are:
\itemize{
\item \code{args}: A vector of the main arguments for the model. The
names of these arguments may be different from their
counterparts n the underlying model function. For example, for a
\code{glmnet} model, the argument name for the amount of the penalty
is called "penalty" instead of "lambda" to make it more general
and usable across different types of models (and to not be
specific to a particular model function). The elements of \code{args}
can \code{tune()} with the use of the
\href{https://tune.tidymodels.org/}{tune package}. For more information
see \url{https://www.tidymodels.org/start/tuning/}. If left to their
defaults (\code{NULL}), the
arguments will use the underlying model functions default value.
As discussed below, the arguments in \code{args} are captured as
quosures and are not immediately executed.
\item \code{...}: Optional model-function-specific
parameters. As with \code{args}, these will be quosures and can be
\code{tune()}.
\item \code{mode}: The type of model, such as "regression" or
"classification". Other modes will be added once the package
adds more functionality.
\item \code{method}: This is a slot that is filled in later by the
model's constructor function. It generally contains lists of
information that are used to create the fit and prediction code
as well as required packages and similar data.
\item \code{engine}: This character string declares exactly what
software will be used. It can be a package name or a technology
type.
}

This class and structure is the basis for how parsnip
stores model objects prior to seeing the data.
}
\section{Argument Details}{


An important detail to understand when creating model
specifications is that they are intended to be functionally
independent of the data. While it is true that some tuning
parameters are \emph{data dependent}, the model specification does
not interact with the data at all.

For example, most R functions immediately evaluate their
arguments. For example, when calling \code{mean(dat_vec)}, the object
\code{dat_vec} is immediately evaluated inside of the function.

parsnip model functions do not do this. For example, using

\preformatted{
 rand_forest(mtry = ncol(mtcars) - 1)
}

\strong{does not} execute \code{ncol(mtcars) - 1} when creating the specification.
This can be seen in the output:

\preformatted{
 > rand_forest(mtry = ncol(mtcars) - 1)
 Random Forest Model Specification (unknown)

 Main Arguments:
   mtry = ncol(mtcars) - 1
}

The model functions save the argument \emph{expressions} and their
associated environments (a.k.a. a quosure) to be evaluated later
when either \code{\link[=fit.model_spec]{fit.model_spec()}} or \code{\link[=fit_xy.model_spec]{fit_xy.model_spec()}}  are
called with the actual data.

The consequence of this strategy is that any data required to
get the parameter values must be available when the model is
fit. The two main ways that this can fail is if:

\enumerate{
\item The data have been modified between the creation of the
model specification and when the model fit function is invoked.

\item If the model specification is saved and loaded into a new
session where those same data objects do not exist.
}

The best way to avoid these issues is to not reference any data
objects in the global environment but to use data descriptors
such as \code{.cols()}. Another way of writing the previous
specification is

\preformatted{
 rand_forest(mtry = .cols() - 1)
}

This is not dependent on any specific data object and
is evaluated immediately before the model fitting process begins.

One less advantageous approach to solving this issue is to use
quasiquotation. This would insert the actual R object into the
model specification and might be the best idea when the data
object is small. For example, using

\preformatted{
 rand_forest(mtry = ncol(!!mtcars) - 1)
}

would work (and be reproducible between sessions) but embeds
the entire mtcars data set into the \code{mtry} expression:

\preformatted{
 > rand_forest(mtry = ncol(!!mtcars) - 1)
 Random Forest Model Specification (unknown)

 Main Arguments:
   mtry = ncol(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, <snip>
}

However, if there were an object with the number of columns in
it, this wouldn't be too bad:

\preformatted{
 > mtry_val <- ncol(mtcars) - 1
 > mtry_val
 [1] 10
 > rand_forest(mtry = !!mtry_val)
 Random Forest Model Specification (unknown)

 Main Arguments:
   mtry = 10
}

More information on quosures and quasiquotation can be found at
\url{https://adv-r.hadley.nz/quasiquotation.html}.
}