You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently have MuyGPyS.gp.MultivariateMuyGPS (I'll call this MMuyGPS) as our supported solution for multivariate (multiple response variable) models. Of course, we can also use MuyGPyS.gp.MuyGPS for a multiple response variable model. The distinction is probably not clear to users. In this discussion I will outline the details of these two multivariate MuyGPs implementations and discuss possible future plans for supporting this functionality. I would really appreciate feedback from contributors, users, and stakeholders so that we can make sure that we are doing this right.
===== What is the difference? =====
Currently, MuyGPS with multiple response variables simply learns a single set of kernel hyperparameters on data $X \in \mathbb{R}^{n \times f}$ with responses $Y \in \mathbb{R}^{n \times d}$ and predicts the posterior mean on unobserved data $\mathbf{z} \in \mathbb{R}^{f}$ with $k$ nearest neighbor indices $N \subseteq \{1, \dots, n\}$ as
I will refer to this as the "elementwise" view henceforward. I am also ignoring the SigmaSq scaling parameters here and throughout, since they are not relevant to the discussion and are easy to introduce to the math. Here $\widehat{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{1 \times d}$ is a $d$-vector specifying the posterior mean for each response dimension, and $Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}$ is a singular posterior variance shared by each response dimension.
The elementwise view is equivalent to solving the following system, where $\mathbf{y} \in \mathbb{R}^{nd}$ is the flattened version of $Y$, $M = \{i + j \mid i \in N, j \in \{0, 1, \dots, k-1\} \}$ are the indices of the $d$ responses of each neighbor of $\mathbb{z}$, and $I_d$ is the $d$-rank identity matrix:
I'll refer to this view of the math as "blockwise" henceforward. In the blockwise view, $\widetilde{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{d \times 1}$ the same as $\widehat{Y}(\mathbf{z} \mid X_{N, :})^\top$, the transpose of the elementwise posterior mean. $Var \left ( \widetilde{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}^{d \times d}$ is a diagonal posterior variance matrix, which is equal to $Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) * I_d$.
This picture gets a little more complicated for MMuyGPS. Here we have a separate set of hyperparameters $\theta_i$ associated with each response dimension $i$. Define a diagonal matrix $Q(\cdot, \cdot)$ as
In the MMuyGPS blockwise view, $\breve{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{d \times 1}$ gives the posterior mean for the response variables, but unlike the prior views each dimension $i$ is controlled by a distinct set of kernel hyperparameters $\theta_i$. Similarly, $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}^{d \times d}$ is a diagonal posterior variance matrix, where each diagonal posterior variance prior is independently controlled by the corresponding kernel hyperparameters $\theta_i$.
===== What is wrong with MMuyGPS? =====
The key problem with MMuyGPS is that $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right )$ is a diagonal matrix, and asserts that the off-diagonal covariance elements between the response variables are zero, and so they are all jointly independent of each other. We have postulated that we can instead report $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right ) C$, where $C \in \mathbb{R}^{d \times d}$ is an empirical covariance or empirical correlation matrix drawn from the response observations in the training data. This approach certainly has brevity going for it, but I am curious as to whether we can instead learn a version of the blockwise view of the inference problem where the inner blocks ($Q$ matrices) are not themselves diagonal.
===== A dense blockwise view? =====
What would a dense blockwise view look like? Firstly, there would need to be more sets of hyperparameters. I believe that we would need $\theta_{i,j}$ for each $1 \leq i \leq j \leq d$. There might be additional contraints on these hyperparameters that I have not considered. We could then solve a similar MMuyGPS blockwise view where we replace $Q(\cdot, \cdot)$ with $D(\cdot, \cdot)$, defined as
This would allow us to create fully covariant posterior means and covariances, at a $O(d^2)$ increase in memory overhead, an $O(d^2)$ increase in computation associated with evaluating kernels, and a $O(d^3)$ increase in computation associated with realizing solves.
In addition to significant overhead, this also introduces the serious problem of learning all $\theta_{i,j}$. MMuyGPS makes this simple, since we use the posterior mean and variance, which can be fully split up into individual posterior means and variances by the joint independence assumption, as inputs to our loss functions. Ergo, we effectively train $d$ separate MuyGPS models that do not interact. However, this dense proposal is not separable and so requires a different training procedure. There are likely identifiability issues that make training such a model difficult of impossible. I would very much like input and discussion on this topic.
===== The future of multivariate models =====
Whatever form the math takes, recent software engineering improvements to the library suggest that we should reconsider how we define multivariate models in the library. Indeed, it is tempting to consider function composition solutions that would allow us to fold the functionality currently supported by MMuyGPS into MuyGPS so that, based upon the initialization parameters, we can create a future MuyGPS object that is functionally equivalent to the present MMuyGPS class. This would decrease development and testing overhead significantly, and ensure that new features are easier to incorporate into generic models.
However, there are several problems that must be solved to make this possible.
We would need to generalize KernelFn to somehow support multiple separate simultaneous KernelFn instances in a way that coalesces them in the __call__ and get_optim_params methods. These changes would need to percolate into the downstream Matern and RBF classes in a natural way.
We would also need to generalize how the PosteriorMean and PosteriorVariance work for MMuyGPS-style workflows. We don't want to actually evaluate the blockwise view of the math, so implementing the current behavior would need to be similar to the current independent solve view. It is not immediately obvious to me how to do this cleanly, and will likely require much discussion. We will also need the get_opt_fn functions to be able to reason about different noise parameters for each response variable.
We will need to be very careful with the hooks into optimization throughout.
There are probably other issues that don't immediately come to mind.
If we address these issues and fold the current MMuyGPS behavior into MuyGPS, the we will be able to do something similar with any non-jointly-independent models that we can come up with.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We currently have
MuyGPyS.gp.MultivariateMuyGPS
(I'll call thisMMuyGPS
) as our supported solution for multivariate (multiple response variable) models. Of course, we can also useMuyGPyS.gp.MuyGPS
for a multiple response variable model. The distinction is probably not clear to users. In this discussion I will outline the details of these two multivariate MuyGPs implementations and discuss possible future plans for supporting this functionality. I would really appreciate feedback from contributors, users, and stakeholders so that we can make sure that we are doing this right.===== What is the difference? =====
Currently,$X \in \mathbb{R}^{n \times f}$ with responses $Y \in \mathbb{R}^{n \times d}$ and predicts the posterior mean on unobserved data $\mathbf{z} \in \mathbb{R}^{f}$ with $k$ nearest neighbor indices $N \subseteq \{1, \dots, n\}$ as
MuyGPS
with multiple response variables simply learns a single set of kernel hyperparameters on dataI will refer to this as the "elementwise" view henceforward. I am also ignoring the$\widehat{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{1 \times d}$ is a $d$ -vector specifying the posterior mean for each response dimension, and
$Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}$ is a singular posterior variance shared by each response dimension.
SigmaSq
scaling parameters here and throughout, since they are not relevant to the discussion and are easy to introduce to the math. HereThe elementwise view is equivalent to solving the following system, where$\mathbf{y} \in \mathbb{R}^{nd}$ is the flattened version of $Y$ , $M = \{i + j \mid i \in N, j \in \{0, 1, \dots, k-1\} \}$ are the indices of the $d$ responses of each neighbor of $\mathbb{z}$ , and $I_d$ is the $d$ -rank identity matrix:
$$
\begin{aligned}
\widetilde{Y}(\mathbf{z} \mid X_{N, :}) =
& \begin{bmatrix}
K(\mathbf{z}, X_{N_1, :}) * I_d & \cdots & K(\mathbf{z}, X_{N_k, :}) * I_d
\end{bmatrix} \
& \begin{bmatrix}
K(X_{N_1, :}, X_{N_1, :}) * I_d & \cdots & K(X_{N_1, :}, X_{N_k, :}) * I_d \
\vdots & \ddots & \vdots \
K(X_{N_1, :}, X_{N_k, :}) * I_d & \cdots & K(X_{N_k, :}, X_{N_k, :}) * I_d
\end{bmatrix}
\mathbf{y}M \
%
Var \left ( \widetilde{Y}(\mathbf{z} \mid X{N, :}) \right ) =
& \begin{bmatrix}
K(\mathbf{z}, X_{N_1, :}) * I_d & \cdots & K(\mathbf{z}, X_{N_k, :}) * I_d
\end{bmatrix} \
& \begin{bmatrix}
K(X_{N_1, :}, X_{N_1, :}) * I_d & \cdots & K(X_{N_1, :}, X_{N_k, :}) * I_d \
\vdots & \ddots & \vdots \
K(X_{N_1, :}, X_{N_k, :}) * I_d & \cdots & K(X_{N_k, :}, X_{N_k, :}) * I_d
\end{bmatrix} \
& \begin{bmatrix}
K(\mathbf{z}, X_{N_1, :}) * I_d & \cdots & K(\mathbf{z}, X_{N_k, :}) * I_d
\end{bmatrix},
\end{aligned}
$$
I'll refer to this view of the math as "blockwise" henceforward. In the blockwise view,$\widetilde{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{d \times 1}$ the same as $\widehat{Y}(\mathbf{z} \mid X_{N, :})^\top$ , the transpose of the elementwise posterior mean. $Var \left ( \widetilde{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}^{d \times d}$ is a diagonal posterior variance matrix, which is equal to $Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) * I_d$ .
This picture gets a little more complicated for$\theta_i$ associated with each response dimension $i$ . Define a diagonal matrix $Q(\cdot, \cdot)$ as
MMuyGPS
. Here we have a separate set of hyperparametersMMuyGPS
transforms the blockwise view into$$
\begin{aligned}
\breve{Y}(\mathbf{z} \mid X_{N, :}) =
& \begin{bmatrix}
Q(\mathbf{z}, X_{N_1, :}) & \cdots & Q(\mathbf{z}, X_{N_k, :})
\end{bmatrix} \
& \begin{bmatrix}
Q(X_{N_1, :}, X_{N_1, :}) & \cdots & Q(X_{N_1, :}, X_{N_k, :}) \
\vdots & \ddots & \vdots \
Q(X_{N_1, :}, X_{N_k, :}) & \cdots & Q(X_{N_k, :}, X_{N_k, :})
\end{bmatrix}
\mathbf{y}M \
%
Var \left ( \breve{Y}(\mathbf{z} \mid X{N, :}) \right ) =
& \begin{bmatrix}
Q(\mathbf{z}, X_{N_1, :}) & \cdots & Q(\mathbf{z}, X_{N_k, :})
\end{bmatrix} \
& \begin{bmatrix}
Q(X_{N_1, :}, X_{N_1, :}) & \cdots & Q(X_{N_1, :}, X_{N_k, :}) \
\vdots & \ddots & \vdots \
Q(X_{N_1, :}, X_{N_k, :}) * I_d & \cdots & Q(X_{N_k, :}, X_{N_k, :})
\end{bmatrix} \
& \begin{bmatrix}
Q(\mathbf{z}, X_{N_1, :}) & \cdots & Q(\mathbf{z}, X_{N_k, :})
\end{bmatrix},
\end{aligned}
$$
In the$\breve{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{d \times 1}$ gives the posterior mean for the response variables, but unlike the prior views each dimension $i$ is controlled by a distinct set of kernel hyperparameters $\theta_i$ . Similarly, $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}^{d \times d}$ is a diagonal posterior variance matrix, where each diagonal posterior variance prior is independently controlled by the corresponding kernel hyperparameters $\theta_i$ .
MMuyGPS
blockwise view,===== What is wrong with
MMuyGPS
? =====The key problem with$Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right )$ is a diagonal matrix, and asserts that the off-diagonal covariance elements between the response variables are zero, and so they are all jointly independent of each other. We have postulated that we can instead report $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right ) C$ , where $C \in \mathbb{R}^{d \times d}$ is an empirical covariance or empirical correlation matrix drawn from the response observations in the training data. This approach certainly has brevity going for it, but I am curious as to whether we can instead learn a version of the blockwise view of the inference problem where the inner blocks ($Q$ matrices) are not themselves diagonal.
MMuyGPS
is that===== A dense blockwise view? =====
What would a dense blockwise view look like? Firstly, there would need to be more sets of hyperparameters. I believe that we would need$\theta_{i,j}$ for each $1 \leq i \leq j \leq d$ . There might be additional contraints on these hyperparameters that I have not considered. We could then solve a similar $Q(\cdot, \cdot)$ with $D(\cdot, \cdot)$ , defined as
MMuyGPS
blockwise view where we replaceThis would allow us to create fully covariant posterior means and covariances, at a$O(d^2)$ increase in memory overhead, an $O(d^2)$ increase in computation associated with evaluating kernels, and a $O(d^3)$ increase in computation associated with realizing solves.
In addition to significant overhead, this also introduces the serious problem of learning all$\theta_{i,j}$ . $d$ separate
MMuyGPS
makes this simple, since we use the posterior mean and variance, which can be fully split up into individual posterior means and variances by the joint independence assumption, as inputs to our loss functions. Ergo, we effectively trainMuyGPS
models that do not interact. However, this dense proposal is not separable and so requires a different training procedure. There are likely identifiability issues that make training such a model difficult of impossible. I would very much like input and discussion on this topic.===== The future of multivariate models =====
Whatever form the math takes, recent software engineering improvements to the library suggest that we should reconsider how we define multivariate models in the library. Indeed, it is tempting to consider function composition solutions that would allow us to fold the functionality currently supported by
MMuyGPS
intoMuyGPS
so that, based upon the initialization parameters, we can create a futureMuyGPS
object that is functionally equivalent to the presentMMuyGPS
class. This would decrease development and testing overhead significantly, and ensure that new features are easier to incorporate into generic models.However, there are several problems that must be solved to make this possible.
KernelFn
to somehow support multiple separate simultaneousKernelFn
instances in a way that coalesces them in the__call__
andget_optim_params
methods. These changes would need to percolate into the downstreamMatern
andRBF
classes in a natural way.PosteriorMean
andPosteriorVariance
work forMMuyGPS
-style workflows. We don't want to actually evaluate the blockwise view of the math, so implementing the current behavior would need to be similar to the current independent solve view. It is not immediately obvious to me how to do this cleanly, and will likely require much discussion. We will also need theget_opt_fn
functions to be able to reason about different noise parameters for each response variable.If we address these issues and fold the current
MMuyGPS
behavior intoMuyGPS
, the we will be able to do something similar with any non-jointly-independent models that we can come up with.Beta Was this translation helpful? Give feedback.
All reactions