How should we support multivariate models in the future? #125

bwpriest · 2023-04-05T21:39:25Z

bwpriest
Apr 5, 2023
Maintainer

We currently have MuyGPyS.gp.MultivariateMuyGPS (I'll call this MMuyGPS) as our supported solution for multivariate (multiple response variable) models. Of course, we can also use MuyGPyS.gp.MuyGPS for a multiple response variable model. The distinction is probably not clear to users. In this discussion I will outline the details of these two multivariate MuyGPs implementations and discuss possible future plans for supporting this functionality. I would really appreciate feedback from contributors, users, and stakeholders so that we can make sure that we are doing this right.

===== What is the difference? =====

Currently, MuyGPS with multiple response variables simply learns a single set of kernel hyperparameters on data $X \in \mathbb{R}^{n \times f}$ with responses $Y \in \mathbb{R}^{n \times d}$ and predicts the posterior mean on unobserved data $\mathbf{z} \in \mathbb{R}^{f}$ with $k$ nearest neighbor indices $N \subseteq \{1, \dots, n\}$ as

$$ \begin{aligned} \widehat{Y}(\mathbf{z} \mid X_{N, :}) &= K(\mathbf{z}, X_{N, :}) K(X_{N, :}, X_{N, :})^{-1} Y_{N, :} \\ Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) &= K(\mathbf{z}, X_{N, :}) K(X_{N, :}, X_{N, :})^{-1} K(X_{N, :}, \mathbf{z}). \end{aligned} $$

I will refer to this as the "elementwise" view henceforward. I am also ignoring the SigmaSq scaling parameters here and throughout, since they are not relevant to the discussion and are easy to introduce to the math. Here $\widehat{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{1 \times d}$ is a $d$-vector specifying the posterior mean for each response dimension, and
$Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}$ is a singular posterior variance shared by each response dimension.

The elementwise view is equivalent to solving the following system, where $\mathbf{y} \in \mathbb{R}^{nd}$ is the flattened version of $Y$, $M = \{i + j \mid i \in N, j \in \{0, 1, \dots, k-1\} \}$ are the indices of the $d$ responses of each neighbor of $\mathbb{z}$, and $I_d$ is the $d$-rank identity matrix:

$$
\begin{aligned}
\widetilde{Y}(\mathbf{z} \mid X_{N, :}) =
& \begin{bmatrix}
K(\mathbf{z}, X_{N_1, :}) * I_d & \cdots & K(\mathbf{z}, X_{N_k, :}) * I_d
\end{bmatrix} \
& \begin{bmatrix}
K(X_{N_1, :}, X_{N_1, :}) * I_d & \cdots & K(X_{N_1, :}, X_{N_k, :}) * I_d \
\vdots & \ddots & \vdots \
K(X_{N_1, :}, X_{N_k, :}) * I_d & \cdots & K(X_{N_k, :}, X_{N_k, :}) * I_d
\end{bmatrix}
\mathbf{y}M \
%
Var \left ( \widetilde{Y}(\mathbf{z} \mid X{N, :}) \right ) =
& \begin{bmatrix}
K(\mathbf{z}, X_{N_1, :}) * I_d & \cdots & K(\mathbf{z}, X_{N_k, :}) * I_d
\end{bmatrix} \
& \begin{bmatrix}
K(X_{N_1, :}, X_{N_1, :}) * I_d & \cdots & K(X_{N_1, :}, X_{N_k, :}) * I_d \
\vdots & \ddots & \vdots \
K(X_{N_1, :}, X_{N_k, :}) * I_d & \cdots & K(X_{N_k, :}, X_{N_k, :}) * I_d
\end{bmatrix} \
& \begin{bmatrix}
K(\mathbf{z}, X_{N_1, :}) * I_d & \cdots & K(\mathbf{z}, X_{N_k, :}) * I_d
\end{bmatrix},
\end{aligned}
$$

I'll refer to this view of the math as "blockwise" henceforward. In the blockwise view, $\widetilde{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{d \times 1}$ the same as $\widehat{Y}(\mathbf{z} \mid X_{N, :})^\top$, the transpose of the elementwise posterior mean. $Var \left ( \widetilde{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}^{d \times d}$ is a diagonal posterior variance matrix, which is equal to $Var \left ( \widehat{Y}(\mathbf{z} \mid X_{N, :}) \right ) * I_d$.

This picture gets a little more complicated for MMuyGPS. Here we have a separate set of hyperparameters $\theta_i$ associated with each response dimension $i$. Define a diagonal matrix $Q(\cdot, \cdot)$ as

$$ Q(\cdot, \cdot) \equiv \begin{bmatrix} K_{\theta_1}(\cdot, \cdot) & 0 & 0 & \cdots & 0 \\ 0 & K_{\theta_2}(\cdot, \cdot) & 0 & \cdots & \vdots \\ 0 & 0 & \ddots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & K_{\theta_d}(\cdot, \cdot) \end{bmatrix}. $$

MMuyGPS transforms the blockwise view into

$$
\begin{aligned}
\breve{Y}(\mathbf{z} \mid X_{N, :}) =
& \begin{bmatrix}
Q(\mathbf{z}, X_{N_1, :}) & \cdots & Q(\mathbf{z}, X_{N_k, :})
\end{bmatrix} \
& \begin{bmatrix}
Q(X_{N_1, :}, X_{N_1, :}) & \cdots & Q(X_{N_1, :}, X_{N_k, :}) \
\vdots & \ddots & \vdots \
Q(X_{N_1, :}, X_{N_k, :}) & \cdots & Q(X_{N_k, :}, X_{N_k, :})
\end{bmatrix}
\mathbf{y}M \
%
Var \left ( \breve{Y}(\mathbf{z} \mid X{N, :}) \right ) =
& \begin{bmatrix}
Q(\mathbf{z}, X_{N_1, :}) & \cdots & Q(\mathbf{z}, X_{N_k, :})
\end{bmatrix} \
& \begin{bmatrix}
Q(X_{N_1, :}, X_{N_1, :}) & \cdots & Q(X_{N_1, :}, X_{N_k, :}) \
\vdots & \ddots & \vdots \
Q(X_{N_1, :}, X_{N_k, :}) * I_d & \cdots & Q(X_{N_k, :}, X_{N_k, :})
\end{bmatrix} \
& \begin{bmatrix}
Q(\mathbf{z}, X_{N_1, :}) & \cdots & Q(\mathbf{z}, X_{N_k, :})
\end{bmatrix},
\end{aligned}
$$

In the MMuyGPS blockwise view, $\breve{Y}(\mathbf{z} \mid X_{N, :}) \in \mathbb{R}^{d \times 1}$ gives the posterior mean for the response variables, but unlike the prior views each dimension $i$ is controlled by a distinct set of kernel hyperparameters $\theta_i$. Similarly, $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right ) \in \mathbb{R}^{d \times d}$ is a diagonal posterior variance matrix, where each diagonal posterior variance prior is independently controlled by the corresponding kernel hyperparameters $\theta_i$.

===== What is wrong with MMuyGPS? =====

The key problem with MMuyGPS is that $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right )$ is a diagonal matrix, and asserts that the off-diagonal covariance elements between the response variables are zero, and so they are all jointly independent of each other. We have postulated that we can instead report $Var \left ( \breve{Y}(\mathbf{z} \mid X_{N, :}) \right ) C$, where $C \in \mathbb{R}^{d \times d}$ is an empirical covariance or empirical correlation matrix drawn from the response observations in the training data. This approach certainly has brevity going for it, but I am curious as to whether we can instead learn a version of the blockwise view of the inference problem where the inner blocks ($Q$ matrices) are not themselves diagonal.

===== A dense blockwise view? =====

What would a dense blockwise view look like? Firstly, there would need to be more sets of hyperparameters. I believe that we would need $\theta_{i,j}$ for each $1 \leq i \leq j \leq d$. There might be additional contraints on these hyperparameters that I have not considered. We could then solve a similar MMuyGPS blockwise view where we replace $Q(\cdot, \cdot)$ with $D(\cdot, \cdot)$, defined as

$$ D(\cdot, \cdot) \equiv \begin{bmatrix} K_{\theta_{1,1}}(\cdot, \cdot) & K_{\theta_{1,2}}(\cdot, \cdot) & \cdots & K_{\theta_{1,d}}(\cdot, \cdot) \\ K_{\theta_{2,1}}(\cdot, \cdot) & K_{\theta_{2,2}}(\cdot, \cdot) & \ddots & K_{\theta_{2,d}}(\cdot, \cdot) \\ \vdots & \ddots & \ddots & \vdots \\ K_{\theta_{d,1}}(\cdot, \cdot) & K_{\theta_{d,2}}(\cdot, \cdot) & \cdots & K_{\theta_{d,d}}(\cdot, \cdot) \end{bmatrix}. $$

This would allow us to create fully covariant posterior means and covariances, at a $O(d^2)$ increase in memory overhead, an $O(d^2)$ increase in computation associated with evaluating kernels, and a $O(d^3)$ increase in computation associated with realizing solves.

In addition to significant overhead, this also introduces the serious problem of learning all $\theta_{i,j}$. MMuyGPS makes this simple, since we use the posterior mean and variance, which can be fully split up into individual posterior means and variances by the joint independence assumption, as inputs to our loss functions. Ergo, we effectively train $d$ separate MuyGPS models that do not interact. However, this dense proposal is not separable and so requires a different training procedure. There are likely identifiability issues that make training such a model difficult of impossible. I would very much like input and discussion on this topic.

===== The future of multivariate models =====

Whatever form the math takes, recent software engineering improvements to the library suggest that we should reconsider how we define multivariate models in the library. Indeed, it is tempting to consider function composition solutions that would allow us to fold the functionality currently supported by MMuyGPS into MuyGPS so that, based upon the initialization parameters, we can create a future MuyGPS object that is functionally equivalent to the present MMuyGPS class. This would decrease development and testing overhead significantly, and ensure that new features are easier to incorporate into generic models.

However, there are several problems that must be solved to make this possible.

We would need to generalize KernelFn to somehow support multiple separate simultaneous KernelFn instances in a way that coalesces them in the __call__ and get_optim_params methods. These changes would need to percolate into the downstream Matern and RBF classes in a natural way.
We would also need to generalize how the PosteriorMean and PosteriorVariance work for MMuyGPS-style workflows. We don't want to actually evaluate the blockwise view of the math, so implementing the current behavior would need to be similar to the current independent solve view. It is not immediately obvious to me how to do this cleanly, and will likely require much discussion. We will also need the get_opt_fn functions to be able to reason about different noise parameters for each response variable.
We will need to be very careful with the hooks into optimization throughout.
There are probably other issues that don't immediately come to mind.

If we address these issues and fold the current MMuyGPS behavior into MuyGPS, the we will be able to do something similar with any non-jointly-independent models that we can come up with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we support multivariate models in the future? #125

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How should we support multivariate models in the future? #125

bwpriest Apr 5, 2023 Maintainer

Replies: 0 comments

bwpriest
Apr 5, 2023
Maintainer