As you probably know, uncertainty about a value can be decomposed into aleatoric and epistemic uncertainty, where
- Aleatoric uncertainty corresponds to the irreducible stochasticity in the value, such as its dependecy on a coin toss.
- Epistemic uncertainty represents the uncertainty that can be reduced to zero if we have infinite data about our problem.
But what does it mean in terms of math? Turns out this decomposition can be derived by analyzing the entropy of the predictive posterior.
Imagine we have a dataset of data points
Its entropy
To decompose this uncertainty we will use two properties of entropy:
-
$H[A, B] = H[A] + H[B] - I[A, B]$ , which provides a decomposition of the entropy of the joint in terms of entropies of the marginals and the mutual information between the variables of interest. -
$H[A \mid B] = H[A, B] - H[B]$ , meaning that conditional entropy is a difference between our uncertainty about the joint and marginal values.
Both properties can be easily derived from the definition of entropy.
Applying the first property to the joint posterior, we get
We can rearrange the terms to express the entropy of the predictive posterior:
which, using the second property, can be transformed into
Let's analyze the terms we've got here. Using the definition of conditional entropy, the first term can be expressed as
So here, given a model
As for the second term,
where we decomposed mutual information into a difference of entropies. Note that this term is zero only when adding
One way to compute epistemic uncertainty if we have access to samples from
i.e. to subtract the average prediction entropy of the model ensemble elements from the total prediction entropy of the whole ensemble. In practice it's usually not feasible to sample from the model posterior, so multiple models trained with different random seeds are used to approximate samples. If the model is stochastic (e.g. it uses dropout or epinets), another option is to use the same model with different inference random seeds.