Tensorflow implementation of the mixture of softmaxes algorithm described in the paper Breaking the Softmax Bottleneck: A High-Rank RNN Language Model (Yang et al., 2017).
See https://github.com/zihangdai/mos for an implementation using PyTorch.
In natural language processing, the extent to which the true probability distribution of appropriate responses can be approximated overall by the network depends on the ability to express probabilties.
The problem with using the softmax function is that, when applied to the logits or raw outputs of a neural network, a substantial amount of information is lost.
This loss of information, signified by the low-rank of a resultant matrix one constructs from the logits, encourages the network to fit generic responses to each input.
Ideally, the rank of the matrix should be high, which entails more expressiveness and allows the network to use more information in its generation of responses and its analysis. Thus, this is what the mixture of softmaxes network accomplishes.