diff --git a/aitk/_version.py b/aitk/_version.py
index b211de5..9e2806c 100644
--- a/aitk/_version.py
+++ b/aitk/_version.py
@@ -8,5 +8,5 @@
#
# **************************************************************
-version_info = (2, 1, 0)
+version_info = (3, 0, 0)
__version__ = ".".join(map(str, version_info))
diff --git a/aitk/keras/README.md b/aitk/keras/README.md
deleted file mode 100644
index 0521ade..0000000
--- a/aitk/keras/README.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Neural network models
-This module implements building-blocks for larger neural network models in the
-Keras-style. This module does _not_ implement a general autograd system in order
-emphasize conceptual understanding over flexibility.
-
-1. **Activations**. Common activation nonlinearities. Includes:
- - Rectified linear units (ReLU) ([Hahnloser et al., 2000](http://invibe.net/biblio_database_dyva/woda/data/att/6525.file.pdf))
- - Leaky rectified linear units
- ([Maas, Hannun, & Ng, 2013](https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf))
- - Exponential linear units (ELU) ([Clevert, Unterthiner, & Hochreiter, 2016](http://arxiv.org/abs/1511.07289))
- - Scaled exponential linear units ([Klambauer, Unterthiner, & Mayr, 2017](https://arxiv.org/pdf/1706.02515.pdf))
- - Softplus units
- - Hard sigmoid units
- - Exponential units
- - Hyperbolic tangent (tanh)
- - Logistic sigmoid
- - Affine
-
-2. **Losses**. Common loss functions. Includes:
- - Squared error
- - Categorical cross entropy
- - VAE Bernoulli loss ([Kingma & Welling, 2014](https://arxiv.org/abs/1312.6114))
- - Wasserstein loss with gradient penalty ([Gulrajani et al., 2017](https://arxiv.org/pdf/1704.00028.pdf))
- - Noise contrastive estimation (NCE) loss ([Gutmann & Hyvärinen](https://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf); [Minh & Teh, 2012](https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf))
-
-3. **Wrappers**. Layer wrappers. Includes:
- - Dropout ([Srivastava, et al., 2014](http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf))
-
-4. **Layers**. Common layers / layer-wise operations that can be composed to
- create larger neural networks. Includes:
- - Fully-connected
- - Sparse evolutionary ([Mocanu et al., 2018](https://www.nature.com/articles/s41467-018-04316-3))
- - Dot-product attention ([Luong, Pho, & Manning, 2015](https://arxiv.org/pdf/1508.04025.pdf); [Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))
- - 1D and 2D convolution (with stride, padding, and dilation) ([van den Oord et al., 2016](https://arxiv.org/pdf/1609.03499.pdf); [Yu & Kolton, 2016](https://arxiv.org/pdf/1511.07122.pdf))
- - 2D "deconvolution" (with stride and padding) ([Zeiler et al., 2010](https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf))
- - Restricted Boltzmann machines (with CD-_n_ training) ([Smolensky, 1996](http://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf); [Carreira-Perpiñán & Hinton, 2005](http://www.cs.toronto.edu/~fritz/absps/cdmiguel.pdf))
- - Elementwise multiplication
- - Embedding
- - Summation
- - Flattening
- - Softmax
- - Max & average pooling
- - 1D and 2D batch normalization ([Ioffe & Szegedy, 2015](http://proceedings.mlr.press/v37/ioffe15.pdf))
- - 1D and 2D layer normalization ([Ba, Kiros, & Hinton, 2016](https://arxiv.org/pdf/1607.06450.pdf))
- - Recurrent ([Elman, 1990](https://crl.ucsd.edu/~elman/Papers/fsit.pdf))
- - Long short-term memory (LSTM) ([Hochreiter & Schmidhuber, 1997](http://www.bioinf.jku.at/publications/older/2604.pdf))
-
-5. **Optimizers**. Common modifications to stochastic gradient descent.
- Includes:
- - SGD with momentum ([Rummelhart, Hinton, & Williams, 1986](https://www.cs.princeton.edu/courses/archive/spring18/cos495/res/backprop_old.pdf))
- - AdaGrad ([Duchi, Hazan, & Singer, 2011](http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))
- - RMSProp ([Tieleman & Hinton, 2012](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf))
- - Adam ([Kingma & Ba, 2015](https://arxiv.org/pdf/1412.6980v8.pdf))
-
-6. **Learning Rate Schedulers**. Common learning rate decay schedules.
- - Constant
- - Exponential decay
- - Noam/Transformer scheduler ([Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))
- - King/Dlib scheduler ([King, 2018](http://blog.dlib.net/2018/02/automatic-learning-rate-scheduling-that.html))
-
-6. **Initializers**. Common weight initialization strategies.
- - Glorot/Xavier uniform and normal ([Glorot & Bengio, 2010](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf))
- - He/Kaiming uniform and normal ([He et al., 2015](https://arxiv.org/pdf/1502.01852v1.pdf))
- - Standard normal
- - Truncated normal
-
-7. **Modules**. Common multi-layer blocks that appear across many deep networks.
- Includes:
- - Bidirectional LSTMs ([Schuster & Paliwal, 1997](https://pdfs.semanticscholar.org/4b80/89bc9b49f84de43acc2eb8900035f7d492b2.pdf))
- - ResNet-style "identity" (i.e., `same`-convolution) residual blocks ([He et al., 2015](https://arxiv.org/pdf/1512.03385.pdf))
- - ResNet-style "convolutional" (i.e., parametric) residual blocks ([He et al., 2015](https://arxiv.org/pdf/1512.03385.pdf))
- - WaveNet-style residual block with dilated causal convolutions ([van den Oord et al., 2016](https://arxiv.org/pdf/1609.03499.pdf))
- - Transformer-style multi-headed dot-product attention ([Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))
-
-8. **Models**. Well-known network architectures. Includes:
- - `vae.py`: Bernoulli variational autoencoder ([Kingma & Welling, 2014](https://arxiv.org/abs/1312.6114))
- - `wgan_gp.py`: Wasserstein generative adversarial network with gradient
- penalty ([Gulrajani et al., 2017](https://arxiv.org/pdf/1704.00028.pdf);
-[Goodfellow et al., 2014](https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf))
- - `w2v.py`: word2vec model with CBOW and skip-gram architectures and
- training via noise contrastive estimation ([Mikolov et al., 2012](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf))
-
-8. **Utils**. Common helper functions, primarily for dealing with CNNs.
- Includes:
- - `im2col`
- - `col2im`
- - `conv1D`
- - `conv2D`
- - `dilate`
- - `deconv2D`
- - `minibatch`
- - Various weight initialization utilities
- - Various padding and convolution arithmetic utilities
diff --git a/aitk/keras/__init__.py b/aitk/keras/__init__.py
deleted file mode 100644
index 9accc14..0000000
--- a/aitk/keras/__init__.py
+++ /dev/null
@@ -1,30 +0,0 @@
-# -*- coding: utf-8 -*-
-# **************************************************************
-# aitk.keras: A Python Keras model API
-#
-# Copyright (c) 2021 AITK Developers
-#
-# https://github.com/ArtificialIntelligenceToolkit/aitk.keras
-#
-# **************************************************************
-
-"""A module of basic building blcoks for constructing neural networks"""
-from . import utils
-from . import losses
-from . import activations
-from . import schedulers
-from . import optimizers
-from . import wrappers
-from . import layers
-from . import initializers
-from . import modules
-from . import models
-from . import datasets
-
-import sys
-import numpy
-
-# Create a fake module "backend" that is really numpy
-backend = numpy
-backend.image_data_format = lambda: 'channels_last'
-sys.modules["aitk.keras.backend"] = backend
diff --git a/aitk/keras/activations/README.md b/aitk/keras/activations/README.md
deleted file mode 100644
index 6287b59..0000000
--- a/aitk/keras/activations/README.md
+++ /dev/null
@@ -1,20 +0,0 @@
-# Activation Functions
-The `activations` module implements several common activation functions:
-
-- Rectified linear units (ReLU) ([Hahnloser et al., 2000](http://invibe.net/biblio_database_dyva/woda/data/att/6525.file.pdf))
-- Leaky rectified linear units
- ([Maas, Hannun, & Ng, 2013](https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf))
-- Exponential linear units ([Clevert, Unterthiner, & Hochreiter, 2016](https://arxiv.org/pdf/1511.07289.pdf))
-- Scaled exponential linear units ([Klambauer, Unterthiner, & Mayr, 2017](https://arxiv.org/pdf/1706.02515.pdf))
-- Softplus units
-- Hard sigmoid units
-- Exponential units
-- Hyperbolic tangent (tanh)
-- Logistic sigmoid
-- Affine
-
-
-## Plots
-
-
-
diff --git a/aitk/keras/activations/__init__.py b/aitk/keras/activations/__init__.py
deleted file mode 100644
index 8ba160e..0000000
--- a/aitk/keras/activations/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .activations import *
diff --git a/aitk/keras/activations/activations.py b/aitk/keras/activations/activations.py
deleted file mode 100644
index f2c1949..0000000
--- a/aitk/keras/activations/activations.py
+++ /dev/null
@@ -1,627 +0,0 @@
-"""A collection of activation function objects for building neural networks"""
-
-from abc import ABC, abstractmethod
-
-import numpy as np
-
-
-class ActivationBase(ABC):
- def __init__(self, **kwargs):
- """Initialize the ActivationBase object"""
- super().__init__()
-
- def __call__(self, z):
- """Apply the activation function to an input"""
- if z.ndim == 1:
- z = z.reshape(1, -1)
- return self.fn(z)
-
- @abstractmethod
- def fn(self, z):
- """Apply the activation function to an input"""
- raise NotImplementedError
-
- @abstractmethod
- def grad(self, x, **kwargs):
- """Compute the gradient of the activation function wrt the input"""
- raise NotImplementedError
-
-
-class Sigmoid(ActivationBase):
- def __init__(self):
- """A logistic sigmoid activation function."""
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Sigmoid"
-
- def fn(self, z):
- r"""
- Evaluate the logistic sigmoid, :math:`\sigma`, on the elements of input `z`.
-
- .. math::
-
- \sigma(x_i) = \frac{1}{1 + e^{-x_i}}
- """
- return 1 / (1 + np.exp(-z))
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the logistic sigmoid on the elements of `x`.
-
- .. math::
-
- \frac{\partial \sigma}{\partial x_i} = \sigma(x_i) (1 - \sigma(x_i))
- """
- fn_x = self.fn(x)
- return fn_x * (1 - fn_x)
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the logistic sigmoid on the elements of `x`.
-
- .. math::
-
- \frac{\partial^2 \sigma}{\partial x_i^2} =
- \frac{\partial \sigma}{\partial x_i} (1 - 2 \sigma(x_i))
- """
- fn_x = self.fn(x)
- return fn_x * (1 - fn_x) * (1 - 2 * fn_x)
-
-
-class ReLU(ActivationBase):
- """
- A rectified linear activation function.
-
- Notes
- -----
- "ReLU units can be fragile during training and can "die". For example, a
- large gradient flowing through a ReLU neuron could cause the weights to
- update in such a way that the neuron will never activate on any datapoint
- again. If this happens, then the gradient flowing through the unit will
- forever be zero from that point on. That is, the ReLU units can
- irreversibly die during training since they can get knocked off the data
- manifold.
-
- For example, you may find that as much as 40% of your network can be "dead"
- (i.e. neurons that never activate across the entire training dataset) if
- the learning rate is set too high. With a proper setting of the learning
- rate this is less frequently an issue." [*]_
-
- References
- ----------
- .. [*] Karpathy, A. "CS231n: Convolutional neural networks for visual recognition".
- """
-
- def __init__(self):
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "ReLU"
-
- def fn(self, z):
- r"""
- Evaulate the ReLU function on the elements of input `z`.
-
- .. math::
-
- \text{ReLU}(z_i)
- &= z_i \ \ \ \ &&\text{if }z_i > 0 \\
- &= 0 \ \ \ \ &&\text{otherwise}
- """
- return np.clip(z, 0, np.inf)
-
- def grad(self, x):
- r"""
- Evaulate the first derivative of the ReLU function on the elements of input `x`.
-
- .. math::
-
- \frac{\partial \text{ReLU}}{\partial x_i}
- &= 1 \ \ \ \ &&\text{if }x_i > 0 \\
- &= 0 \ \ \ \ &&\text{otherwise}
- """
- return (x > 0).astype(int)
-
- def grad2(self, x):
- r"""
- Evaulate the second derivative of the ReLU function on the elements of
- input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{ReLU}}{\partial x_i^2} = 0
- """
- return np.zeros_like(x)
-
-
-class LeakyReLU(ActivationBase):
- """
- 'Leaky' version of a rectified linear unit (ReLU).
-
- Notes
- -----
- Leaky ReLUs [*]_ are designed to address the vanishing gradient problem in
- ReLUs by allowing a small non-zero gradient when `x` is negative.
-
- Parameters
- ----------
- alpha: float
- Activation slope when x < 0. Default is 0.3.
-
- References
- ----------
- .. [*] Mass, L. M., Hannun, A. Y, & Ng, A. Y. (2013). "Rectifier
- nonlinearities improve neural network acoustic models". *Proceedings of
- the 30th International Conference of Machine Learning, 30*.
- """
-
- def __init__(self, alpha=0.3):
- self.alpha = alpha
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Leaky ReLU(alpha={})".format(self.alpha)
-
- def fn(self, z):
- r"""
- Evaluate the leaky ReLU function on the elements of input `z`.
-
- .. math::
-
- \text{LeakyReLU}(z_i)
- &= z_i \ \ \ \ &&\text{if } z_i > 0 \\
- &= \alpha z_i \ \ \ \ &&\text{otherwise}
- """
- _z = z.copy()
- _z[z < 0] = _z[z < 0] * self.alpha
- return _z
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the leaky ReLU function on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{LeakyReLU}}{\partial x_i}
- &= 1 \ \ \ \ &&\text{if }x_i > 0 \\
- &= \alpha \ \ \ \ &&\text{otherwise}
- """
- out = np.ones_like(x)
- out[x < 0] *= self.alpha
- return out
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the leaky ReLU function on the
- elements of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{LeakyReLU}}{\partial x_i^2} = 0
- """
- return np.zeros_like(x)
-
-
-class Tanh(ActivationBase):
- def __init__(self):
- """A hyperbolic tangent activation function."""
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Tanh"
-
- def fn(self, z):
- """Compute the tanh function on the elements of input `z`."""
- return np.tanh(z)
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the tanh function on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \tanh}{\partial x_i} = 1 - \tanh(x)^2
- """
- return 1 - np.tanh(x) ** 2
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the tanh function on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \tanh}{\partial x_i^2} =
- -2 \tanh(x) \left(\frac{\partial \tanh}{\partial x_i}\right)
- """
- tanh_x = np.tanh(x)
- return -2 * tanh_x * (1 - tanh_x ** 2)
-
-
-class Affine(ActivationBase):
- def __init__(self, slope=1, intercept=0):
- """
- An affine activation function.
-
- Parameters
- ----------
- slope: float
- Activation slope. Default is 1.
- intercept: float
- Intercept/offset term. Default is 0.
- """
- self.slope = slope
- self.intercept = intercept
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Affine(slope={}, intercept={})".format(self.slope, self.intercept)
-
- def fn(self, z):
- r"""
- Evaluate the Affine activation on the elements of input `z`.
-
- .. math::
-
- \text{Affine}(z_i) = \text{slope} \times z_i + \text{intercept}
- """
- return self.slope * z + self.intercept
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the Affine activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{Affine}}{\partial x_i} = \text{slope}
- """
- return self.slope * np.ones_like(x)
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the Affine activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{Affine}}{\partial x_i^2} = 0
- """
- return np.zeros_like(x)
-
-
-class Identity(Affine):
- def __init__(self):
- """
- Identity activation function.
-
- Notes
- -----
- :class:`Identity` is just syntactic sugar for :class:`Affine` with
- slope = 1 and intercept = 0.
- """
- super().__init__(slope=1, intercept=0)
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Identity"
-
-
-class ELU(ActivationBase):
- def __init__(self, alpha=1.0):
- r"""
- An exponential linear unit (ELU).
-
- Notes
- -----
- ELUs are intended to address the fact that ReLUs are strictly nonnegative
- and thus have an average activation > 0, increasing the chances of internal
- covariate shift and slowing down learning. ELU units address this by (1)
- allowing negative values when :math:`x < 0`, which (2) are bounded by a value
- :math:`-\alpha`. Similar to :class:`LeakyReLU`, the negative activation
- values help to push the average unit activation towards 0. Unlike
- :class:`LeakyReLU`, however, the boundedness of the negative activation
- allows for greater robustness in the face of large negative values,
- allowing the function to avoid conveying the *degree* of "absence"
- (negative activation) in the input. [*]_
-
- Parameters
- ----------
- alpha : float
- Slope of negative segment. Default is 1.
-
- References
- ----------
- .. [*] Clevert, D. A., Unterthiner, T., Hochreiter, S. (2016). "Fast
- and accurate deep network learning by exponential linear units
- (ELUs)". *4th International Conference on Learning
- Representations*.
- """
- self.alpha = alpha
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "ELU(alpha={})".format(self.alpha)
-
- def fn(self, z):
- r"""
- Evaluate the ELU activation on the elements of input `z`.
-
- .. math::
-
- \text{ELU}(z_i)
- &= z_i \ \ \ \ &&\text{if }z_i > 0 \\
- &= \alpha (e^{z_i} - 1) \ \ \ \ &&\text{otherwise}
- """
- # z if z > 0 else alpha * (e^z - 1)
- return np.where(z > 0, z, self.alpha * (np.exp(z) - 1))
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the ELU activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{ELU}}{\partial x_i}
- &= 1 \ \ \ \ &&\text{if } x_i > 0 \\
- &= \alpha e^{x_i} \ \ \ \ &&\text{otherwise}
- """
- # 1 if x > 0 else alpha * e^(z)
- return np.where(x > 0, np.ones_like(x), self.alpha * np.exp(x))
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the ELU activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{ELU}}{\partial x_i^2}
- &= 0 \ \ \ \ &&\text{if } x_i > 0 \\
- &= \alpha e^{x_i} \ \ \ \ &&\text{otherwise}
- """
- # 0 if x > 0 else alpha * e^(z)
- return np.where(x >= 0, np.zeros_like(x), self.alpha * np.exp(x))
-
-
-class Exponential(ActivationBase):
- def __init__(self):
- """An exponential (base e) activation function"""
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Exponential"
-
- def fn(self, z):
- r"""
- Evaluate the activation function
-
- .. math::
- \text{Exponential}(z_i) = e^{z_i}
- """
- return np.exp(z)
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the exponential activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{Exponential}}{\partial x_i} = e^{x_i}
- """
- return np.exp(x)
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the exponential activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{Exponential}}{\partial x_i^2} = e^{x_i}
- """
- return np.exp(x)
-
-
-class SELU(ActivationBase):
- r"""
- A scaled exponential linear unit (SELU).
-
- Notes
- -----
- SELU units, when used in conjunction with proper weight initialization and
- regularization techniques, encourage neuron activations to converge to
- zero-mean and unit variance without explicit use of e.g., batchnorm.
-
- For SELU units, the :math:`\alpha` and :math:`\text{scale}` values are
- constants chosen so that the mean and variance of the inputs are preserved
- between consecutive layers. As such the authors propose weights be
- initialized using Lecun-Normal initialization: :math:`w_{ij} \sim
- \mathcal{N}(0, 1 / \text{fan_in})`, and to use the dropout variant
- :math:`\alpha`-dropout during regularization. [*]_
-
- See the reference for more information (especially the appendix ;-) ).
-
- References
- ----------
- .. [*] Klambauer, G., Unterthiner, T., & Hochreiter, S. (2017).
- "Self-normalizing neural networks." *Advances in Neural Information
- Processing Systems, 30.*
- """
-
- def __init__(self):
- self.alpha = 1.6732632423543772848170429916717
- self.scale = 1.0507009873554804934193349852946
- self.elu = ELU(alpha=self.alpha)
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "SELU"
-
- def fn(self, z):
- r"""
- Evaluate the SELU activation on the elements of input `z`.
-
- .. math::
-
- \text{SELU}(z_i) = \text{scale} \times \text{ELU}(z_i, \alpha)
-
- which is simply
-
- .. math::
-
- \text{SELU}(z_i)
- &= \text{scale} \times z_i \ \ \ \ &&\text{if }z_i > 0 \\
- &= \text{scale} \times \alpha (e^{z_i} - 1) \ \ \ \ &&\text{otherwise}
- """
- return self.scale * self.elu.fn(z)
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the SELU activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{SELU}}{\partial x_i}
- &= \text{scale} \ \ \ \ &&\text{if } x_i > 0 \\
- &= \text{scale} \times \alpha e^{x_i} \ \ \ \ &&\text{otherwise}
- """
- return np.where(
- x >= 0, np.ones_like(x) * self.scale, np.exp(x) * self.alpha * self.scale,
- )
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the SELU activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{SELU}}{\partial x_i^2}
- &= 0 \ \ \ \ &&\text{if } x_i > 0 \\
- &= \text{scale} \times \alpha e^{x_i} \ \ \ \ &&\text{otherwise}
- """
- return np.where(x > 0, np.zeros_like(x), np.exp(x) * self.alpha * self.scale)
-
-
-class HardSigmoid(ActivationBase):
- def __init__(self):
- """
- A "hard" sigmoid activation function.
-
- Notes
- -----
- The hard sigmoid is a piecewise linear approximation of the logistic
- sigmoid that is computationally more efficient to compute.
- """
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "Hard Sigmoid"
-
- def fn(self, z):
- r"""
- Evaluate the hard sigmoid activation on the elements of input `z`.
-
- .. math::
-
- \text{HardSigmoid}(z_i)
- &= 0 \ \ \ \ &&\text{if }z_i < -2.5 \\
- &= 0.2 z_i + 0.5 \ \ \ \ &&\text{if }-2.5 \leq z_i \leq 2.5 \\
- &= 1 \ \ \ \ &&\text{if }z_i > 2.5
- """
- return np.clip((0.2 * z) + 0.5, 0.0, 1.0)
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the hard sigmoid activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{HardSigmoid}}{\partial x_i}
- &= 0.2 \ \ \ \ &&\text{if } -2.5 \leq x_i \leq 2.5\\
- &= 0 \ \ \ \ &&\text{otherwise}
- """
- return np.where((x >= -2.5) & (x <= 2.5), 0.2, 0)
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the hard sigmoid activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{HardSigmoid}}{\partial x_i^2} = 0
- """
- return np.zeros_like(x)
-
-
-class SoftPlus(ActivationBase):
- def __init__(self):
- """
- A softplus activation function.
-
- Notes
- -----
- In contrast to :class:`ReLU`, the softplus activation is differentiable
- everywhere (including 0). It is, however, less computationally efficient to
- compute.
-
- The derivative of the softplus activation is the logistic sigmoid.
- """
- super().__init__()
-
- def __str__(self):
- """Return a string representation of the activation function"""
- return "SoftPlus"
-
- def fn(self, z):
- r"""
- Evaluate the softplus activation on the elements of input `z`.
-
- .. math::
-
- \text{SoftPlus}(z_i) = \log(1 + e^{z_i})
- """
- return np.log(np.exp(z) + 1)
-
- def grad(self, x):
- r"""
- Evaluate the first derivative of the softplus activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial \text{SoftPlus}}{\partial x_i} = \frac{e^{x_i}}{1 + e^{x_i}}
- """
- exp_x = np.exp(x)
- return exp_x / (exp_x + 1)
-
- def grad2(self, x):
- r"""
- Evaluate the second derivative of the softplus activation on the elements
- of input `x`.
-
- .. math::
-
- \frac{\partial^2 \text{SoftPlus}}{\partial x_i^2} =
- \frac{e^{x_i}}{(1 + e^{x_i})^2}
- """
- exp_x = np.exp(x)
- return exp_x / ((exp_x + 1) ** 2)
diff --git a/aitk/keras/activations/img/plot.png b/aitk/keras/activations/img/plot.png
deleted file mode 100644
index e77a10f..0000000
Binary files a/aitk/keras/activations/img/plot.png and /dev/null differ
diff --git a/aitk/keras/callbacks.py b/aitk/keras/callbacks.py
deleted file mode 100644
index 574c222..0000000
--- a/aitk/keras/callbacks.py
+++ /dev/null
@@ -1,225 +0,0 @@
-# -*- coding: utf-8 -*-
-# **************************************************************
-# aitk.keras: A Python Keras model API
-#
-# Copyright (c) 2021 AITK Developers
-#
-# https://github.com/ArtificialIntelligenceToolkit/aitk.keras
-#
-# **************************************************************
-
-class Callback:
- def __init__(self):
- self.validation_data = None
- self.model = None
-
- def set_params(self, params):
- self.params = params
-
- def set_model(self, model):
- self.model = model
-
- def on_batch_begin(self, batch, logs=None):
- """A backwards compatibility alias for `on_train_batch_begin`."""
-
- def on_batch_end(self, batch, logs=None):
- """A backwards compatibility alias for `on_train_batch_end`."""
-
- def on_epoch_begin(self, epoch, logs=None):
- """Called at the start of an epoch.
-
- Subclasses should override for any actions to run. This
- function should only be called during TRAIN mode.
-
- Args:
- epoch: Integer, index of epoch.
- logs: Dict. Currently no data is passed to this argument for
- this method but that may change in the future.
- """
-
- def on_epoch_end(self, epoch, logs=None):
- """Called at the end of an epoch.
-
- Subclasses should override for any actions to run. This function
- should only be called during TRAIN mode.
-
- Args:
- epoch: Integer, index of epoch.
- logs: Dict, metric results for this training epoch, and for the
- validation epoch if validation is performed. Validation result keys
- are prefixed with `val_`. For training epoch, the values of the
- `Model`'s metrics are returned. Example : `{'loss': 0.2, 'accuracy':
- 0.7}`.
- """
-
- def on_train_batch_begin(self, batch, logs=None):
- """Called at the beginning of a training batch in `fit` methods.
-
- Subclasses should override for any actions to run.
-
- Note that if the `steps_per_execution` argument to `compile` in
- `tf.keras.Model` is set to `N`, this method will only be called every `N`
- batches.
-
- Args:
- batch: Integer, index of batch within the current epoch.
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
- # For backwards compatibility.
- self.on_batch_begin(batch, logs=logs)
-
- def on_train_batch_end(self, batch, logs=None):
- """Called at the end of a training batch in `fit` methods.
-
- Subclasses should override for any actions to run.
-
- Note that if the `steps_per_execution` argument to `compile` in
- `tf.keras.Model` is set to `N`, this method will only be called every `N`
- batches.
-
- Args:
- batch: Integer, index of batch within the current epoch.
- logs: Dict. Aggregated metric results up until this batch.
- """
- # For backwards compatibility.
- self.on_batch_end(batch, logs=logs)
-
-
- def on_test_batch_begin(self, batch, logs=None):
- """Called at the beginning of a batch in `evaluate` methods.
-
- Also called at the beginning of a validation batch in the `fit`
- methods, if validation data is provided.
-
- Subclasses should override for any actions to run.
-
- Note that if the `steps_per_execution` argument to `compile` in
- `tf.keras.Model` is set to `N`, this method will only be called every `N`
- batches.
-
- Args:
- batch: Integer, index of batch within the current epoch.
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
-
- def on_test_batch_end(self, batch, logs=None):
- """Called at the end of a batch in `evaluate` methods.
-
- Also called at the end of a validation batch in the `fit`
- methods, if validation data is provided.
-
- Subclasses should override for any actions to run.
-
- Note that if the `steps_per_execution` argument to `compile` in
- `tf.keras.Model` is set to `N`, this method will only be called every `N`
- batches.
-
- Args:
- batch: Integer, index of batch within the current epoch.
- logs: Dict. Aggregated metric results up until this batch.
- """
-
- def on_predict_batch_begin(self, batch, logs=None):
- """Called at the beginning of a batch in `predict` methods.
-
- Subclasses should override for any actions to run.
-
- Note that if the `steps_per_execution` argument to `compile` in
- `tf.keras.Model` is set to `N`, this method will only be called every `N`
- batches.
-
- Args:
- batch: Integer, index of batch within the current epoch.
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
-
- def on_predict_batch_end(self, batch, logs=None):
- """Called at the end of a batch in `predict` methods.
-
- Subclasses should override for any actions to run.
-
- Note that if the `steps_per_execution` argument to `compile` in
- `tf.keras.Model` is set to `N`, this method will only be called every `N`
- batches.
-
- Args:
- batch: Integer, index of batch within the current epoch.
- logs: Dict. Aggregated metric results up until this batch.
- """
-
- def on_train_begin(self, logs=None):
- """Called at the beginning of training.
-
- Subclasses should override for any actions to run.
-
- Args:
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
-
- def on_train_end(self, logs=None):
- """Called at the end of training.
-
- Subclasses should override for any actions to run.
-
- Args:
- logs: Dict. Currently the output of the last call to `on_epoch_end()`
- is passed to this argument for this method but that may change in
- the future.
- """
-
- def on_test_begin(self, logs=None):
- """Called at the beginning of evaluation or validation.
-
- Subclasses should override for any actions to run.
-
- Args:
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
-
- def on_test_end(self, logs=None):
- """Called at the end of evaluation or validation.
-
- Subclasses should override for any actions to run.
-
- Args:
- logs: Dict. Currently the output of the last call to
- `on_test_batch_end()` is passed to this argument for this method
- but that may change in the future.
- """
-
- def on_predict_begin(self, logs=None):
- """Called at the beginning of prediction.
-
- Subclasses should override for any actions to run.
-
- Args:
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
-
- def on_predict_end(self, logs=None):
- """Called at the end of prediction.
-
- Subclasses should override for any actions to run.
-
- Args:
- logs: Dict. Currently no data is passed to this argument for this method
- but that may change in the future.
- """
-
-class History(Callback):
- def __init__(self):
- super().__init__()
- self.history = {}
-
- def on_epoch_end(self, epoch, logs=None):
- if logs:
- for metric in logs:
- if metric not in self.history:
- self.history[metric] = []
- self.history[metric].append(logs[metric])
diff --git a/aitk/keras/datasets/BUILD b/aitk/keras/datasets/BUILD
deleted file mode 100644
index af31da0..0000000
--- a/aitk/keras/datasets/BUILD
+++ /dev/null
@@ -1,38 +0,0 @@
-# Description:
-# Contains the Keras datasets package (internal TensorFlow version).
-
-package(
- default_visibility = [
- "//keras:__subpackages__",
- ],
- licenses = ["notice"],
-)
-
-filegroup(
- name = "all_py_srcs",
- srcs = glob(["*.py"]),
- visibility = ["//keras/google/private_tf_api_test:__pkg__"],
-)
-
-py_library(
- name = "datasets",
- srcs = [
- "__init__.py",
- "boston_housing.py",
- "cifar.py",
- "cifar10.py",
- "cifar100.py",
- "fashion_mnist.py",
- "imdb.py",
- "mnist.py",
- "reuters.py",
- ],
- srcs_version = "PY3",
- visibility = ["//visibility:public"],
- deps = [
- "//:expect_numpy_installed",
- "//:expect_tensorflow_installed",
- "//keras:backend",
- "//keras/utils:engine_utils",
- ],
-)
diff --git a/aitk/keras/datasets/__init__.py b/aitk/keras/datasets/__init__.py
deleted file mode 100644
index 098bf7b..0000000
--- a/aitk/keras/datasets/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-"""Small NumPy datasets for debugging/testing."""
-
diff --git a/aitk/keras/datasets/boston_housing.py b/aitk/keras/datasets/boston_housing.py
deleted file mode 100644
index 0ac42bd..0000000
--- a/aitk/keras/datasets/boston_housing.py
+++ /dev/null
@@ -1,74 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Boston housing price regression dataset."""
-
-import numpy as np
-
-from .utils import get_file, get_file_async
-
-
-def load_data(path='boston_housing.npz', test_split=0.2, seed=113):
- """Loads the Boston Housing dataset.
-
- This is a dataset taken from the StatLib library which is maintained at
- Carnegie Mellon University.
-
- Samples contain 13 attributes of houses at different locations around the
- Boston suburbs in the late 1970s. Targets are the median values of
- the houses at a location (in k$).
-
- The attributes themselves are defined in the
- [StatLib website](http://lib.stat.cmu.edu/datasets/boston).
-
- Args:
- path: path where to cache the dataset locally
- (relative to `~/.keras/datasets`).
- test_split: fraction of the data to reserve as test set.
- seed: Random seed for shuffling the data
- before computing the test split.
-
- Returns:
- Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train, x_test**: numpy arrays with shape `(num_samples, 13)`
- containing either the training samples (for x_train),
- or test samples (for y_train).
-
- **y_train, y_test**: numpy arrays of shape `(num_samples,)` containing the
- target scalars. The targets are float scalars typically between 10 and
- 50 that represent the home prices in k$.
- """
- assert 0 <= test_split < 1
- origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
- path = get_file(
- path,
- origin=origin_folder + 'boston_housing.npz',
- file_hash=
- 'f553886a1f8d56431e820c5b82552d9d95cfcb96d1e678153f8839538947dff5')
- with np.load(path, allow_pickle=True) as f: # pylint: disable=unexpected-keyword-arg
- x = f['x']
- y = f['y']
-
- rng = np.random.RandomState(seed)
- indices = np.arange(len(x))
- rng.shuffle(indices)
- x = x[indices]
- y = y[indices]
-
- x_train = np.array(x[:int(len(x) * (1 - test_split))])
- y_train = np.array(y[:int(len(x) * (1 - test_split))])
- x_test = np.array(x[int(len(x) * (1 - test_split)):])
- y_test = np.array(y[int(len(x) * (1 - test_split)):])
- return (x_train, y_train), (x_test, y_test)
diff --git a/aitk/keras/datasets/cifar.py b/aitk/keras/datasets/cifar.py
deleted file mode 100644
index af4f44b..0000000
--- a/aitk/keras/datasets/cifar.py
+++ /dev/null
@@ -1,42 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Utilities common to CIFAR10 and CIFAR100 datasets."""
-
-import _pickle as cPickle
-
-
-def load_batch(fpath, label_key='labels'):
- """Internal utility for parsing CIFAR data.
-
- Args:
- fpath: path the file to parse.
- label_key: key for label data in the retrieve
- dictionary.
-
- Returns:
- A tuple `(data, labels)`.
- """
- with open(fpath, 'rb') as f:
- d = cPickle.load(f, encoding='bytes')
- # decode utf8
- d_decoded = {}
- for k, v in d.items():
- d_decoded[k.decode('utf8')] = v
- d = d_decoded
- data = d['data']
- labels = d[label_key]
-
- data = data.reshape(data.shape[0], 3, 32, 32)
- return data, labels
diff --git a/aitk/keras/datasets/cifar10.py b/aitk/keras/datasets/cifar10.py
deleted file mode 100644
index bd4af25..0000000
--- a/aitk/keras/datasets/cifar10.py
+++ /dev/null
@@ -1,107 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""CIFAR10 small images classification dataset."""
-
-import os
-
-import numpy as np
-
-from ..backend import image_data_format
-from .cifar import load_batch
-from .utils import get_file
-
-
-def load_data():
- """Loads the CIFAR10 dataset.
-
- This is a dataset of 50,000 32x32 color training images and 10,000 test
- images, labeled over 10 categories. See more info at the
- [CIFAR homepage](https://www.cs.toronto.edu/~kriz/cifar.html).
-
- The classes are:
-
- | Label | Description |
- |:-----:|-------------|
- | 0 | airplane |
- | 1 | automobile |
- | 2 | bird |
- | 3 | cat |
- | 4 | deer |
- | 5 | dog |
- | 6 | frog |
- | 7 | horse |
- | 8 | ship |
- | 9 | truck |
-
- Returns:
- Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train**: uint8 NumPy array of grayscale image data with shapes
- `(50000, 32, 32, 3)`, containing the training data. Pixel values range
- from 0 to 255.
-
- **y_train**: uint8 NumPy array of labels (integers in range 0-9)
- with shape `(50000, 1)` for the training data.
-
- **x_test**: uint8 NumPy array of grayscale image data with shapes
- `(10000, 32, 32, 3)`, containing the test data. Pixel values range
- from 0 to 255.
-
- **y_test**: uint8 NumPy array of labels (integers in range 0-9)
- with shape `(10000, 1)` for the test data.
-
- Example:
-
- ```python
- (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
- assert x_train.shape == (50000, 32, 32, 3)
- assert x_test.shape == (10000, 32, 32, 3)
- assert y_train.shape == (50000, 1)
- assert y_test.shape == (10000, 1)
- ```
- """
- dirname = 'cifar-10-batches-py'
- origin = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
- path = get_file(
- dirname,
- origin=origin,
- untar=True,
- file_hash=
- '6d958be074577803d12ecdefd02955f39262c83c16fe9348329d7fe0b5c001ce')
-
- num_train_samples = 50000
-
- x_train = np.empty((num_train_samples, 3, 32, 32), dtype='uint8')
- y_train = np.empty((num_train_samples,), dtype='uint8')
-
- for i in range(1, 6):
- fpath = os.path.join(path, 'data_batch_' + str(i))
- (x_train[(i - 1) * 10000:i * 10000, :, :, :],
- y_train[(i - 1) * 10000:i * 10000]) = load_batch(fpath)
-
- fpath = os.path.join(path, 'test_batch')
- x_test, y_test = load_batch(fpath)
-
- y_train = np.reshape(y_train, (len(y_train), 1))
- y_test = np.reshape(y_test, (len(y_test), 1))
-
- if image_data_format() == 'channels_last':
- x_train = x_train.transpose(0, 2, 3, 1)
- x_test = x_test.transpose(0, 2, 3, 1)
-
- x_test = x_test.astype(x_train.dtype)
- y_test = y_test.astype(y_train.dtype)
-
- return (x_train, y_train), (x_test, y_test)
diff --git a/aitk/keras/datasets/cifar100.py b/aitk/keras/datasets/cifar100.py
deleted file mode 100644
index 59bfee0..0000000
--- a/aitk/keras/datasets/cifar100.py
+++ /dev/null
@@ -1,92 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""CIFAR100 small images classification dataset."""
-
-import os
-
-import numpy as np
-
-from ..backend import image_data_format
-from .cifar import load_batch
-from .utils import get_file
-
-
-def load_data(label_mode='fine'):
- """Loads the CIFAR100 dataset.
-
- This is a dataset of 50,000 32x32 color training images and
- 10,000 test images, labeled over 100 fine-grained classes that are
- grouped into 20 coarse-grained classes. See more info at the
- [CIFAR homepage](https://www.cs.toronto.edu/~kriz/cifar.html).
-
- Args:
- label_mode: one of "fine", "coarse". If it is "fine" the category labels
- are the fine-grained labels, if it is "coarse" the output labels are the
- coarse-grained superclasses.
-
- Returns:
- Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train**: uint8 NumPy array of grayscale image data with shapes
- `(50000, 32, 32, 3)`, containing the training data. Pixel values range
- from 0 to 255.
-
- **y_train**: uint8 NumPy array of labels (integers in range 0-99)
- with shape `(50000, 1)` for the training data.
-
- **x_test**: uint8 NumPy array of grayscale image data with shapes
- `(10000, 32, 32, 3)`, containing the test data. Pixel values range
- from 0 to 255.
-
- **y_test**: uint8 NumPy array of labels (integers in range 0-99)
- with shape `(10000, 1)` for the test data.
-
- Example:
-
- ```python
- (x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()
- assert x_train.shape == (50000, 32, 32, 3)
- assert x_test.shape == (10000, 32, 32, 3)
- assert y_train.shape == (50000, 1)
- assert y_test.shape == (10000, 1)
- ```
- """
- if label_mode not in ['fine', 'coarse']:
- raise ValueError('`label_mode` must be one of `"fine"`, `"coarse"`. '
- f'Received: label_mode={label_mode}.')
-
- dirname = 'cifar-100-python'
- origin = 'https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz'
- path = get_file(
- dirname,
- origin=origin,
- untar=True,
- file_hash=
- '85cd44d02ba6437773c5bbd22e183051d648de2e7d6b014e1ef29b855ba677a7')
-
- fpath = os.path.join(path, 'train')
- x_train, y_train = load_batch(fpath, label_key=label_mode + '_labels')
-
- fpath = os.path.join(path, 'test')
- x_test, y_test = load_batch(fpath, label_key=label_mode + '_labels')
-
- y_train = np.reshape(y_train, (len(y_train), 1))
- y_test = np.reshape(y_test, (len(y_test), 1))
-
- if image_data_format() == 'channels_last':
- x_train = x_train.transpose(0, 2, 3, 1)
- x_test = x_test.transpose(0, 2, 3, 1)
-
- return (x_train, y_train), (x_test, y_test)
diff --git a/aitk/keras/datasets/fashion_mnist.py b/aitk/keras/datasets/fashion_mnist.py
deleted file mode 100644
index 31bf238..0000000
--- a/aitk/keras/datasets/fashion_mnist.py
+++ /dev/null
@@ -1,103 +0,0 @@
-# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Fashion-MNIST dataset."""
-
-import gzip
-import os
-
-import numpy as np
-
-from .utils import get_file
-
-
-def load_data():
- """Loads the Fashion-MNIST dataset.
-
- This is a dataset of 60,000 28x28 grayscale images of 10 fashion categories,
- along with a test set of 10,000 images. This dataset can be used as
- a drop-in replacement for MNIST.
-
- The classes are:
-
- | Label | Description |
- |:-----:|-------------|
- | 0 | T-shirt/top |
- | 1 | Trouser |
- | 2 | Pullover |
- | 3 | Dress |
- | 4 | Coat |
- | 5 | Sandal |
- | 6 | Shirt |
- | 7 | Sneaker |
- | 8 | Bag |
- | 9 | Ankle boot |
-
- Returns:
- Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train**: uint8 NumPy array of grayscale image data with shapes
- `(60000, 28, 28)`, containing the training data.
-
- **y_train**: uint8 NumPy array of labels (integers in range 0-9)
- with shape `(60000,)` for the training data.
-
- **x_test**: uint8 NumPy array of grayscale image data with shapes
- (10000, 28, 28), containing the test data.
-
- **y_test**: uint8 NumPy array of labels (integers in range 0-9)
- with shape `(10000,)` for the test data.
-
- Example:
-
- ```python
- (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
- assert x_train.shape == (60000, 28, 28)
- assert x_test.shape == (10000, 28, 28)
- assert y_train.shape == (60000,)
- assert y_test.shape == (10000,)
- ```
-
- License:
- The copyright for Fashion-MNIST is held by Zalando SE.
- Fashion-MNIST is licensed under the [MIT license](
- https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE).
-
- """
- dirname = os.path.join('datasets', 'fashion-mnist')
- base = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
- files = [
- 'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
- 't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
- ]
-
- paths = []
- for fname in files:
- paths.append(get_file(fname, origin=base + fname, cache_subdir=dirname))
-
- with gzip.open(paths[0], 'rb') as lbpath:
- y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)
-
- with gzip.open(paths[1], 'rb') as imgpath:
- x_train = np.frombuffer(
- imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)
-
- with gzip.open(paths[2], 'rb') as lbpath:
- y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)
-
- with gzip.open(paths[3], 'rb') as imgpath:
- x_test = np.frombuffer(
- imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
-
- return (x_train, y_train), (x_test, y_test)
diff --git a/aitk/keras/datasets/imdb.py b/aitk/keras/datasets/imdb.py
deleted file mode 100644
index 1074cd2..0000000
--- a/aitk/keras/datasets/imdb.py
+++ /dev/null
@@ -1,184 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""IMDB sentiment classification dataset."""
-
-import json
-
-from .utils import get_file
-
-import numpy as np
-from keras_preprocessing import sequence
-_remove_long_seq = sequence._remove_long_seq
-
-def load_data(path='imdb.npz',
- num_words=None,
- skip_top=0,
- maxlen=None,
- seed=113,
- start_char=1,
- oov_char=2,
- index_from=3,
- **kwargs):
- """Loads the [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/).
-
- This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment
- (positive/negative). Reviews have been preprocessed, and each review is
- encoded as a list of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset,
- so that for instance the integer "3" encodes the 3rd most frequent word in
- the data. This allows for quick filtering operations such as:
- "only consider the top 10,000 most
- common words, but eliminate the top 20 most common words".
-
- As a convention, "0" does not stand for a specific word, but instead is used
- to encode any unknown word.
-
- Args:
- path: where to cache the data (relative to `~/.keras/dataset`).
- num_words: integer or None. Words are
- ranked by how often they occur (in the training set) and only
- the `num_words` most frequent words are kept. Any less frequent word
- will appear as `oov_char` value in the sequence data. If None,
- all words are kept. Defaults to None, so all words are kept.
- skip_top: skip the top N most frequently occurring words
- (which may not be informative). These words will appear as
- `oov_char` value in the dataset. Defaults to 0, so no words are
- skipped.
- maxlen: int or None. Maximum sequence length.
- Any longer sequence will be truncated. Defaults to None, which
- means no truncation.
- seed: int. Seed for reproducible data shuffling.
- start_char: int. The start of a sequence will be marked with this
- character. Defaults to 1 because 0 is usually the padding character.
- oov_char: int. The out-of-vocabulary character.
- Words that were cut out because of the `num_words` or
- `skip_top` limits will be replaced with this character.
- index_from: int. Index actual words with this index and higher.
- **kwargs: Used for backwards compatibility.
-
- Returns:
- Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train, x_test**: lists of sequences, which are lists of indexes
- (integers). If the num_words argument was specific, the maximum
- possible index value is `num_words - 1`. If the `maxlen` argument was
- specified, the largest possible sequence length is `maxlen`.
-
- **y_train, y_test**: lists of integer labels (1 or 0).
-
- Raises:
- ValueError: in case `maxlen` is so low
- that no input sequence could be kept.
-
- Note that the 'out of vocabulary' character is only used for
- words that were present in the training set but are not included
- because they're not making the `num_words` cut here.
- Words that were not seen in the training set but are in the test set
- have simply been skipped.
- """
- # Legacy support
- if 'nb_words' in kwargs:
- print('The `nb_words` argument in `load_data` '
- 'has been renamed `num_words`.')
- num_words = kwargs.pop('nb_words')
- if kwargs:
- raise TypeError(f'Unrecognized keyword arguments: {str(kwargs)}.')
-
- origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
- path = get_file(
- path,
- origin=origin_folder + 'imdb.npz',
- file_hash=
- '69664113be75683a8fe16e3ed0ab59fda8886cb3cd7ada244f7d9544e4676b9f')
- with np.load(path, allow_pickle=True) as f: # pylint: disable=unexpected-keyword-arg
- x_train, labels_train = f['x_train'], f['y_train']
- x_test, labels_test = f['x_test'], f['y_test']
-
- rng = np.random.RandomState(seed)
- indices = np.arange(len(x_train))
- rng.shuffle(indices)
- x_train = x_train[indices]
- labels_train = labels_train[indices]
-
- indices = np.arange(len(x_test))
- rng.shuffle(indices)
- x_test = x_test[indices]
- labels_test = labels_test[indices]
-
- if start_char is not None:
- x_train = [[start_char] + [w + index_from for w in x] for x in x_train]
- x_test = [[start_char] + [w + index_from for w in x] for x in x_test]
- elif index_from:
- x_train = [[w + index_from for w in x] for x in x_train]
- x_test = [[w + index_from for w in x] for x in x_test]
-
- if maxlen:
- x_train, labels_train = _remove_long_seq(maxlen, x_train, labels_train)
- x_test, labels_test = _remove_long_seq(maxlen, x_test, labels_test)
- if not x_train or not x_test:
- raise ValueError('After filtering for sequences shorter than maxlen='
- f'{str(maxlen)}, no sequence was kept. Increase maxlen.')
-
- xs = x_train + x_test
- labels = np.concatenate([labels_train, labels_test])
-
- if not num_words:
- num_words = max(max(x) for x in xs)
-
- # by convention, use 2 as OOV word
- # reserve 'index_from' (=3 by default) characters:
- # 0 (padding), 1 (start), 2 (OOV)
- if oov_char is not None:
- xs = [
- [w if (skip_top <= w < num_words) else oov_char for w in x] for x in xs
- ]
- else:
- xs = [[w for w in x if skip_top <= w < num_words] for x in xs]
-
- idx = len(x_train)
- x_train, y_train = np.array(xs[:idx], dtype='object'), labels[:idx]
- x_test, y_test = np.array(xs[idx:], dtype='object'), labels[idx:]
- return (x_train, y_train), (x_test, y_test)
-
-
-def get_word_index(path='imdb_word_index.json'):
- """Retrieves a dict mapping words to their index in the IMDB dataset.
-
- Args:
- path: where to cache the data (relative to `~/.keras/dataset`).
-
- Returns:
- The word index dictionary. Keys are word strings, values are their index.
-
- Example:
-
- ```python
- # Retrieve the training sequences.
- (x_train, _), _ = keras.datasets.imdb.load_data()
- # Retrieve the word index file mapping words to indices
- word_index = keras.datasets.imdb.get_word_index()
- # Reverse the word index to obtain a dict mapping indices to words
- inverted_word_index = dict((i, word) for (word, i) in word_index.items())
- # Decode the first sequence in the dataset
- decoded_sequence = " ".join(inverted_word_index[i] for i in x_train[0])
- ```
- """
- origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
- path = get_file(
- path,
- origin=origin_folder + 'imdb_word_index.json',
- file_hash='bfafd718b763782e994055a2d397834f')
- with open(path) as f:
- return json.load(f)
diff --git a/aitk/keras/datasets/mnist.py b/aitk/keras/datasets/mnist.py
deleted file mode 100644
index 69de521..0000000
--- a/aitk/keras/datasets/mnist.py
+++ /dev/null
@@ -1,152 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""MNIST handwritten digits dataset."""
-
-import numpy as np
-import os
-
-from .utils import get_file, get_file_async
-
-origin_folders = [
- ('https://storage.googleapis.com/tensorflow/tf-keras-datasets/', '731c5ac602752760c8e48fbffcf8c3b850d9dc2a2aedcf2cc48468fc17b673d1'),
- ("https://raw.githubusercontent.com/ArtificialIntelligenceToolkit/datasets/master/mnist/", None),
-]
-
-def load_data(path='mnist.npz'):
- """Loads the MNIST dataset.
-
- This is a dataset of 60,000 28x28 grayscale images of the 10 digits,
- along with a test set of 10,000 images.
- More info can be found at the
- [MNIST homepage](http://yann.lecun.com/exdb/mnist/).
-
- Args:
- path: path where to cache the dataset locally
- (relative to `~/.keras/datasets`).
-
- Returns:
- Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train**: uint8 NumPy array of grayscale image data with shapes
- `(60000, 28, 28)`, containing the training data. Pixel values range
- from 0 to 255.
-
- **y_train**: uint8 NumPy array of digit labels (integers in range 0-9)
- with shape `(60000,)` for the training data.
-
- **x_test**: uint8 NumPy array of grayscale image data with shapes
- (10000, 28, 28), containing the test data. Pixel values range
- from 0 to 255.
-
- **y_test**: uint8 NumPy array of digit labels (integers in range 0-9)
- with shape `(10000,)` for the test data.
-
- Example:
-
- ```python
- (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
- assert x_train.shape == (60000, 28, 28)
- assert x_test.shape == (10000, 28, 28)
- assert y_train.shape == (60000,)
- assert y_test.shape == (10000,)
- ```
-
- License:
- Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset,
- which is a derivative work from original NIST datasets.
- MNIST dataset is made available under the terms of the
- [Creative Commons Attribution-Share Alike 3.0 license.](
- https://creativecommons.org/licenses/by-sa/3.0/)
- """
- for origin_folder, file_hash in origin_folders:
- download_path = None
- try:
- download_path = get_file(
- path,
- origin=origin_folder + 'mnist.npz',
- file_hash=file_hash)
- except Exception:
- print("Failed dataset download; trying another URL...")
- continue
-
- if download_path and os.path.isfile(download_path):
- with np.load(download_path, allow_pickle=True) as f:
- x_train, y_train = f['x_train'], f['y_train']
- x_test, y_test = f['x_test'], f['y_test']
- return (x_train, y_train), (x_test, y_test)
-
-
-async def load_data_async(path='mnist.npz'):
- """Loads the MNIST dataset.
-
- This is a dataset of 60,000 28x28 grayscale images of the 10 digits,
- along with a test set of 10,000 images.
- More info can be found at the
- [MNIST homepage](http://yann.lecun.com/exdb/mnist/).
-
- Args:
- path: path where to cache the dataset locally
- (relative to `~/.keras/datasets`).
-
- Returns:
- Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train**: uint8 NumPy array of grayscale image data with shapes
- `(60000, 28, 28)`, containing the training data. Pixel values range
- from 0 to 255.
-
- **y_train**: uint8 NumPy array of digit labels (integers in range 0-9)
- with shape `(60000,)` for the training data.
-
- **x_test**: uint8 NumPy array of grayscale image data with shapes
- (10000, 28, 28), containing the test data. Pixel values range
- from 0 to 255.
-
- **y_test**: uint8 NumPy array of digit labels (integers in range 0-9)
- with shape `(10000,)` for the test data.
-
- Example:
-
- ```python
- (x_train, y_train), (x_test, y_test) = await keras.datasets.mnist.load_data_async()
- assert x_train.shape == (60000, 28, 28)
- assert x_test.shape == (10000, 28, 28)
- assert y_train.shape == (60000,)
- assert y_test.shape == (10000,)
- ```
-
- License:
- Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset,
- which is a derivative work from original NIST datasets.
- MNIST dataset is made available under the terms of the
- [Creative Commons Attribution-Share Alike 3.0 license.](
- https://creativecommons.org/licenses/by-sa/3.0/)
- """
- for origin_folder, file_hash in origin_folders:
- download_path = None
- if not os.path.isfile(path):
- try:
- download_path = await get_file_async(origin_folder, path)
- except Exception:
- print("Failed dataset download; trying another URL...")
- continue
- else:
- download_path = path
-
- if download_path and os.path.isfile(download_path):
- with np.load(download_path, allow_pickle=True) as f:
- x_train, y_train = f['x_train'], f['y_train']
- x_test, y_test = f['x_test'], f['y_test']
- return (x_train, y_train), (x_test, y_test)
diff --git a/aitk/keras/datasets/reuters.py b/aitk/keras/datasets/reuters.py
deleted file mode 100644
index a649a75..0000000
--- a/aitk/keras/datasets/reuters.py
+++ /dev/null
@@ -1,163 +0,0 @@
-# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Reuters topic classification dataset."""
-
-import json
-
-import numpy as np
-
-from keras_preprocessing import sequence
-_remove_long_seq = sequence._remove_long_seq
-
-from .utils import get_file
-
-
-def load_data(path='reuters.npz',
- num_words=None,
- skip_top=0,
- maxlen=None,
- test_split=0.2,
- seed=113,
- start_char=1,
- oov_char=2,
- index_from=3,
- **kwargs):
- """Loads the Reuters newswire classification dataset.
-
- This is a dataset of 11,228 newswires from Reuters, labeled over 46 topics.
-
- This was originally generated by parsing and preprocessing the classic
- Reuters-21578 dataset, but the preprocessing code is no longer packaged
- with Keras. See this
- [github discussion](https://github.com/keras-team/keras/issues/12072)
- for more info.
-
- Each newswire is encoded as a list of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset,
- so that for instance the integer "3" encodes the 3rd most frequent word in
- the data. This allows for quick filtering operations such as:
- "only consider the top 10,000 most
- common words, but eliminate the top 20 most common words".
-
- As a convention, "0" does not stand for a specific word, but instead is used
- to encode any unknown word.
-
- Args:
- path: where to cache the data (relative to `~/.keras/dataset`).
- num_words: integer or None. Words are
- ranked by how often they occur (in the training set) and only
- the `num_words` most frequent words are kept. Any less frequent word
- will appear as `oov_char` value in the sequence data. If None,
- all words are kept. Defaults to None, so all words are kept.
- skip_top: skip the top N most frequently occurring words
- (which may not be informative). These words will appear as
- `oov_char` value in the dataset. Defaults to 0, so no words are
- skipped.
- maxlen: int or None. Maximum sequence length.
- Any longer sequence will be truncated. Defaults to None, which
- means no truncation.
- test_split: Float between 0 and 1. Fraction of the dataset to be used
- as test data. Defaults to 0.2, meaning 20% of the dataset is used as
- test data.
- seed: int. Seed for reproducible data shuffling.
- start_char: int. The start of a sequence will be marked with this
- character. Defaults to 1 because 0 is usually the padding character.
- oov_char: int. The out-of-vocabulary character.
- Words that were cut out because of the `num_words` or
- `skip_top` limits will be replaced with this character.
- index_from: int. Index actual words with this index and higher.
- **kwargs: Used for backwards compatibility.
-
- Returns:
- Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
-
- **x_train, x_test**: lists of sequences, which are lists of indexes
- (integers). If the num_words argument was specific, the maximum
- possible index value is `num_words - 1`. If the `maxlen` argument was
- specified, the largest possible sequence length is `maxlen`.
-
- **y_train, y_test**: lists of integer labels (1 or 0).
-
- Note: The 'out of vocabulary' character is only used for
- words that were present in the training set but are not included
- because they're not making the `num_words` cut here.
- Words that were not seen in the training set but are in the test set
- have simply been skipped.
- """
- # Legacy support
- if 'nb_words' in kwargs:
- print('The `nb_words` argument in `load_data` '
- 'has been renamed `num_words`.')
- num_words = kwargs.pop('nb_words')
- if kwargs:
- raise TypeError(f'Unrecognized keyword arguments: {str(kwargs)}')
-
- origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
- path = get_file(
- path,
- origin=origin_folder + 'reuters.npz',
- file_hash=
- 'd6586e694ee56d7a4e65172e12b3e987c03096cb01eab99753921ef915959916')
- with np.load(path, allow_pickle=True) as f: # pylint: disable=unexpected-keyword-arg
- xs, labels = f['x'], f['y']
-
- rng = np.random.RandomState(seed)
- indices = np.arange(len(xs))
- rng.shuffle(indices)
- xs = xs[indices]
- labels = labels[indices]
-
- if start_char is not None:
- xs = [[start_char] + [w + index_from for w in x] for x in xs]
- elif index_from:
- xs = [[w + index_from for w in x] for x in xs]
-
- if maxlen:
- xs, labels = _remove_long_seq(maxlen, xs, labels)
-
- if not num_words:
- num_words = max(max(x) for x in xs)
-
- # by convention, use 2 as OOV word
- # reserve 'index_from' (=3 by default) characters:
- # 0 (padding), 1 (start), 2 (OOV)
- if oov_char is not None:
- xs = [[w if skip_top <= w < num_words else oov_char for w in x] for x in xs]
- else:
- xs = [[w for w in x if skip_top <= w < num_words] for x in xs]
-
- idx = int(len(xs) * (1 - test_split))
- x_train, y_train = np.array(xs[:idx], dtype='object'), np.array(labels[:idx])
- x_test, y_test = np.array(xs[idx:], dtype='object'), np.array(labels[idx:])
-
- return (x_train, y_train), (x_test, y_test)
-
-
-def get_word_index(path='reuters_word_index.json'):
- """Retrieves a dict mapping words to their index in the Reuters dataset.
-
- Args:
- path: where to cache the data (relative to `~/.keras/dataset`).
-
- Returns:
- The word index dictionary. Keys are word strings, values are their index.
- """
- origin_folder = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
- path = get_file(
- path,
- origin=origin_folder + 'reuters_word_index.json',
- file_hash='4d44cc38712099c9e383dc6e5f11a921')
- with open(path) as f:
- return json.load(f)
diff --git a/aitk/keras/datasets/utils.py b/aitk/keras/datasets/utils.py
deleted file mode 100644
index 41bbb37..0000000
--- a/aitk/keras/datasets/utils.py
+++ /dev/null
@@ -1,871 +0,0 @@
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-
-"""Utilities for file download and caching."""
-
-from abc import abstractmethod
-from contextlib import closing
-import functools
-import hashlib
-import multiprocessing.dummy
-import os
-import io
-import pathlib
-import queue
-import random
-import shutil
-import tarfile
-import threading
-import time
-import typing
-import urllib
-import weakref
-import zipfile
-
-from six.moves.urllib.parse import urlsplit
-import numpy as np
-from six.moves.urllib.request import urlopen
-from urllib.request import urlretrieve
-
-async def get_file_async(origin_folder, file_name):
- try:
- print("Downloading data from %s" % (origin_folder + file_name))
- import js
- response = await js.fetch(origin_folder + file_name)
- fp = io.BytesIO((await response.arrayBuffer()).to_py())
- bytes = fp.read()
- with open(file_name, "wb") as fp:
- fp.write(bytes)
- except Exception:
- print("Could not load dataset")
- return
- return file_name
-
-def path_to_string(path):
- """Convert `PathLike` objects to their string representation.
-
- If given a non-string typed path object, converts it to its string
- representation.
-
- If the object passed to `path` is not among the above, then it is
- returned unchanged. This allows e.g. passthrough of file objects
- through this function.
-
- Args:
- path: `PathLike` object that represents a path
-
- Returns:
- A string representation of the path argument, if Python support exists.
- """
- if isinstance(path, os.PathLike):
- return os.fspath(path)
- return path
-
-def _extract_archive(file_path, path='.', archive_format='auto'):
- """Extracts an archive if it matches tar, tar.gz, tar.bz, or zip formats.
-
- Args:
- file_path: path to the archive file
- path: path to extract the archive file
- archive_format: Archive format to try for extracting the file.
- Options are 'auto', 'tar', 'zip', and None.
- 'tar' includes tar, tar.gz, and tar.bz files.
- The default 'auto' is ['tar', 'zip'].
- None or an empty list will return no matches found.
-
- Returns:
- True if a match was found and an archive extraction was completed,
- False otherwise.
- """
- if archive_format is None:
- return False
- if archive_format == 'auto':
- archive_format = ['tar', 'zip']
- if isinstance(archive_format, str):
- archive_format = [archive_format]
-
- file_path = path_to_string(file_path)
- path = path_to_string(path)
-
- for archive_type in archive_format:
- if archive_type == 'tar':
- open_fn = tarfile.open
- is_match_fn = tarfile.is_tarfile
- if archive_type == 'zip':
- open_fn = zipfile.ZipFile
- is_match_fn = zipfile.is_zipfile
-
- if is_match_fn(file_path):
- with open_fn(file_path) as archive:
- try:
- archive.extractall(path)
- except (tarfile.TarError, RuntimeError, KeyboardInterrupt):
- if os.path.exists(path):
- if os.path.isfile(path):
- os.remove(path)
- else:
- shutil.rmtree(path)
- raise
- return True
- return False
-
-
-def get_file(fname=None,
- origin=None,
- untar=False,
- md5_hash=None,
- file_hash=None,
- cache_subdir='datasets',
- hash_algorithm='auto',
- extract=False,
- archive_format='auto',
- cache_dir=None):
- """Downloads a file from a URL if it not already in the cache.
-
- By default the file at the url `origin` is downloaded to the
- cache_dir `~/.keras`, placed in the cache_subdir `datasets`,
- and given the filename `fname`. The final location of a file
- `example.txt` would therefore be `~/.keras/datasets/example.txt`.
-
- Files in tar, tar.gz, tar.bz, and zip formats can also be extracted.
- Passing a hash will verify the file after download. The command line
- programs `shasum` and `sha256sum` can compute the hash.
-
- Example:
-
- ```python
- path_to_downloaded_file = tf.keras.utils.get_file(
- "flower_photos",
- "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz",
- untar=True)
- ```
-
- Args:
- fname: Name of the file. If an absolute path `/path/to/file.txt` is
- specified the file will be saved at that location. If `None`, the
- name of the file at `origin` will be used.
- origin: Original URL of the file.
- untar: Deprecated in favor of `extract` argument.
- boolean, whether the file should be decompressed
- md5_hash: Deprecated in favor of `file_hash` argument.
- md5 hash of the file for verification
- file_hash: The expected hash string of the file after download.
- The sha256 and md5 hash algorithms are both supported.
- cache_subdir: Subdirectory under the Keras cache dir where the file is
- saved. If an absolute path `/path/to/folder` is
- specified the file will be saved at that location.
- hash_algorithm: Select the hash algorithm to verify the file.
- options are `'md5'`, `'sha256'`, and `'auto'`.
- The default 'auto' detects the hash algorithm in use.
- extract: True tries extracting the file as an Archive, like tar or zip.
- archive_format: Archive format to try for extracting the file.
- Options are `'auto'`, `'tar'`, `'zip'`, and `None`.
- `'tar'` includes tar, tar.gz, and tar.bz files.
- The default `'auto'` corresponds to `['tar', 'zip']`.
- None or an empty list will return no matches found.
- cache_dir: Location to store cached files, when None it
- defaults to the default directory `~/.keras/`.
-
- Returns:
- Path to the downloaded file
- """
- if origin is None:
- raise ValueError('Please specify the "origin" argument (URL of the file '
- 'to download).')
-
- if cache_dir is None:
- cache_dir = os.path.join(os.path.expanduser('~'), '.keras')
- if md5_hash is not None and file_hash is None:
- file_hash = md5_hash
- hash_algorithm = 'md5'
- datadir_base = os.path.expanduser(cache_dir)
- if not os.access(datadir_base, os.W_OK):
- datadir_base = os.path.join('/tmp', '.keras')
- datadir = os.path.join(datadir_base, cache_subdir)
- _makedirs_exist_ok(datadir)
-
- fname = path_to_string(fname)
- if not fname:
- fname = os.path.basename(urlsplit(origin).path)
- if not fname:
- raise ValueError(
- f"Can't parse the file name from the origin provided: '{origin}'."
- "Please specify the `fname` as the input param.")
-
- if untar:
- if fname.endswith('.tar.gz'):
- fname = pathlib.Path(fname)
- # The 2 `.with_suffix()` are because of `.tar.gz` as pathlib
- # considers it as 2 suffixes.
- fname = fname.with_suffix('').with_suffix('')
- fname = str(fname)
- untar_fpath = os.path.join(datadir, fname)
- fpath = untar_fpath + '.tar.gz'
- else:
- fpath = os.path.join(datadir, fname)
-
- download = False
- if os.path.exists(fpath):
- # File found; verify integrity if a hash was provided.
- if file_hash is not None:
- if not validate_file(fpath, file_hash, algorithm=hash_algorithm):
- print('A local file was found, but it seems to be '
- 'incomplete or outdated because the ' + hash_algorithm +
- ' file hash does not match the original value of ' + file_hash +
- ' so we will re-download the data.')
- download = True
- else:
- download = True
-
- if download:
- print('Downloading data from', origin)
-
- error_msg = 'URL fetch failure on {}: {} -- {}'
- try:
- try:
- urlretrieve(origin, fpath)
- except urllib.error.HTTPError as e:
- raise Exception(error_msg.format(origin, e.code, e.msg))
- except urllib.error.URLError as e:
- raise Exception(error_msg.format(origin, e.errno, e.reason))
- except (Exception, KeyboardInterrupt) as e:
- if os.path.exists(fpath):
- os.remove(fpath)
- raise
-
- if untar:
- if not os.path.exists(untar_fpath):
- _extract_archive(fpath, datadir, archive_format='tar')
- return untar_fpath
-
- if extract:
- _extract_archive(fpath, datadir, archive_format)
-
- return fpath
-
-
-def _makedirs_exist_ok(datadir):
- os.makedirs(datadir, exist_ok=True) # pylint: disable=unexpected-keyword-arg
-
-
-def _resolve_hasher(algorithm, file_hash=None):
- """Returns hash algorithm as hashlib function."""
- if algorithm == 'sha256':
- return hashlib.sha256()
-
- if algorithm == 'auto' and file_hash is not None and len(file_hash) == 64:
- return hashlib.sha256()
-
- # This is used only for legacy purposes.
- return hashlib.md5()
-
-
-def _hash_file(fpath, algorithm='sha256', chunk_size=65535):
- """Calculates a file sha256 or md5 hash.
-
- Example:
-
- ```python
- _hash_file('/path/to/file.zip')
- 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
- ```
-
- Args:
- fpath: path to the file being validated
- algorithm: hash algorithm, one of `'auto'`, `'sha256'`, or `'md5'`.
- The default `'auto'` detects the hash algorithm in use.
- chunk_size: Bytes to read at a time, important for large files.
-
- Returns:
- The file hash
- """
- if isinstance(algorithm, str):
- hasher = _resolve_hasher(algorithm)
- else:
- hasher = algorithm
-
- with open(fpath, 'rb') as fpath_file:
- for chunk in iter(lambda: fpath_file.read(chunk_size), b''):
- hasher.update(chunk)
-
- return hasher.hexdigest()
-
-
-def validate_file(fpath, file_hash, algorithm='auto', chunk_size=65535):
- """Validates a file against a sha256 or md5 hash.
-
- Args:
- fpath: path to the file being validated
- file_hash: The expected hash string of the file.
- The sha256 and md5 hash algorithms are both supported.
- algorithm: Hash algorithm, one of 'auto', 'sha256', or 'md5'.
- The default 'auto' detects the hash algorithm in use.
- chunk_size: Bytes to read at a time, important for large files.
-
- Returns:
- Whether the file is valid
- """
- hasher = _resolve_hasher(algorithm, file_hash)
-
- if str(_hash_file(fpath, hasher, chunk_size)) == str(file_hash):
- return True
- else:
- return False
-
-
-class ThreadsafeIter:
- """Wrap an iterator with a lock and propagate exceptions to all threads."""
-
- def __init__(self, it):
- self.it = it
- self.lock = threading.Lock()
-
- # After a generator throws an exception all subsequent next() calls raise a
- # StopIteration Exception. This, however, presents an issue when mixing
- # generators and threading because it means the order of retrieval need not
- # match the order in which the generator was called. This can make it appear
- # that a generator exited normally when in fact the terminating exception is
- # just in a different thread. In order to provide thread safety, once
- # self.it has thrown an exception we continue to throw the same exception.
- self._exception = None
-
- def __iter__(self):
- return self
-
- def next(self):
- return self.__next__()
-
- def __next__(self):
- with self.lock:
- if self._exception:
- raise self._exception # pylint: disable=raising-bad-type
-
- try:
- return next(self.it)
- except Exception as e:
- self._exception = e
- raise
-
-
-def threadsafe_generator(f):
-
- @functools.wraps(f)
- def g(*a, **kw):
- return ThreadsafeIter(f(*a, **kw))
-
- return g
-
-
-class Sequence:
- """Base object for fitting to a sequence of data, such as a dataset.
-
- Every `Sequence` must implement the `__getitem__` and the `__len__` methods.
- If you want to modify your dataset between epochs you may implement
- `on_epoch_end`.
- The method `__getitem__` should return a complete batch.
-
- Notes:
-
- `Sequence` are a safer way to do multiprocessing. This structure guarantees
- that the network will only train once
- on each sample per epoch which is not the case with generators.
-
- Examples:
-
- ```python
- from skimage.io import imread
- from skimage.transform import resize
- import numpy as np
- import math
-
- # Here, `x_set` is list of path to the images
- # and `y_set` are the associated classes.
-
- class CIFAR10Sequence(Sequence):
-
- def __init__(self, x_set, y_set, batch_size):
- self.x, self.y = x_set, y_set
- self.batch_size = batch_size
-
- def __len__(self):
- return math.ceil(len(self.x) / self.batch_size)
-
- def __getitem__(self, idx):
- batch_x = self.x[idx * self.batch_size:(idx + 1) *
- self.batch_size]
- batch_y = self.y[idx * self.batch_size:(idx + 1) *
- self.batch_size]
-
- return np.array([
- resize(imread(file_name), (200, 200))
- for file_name in batch_x]), np.array(batch_y)
- ```
- """
-
- @abstractmethod
- def __getitem__(self, index):
- """Gets batch at position `index`.
-
- Args:
- index: position of the batch in the Sequence.
-
- Returns:
- A batch
- """
- raise NotImplementedError
-
- @abstractmethod
- def __len__(self):
- """Number of batch in the Sequence.
-
- Returns:
- The number of batches in the Sequence.
- """
- raise NotImplementedError
-
- def on_epoch_end(self):
- """Method called at the end of every epoch.
- """
- pass
-
- def __iter__(self):
- """Create a generator that iterate over the Sequence."""
- for item in (self[i] for i in range(len(self))):
- yield item
-
-
-def iter_sequence_infinite(seq):
- """Iterates indefinitely over a Sequence.
-
- Args:
- seq: `Sequence` instance.
-
- Yields:
- Batches of data from the `Sequence`.
- """
- while True:
- for item in seq:
- yield item
-
-
-# Global variables to be shared across processes
-_SHARED_SEQUENCES = {}
-# We use a Value to provide unique id to different processes.
-_SEQUENCE_COUNTER = None
-
-
-# Because multiprocessing pools are inherently unsafe, starting from a clean
-# state can be essential to avoiding deadlocks. In order to accomplish this, we
-# need to be able to check on the status of Pools that we create.
-_DATA_POOLS = weakref.WeakSet()
-_WORKER_ID_QUEUE = None # Only created if needed.
-_WORKER_IDS = set()
-_FORCE_THREADPOOL = False
-_FORCE_THREADPOOL_LOCK = threading.RLock()
-
-
-def dont_use_multiprocessing_pool(f):
- @functools.wraps(f)
- def wrapped(*args, **kwargs):
- with _FORCE_THREADPOOL_LOCK:
- global _FORCE_THREADPOOL
- old_force_threadpool, _FORCE_THREADPOOL = _FORCE_THREADPOOL, True
- out = f(*args, **kwargs)
- _FORCE_THREADPOOL = old_force_threadpool
- return out
- return wrapped
-
-
-def get_pool_class(use_multiprocessing):
- global _FORCE_THREADPOOL
- if not use_multiprocessing or _FORCE_THREADPOOL:
- return multiprocessing.dummy.Pool # ThreadPool
- return multiprocessing.Pool
-
-
-def get_worker_id_queue():
- """Lazily create the queue to track worker ids."""
- global _WORKER_ID_QUEUE
- if _WORKER_ID_QUEUE is None:
- _WORKER_ID_QUEUE = multiprocessing.Queue()
- return _WORKER_ID_QUEUE
-
-
-def init_pool(seqs):
- global _SHARED_SEQUENCES
- _SHARED_SEQUENCES = seqs
-
-
-def get_index(uid, i):
- """Get the value from the Sequence `uid` at index `i`.
-
- To allow multiple Sequences to be used at the same time, we use `uid` to
- get a specific one. A single Sequence would cause the validation to
- overwrite the training Sequence.
-
- Args:
- uid: int, Sequence identifier
- i: index
-
- Returns:
- The value at index `i`.
- """
- return _SHARED_SEQUENCES[uid][i]
-
-
-class SequenceEnqueuer:
- """Base class to enqueue inputs.
-
- The task of an Enqueuer is to use parallelism to speed up preprocessing.
- This is done with processes or threads.
-
- Example:
-
- ```python
- enqueuer = SequenceEnqueuer(...)
- enqueuer.start()
- datas = enqueuer.get()
- for data in datas:
- # Use the inputs; training, evaluating, predicting.
- # ... stop sometime.
- enqueuer.stop()
- ```
-
- The `enqueuer.get()` should be an infinite stream of datas.
- """
-
- def __init__(self, sequence,
- use_multiprocessing=False):
- self.sequence = sequence
- self.use_multiprocessing = use_multiprocessing
-
- global _SEQUENCE_COUNTER
- if _SEQUENCE_COUNTER is None:
- try:
- _SEQUENCE_COUNTER = multiprocessing.Value('i', 0)
- except OSError:
- # In this case the OS does not allow us to use
- # multiprocessing. We resort to an int
- # for enqueuer indexing.
- _SEQUENCE_COUNTER = 0
-
- if isinstance(_SEQUENCE_COUNTER, int):
- self.uid = _SEQUENCE_COUNTER
- _SEQUENCE_COUNTER += 1
- else:
- # Doing Multiprocessing.Value += x is not process-safe.
- with _SEQUENCE_COUNTER.get_lock():
- self.uid = _SEQUENCE_COUNTER.value
- _SEQUENCE_COUNTER.value += 1
-
- self.workers = 0
- self.executor_fn = None
- self.queue = None
- self.run_thread = None
- self.stop_signal = None
-
- def is_running(self):
- return self.stop_signal is not None and not self.stop_signal.is_set()
-
- def start(self, workers=1, max_queue_size=10):
- """Starts the handler's workers.
-
- Args:
- workers: Number of workers.
- max_queue_size: queue size
- (when full, workers could block on `put()`)
- """
- if self.use_multiprocessing:
- self.executor_fn = self._get_executor_init(workers)
- else:
- # We do not need the init since it's threads.
- self.executor_fn = lambda _: get_pool_class(False)(workers)
- self.workers = workers
- self.queue = queue.Queue(max_queue_size)
- self.stop_signal = threading.Event()
- self.run_thread = threading.Thread(target=self._run)
- self.run_thread.daemon = True
- self.run_thread.start()
-
- def _send_sequence(self):
- """Sends current Iterable to all workers."""
- # For new processes that may spawn
- _SHARED_SEQUENCES[self.uid] = self.sequence
-
- def stop(self, timeout=None):
- """Stops running threads and wait for them to exit, if necessary.
-
- Should be called by the same thread which called `start()`.
-
- Args:
- timeout: maximum time to wait on `thread.join()`
- """
- self.stop_signal.set()
- with self.queue.mutex:
- self.queue.queue.clear()
- self.queue.unfinished_tasks = 0
- self.queue.not_full.notify()
- self.run_thread.join(timeout)
- _SHARED_SEQUENCES[self.uid] = None
-
- def __del__(self):
- if self.is_running():
- self.stop()
-
- @abstractmethod
- def _run(self):
- """Submits request to the executor and queue the `Future` objects."""
- raise NotImplementedError
-
- @abstractmethod
- def _get_executor_init(self, workers):
- """Gets the Pool initializer for multiprocessing.
-
- Args:
- workers: Number of workers.
-
- Returns:
- Function, a Function to initialize the pool
- """
- raise NotImplementedError
-
- @abstractmethod
- def get(self):
- """Creates a generator to extract data from the queue.
-
- Skip the data if it is `None`.
- # Returns
- Generator yielding tuples `(inputs, targets)`
- or `(inputs, targets, sample_weights)`.
- """
- raise NotImplementedError
-
-
-class OrderedEnqueuer(SequenceEnqueuer):
- """Builds a Enqueuer from a Sequence.
-
- Args:
- sequence: A `tf.keras.utils.data_utils.Sequence` object.
- use_multiprocessing: use multiprocessing if True, otherwise threading
- shuffle: whether to shuffle the data at the beginning of each epoch
- """
-
- def __init__(self, sequence, use_multiprocessing=False, shuffle=False):
- super(OrderedEnqueuer, self).__init__(sequence, use_multiprocessing)
- self.shuffle = shuffle
-
- def _get_executor_init(self, workers):
- """Gets the Pool initializer for multiprocessing.
-
- Args:
- workers: Number of workers.
-
- Returns:
- Function, a Function to initialize the pool
- """
- def pool_fn(seqs):
- pool = get_pool_class(True)(
- workers, initializer=init_pool_generator,
- initargs=(seqs, None, get_worker_id_queue()))
- _DATA_POOLS.add(pool)
- return pool
-
- return pool_fn
-
- def _wait_queue(self):
- """Wait for the queue to be empty."""
- while True:
- time.sleep(0.1)
- if self.queue.unfinished_tasks == 0 or self.stop_signal.is_set():
- return
-
- def _run(self):
- """Submits request to the executor and queue the `Future` objects."""
- sequence = list(range(len(self.sequence)))
- self._send_sequence() # Share the initial sequence
- while True:
- if self.shuffle:
- random.shuffle(sequence)
-
- with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
- for i in sequence:
- if self.stop_signal.is_set():
- return
-
- self.queue.put(
- executor.apply_async(get_index, (self.uid, i)), block=True)
-
- # Done with the current epoch, waiting for the final batches
- self._wait_queue()
-
- if self.stop_signal.is_set():
- # We're done
- return
-
- # Call the internal on epoch end.
- self.sequence.on_epoch_end()
- self._send_sequence() # Update the pool
-
- def get(self):
- """Creates a generator to extract data from the queue.
-
- Skip the data if it is `None`.
-
- Yields:
- The next element in the queue, i.e. a tuple
- `(inputs, targets)` or
- `(inputs, targets, sample_weights)`.
- """
- while self.is_running():
- try:
- inputs = self.queue.get(block=True, timeout=5).get()
- if self.is_running():
- self.queue.task_done()
- if inputs is not None:
- yield inputs
- except queue.Empty:
- pass
- except Exception as e: # pylint: disable=broad-except
- self.stop()
- raise e
-
-
-def init_pool_generator(gens, random_seed=None, id_queue=None):
- """Initializer function for pool workers.
-
- Args:
- gens: State which should be made available to worker processes.
- random_seed: An optional value with which to seed child processes.
- id_queue: A multiprocessing Queue of worker ids. This is used to indicate
- that a worker process was created by Keras and can be terminated using
- the cleanup_all_keras_forkpools utility.
- """
- global _SHARED_SEQUENCES
- _SHARED_SEQUENCES = gens
-
- worker_proc = multiprocessing.current_process()
-
- # name isn't used for anything, but setting a more descriptive name is helpful
- # when diagnosing orphaned processes.
- worker_proc.name = 'Keras_worker_{}'.format(worker_proc.name)
-
- if random_seed is not None:
- np.random.seed(random_seed + worker_proc.ident)
-
- if id_queue is not None:
- # If a worker dies during init, the pool will just create a replacement.
- id_queue.put(worker_proc.ident, block=True, timeout=0.1)
-
-
-def next_sample(uid):
- """Gets the next value from the generator `uid`.
-
- To allow multiple generators to be used at the same time, we use `uid` to
- get a specific one. A single generator would cause the validation to
- overwrite the training generator.
-
- Args:
- uid: int, generator identifier
-
- Returns:
- The next value of generator `uid`.
- """
- return next(_SHARED_SEQUENCES[uid])
-
-
-class GeneratorEnqueuer(SequenceEnqueuer):
- """Builds a queue out of a data generator.
-
- The provided generator can be finite in which case the class will throw
- a `StopIteration` exception.
-
- Args:
- generator: a generator function which yields data
- use_multiprocessing: use multiprocessing if True, otherwise threading
- random_seed: Initial seed for workers,
- will be incremented by one for each worker.
- """
-
- def __init__(self, generator,
- use_multiprocessing=False,
- random_seed=None):
- super(GeneratorEnqueuer, self).__init__(generator, use_multiprocessing)
- self.random_seed = random_seed
-
- def _get_executor_init(self, workers):
- """Gets the Pool initializer for multiprocessing.
-
- Args:
- workers: Number of works.
-
- Returns:
- A Function to initialize the pool
- """
- def pool_fn(seqs):
- pool = get_pool_class(True)(
- workers, initializer=init_pool_generator,
- initargs=(seqs, self.random_seed, get_worker_id_queue()))
- _DATA_POOLS.add(pool)
- return pool
- return pool_fn
-
- def _run(self):
- """Submits request to the executor and queue the `Future` objects."""
- self._send_sequence() # Share the initial generator
- with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
- while True:
- if self.stop_signal.is_set():
- return
-
- self.queue.put(
- executor.apply_async(next_sample, (self.uid,)), block=True)
-
- def get(self):
- """Creates a generator to extract data from the queue.
-
- Skip the data if it is `None`.
-
- Yields:
- The next element in the queue, i.e. a tuple
- `(inputs, targets)` or
- `(inputs, targets, sample_weights)`.
- """
- try:
- while self.is_running():
- inputs = self.queue.get(block=True).get()
- self.queue.task_done()
- if inputs is not None:
- yield inputs
- except StopIteration:
- # Special case for finite generators
- last_ones = []
- while self.queue.qsize() > 0:
- last_ones.append(self.queue.get(block=True))
- # Wait for them to complete
- for f in last_ones:
- f.wait()
- # Keep the good ones
- last_ones = [future.get() for future in last_ones if future.successful()]
- for inputs in last_ones:
- if inputs is not None:
- yield inputs
- except Exception as e: # pylint: disable=broad-except
- self.stop()
- if 'generator already executing' in str(e):
- raise RuntimeError(
- 'Your generator is NOT thread-safe. '
- 'Keras requires a thread-safe generator when '
- '`use_multiprocessing=False, workers > 1`. ')
- raise e
diff --git a/aitk/keras/initializers/README.md b/aitk/keras/initializers/README.md
deleted file mode 100644
index ebbe2f0..0000000
--- a/aitk/keras/initializers/README.md
+++ /dev/null
@@ -1,4 +0,0 @@
-# Initializers
-The `initializers.py` module contains objects for initializing optimizers,
-activation functions, weight initializers, and learning rate schedulers from
-strings or parameter dictionaries.
diff --git a/aitk/keras/initializers/__init__.py b/aitk/keras/initializers/__init__.py
deleted file mode 100644
index 91c82ab..0000000
--- a/aitk/keras/initializers/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .initializers import *
diff --git a/aitk/keras/initializers/initializers.py b/aitk/keras/initializers/initializers.py
deleted file mode 100644
index a828fda..0000000
--- a/aitk/keras/initializers/initializers.py
+++ /dev/null
@@ -1,264 +0,0 @@
-import re
-from functools import partial
-from ast import literal_eval as eval
-
-import numpy as np
-
-from ..optimizers import OptimizerBase, SGD, AdaGrad, RMSProp, Adam
-from ..activations import ActivationBase, Affine, ReLU, Tanh, Sigmoid, LeakyReLU
-from ..schedulers import (
- SchedulerBase,
- ConstantScheduler,
- ExponentialScheduler,
- NoamScheduler,
- KingScheduler,
-)
-
-from ..utils import (
- he_normal,
- he_uniform,
- glorot_normal,
- glorot_uniform,
- truncated_normal,
-)
-
-
-class ActivationInitializer(object):
- def __init__(self, param=None):
- """
- A class for initializing activation functions. Valid inputs are:
- (a) __str__ representations of `ActivationBase` instances
- (b) `ActivationBase` instances
-
- If `param` is `None`, return the identity function: f(X) = X
- """
- self.param = param
-
- def __call__(self):
- param = self.param
- if param is None:
- act = Affine(slope=1, intercept=0)
- elif isinstance(param, ActivationBase):
- act = param.copy()
- elif isinstance(param, str):
- act = self.init_from_str(param)
- else:
- raise ValueError("Unknown activation: {}".format(param))
- return act
-
- def init_from_str(self, act_str):
- act_str = act_str.lower()
- if act_str == "relu":
- act_fn = ReLU()
- elif act_str == "tanh":
- act_fn = Tanh()
- elif act_str == "sigmoid":
- act_fn = Sigmoid()
- elif "affine" in act_str:
- r = r"affine\(slope=(.*), intercept=(.*)\)"
- slope, intercept = re.match(r, act_str).groups()
- act_fn = Affine(float(slope), float(intercept))
- elif "leaky relu" in act_str:
- r = r"leaky relu\(alpha=(.*)\)"
- alpha = re.match(r, act_str).groups()[0]
- act_fn = LeakyReLU(float(alpha))
- else:
- raise ValueError("Unknown activation: {}".format(act_str))
- return act_fn
-
-
-class SchedulerInitializer(object):
- def __init__(self, param=None, lr=None):
- """
- A class for initializing learning rate schedulers. Valid inputs are:
- (a) __str__ representations of `SchedulerBase` instances
- (b) `SchedulerBase` instances
- (c) Parameter dicts (e.g., as produced via the `summary` method in
- `LayerBase` instances)
-
- If `param` is `None`, return the ConstantScheduler with learning rate
- equal to `lr`.
- """
- if all([lr is None, param is None]):
- raise ValueError("lr and param cannot both be `None`")
-
- self.lr = lr
- self.param = param
-
- def __call__(self):
- param = self.param
- if param is None:
- scheduler = ConstantScheduler(self.lr)
- elif isinstance(param, SchedulerBase):
- scheduler = param.copy()
- elif isinstance(param, str):
- scheduler = self.init_from_str()
- elif isinstance(param, dict):
- scheduler = self.init_from_dict()
- return scheduler
-
- def init_from_str(self):
- r = r"([a-zA-Z]*)=([^,)]*)"
- sch_str = self.param.lower()
- kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, sch_str)])
-
- if "constant" in sch_str:
- scheduler = ConstantScheduler(**kwargs)
- elif "exponential" in sch_str:
- scheduler = ExponentialScheduler(**kwargs)
- elif "noam" in sch_str:
- scheduler = NoamScheduler(**kwargs)
- elif "king" in sch_str:
- scheduler = KingScheduler(**kwargs)
- else:
- raise NotImplementedError("{}".format(sch_str))
- return scheduler
-
- def init_from_dict(self):
- S = self.param
- sc = S["hyperparameters"] if "hyperparameters" in S else None
-
- if sc is None:
- raise ValueError("Must have `hyperparameters` key: {}".format(S))
-
- if sc and sc["id"] == "ConstantScheduler":
- scheduler = ConstantScheduler()
- elif sc and sc["id"] == "ExponentialScheduler":
- scheduler = ExponentialScheduler()
- elif sc and sc["id"] == "NoamScheduler":
- scheduler = NoamScheduler()
- elif sc:
- raise NotImplementedError("{}".format(sc["id"]))
- scheduler.set_params(sc)
- return scheduler
-
-
-class OptimizerInitializer(object):
- def __init__(self, param=None):
- """
- A class for initializing optimizers. Valid inputs are:
- (a) __str__ representations of `OptimizerBase` instances
- (b) `OptimizerBase` instances
- (c) Parameter dicts (e.g., as produced via the `summary` method in
- `LayerBase` instances)
-
- If `param` is `None`, return the SGD optimizer with default parameters.
- """
- self.param = param
-
- def __call__(self):
- param = self.param
- if param is None:
- opt = SGD()
- elif isinstance(param, OptimizerBase):
- opt = param.copy()
- elif isinstance(param, str):
- opt = self.init_from_str()
- elif isinstance(param, dict):
- opt = self.init_from_dict()
- return opt
-
- def init_from_str(self):
- r = r"([a-zA-Z]*)=([^,)]*)"
- opt_str = self.param.lower()
- kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, opt_str)])
- if "sgd" in opt_str:
- optimizer = SGD(**kwargs)
- elif "adagrad" in opt_str:
- optimizer = AdaGrad(**kwargs)
- elif "rmsprop" in opt_str:
- optimizer = RMSProp(**kwargs)
- elif "adam" in opt_str:
- optimizer = Adam(**kwargs)
- else:
- raise NotImplementedError("{}".format(opt_str))
- return optimizer
-
- def init_from_dict(self):
- O = self.param
- cc = O["cache"] if "cache" in O else None
- op = O["hyperparameters"] if "hyperparameters" in O else None
-
- if op is None:
- raise ValueError("Must have `hyperparemeters` key: {}".format(O))
-
- if op and op["id"] == "SGD":
- optimizer = SGD()
- elif op and op["id"] == "RMSProp":
- optimizer = RMSProp()
- elif op and op["id"] == "AdaGrad":
- optimizer = AdaGrad()
- elif op and op["id"] == "Adam":
- optimizer = Adam()
- elif op:
- raise NotImplementedError("{}".format(op["id"]))
- optimizer.set_params(op, cc)
- return optimizer
-
-
-class WeightInitializer(object):
- def __init__(self, act_fn_str, mode="glorot_uniform"):
- """
- A factory for weight initializers.
-
- Parameters
- ----------
- act_fn_str : str
- The string representation for the layer activation function
- mode : str (default: 'glorot_uniform')
- The weight initialization strategy. Valid entries are {"he_normal",
- "he_uniform", "glorot_normal", glorot_uniform", "std_normal",
- "trunc_normal"}
- """
- if mode not in [
- "he_normal",
- "he_uniform",
- "glorot_normal",
- "glorot_uniform",
- "std_normal",
- "trunc_normal",
- ]:
- raise ValueError("Unrecognize initialization mode: {}".format(mode))
-
- self.mode = mode
- self.act_fn = act_fn_str
-
- if mode == "glorot_uniform":
- self._fn = glorot_uniform
- elif mode == "glorot_normal":
- self._fn = glorot_normal
- elif mode == "he_uniform":
- self._fn = he_uniform
- elif mode == "he_normal":
- self._fn = he_normal
- elif mode == "std_normal":
- self._fn = np.random.randn
- elif mode == "trunc_normal":
- self._fn = partial(truncated_normal, mean=0, std=1)
-
- def __call__(self, weight_shape):
- if "glorot" in self.mode:
- gain = self._calc_glorot_gain()
- W = self._fn(weight_shape, gain)
- elif self.mode == "std_normal":
- W = self._fn(*weight_shape)
- else:
- W = self._fn(weight_shape)
- return W
-
- def _calc_glorot_gain(self):
- """
- Values from:
- https://pytorch.org/docs/stable/nn.html?#torch.nn.init.calculate_gain
- """
- gain = 1.0
- act_str = self.act_fn.lower()
- if act_str == "tanh":
- gain = 5.0 / 3.0
- elif act_str == "relu":
- gain = np.sqrt(2)
- elif "leaky relu" in act_str:
- r = r"leaky relu\(alpha=(.*)\)"
- alpha = re.match(r, act_str).groups()[0]
- gain = np.sqrt(2 / 1 + float(alpha) ** 2)
- return gain
diff --git a/aitk/keras/layers/README.md b/aitk/keras/layers/README.md
deleted file mode 100644
index 81e888c..0000000
--- a/aitk/keras/layers/README.md
+++ /dev/null
@@ -1,20 +0,0 @@
-# Layers
-The `layers.py` module implements common layers / layer-wise operations that can
-be composed to create larger neural networks. It includes:
-
-- Fully-connected layers
-- Sparse evolutionary layers ([Mocanu et al., 2018](https://www.nature.com/articles/s41467-018-04316-3))
-- Dot-product attention layers ([Luong, Pho, & Manning, 2015](https://arxiv.org/pdf/1508.04025.pdf); [Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))
-- 1D and 2D convolution (with stride, padding, and dilation) layers ([van den Oord et al., 2016](https://arxiv.org/pdf/1609.03499.pdf); [Yu & Kolton, 2016](https://arxiv.org/pdf/1511.07122.pdf))
-- 2D "deconvolution" (with stride and padding) layers ([Zeiler et al., 2010](https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf))
-- Restricted Boltzmann machines (with CD-_n_ training) ([Smolensky, 1996](http://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf); [Carreira-Perpiñán & Hinton, 2005](http://www.cs.toronto.edu/~fritz/absps/cdmiguel.pdf))
-- Elementwise multiplication operation
-- Summation operation
-- Flattening operation
-- Embedding layer
-- Softmax layer
-- Max & average pooling layer
-- 1D and 2D batch normalization layers ([Ioffe & Szegedy, 2015](http://proceedings.mlr.press/v37/ioffe15.pdf))
-- 1D and 2D layer normalization layers ([Ba, Kiros, & Hinton, 2016](https://arxiv.org/pdf/1607.06450.pdf))
-- Recurrent layers ([Elman, 1990](https://crl.ucsd.edu/~elman/Papers/fsit.pdf))
-- Long short-term memory (LSTM) layers ([Hochreiter & Schmidhuber, 1997](http://www.bioinf.jku.at/publications/older/2604.pdf))
diff --git a/aitk/keras/layers/__init__.py b/aitk/keras/layers/__init__.py
deleted file mode 100644
index 790b4fa..0000000
--- a/aitk/keras/layers/__init__.py
+++ /dev/null
@@ -1,4324 +0,0 @@
-# -*- coding: utf-8 -*-
-# **************************************************************
-# aitk.keras: A Python Keras model API
-#
-# Copyright (c) 2021 AITK Developers
-#
-# https://github.com/ArtificialIntelligenceToolkit/aitk.keras
-#
-# **************************************************************
-
-"""A collection of composable layer objects for building neural networks"""
-from abc import ABC, abstractmethod
-
-import numpy as np
-
-from ..wrappers import init_wrappers, Dropout
-
-from ..initializers import (
- WeightInitializer,
- OptimizerInitializer,
- ActivationInitializer,
-)
-
-from ..utils import (
- pad1D,
- pad2D,
- conv1D,
- conv2D,
- im2col,
- col2im,
- dilate,
- deconv2D_naive,
- calc_pad_dims_2D,
-)
-
-class Activation():
- def __init__(self, activation):
- self.activation = activation
-
-NAME_CACHE = {}
-
-class LayerBase(ABC):
- def __init__(self, name=None):
- """An abstract base class inherited by all neural network layers"""
- self.X = []
- self.act_fn = None
- self.trainable = True
- self.name = self.make_name(name)
- self.optimizer = None
- self.default_kernel_optimizer = "glorot_uniform"
-
- self.gradients = {}
- self.parameters = {}
- self.derived_variables = {}
- self.input_layers = []
- self.output_layers = []
-
- super().__init__()
-
- def __call__(self, input_layer):
- if isinstance(input_layer, (list, tuple)):
- for layer in input_layer:
- layer.output_layers.append(self)
- self.input_layers.append(layer)
- else:
- input_layer.output_layers.append(self)
- self.input_layers.append(input_layer)
- return self
-
- def __str__(self):
- return f"<{self.__class__.__name__}(name='{self.name}')>"
-
- def make_name(self, name):
- if name is None:
- class_name = self.__class__.__name__.lower()
- count = NAME_CACHE.get(class_name, 0)
- if count == 0:
- new_name = class_name
- else:
- new_name = "%s_%s" % (class_name, count)
- NAME_CACHE[class_name] = count + 1
- return new_name
- else:
- return name
-
- def set_optimizer(self, optimizer=None):
- optimizer = optimizer or self.default_kernel_optimizer
- self.optimizer = OptimizerInitializer(optimizer)()
-
- def has_trainable_params(self):
- return self.parameters != {}
-
- @abstractmethod
- def _init_params(self, **kwargs):
- raise NotImplementedError
-
- @abstractmethod
- def forward(self, z, **kwargs):
- """Perform a forward pass through the layer"""
- raise NotImplementedError
-
- @abstractmethod
- def backward(self, out, **kwargs):
- """Perform a backward pass through the layer"""
- raise NotImplementedError
-
- def freeze(self):
- """
- Freeze the layer parameters at their current values so they can no
- longer be updated.
- """
- self.trainable = False
-
- def unfreeze(self):
- """Unfreeze the layer parameters so they can be updated."""
- self.trainable = True
-
- def flush_gradients(self):
- """Erase all the layer's derived variables and gradients."""
- assert self.trainable, "Layer is frozen"
- self.X = []
- for k, v in self.derived_variables.items():
- self.derived_variables[k] = []
-
- for k, v in self.gradients.items():
- self.gradients[k] = np.zeros_like(v)
-
- def update(self, cur_loss=None):
- """
- Update the layer parameters using the accrued gradients and layer
- optimizer. Flush all gradients once the update is complete.
- """
- assert self.trainable, "Layer is frozen"
- self.optimizer.step()
- for k, v in self.gradients.items():
- if k in self.parameters:
- self.parameters[k] = self.optimizer(self.parameters[k], v, k, cur_loss)
- self.flush_gradients()
-
- def set_params(self, summary_dict):
- """
- Set the layer parameters from a dictionary of values.
-
- Parameters
- ----------
- summary_dict : dict
- A dictionary of layer parameters and hyperparameters. If a required
- parameter or hyperparameter is not included within `summary_dict`,
- this method will use the value in the current layer's
- :meth:`summary` method.
-
- Returns
- -------
- layer : :doc:`Layer ` object
- The newly-initialized layer.
- """
- layer, sd = self, summary_dict
-
- # collapse `parameters` and `hyperparameters` nested dicts into a single
- # merged dictionary
- flatten_keys = ["parameters", "hyperparameters"]
- for k in flatten_keys:
- if k in sd:
- entry = sd[k]
- sd.update(entry)
- del sd[k]
-
- for k, v in sd.items():
- if k in self.parameters:
- layer.parameters[k] = v
- if k in self.hyperparameters:
- if k == "act_fn":
- layer.act_fn = ActivationInitializer(v)()
- elif k == "optimizer":
- layer.optimizer = OptimizerInitializer(sd[k])()
- elif k == "wrappers":
- layer = init_wrappers(layer, sd[k])
- elif k not in ["wrappers", "optimizer"]:
- setattr(layer, k, v)
- return layer
-
- def get_weights(self):
- # Returns pointers to weight matrices, in order:
- return [self.parameters[key] for key in self.parameters]
-
- def set_weights(self, weights, copy=True):
- # Ordered set of parameters:
- for i, key in enumerate(self.parameters):
- if copy:
- self.parameters[key] = weights[i].copy()
- else:
- self.parameters[key] = weights[i]
- self.weights_set = True
-
- def summary(self):
- """Return a dict of the layer parameters, hyperparameters, and ID."""
- return {
- "layer": self.hyperparameters["layer"],
- "parameters": self.parameters,
- "hyperparameters": self.hyperparameters,
- }
-
-
-class Input(LayerBase):
- def __init__(self, input_shape, batch_size=None, name=None):
- super().__init__(name=name)
- self.n_out = input_shape
- self.trainable = False
-
- def forward(self, z, **kwargs):
- """Perform a forward pass through the layer"""
- return z
-
- def backward(self, out, **kwargs):
- """Perform a backward pass through the layer"""
- raise NotImplementedError
-
- def _init_params(self, **kwargs):
- raise NotImplementedError
-
-InputLayer = Input
-
-class DotProductAttention(LayerBase):
- def __init__(self, scale=True, dropout_p=0, kernel_initializer="glorot_uniform", name=None):
- r"""
- A single "attention head" layer using a dot-product for the scoring function.
-
- Notes
- -----
- The equations for a dot product attention layer are:
-
- .. math::
-
- \mathbf{Z} &= \mathbf{K Q}^\\top \ \ \ \ &&\text{if scale = False} \\
- &= \mathbf{K Q}^\top / \sqrt{d_k} \ \ \ \ &&\text{if scale = True} \\
- \mathbf{Y} &= \text{dropout}(\text{softmax}(\mathbf{Z})) \mathbf{V}
-
- Parameters
- ----------
- scale : bool
- Whether to scale the the key-query dot product by the square root
- of the key/query vector dimensionality before applying the Softmax.
- This is useful, since the scale of dot product will otherwise
- increase as query / key dimensions grow. Default is True.
- dropout_p : float in [0, 1)
- The dropout propbability during training, applied to the output of
- the softmax. If 0, no dropout is applied. Default is 0.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- Unused.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.scale = scale
- self.dropout_p = dropout_p
- self._init_params()
-
- def _init_params(self):
- self.softmax = Dropout(Softmax(), self.dropout_p)
- smdv = self.softmax.derived_variables
- self.derived_variables = {
- "attention_weights": [],
- "dropout_mask": smdv["wrappers"][0]["dropout_mask"],
- }
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "DotProductAttention",
- "kernel_initializer": self.kernel_initializer,
- "scale": self.scale,
- "dropout_p": self.dropout_p,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def freeze(self):
- """
- Freeze the layer parameters at their current values so they can no
- longer be updated.
- """
- self.trainable = False
- self.softmax.freeze()
-
- def unfreeze(self):
- """Unfreeze the layer parameters so they can be updated."""
- self.trainable = True
- self.softmax.unfreeze()
-
- def forward(self, Q, K, V, retain_derived=True):
- r"""
- Compute the attention-weighted output of a collection of keys, values,
- and queries.
-
- Notes
- -----
- In the most abstract (ie., hand-wave-y) sense:
-
- - Query vectors ask questions
- - Key vectors advertise their relevancy to questions
- - Value vectors give possible answers to questions
- - The dot product between Key and Query vectors provides scores for
- each of the the `n_ex` different Value vectors
-
- For a single query and `n` key-value pairs, dot-product attention (with
- scaling) is::
-
- w0 = dropout(softmax( (query @ key[0]) / sqrt(d_k) ))
- w1 = dropout(softmax( (query @ key[1]) / sqrt(d_k) ))
- ...
- wn = dropout(softmax( (query @ key[n]) / sqrt(d_k) ))
-
- y = np.array([w0, ..., wn]) @ values
- (1 × n_ex) (n_ex × d_v)
-
- In words, keys and queries are combined via dot-product to produce a
- score, which is then passed through a softmax to produce a weight on
- each value vector in Values. We elementwise multiply each value vector
- by its weight, and then take the elementwise sum of each weighted value
- vector to get the :math:`1 \times d_v` output for the current example.
-
- In vectorized form,
-
- .. math::
-
- \mathbf{Y} = \text{dropout}(
- \text{softmax}(\mathbf{KQ}^\top / \sqrt{d_k})
- ) \mathbf{V}
-
- Parameters
- ----------
- Q : :py:class:`ndarray ` of shape `(n_ex, *, d_k)`
- A set of `n_ex` query vectors packed into a single matrix.
- Optional middle dimensions can be used to specify, e.g., the number
- of parallel attention heads.
- K : :py:class:`ndarray ` of shape `(n_ex, *, d_k)`
- A set of `n_ex` key vectors packed into a single matrix. Optional
- middle dimensions can be used to specify, e.g., the number of
- parallel attention heads.
- V : :py:class:`ndarray ` of shape `(n_ex, *, d_v)`
- A set of `n_ex` value vectors packed into a single matrix. Optional
- middle dimensions can be used to specify, e.g., the number of
- parallel attention heads.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, *, d_v)`
- The attention-weighted output values
- """
- Y, weights = self._fwd(Q, K, V)
-
- if retain_derived:
- self.X.append((Q, K, V))
- self.derived_variables["attention_weights"].append(weights)
-
- return Y
-
- def _fwd(self, Q, K, V):
- """Actual computation of forward pass"""
- scale = 1 / np.sqrt(Q.shape[-1]) if self.scale else 1
- scores = Q @ K.swapaxes(-2, -1) * scale # attention scores
- weights = self.softmax.forward(scores) # attention weights
- Y = weights @ V
- return Y, weights
-
- def backward(self, dLdy, retain_grads=True):
- r"""
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, *, d_v)`
- The gradient of the loss wrt. the layer output `Y`
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dQ : :py:class:`ndarray ` of shape `(n_ex, *, d_k)` or list of arrays
- The gradient of the loss wrt. the layer query matrix/matrices `Q`.
- dK : :py:class:`ndarray ` of shape `(n_ex, *, d_k)` or list of arrays
- The gradient of the loss wrt. the layer key matrix/matrices `K`.
- dV : :py:class:`ndarray ` of shape `(n_ex, *, d_v)` or list of arrays
- The gradient of the loss wrt. the layer value matrix/matrices `V`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dQ, dK, dV = [], [], []
- weights = self.derived_variables["attention_weights"]
- for dy, (q, k, v), w in zip(dLdy, self.X, weights):
- dq, dk, dv = self._bwd(dy, q, k, v, w)
- dQ.append(dq)
- dK.append(dk)
- dV.append(dv)
-
- if len(self.X) == 1:
- dQ, dK, dV = dQ[0], dK[0], dV[0]
-
- return dQ, dK, dV
-
- def _bwd(self, dy, q, k, v, weights):
- """Actual computation of the gradient of the loss wrt. q, k, and v"""
- d_k = k.shape[-1]
- scale = 1 / np.sqrt(d_k) if self.scale else 1
-
- dV = weights.swapaxes(-2, -1) @ dy
- dWeights = dy @ v.swapaxes(-2, -1)
- dScores = self.softmax.backward(dWeights)
- dQ = dScores @ k * scale
- dK = dScores.swapaxes(-2, -1) @ q * scale
- return dQ, dK, dV
-
-
-class RBM(LayerBase):
- def __init__(self, n_out, K=1, kernel_initializer="glorot_uniform", name=None):
- """
- A Restricted Boltzmann machine with Bernoulli visible and hidden units.
-
- Parameters
- ----------
- n_out : int
- The number of output dimensions/units.
- K : int
- The number of contrastive divergence steps to run before computing
- a single gradient update. Default is 1.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.K = K # CD-K
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.n_out = n_out
- self.is_initialized = False
- self.weights_set = False
- self.act_fn_V = ActivationInitializer("Sigmoid")()
- self.act_fn_H = ActivationInitializer("Sigmoid")()
- self.parameters = {"W": None, "b_in": None, "b_out": None}
-
- self._init_params()
-
- def _init_params(self):
- if not self.weights_set:
- b_in = np.zeros((1, self.n_in))
- b_out = np.zeros((1, self.n_out))
- init_weights = WeightInitializer(str(self.act_fn_V), mode=self.kernel_initializer)
- W = init_weights((self.n_in, self.n_out))
- else:
- W, b_in, b_out = self.get_weights()
-
- self.parameters = {"W": W, "b_in": b_in, "b_out": b_out}
- self.gradients = {
- "W": np.zeros_like(W),
- "b_in": np.zeros_like(b_in),
- "b_out": np.zeros_like(b_out),
- }
-
- self.derived_variables = {
- "V": None,
- "p_H": None,
- "p_V_prime": None,
- "p_H_prime": None,
- "positive_grad": None,
- "negative_grad": None,
- }
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "RBM",
- "K": self.K,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "kernel_initializer": self.kernel_initializer,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameterse,
- },
- }
-
- def CD_update(self, X):
- """
- Perform a single contrastive divergence-`k` training update using the
- visible inputs `X` as a starting point for the Gibbs sampler.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples. Each feature in X should ideally be
- binary-valued, although it is possible to also train on real-valued
- features ranging between (0, 1) (e.g., grayscale images).
- """
- self.forward(X)
- self.backward()
-
- def forward(self, V, K=None, retain_derived=True):
- """
- Perform the CD-`k` "forward pass" of visible inputs into hidden units
- and back.
-
- Notes
- -----
- This implementation follows [1]_'s recommendations for the RBM forward
- pass:
-
- - Use real-valued probabilities for both the data and the visible
- unit reconstructions.
- - Only the final update of the hidden units should use the actual
- probabilities -- all others should be sampled binary states.
- - When collecting the pairwise statistics for learning weights or
- the individual statistics for learning biases, use the
- probabilities, not the binary states.
-
- References
- ----------
- .. [1] Hinton, G. (2010). "A practical guide to training restricted
- Boltzmann machines". *UTML TR 2010-003*
-
- Parameters
- ----------
- V : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Visible input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples. Each feature in V should ideally be
- binary-valued, although it is possible to also train on real-valued
- features ranging between (0, 1) (e.g., grayscale images).
- K : int
- The number of steps of contrastive divergence steps to run before
- computing the gradient update. If None, use ``self.K``. Default is
- None.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
- """
- if not self.is_initialized:
- self.n_in = V.shape[1]
- self._init_params()
-
- # override self.K if necessary
- K = self.K if K is None else K
-
- W = self.parameters["W"]
- b_in = self.parameters["b_in"]
- b_out = self.parameters["b_out"]
-
- # compute hidden unit probabilities
- Z_H = V @ W + b_out
- p_H = self.act_fn_H.fn(Z_H)
-
- # sample hidden states (stochastic binary values)
- H = np.random.rand(*p_H.shape) <= p_H
- H = H.astype(float)
-
- # always use probabilities when computing gradients
- positive_grad = V.T @ p_H
-
- # perform CD-k
- # TODO: use persistent CD-k
- # https://www.cs.toronto.edu/~tijmen/pcd/pcd.pdf
- H_prime = H.copy()
- for k in range(K):
- # resample v' given h (H_prime is binary for all but final step)
- Z_V_prime = H_prime @ W.T + b_in
- p_V_prime = self.act_fn_V.fn(Z_V_prime)
-
- # don't resample visual units - always use raw probabilities!
- V_prime = p_V_prime
-
- # compute p(h' | v')
- Z_H_prime = V_prime @ W + b_out
- p_H_prime = self.act_fn_H.fn(Z_H_prime)
-
- # if this is the final iteration of CD, keep hidden state
- # probabilities (don't sample)
- H_prime = p_H_prime
- if k != self.K - 1:
- H_prime = np.random.rand(*p_H_prime.shape) <= p_H_prime
- H_prime = H_prime.astype(float)
-
- negative_grad = p_V_prime.T @ p_H_prime
-
- if retain_derived:
- self.derived_variables["V"] = V
- self.derived_variables["p_H"] = p_H
- self.derived_variables["p_V_prime"] = p_V_prime
- self.derived_variables["p_H_prime"] = p_H_prime
- self.derived_variables["positive_grad"] = positive_grad
- self.derived_variables["negative_grad"] = negative_grad
-
- def backward(self, retain_grads=True, *args):
- """
- Perform a gradient update on the layer parameters via the contrastive
- divergence equations.
-
- Parameters
- ----------
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
- """
- V = self.derived_variables["V"]
- p_H = self.derived_variables["p_H"]
- p_V_prime = self.derived_variables["p_V_prime"]
- p_H_prime = self.derived_variables["p_H_prime"]
- positive_grad = self.derived_variables["positive_grad"]
- negative_grad = self.derived_variables["negative_grad"]
-
- if retain_grads:
- self.gradients["b_in"] = V - p_V_prime
- self.gradients["b_out"] = p_H - p_H_prime
- self.gradients["W"] = positive_grad - negative_grad
-
- def reconstruct(self, X, n_steps=10, return_prob=False):
- """
- Reconstruct an input `X` by running the trained Gibbs sampler for
- `n_steps`-worth of CD-`k`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples. Each feature in `X` should ideally be
- binary-valued, although it is possible to also train on real-valued
- features ranging between (0, 1) (e.g., grayscale images). If `X` has
- missing values, it may be sufficient to mark them with random
- entries and allow the reconstruction to impute them.
- n_steps : int
- The number of Gibbs sampling steps to perform when generating the
- reconstruction. Default is 10.
- return_prob : bool
- Whether to return the real-valued feature probabilities for the
- reconstruction or the binary samples. Default is False.
-
- Returns
- -------
- V : :py:class:`ndarray ` of shape `(n_ex, in_ch)`
- The reconstruction (or feature probabilities if `return_prob` is
- true) of the visual input `X` after running the Gibbs sampler for
- `n_steps`.
- """
- self.forward(X, K=n_steps)
- p_V_prime = self.derived_variables["p_V_prime"]
-
- # ignore the gradients produced during this reconstruction
- self.flush_gradients()
-
- # sample V_prime reconstruction if return_prob is False
- V = p_V_prime
- if not return_prob:
- V = (np.random.rand(*p_V_prime.shape) <= p_V_prime).astype(float)
- return V
-
-
-#######################################################################
-# Layer Ops #
-#######################################################################
-
-
-class Add(LayerBase):
- def __init__(self, act_fn=None, name=None):
- """
- An "addition" layer that returns the sum of its inputs, passed through
- an optional nonlinearity.
-
- Parameters
- ----------
- act_fn : str, :doc:`Activation ` object, or None
- The element-wise output nonlinearity used in computing the final
- output. If None, use the identity function :math:`f(x) = x`.
- Default is None.
- """ # noqa: E501
- super().__init__(name=name)
- self.act_fn = ActivationInitializer(act_fn)()
- self._init_params()
-
- def _init_params(self):
- self.derived_variables = {"sum": []}
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Sum",
- "act_fn": str(self.act_fn),
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- r"""
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : list of length `n_inputs`
- A list of tensors, all of the same shape.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, *)`
- The sum over the `n_ex` examples.
- """
- out = X[0].copy()
- for i in range(1, len(X)):
- out += X[i]
- if retain_derived:
- self.X.append(X)
- self.derived_variables["sum"].append(out)
- return self.act_fn(out)
-
- def backward(self, dLdY, retain_grads=True):
- r"""
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, *)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : list of length `n_inputs`
- The gradient of the loss wrt. each input in `X`.
- """
- if not isinstance(dLdY, list):
- dLdY = [dLdY]
-
- X = self.X
- _sum = self.derived_variables["sum"]
- grads = [self._bwd(dy, x, ss) for dy, x, ss in zip(dLdY, X, _sum)]
- return grads[0] if len(X) == 1 else grads
-
- def _bwd(self, dLdY, X, _sum):
- """Actual computation of gradient of the loss wrt. each input"""
- grads = [dLdY * self.act_fn.grad(_sum) for _ in X]
- return grads
-
-
-class Multiply(LayerBase):
- def __init__(self, act_fn=None, name=None):
- """
- A multiplication layer that returns the *elementwise* product of its
- inputs, passed through an optional nonlinearity.
-
- Parameters
- ----------
- act_fn : str, :doc:`Activation ` object, or None
- The element-wise output nonlinearity used in computing the final
- output. If None, use the identity function :math:`f(x) = x`.
- Default is None.
- """ # noqa: E501
- super().__init__(name=name)
- self.act_fn = ActivationInitializer(act_fn)()
- self._init_params()
-
- def _init_params(self):
- self.derived_variables = {"product": []}
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Multiply",
- "act_fn": str(self.act_fn),
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- r"""
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : list of length `n_inputs`
- A list of tensors, all of the same shape.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, *)`
- The product over the `n_ex` examples.
- """ # noqa: E501
- out = X[0].copy()
- for i in range(1, len(X)):
- out *= X[i]
- if retain_derived:
- self.X.append(X)
- self.derived_variables["product"].append(out)
- return self.act_fn(out)
-
- def backward(self, dLdY, retain_grads=True):
- r"""
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, *)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : list of length `n_inputs`
- The gradient of the loss wrt. each input in `X`.
- """
- if not isinstance(dLdY, list):
- dLdY = [dLdY]
-
- X = self.X
- _prod = self.derived_variables["product"]
- grads = [self._bwd(dy, x, pr) for dy, x, pr in zip(dLdY, X, _prod)]
- return grads[0] if len(X) == 1 else grads
-
- def _bwd(self, dLdY, X, prod):
- """Actual computation of gradient of loss wrt. each input"""
- grads = [dLdY * self.act_fn.grad(prod)] * len(X)
- for i, x in enumerate(X):
- grads = [g * x if j != i else g for j, g in enumerate(grads)]
- return grads
-
-
-class Flatten(LayerBase):
- def __init__(self, keep_dim="first", name=None):
- """
- Flatten a multidimensional input into a 2D matrix.
-
- Parameters
- ----------
- keep_dim : {'first', 'last', -1}
- The dimension of the original input to retain. Typically used for
- retaining the minibatch dimension.. If -1, flatten all dimensions.
- Default is 'first'.
- """ # noqa: E501
- super().__init__(name=name)
- self.n_out = 0
- self.n_in = []
-
- self.keep_dim = keep_dim
- self._init_params()
-
- def _init_params(self):
- self.X = []
- self.gradients = {}
- self.parameters = {}
- self.derived_variables = {"in_dims": []}
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Flatten",
- "keep_dim": self.keep_dim,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- r"""
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray `
- Input volume to flatten.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(*out_dims)`
- Flattened output. If `keep_dim` is `'first'`, `X` is reshaped to
- ``(X.shape[0], -1)``, otherwise ``(-1, X.shape[0])``.
- """
- self.n_in = X.shape
- if retain_derived:
- self.derived_variables["in_dims"].append(X.shape)
- if self.keep_dim == -1:
- return X.flatten().reshape(1, -1)
- rs = (X.shape[0], -1) if self.keep_dim == "first" else (-1, X.shape[-1])
- self.n_out = rs
- return X.reshape(*rs)
-
- def backward(self, dLdy, retain_grads=True):
- r"""
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(*out_dims)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(*in_dims)` or list of arrays
- The gradient of the loss wrt. the layer input(s) `X`.
- """ # noqa: E501
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
- in_dims = self.derived_variables["in_dims"]
- out = [dy.reshape(*dims) for dy, dims in zip(dLdy, in_dims)]
- return out[0] if len(dLdy) == 1 else out
-
-class Concatenate(LayerBase):
- def __init__(self, name=None):
- """
- Concatenate a list of input layers into one.
- """ # noqa: E501
- super().__init__(name=name)
- self.n_out = 0
- self.n_in = []
-
- self._init_params()
-
- def _init_params(self):
- self.X = []
- self.gradients = {}
- self.parameters = {}
- self.derived_variables = {}
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Concatenate",
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- r"""
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray `
- Input volume to flatten.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y :
- """
- result = np.concatenate(X, -1)
- self.n_out = result.shape[1:]
- self.n_in = [layer.n_out for layer in self.input_layers]
- return result
-
- def backward(self, dLdy, retain_grads=True):
- r"""
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(*out_dims)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX :
- """ # noqa: E501
- return dLdy
-
-
-#######################################################################
-# Normalization Layers #
-#######################################################################
-
-
-class BatchNorm2D(LayerBase):
- def __init__(self, momentum=0.9, epsilon=1e-5, name=None):
- """
- A batch normalization layer for two-dimensional inputs with an
- additional channel dimension.
-
- Notes
- -----
- BatchNorm is an attempt address the problem of internal covariate
- shift (ICS) during training by normalizing layer inputs.
-
- ICS refers to the change in the distribution of layer inputs during
- training as a result of the changing parameters of the previous
- layer(s). ICS can make it difficult to train models with saturating
- nonlinearities, and in general can slow training by requiring a lower
- learning rate.
-
- Equations [train]::
-
- Y = scaler * norm(X) + intercept
- norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)
-
- Equations [test]::
-
- Y = scaler * running_norm(X) + intercept
- running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)
-
- In contrast to :class:`LayerNorm2D`, the BatchNorm layer calculates
- the mean and var across the *batch* rather than the output features.
- This has two disadvantages:
-
- 1. It is highly affected by batch size: smaller mini-batch sizes
- increase the variance of the estimates for the global mean and
- variance.
-
- 2. It is difficult to apply in RNNs -- one must fit a separate
- BatchNorm layer for *each* time-step.
-
- Parameters
- ----------
- momentum : float
- The momentum term for the running mean/running std calculations.
- The closer this is to 1, the less weight will be given to the
- mean/std of the current batch (i.e., higher smoothing). Default is
- 0.9.
- epsilon : float
- A small smoothing constant to use during computation of ``norm(X)``
- to avoid divide-by-zero errors. Default is 1e-5.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.in_ch = None
- self.out_ch = None
- self.epsilon = epsilon
- self.momentum = momentum
- self.parameters = {
- "scaler": None,
- "intercept": None,
- "running_var": None,
- "running_mean": None,
- }
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- scaler = np.random.rand(self.in_ch)
- intercept = np.zeros(self.in_ch)
-
- # init running mean and std at 0 and 1, respectively
- running_mean = np.zeros(self.in_ch)
- running_var = np.ones(self.in_ch)
-
- self.parameters = {
- "scaler": scaler,
- "intercept": intercept,
- "running_var": running_var,
- "running_mean": running_mean,
- }
-
- self.gradients = {
- "scaler": np.zeros_like(scaler),
- "intercept": np.zeros_like(intercept),
- }
-
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "BatchNorm2D",
- "act_fn": None,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "epsilon": self.epsilon,
- "momentum": self.momentum,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def reset_running_stats(self):
- """Reset the running mean and variance estimates to 0 and 1."""
- assert self.trainable, "Layer is frozen"
- self.parameters["running_mean"] = np.zeros(self.in_ch)
- self.parameters["running_var"] = np.ones(self.in_ch)
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Notes
- -----
- Equations [train]::
-
- Y = scaler * norm(X) + intercept
- norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)
-
- Equations [test]::
-
- Y = scaler * running_norm(X) + intercept
- running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)
-
- In contrast to :class:`LayerNorm2D`, the BatchNorm layer calculates the
- mean and var across the *batch* rather than the output features.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- Input volume containing the `in_rows` x `in_cols`-dimensional
- features for a minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to use the current intput to adjust the running mean and
- running_var computations. Setting this to True is the same as
- freezing the layer for the current input. Default is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- Layer output for each of the `n_ex` examples.
- """ # noqa: E501
- if not self.is_initialized:
- self.in_ch = self.out_ch = X.shape[3]
- self._init_params()
-
- ep = self.hyperparameters["epsilon"]
- mm = self.hyperparameters["momentum"]
- rm = self.parameters["running_mean"]
- rv = self.parameters["running_var"]
-
- scaler = self.parameters["scaler"]
- intercept = self.parameters["intercept"]
-
- # if the layer is frozen, use our running mean/std values rather
- # than the mean/std values for the new batch
- X_mean = self.parameters["running_mean"]
- X_var = self.parameters["running_var"]
-
- if self.trainable and retain_derived:
- X_mean, X_var = X.mean(axis=(0, 1, 2)), X.var(axis=(0, 1, 2)) # , ddof=1)
- self.parameters["running_mean"] = mm * rm + (1.0 - mm) * X_mean
- self.parameters["running_var"] = mm * rv + (1.0 - mm) * X_var
-
- if retain_derived:
- self.X.append(X)
-
- N = (X - X_mean) / np.sqrt(X_var + ep)
- y = scaler * N + intercept
- return y
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss wrt. the layer input `X`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx, dScaler, dIntercept = self._bwd(dy, x)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["scaler"] += dScaler
- self.gradients["intercept"] += dIntercept
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X):
- """Computation of gradient of loss wrt. X, scaler, and intercept"""
- scaler = self.parameters["scaler"]
- ep = self.hyperparameters["epsilon"]
-
- # reshape to 2D, retaining channel dim
- X_shape = X.shape
- X = np.reshape(X, (-1, X.shape[3]))
- dLdy = np.reshape(dLdy, (-1, dLdy.shape[3]))
-
- # apply 1D batchnorm backward pass on reshaped array
- n_ex, in_ch = X.shape
- X_mean, X_var = X.mean(axis=0), X.var(axis=0) # , ddof=1)
-
- N = (X - X_mean) / np.sqrt(X_var + ep)
- dIntercept = dLdy.sum(axis=0)
- dScaler = np.sum(dLdy * N, axis=0)
-
- dN = dLdy * scaler
- dX = (n_ex * dN - dN.sum(axis=0) - N * (dN * N).sum(axis=0)) / (
- n_ex * np.sqrt(X_var + ep)
- )
-
- return np.reshape(dX, X_shape), dScaler, dIntercept
-
-
-class BatchNorm1D(LayerBase):
- def __init__(self, momentum=0.9, epsilon=1e-5, name=None):
- """
- A batch normalization layer for 1D inputs.
-
- Notes
- -----
- BatchNorm is an attempt address the problem of internal covariate
- shift (ICS) during training by normalizing layer inputs.
-
- ICS refers to the change in the distribution of layer inputs during
- training as a result of the changing parameters of the previous
- layer(s). ICS can make it difficult to train models with saturating
- nonlinearities, and in general can slow training by requiring a lower
- learning rate.
-
- Equations [train]::
-
- Y = scaler * norm(X) + intercept
- norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)
-
- Equations [test]::
-
- Y = scaler * running_norm(X) + intercept
- running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)
-
- In contrast to :class:`LayerNorm1D`, the BatchNorm layer calculates
- the mean and var across the *batch* rather than the output features.
- This has two disadvantages:
-
- 1. It is highly affected by batch size: smaller mini-batch sizes
- increase the variance of the estimates for the global mean and
- variance.
-
- 2. It is difficult to apply in RNNs -- one must fit a separate
- BatchNorm layer for *each* time-step.
-
- Parameters
- ----------
- momentum : float
- The momentum term for the running mean/running std calculations.
- The closer this is to 1, the less weight will be given to the
- mean/std of the current batch (i.e., higher smoothing). Default is
- 0.9.
- epsilon : float
- A small smoothing constant to use during computation of ``norm(X)``
- to avoid divide-by-zero errors. Default is 1e-5.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.n_in = None
- self.n_out = None
- self.epsilon = epsilon
- self.momentum = momentum
- self.parameters = {
- "scaler": None,
- "intercept": None,
- "running_var": None,
- "running_mean": None,
- }
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- scaler = np.random.rand(self.n_in)
- intercept = np.zeros(self.n_in)
-
- # init running mean and std at 0 and 1, respectively
- running_mean = np.zeros(self.n_in)
- running_var = np.ones(self.n_in)
-
- self.parameters = {
- "scaler": scaler,
- "intercept": intercept,
- "running_mean": running_mean,
- "running_var": running_var,
- }
-
- self.gradients = {
- "scaler": np.zeros_like(scaler),
- "intercept": np.zeros_like(intercept),
- }
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "BatchNorm1D",
- "act_fn": None,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "epsilon": self.epsilon,
- "momentum": self.momentum,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def reset_running_stats(self):
- """Reset the running mean and variance estimates to 0 and 1."""
- assert self.trainable, "Layer is frozen"
- self.parameters["running_mean"] = np.zeros(self.n_in)
- self.parameters["running_var"] = np.ones(self.n_in)
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to use the current intput to adjust the running mean and
- running_var computations. Setting this to True is the same as
- freezing the layer for the current input. Default is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer output for each of the `n_ex` examples
- """
- if not self.is_initialized:
- self.n_in = self.n_out = X.shape[1]
- self._init_params()
-
- ep = self.hyperparameters["epsilon"]
- mm = self.hyperparameters["momentum"]
- rm = self.parameters["running_mean"]
- rv = self.parameters["running_var"]
-
- scaler = self.parameters["scaler"]
- intercept = self.parameters["intercept"]
-
- # if the layer is frozen, use our running mean/std values rather
- # than the mean/std values for the new batch
- X_mean = self.parameters["running_mean"]
- X_var = self.parameters["running_var"]
-
- if self.trainable and retain_derived:
- X_mean, X_var = X.mean(axis=0), X.var(axis=0) # , ddof=1)
- self.parameters["running_mean"] = mm * rm + (1.0 - mm) * X_mean
- self.parameters["running_var"] = mm * rv + (1.0 - mm) * X_var
-
- if retain_derived:
- self.X.append(X)
-
- N = (X - X_mean) / np.sqrt(X_var + ep)
- y = scaler * N + intercept
- return y
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer input `X`.
- """
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx, dScaler, dIntercept = self._bwd(dy, x)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["scaler"] += dScaler
- self.gradients["intercept"] += dIntercept
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X):
- """Computation of gradient of loss wrt X, scaler, and intercept"""
- scaler = self.parameters["scaler"]
- ep = self.hyperparameters["epsilon"]
-
- n_ex, n_in = X.shape
- X_mean, X_var = X.mean(axis=0), X.var(axis=0) # , ddof=1)
-
- N = (X - X_mean) / np.sqrt(X_var + ep)
- dIntercept = dLdy.sum(axis=0)
- dScaler = np.sum(dLdy * N, axis=0)
-
- dN = dLdy * scaler
- dX = (n_ex * dN - dN.sum(axis=0) - N * (dN * N).sum(axis=0)) / (
- n_ex * np.sqrt(X_var + ep)
- )
-
- return dX, dScaler, dIntercept
-
-
-class LayerNorm2D(LayerBase):
- def __init__(self, epsilon=1e-5, name=None):
- """
- A layer normalization layer for 2D inputs with an additional channel
- dimension.
-
- Notes
- -----
- In contrast to :class:`BatchNorm2D`, the LayerNorm layer calculates the
- mean and variance across *features* rather than examples in the batch
- ensuring that the mean and variance estimates are independent of batch
- size and permitting straightforward application in RNNs.
-
- Equations [train & test]::
-
- Y = scaler * norm(X) + intercept
- norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)
-
- Also in contrast to :class:`BatchNorm2D`, `scaler` and `intercept` are applied
- *elementwise* to ``norm(X)``.
-
- Parameters
- ----------
- epsilon : float
- A small smoothing constant to use during computation of ``norm(X)``
- to avoid divide-by-zero errors. Default is 1e-5.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.in_ch = None
- self.out_ch = None
- self.epsilon = epsilon
- self.parameters = {"scaler": None, "intercept": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self, X_shape):
- n_ex, in_rows, in_cols, in_ch = X_shape
-
- scaler = np.random.rand(in_rows, in_cols, in_ch)
- intercept = np.zeros((in_rows, in_cols, in_ch))
-
- self.parameters = {"scaler": scaler, "intercept": intercept}
-
- self.gradients = {
- "scaler": np.zeros_like(scaler),
- "intercept": np.zeros_like(intercept),
- }
-
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "LayerNorm2D",
- "act_fn": None,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "epsilon": self.epsilon,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Notes
- -----
- Equations [train & test]::
-
- Y = scaler * norm(X) + intercept
- norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- Input volume containing the `in_rows` by `in_cols`-dimensional
- features for a minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- Layer output for each of the `n_ex` examples.
- """ # noqa: E501
- if not self.is_initialized:
- self.in_ch = self.out_ch = X.shape[3]
- self._init_params(X.shape)
-
- scaler = self.parameters["scaler"]
- ep = self.hyperparameters["epsilon"]
- intercept = self.parameters["intercept"]
-
- if retain_derived:
- self.X.append(X)
-
- X_var = X.var(axis=(1, 2, 3), keepdims=True)
- X_mean = X.mean(axis=(1, 2, 3), keepdims=True)
- lnorm = (X - X_mean) / np.sqrt(X_var + ep)
- y = scaler * lnorm + intercept
- return y
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss wrt. the layer input `X`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx, dScaler, dIntercept = self._bwd(dy, x)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["scaler"] += dScaler
- self.gradients["intercept"] += dIntercept
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dy, X):
- """Computation of gradient of the loss wrt X, scaler, intercept"""
- scaler = self.parameters["scaler"]
- ep = self.hyperparameters["epsilon"]
-
- X_mean = X.mean(axis=(1, 2, 3), keepdims=True)
- X_var = X.var(axis=(1, 2, 3), keepdims=True)
- lnorm = (X - X_mean) / np.sqrt(X_var + ep)
-
- dLnorm = dy * scaler
- dIntercept = dy.sum(axis=0)
- dScaler = np.sum(dy * lnorm, axis=0)
-
- n_in = np.prod(X.shape[1:])
- lnorm = lnorm.reshape(-1, n_in)
- dLnorm = dLnorm.reshape(lnorm.shape)
- X_var = X_var.reshape(X_var.shape[:2])
-
- dX = (
- n_in * dLnorm
- - dLnorm.sum(axis=1, keepdims=True)
- - lnorm * (dLnorm * lnorm).sum(axis=1, keepdims=True)
- ) / (n_in * np.sqrt(X_var + ep))
-
- # reshape X gradients back to proper dimensions
- return np.reshape(dX, X.shape), dScaler, dIntercept
-
-
-class LayerNorm1D(LayerBase):
- def __init__(self, epsilon=1e-5, name=None):
- """
- A layer normalization layer for 1D inputs.
-
- Notes
- -----
- In contrast to :class:`BatchNorm1D`, the LayerNorm layer calculates the
- mean and variance across *features* rather than examples in the batch
- ensuring that the mean and variance estimates are independent of batch
- size and permitting straightforward application in RNNs.
-
- Equations [train & test]::
-
- Y = scaler * norm(X) + intercept
- norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)
-
- Also in contrast to :class:`BatchNorm1D`, `scaler` and `intercept` are applied
- *elementwise* to ``norm(X)``.
-
- Parameters
- ----------
- epsilon : float
- A small smoothing constant to use during computation of ``norm(X)``
- to avoid divide-by-zero errors. Default is 1e-5.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.n_in = None
- self.n_out = None
- self.epsilon = epsilon
- self.parameters = {"scaler": None, "intercept": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- scaler = np.random.rand(self.n_in)
- intercept = np.zeros(self.n_in)
-
- self.parameters = {"scaler": scaler, "intercept": intercept}
-
- self.gradients = {
- "scaler": np.zeros_like(scaler),
- "intercept": np.zeros_like(intercept),
- }
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "LayerNorm1D",
- "act_fn": None,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "epsilon": self.epsilon,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer output for each of the `n_ex` examples.
- """
- if not self.is_initialized:
- self.n_in = self.n_out = X.shape[1]
- self._init_params()
-
- scaler = self.parameters["scaler"]
- ep = self.hyperparameters["epsilon"]
- intercept = self.parameters["intercept"]
-
- if retain_derived:
- self.X.append(X)
-
- X_mean, X_var = X.mean(axis=1, keepdims=True), X.var(axis=1, keepdims=True)
- lnorm = (X - X_mean) / np.sqrt(X_var + ep)
- y = scaler * lnorm + intercept
- return y
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer input `X`.
- """
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx, dScaler, dIntercept = self._bwd(dy, x)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["scaler"] += dScaler
- self.gradients["intercept"] += dIntercept
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X):
- """Computation of gradient of the loss wrt X, scaler, intercept"""
- scaler = self.parameters["scaler"]
- ep = self.hyperparameters["epsilon"]
-
- n_ex, n_in = X.shape
- X_mean, X_var = X.mean(axis=1, keepdims=True), X.var(axis=1, keepdims=True)
-
- lnorm = (X - X_mean) / np.sqrt(X_var + ep)
- dIntercept = dLdy.sum(axis=0)
- dScaler = np.sum(dLdy * lnorm, axis=0)
-
- dLnorm = dLdy * scaler
- dX = (
- n_in * dLnorm
- - dLnorm.sum(axis=1, keepdims=True)
- - lnorm * (dLnorm * lnorm).sum(axis=1, keepdims=True)
- ) / (n_in * np.sqrt(X_var + ep))
-
- return dX, dScaler, dIntercept
-
-
-#######################################################################
-# MLP Layers #
-#######################################################################
-
-
-class Embedding(LayerBase):
- def __init__(
- self, n_out, vocab_size, pool=None, kernel_initializer="glorot_uniform", name=None
- ):
- """
- An embedding layer.
-
- Notes
- -----
- Equations::
-
- Y = W[x]
-
- NB. This layer must be the first in a neural network as the gradients
- do not get passed back through to the inputs.
-
- Parameters
- ----------
- n_out : int
- The dimensionality of the embeddings
- vocab_size : int
- The total number of items in the vocabulary. All integer indices
- are expected to range between 0 and `vocab_size - 1`.
- pool : {'sum', 'mean', None}
- If not None, apply this function to the collection of `n_in`
- encodings in each example to produce a single, pooled embedding.
- Default is None.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
- fstr = "'pool' must be either 'sum', 'mean', or None but got '{}'"
- assert pool in ["sum", "mean", None], fstr.format(pool)
-
- self.kernel_initializer = kernel_initializer
- self.pool = pool
- self.n_out = n_out
- self.vocab_size = vocab_size
- self.parameters = {"W": None}
- self.is_initialized = False
- self.weights_set = False
- self._init_params()
-
- def _init_params(self):
- if not self.weights_set:
- init_weights = WeightInitializer("Affine(slope=1, intercept=0)", mode=self.kernel_initializer)
- W = init_weights((self.vocab_size, self.n_out))
- else:
- W = self.get_weights()
-
- self.parameters = {"W": W}
- self.derived_variables = {}
- self.gradients = {"W": np.zeros_like(W)}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Embedding",
- "kernel_initializer": self.kernel_initializer,
- "pool": self.pool,
- "n_out": self.n_out,
- "vocab_size": self.vocab_size,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def lookup(self, ids):
- """
- Return the embeddings associated with the IDs in `ids`.
-
- Parameters
- ----------
- word_ids : :py:class:`ndarray ` of shape (`M`,)
- An array of `M` IDs to retrieve embeddings for.
-
- Returns
- -------
- embeddings : :py:class:`ndarray ` of shape (`M`, `n_out`)
- The embedding vectors for each of the `M` IDs.
- """
- return self.parameters["W"][ids]
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Notes
- -----
- Equations:
- Y = W[x]
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)` or list of length `n_ex`
- Layer input, representing a minibatch of `n_ex` examples. If
- ``self.pool`` is None, each example must consist of exactly `n_in`
- integer token IDs. Otherwise, `X` can be a ragged array, with each
- example consisting of a variable number of token IDs.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through with regard to this input.
- Default is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_in, n_out)`
- Embeddings for each coordinate of each of the `n_ex` examples
- """ # noqa: E501
- # if X is a ragged array
- if isinstance(X, list) and not issubclass(X[0].dtype.type, np.integer):
- fstr = "Input to Embedding layer must be an array of integers, got '{}'"
- raise TypeError(fstr.format(X[0].dtype.type))
-
- # otherwise
- if isinstance(X, np.ndarray) and not issubclass(X.dtype.type, np.integer):
- fstr = "Input to Embedding layer must be an array of integers, got '{}'"
- raise TypeError(fstr.format(X.dtype.type))
-
- Y = self._fwd(X)
- if retain_derived:
- self.X.append(X)
- return Y
-
- def _fwd(self, X):
- """Actual computation of forward pass"""
- W = self.parameters["W"]
- if self.pool is None:
- emb = W[X]
- elif self.pool == "sum":
- emb = np.array([W[x].sum(axis=0) for x in X])[:, None, :]
- elif self.pool == "mean":
- emb = np.array([W[x].mean(axis=0) for x in X])[:, None, :]
- return emb
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to embedding weights.
-
- Notes
- -----
- Because the items in `X` are interpreted as indices, we cannot compute
- the gradient of the layer output wrt. `X`.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, n_in, n_out)` or list of arrays
- The gradient(s) of the loss wrt. the layer output(s)
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- for dy, x in zip(dLdy, self.X):
- dw = self._bwd(dy, x)
-
- if retain_grads:
- self.gradients["W"] += dw
-
- def _bwd(self, dLdy, X):
- """Actual computation of gradient of the loss wrt. W"""
- dW = np.zeros_like(self.parameters["W"])
- dLdy = dLdy.reshape(-1, self.n_out)
-
- if self.pool is None:
- for ix, v_id in enumerate(X.flatten()):
- dW[v_id] += dLdy[ix]
- elif self.pool == "sum":
- for ix, v_ids in enumerate(X):
- dW[v_ids] += dLdy[ix]
- elif self.pool == "mean":
- for ix, v_ids in enumerate(X):
- dW[v_ids] += dLdy[ix] / len(v_ids)
- return dW
-
-
-class Dense(LayerBase):
- def __init__(self, n_out, activation=None, kernel_initializer="glorot_uniform", name=None):
- r"""
- A fully-connected (dense) layer.
-
- Notes
- -----
- A fully connected layer computes the function
-
- .. math::
-
- \mathbf{Y} = f( \mathbf{WX} + \mathbf{b} )
-
- where `f` is the activation nonlinearity, **W** and **b** are
- parameters of the layer, and **X** is the minibatch of input examples.
-
- Parameters
- ----------
- n_out : int
- The dimensionality of the layer output
- act_fn : str, :doc:`Activation ` object, or None
- The element-wise output nonlinearity used in computing `Y`. If None,
- use the identity function :math:`f(X) = X`. Default is None.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.n_out = n_out
- self.act_fn = ActivationInitializer(activation)()
- self.parameters = {"W": None, "b": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- if not self.weights_set:
- init_weights = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
- W = init_weights((self.n_in, self.n_out))
- b = np.zeros((1, self.n_out))
- else:
- W, b = self.get_weights()
-
- self.parameters = {"W": W, "b": b}
- self.derived_variables = {"Z": []}
- self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Dense",
- "kernel_initializer": self.kernel_initializer,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "act_fn": str(self.act_fn),
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_out)`
- Layer output for each of the `n_ex` examples.
- """
- if not self.is_initialized:
- self.n_in = X.shape[1]
- self._init_params()
-
- Y, Z = self._fwd(X)
-
- if retain_derived:
- self.X.append(X)
- self.derived_variables["Z"].append(Z)
-
- return Y
-
- def _fwd(self, X):
- """Actual computation of forward pass"""
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- Z = X @ W + b
- Y = self.act_fn(Z)
- return Y, Z
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, n_out)` or list of arrays
- The gradient(s) of the loss wrt. the layer output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape `(n_ex, n_in)` or list of arrays
- The gradient of the loss wrt. the layer input(s) `X`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx, dw, db = self._bwd(dy, x)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["W"] += dw
- self.gradients["b"] += db
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X):
- """Actual computation of gradient of the loss wrt. X, W, and b"""
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- Z = X @ W + b
- dZ = dLdy * self.act_fn.grad(Z)
-
- dX = dZ @ W.T
- dW = X.T @ dZ
- dB = dZ.sum(axis=0) # don't keep dimensions
- return dX, dW, dB
-
- def _bwd2(self, dLdy, X, dLdy_bwd):
- """Compute second derivatives / deriv. of loss wrt. dX, dW, and db"""
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- dZ = self.act_fn.grad(X @ W + b)
- ddZ = self.act_fn.grad2(X @ W + b)
-
- ddX = dLdy @ W * dZ
- ddW = dLdy.T @ (dLdy_bwd * dZ)
- ddB = np.sum(dLdy @ W * dLdy_bwd * ddZ, axis=0, keepdims=True)
- return ddX, ddW, ddB
-
-
-class Softmax(LayerBase):
- def __init__(self, dim=-1, name=None):
- r"""
- A softmax nonlinearity layer.
-
- Notes
- -----
- This is implemented as a layer rather than an activation primarily
- because it requires retaining the layer input in order to compute the
- softmax gradients properly. In other words, in contrast to other
- simple activations, the softmax function and its gradient are not
- computed elementwise, and thus are more easily expressed as a layer.
-
- The softmax function computes:
-
- .. math::
-
- y_i = \frac{e^{x_i}}{\sum_j e^{x_j}}
-
- where :math:`x_i` is the `i` th element of input example **x**.
-
- Parameters
- ----------
- dim: int
- The dimension in `X` along which the softmax will be computed.
- Default is -1.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.dim = dim
- self.n_in = None
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- self.gradients = {}
- self.parameters = {}
- self.derived_variables = {}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "SoftmaxLayer",
- "n_in": self.n_in,
- "n_out": self.n_in,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_out)`
- Layer output for each of the `n_ex` examples.
- """
- if not self.is_initialized:
- self.n_in = X.shape[1]
- self._init_params()
-
- Y = self._fwd(X)
-
- if retain_derived:
- self.X.append(X)
-
- return Y
-
- def _fwd(self, X):
- """Actual computation of softmax forward pass"""
- # center data to avoid overflow
- e_X = np.exp(X - np.max(X, axis=self.dim, keepdims=True))
- return e_X / e_X.sum(axis=self.dim, keepdims=True)
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, n_out)` or list of arrays
- The gradient(s) of the loss wrt. the layer output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer input `X`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx = self._bwd(dy, x)
- dX.append(dx)
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X):
- """
- Actual computation of the gradient of the loss wrt. the input X.
-
- The Jacobian, J, of the softmax for input x = [x1, ..., xn] is:
- J[i, j] =
- softmax(x_i) * (1 - softmax(x_j)) if i = j
- -softmax(x_i) * softmax(x_j) if i != j
- where
- x_n is input example n (ie., the n'th row in X)
- """
- dX = []
- for dy, x in zip(dLdy, X):
- dxi = []
- for dyi, xi in zip(*np.atleast_2d(dy, x)):
- yi = self._fwd(xi.reshape(1, -1)).reshape(-1, 1)
- dyidxi = np.diagflat(yi) - yi @ yi.T # jacobian wrt. input sample xi
- dxi.append(dyi @ dyidxi)
- dX.append(dxi)
- return np.array(dX).reshape(*X.shape)
-
-
-class SparseEvolution(LayerBase):
- def __init__(
- self,
- n_out,
- zeta=0.3,
- epsilon=20,
- act_fn=None,
- kernel_initializer="glorot_uniform",
- name=None,
- ):
- r"""
- A sparse Erdos-Renyi layer with evolutionary rewiring via the sparse
- evolutionary training (SET) algorithm.
-
- Notes
- -----
- .. math::
-
- Y = f( (\mathbf{W} \odot \mathbf{W}_{mask}) \mathbf{X} + \mathbf{b} )
-
- where :math:`\odot` is the elementwise multiplication operation, `f` is
- the layer activation function, and :math:`\mathbf{W}_{mask}` is an
- evolved binary mask.
-
- Parameters
- ----------
- n_out : int
- The dimensionality of the layer output
- zeta : float
- Proportion of the positive and negative weights closest to zero to
- drop after each training update. Default is 0.3.
- epsilon : float
- Layer sparsity parameter. Default is 20.
- act_fn : str, :doc:`Activation ` object, or None
- The element-wise output nonlinearity used in computing `Y`. If None,
- use the identity function :math:`f(X) = X`. Default is None.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.zeta = zeta
- self.n_out = n_out
- self.epsilon = epsilon
- self.act_fn = ActivationInitializer(act_fn)()
- self.parameters = {"W": None, "b": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- if not self.weights_set:
- init_weights = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
- W = init_weights((self.n_in, self.n_out))
- b = np.zeros((1, self.n_out))
- # convert a fully connected base layer into a sparse layer
- n_in, n_out = W.shape
- p = (self.epsilon * (n_in + n_out)) / (n_in * n_out)
- mask = np.random.binomial(1, p, shape=W.shape)
- else:
- W, b, mask = self.get_weights()
-
- self.derived_variables = {"Z": []}
- self.parameters = {"W": W, "b": b, "W_mask": mask}
- self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "SparseEvolutionary",
- "kernel_initializer": self.kernel_initializer,
- "zeta": self.zeta,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "epsilon": self.epsilon,
- "act_fn": str(self.act_fn),
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing the `n_in`-dimensional features for a
- minibatch of `n_ex` examples.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_out)`
- Layer output for each of the `n_ex` examples.
- """
- if not self.is_initialized:
- self.n_in = X.shape[1]
- self._init_params()
-
- Y, Z = self._fwd(X)
-
- if retain_derived:
- self.X.append(X)
- self.derived_variables["Z"].append(Z)
-
- return Y
-
- def _fwd(self, X):
- """Actual computation of forward pass"""
- W = self.parameters["W"]
- b = self.parameters["b"]
- W_mask = self.parameters["W_mask"]
-
- Z = X @ (W * W_mask) + b
- Y = self.act_fn(Z)
- return Y, Z
-
- def backward(self, dLdy, retain_grads=True):
- """
- Backprop from layer outputs to inputs
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, n_out)` or list of arrays
- The gradient(s) of the loss wrt. the layer output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer input `X`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- for dy, x in zip(dLdy, X):
- dx, dw, db = self._bwd(dy, x)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["W"] += dw
- self.gradients["b"] += db
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X):
- """Actual computation of gradient of the loss wrt. X, W, and b"""
- W = self.parameters["W"]
- b = self.parameters["b"]
- W_sparse = W * self.parameters["W_mask"]
-
- Z = X @ W_sparse + b
- dZ = dLdy * self.act_fn.grad(Z)
-
- dX = dZ @ W_sparse.T
- dW = X.T @ dZ
- dB = dZ.sum(axis=0, keepdims=True)
- return dX, dW, dB
-
- def _bwd2(self, dLdy, X, dLdy_bwd):
- """Compute second derivatives / deriv. of loss wrt. dX, dW, and db"""
- W = self.parameters["W"]
- b = self.parameters["b"]
- W_sparse = W * self.parameters["W_mask"]
-
- dZ = self.act_fn.grad(X @ W_sparse + b)
- ddZ = self.act_fn.grad2(X @ W_sparse + b)
-
- ddX = dLdy @ W * dZ
- ddW = dLdy.T @ (dLdy_bwd * dZ)
- ddB = np.sum(dLdy @ W_sparse * dLdy_bwd * ddZ, axis=0, keepdims=True)
- return ddX, ddW, ddB
-
- def update(self):
- """
- Update parameters using current gradients and evolve network
- connections via SET.
- """
- assert self.trainable, "Layer is frozen"
- for k, v in self.gradients.items():
- if k in self.parameters:
- self.parameters[k] = self.optimizer(self.parameters[k], v, k)
- self.flush_gradients()
- self._evolve_connections()
-
- def _evolve_connections(self):
- assert self.trainable, "Layer is frozen"
- W = self.parameters["W"]
- W_mask = self.parameters["W_mask"]
- W_flat = (W * W_mask).reshape(-1)
-
- k = int(np.prod(W.shape) * self.zeta)
-
- (p_ix,) = np.where(W_flat > 0)
- (n_ix,) = np.where(W_flat < 0)
-
- # remove the k largest negative and k smallest positive weights
- k_smallest_p = p_ix[np.argsort(W_flat[p_ix])][:k]
- k_largest_n = n_ix[np.argsort(W_flat[n_ix])][-k:]
- n_rewired = len(k_smallest_p) + len(k_largest_n)
-
- self.mask = np.ones_like(W_flat)
- self.mask[k_largest_n] = 0
- self.mask[k_smallest_p] = 0
-
- zero_ixs = np.where(self.mask == 0)
-
- # resample new connections and update mask
- np.shuffle(zero_ixs)
- self.mask[zero_ixs[:n_rewired]] = 1
- self.mask = self.mask.reshape(*W.shape)
-
-
-#######################################################################
-# Convolutional Layers #
-#######################################################################
-
-
-class Conv1D(LayerBase):
- def __init__(
- self,
- out_ch,
- kernel_width,
- pad=0,
- stride=1,
- dilation=0,
- act_fn=None,
- kernel_initializer="glorot_uniform",
- name=None,
- ):
- """
- Apply a one-dimensional convolution kernel over an input volume.
-
- Notes
- -----
- Equations::
-
- out = act_fn(pad(X) * W + b)
- out_dim = floor(1 + (n_rows_in + pad_left + pad_right - kernel_width) / stride)
-
- where '`*`' denotes the cross-correlation operation with stride `s` and dilation `d`.
-
- Parameters
- ----------
- out_ch : int
- The number of filters/kernels to compute in the current layer
- kernel_width : int
- The width of a single 1D filter/kernel in the current layer
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``Y[t]``. If None, use the
- identity function :math:`f(x) = x` by default. Default is None.
- pad : int, tuple, or {'same', 'causal'}
- The number of rows/columns to zero-pad the input with. If `'same'`,
- calculate padding to ensure the output length matches in the input
- length. If `'causal'` compute padding such that the output both has
- the same length as the input AND ``output[t]`` does not depend on
- ``input[t + 1:]``. Default is 0.
- stride : int
- The stride/hop of the convolution kernels as they move over the
- input volume. Default is 1.
- dilation : int
- Number of pixels inserted between kernel elements. Effective kernel
- shape after dilation is: ``[kernel_rows * (d + 1) - d, kernel_cols
- * (d + 1) - d]``. Default is 0.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.pad = pad
- self.kernel_initializer = kernel_initializer
- self.in_ch = None
- self.out_ch = out_ch
- self.stride = stride
- self.dilation = dilation
- self.kernel_width = kernel_width
- self.act_fn = ActivationInitializer(act_fn)()
- self.parameters = {"W": None, "b": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- if not self.weights_set:
- init_weights = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
- W = init_weights((self.kernel_width, self.in_ch, self.out_ch))
- b = np.zeros((1, 1, self.out_ch))
- else:
- W, b = self.get_weights()
-
- self.parameters = {"W": W, "b": b}
- self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
- self.derived_variables = {"Z": [], "out_rows": [], "out_cols": []}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Conv1D",
- "pad": self.pad,
- "kernel_initializer": self.kernel_initializer,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "stride": self.stride,
- "dilation": self.dilation,
- "act_fn": str(self.act_fn),
- "kernel_width": self.kernel_width,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output given input volume `X`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, l_in, in_ch)`
- The input volume consisting of `n_ex` examples, each of length
- `l_in` and with `in_ch` input channels
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, l_out, out_ch)`
- The layer output.
- """
- if not self.is_initialized:
- self.in_ch = X.shape[2]
- self._init_params()
-
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- n_ex, l_in, in_ch = X.shape
- s, p, d = self.stride, self.pad, self.dilation
-
- # pad the input and perform the forward convolution
- Z = conv1D(X, W, s, p, d) + b
- Y = self.act_fn(Z)
-
- if retain_derived:
- self.X.append(X)
- self.derived_variables["Z"].append(Z)
- self.derived_variables["out_rows"].append(Z.shape[1])
- self.derived_variables["out_cols"].append(Z.shape[2])
-
- return Y
-
- def backward(self, dLdy, retain_grads=True):
- """
- Compute the gradient of the loss with respect to the layer parameters.
-
- Notes
- -----
- Relies on :meth:`~numpy_ml.neural_nets.utils.im2col` and
- :meth:`~numpy_ml.neural_nets.utils.col2im` to vectorize the
- gradient calculation. See the private method :meth:`_backward_naive`
- for a more straightforward implementation.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, l_out, out_ch)` or list of arrays
- The gradient(s) of the loss with respect to the layer output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, l_in, in_ch)`
- The gradient of the loss with respect to the layer input volume.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- X = self.X
- Z = self.derived_variables["Z"]
-
- dX = []
- for dy, x, z in zip(dLdy, X, Z):
- dx, dw, db = self._bwd(dy, x, z)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["W"] += dw
- self.gradients["b"] += db
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X, Z):
- """Actual computation of gradient of the loss wrt. X, W, and b"""
- W = self.parameters["W"]
-
- # add a row dimension to X, W, and dZ to permit us to use im2col/col2im
- X2D = np.expand_dims(X, axis=1)
- W2D = np.expand_dims(W, axis=0)
- dLdZ = np.expand_dims(dLdy * self.act_fn.grad(Z), axis=1)
-
- d = self.dilation
- fr, fc, in_ch, out_ch = W2D.shape
- n_ex, l_out, out_ch = dLdy.shape
- fr, fc, s = 1, self.kernel_width, self.stride
-
- # use pad1D here in order to correctly handle self.pad = 'causal',
- # which isn't defined for pad2D
- _, p = pad1D(X, self.pad, self.kernel_width, s, d)
- p2D = (0, 0, p[0], p[1])
-
- # columnize W, X, and dLdy
- dLdZ_col = dLdZ.transpose(3, 1, 2, 0).reshape(out_ch, -1)
- W_col = W2D.transpose(3, 2, 0, 1).reshape(out_ch, -1).T
- X_col, _ = im2col(X2D, W2D.shape, p2D, s, d)
-
- # compute gradients via matrix multiplication and reshape
- dB = dLdZ_col.sum(axis=1).reshape(1, 1, -1)
- dW = (dLdZ_col @ X_col.T).reshape(out_ch, in_ch, fr, fc).transpose(2, 3, 1, 0)
-
- # reshape columnized dX back into the same format as the input volume
- dX_col = W_col @ dLdZ_col
- dX = col2im(dX_col, X2D.shape, W2D.shape, p2D, s, d).transpose(0, 2, 3, 1)
-
- return np.squeeze(dX, axis=1), np.squeeze(dW, axis=0), dB
-
- def _backward_naive(self, dLdy, retain_grads=True):
- """
- A slower (ie., non-vectorized) but more straightforward implementation
- of the gradient computations for a 2D conv layer.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, l_out, out_ch)` or list of arrays
- The gradient(s) of the loss with respect to the layer output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, l_in, in_ch)`
- The gradient of the loss with respect to the layer input volume.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- W = self.parameters["W"]
- b = self.parameters["b"]
- Zs = self.derived_variables["Z"]
-
- Xs, d = self.X, self.dilation
- fw, s, p = self.kernel_width, self.stride, self.pad
-
- dXs = []
- for X, Z, dy in zip(Xs, Zs, dLdy):
- n_ex, l_out, out_ch = dy.shape
- X_pad, (pr1, pr2) = pad1D(X, p, self.kernel_width, s, d)
-
- dX = np.zeros_like(X_pad)
- dZ = dy * self.act_fn.grad(Z)
-
- dW, dB = np.zeros_like(W), np.zeros_like(b)
- for m in range(n_ex):
- for i in range(l_out):
- for c in range(out_ch):
- # compute window boundaries w. stride and dilation
- i0, i1 = i * s, (i * s) + fw * (d + 1) - d
-
- wc = W[:, :, c]
- kernel = dZ[m, i, c]
- window = X_pad[m, i0 : i1 : (d + 1), :]
-
- dB[:, :, c] += kernel
- dW[:, :, c] += window * kernel
- dX[m, i0 : i1 : (d + 1), :] += wc * kernel
-
- if retain_grads:
- self.gradients["W"] += dW
- self.gradients["b"] += dB
-
- pr2 = None if pr2 == 0 else -pr2
- dXs.append(dX[:, pr1:pr2, :])
- return dXs[0] if len(Xs) == 1 else dXs
-
-
-class Conv2D(LayerBase):
- def __init__(
- self,
- out_ch,
- kernel_shape,
- pad=0,
- stride=1,
- dilation=0,
- act_fn=None,
- kernel_initializer="glorot_uniform",
- name=None,
- ):
- """
- Apply a two-dimensional convolution kernel over an input volume.
-
- Notes
- -----
- Equations::
-
- out = act_fn(pad(X) * W + b)
- n_rows_out = floor(1 + (n_rows_in + pad_left + pad_right - filter_rows) / stride)
- n_cols_out = floor(1 + (n_cols_in + pad_top + pad_bottom - filter_cols) / stride)
-
- where `'*'` denotes the cross-correlation operation with stride `s` and
- dilation `d`.
-
- Parameters
- ----------
- out_ch : int
- The number of filters/kernels to compute in the current layer
- kernel_shape : 2-tuple
- The dimension of a single 2D filter/kernel in the current layer
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``Y[t]``. If None, use the
- identity function :math:`f(X) = X` by default. Default is None.
- pad : int, tuple, or 'same'
- The number of rows/columns to zero-pad the input with. Default is
- 0.
- stride : int
- The stride/hop of the convolution kernels as they move over the
- input volume. Default is 1.
- dilation : int
- Number of pixels inserted between kernel elements. Effective kernel
- shape after dilation is: ``[kernel_rows * (d + 1) - d, kernel_cols
- * (d + 1) - d]``. Default is 0.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.pad = pad
- self.kernel_initializer = kernel_initializer
- self.in_ch = None
- self.out_ch = out_ch
- self.stride = stride
- self.dilation = dilation
- self.kernel_shape = kernel_shape
- self.act_fn = ActivationInitializer(act_fn)()
- self.parameters = {"W": None, "b": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- fr, fc = self.kernel_shape
- if not self.weights_set:
- init_weights = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
- W = init_weights((fr, fc, self.in_ch, self.out_ch))
- b = np.zeros((1, 1, 1, self.out_ch))
- else:
- W, b = self.get_weights()
-
- self.parameters = {"W": W, "b": b}
- self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
- self.derived_variables = {"Z": [], "out_rows": [], "out_cols": []}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Conv2D",
- "pad": self.pad,
- "kernel_initializer": self.kernel_initializer,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "stride": self.stride,
- "dilation": self.dilation,
- "act_fn": str(self.act_fn),
- "kernel_shape": self.kernel_shape,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output given input volume `X`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The input volume consisting of `n_ex` examples, each with dimension
- (`in_rows`, `in_cols`, `in_ch`).
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The layer output.
- """ # noqa: E501
- if not self.is_initialized:
- self.in_ch = X.shape[3]
- self._init_params()
-
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- n_ex, in_rows, in_cols, in_ch = X.shape
- s, p, d = self.stride, self.pad, self.dilation
-
- # pad the input and perform the forward convolution
- Z = conv2D(X, W, s, p, d) + b
- Y = self.act_fn(Z)
-
- if retain_derived:
- self.X.append(X)
- self.derived_variables["Z"].append(Z)
- self.derived_variables["out_rows"].append(Z.shape[1])
- self.derived_variables["out_cols"].append(Z.shape[2])
-
- return Y
-
- def backward(self, dLdy, retain_grads=True):
- """
- Compute the gradient of the loss with respect to the layer parameters.
-
- Notes
- -----
- Relies on :meth:`~numpy_ml.neural_nets.utils.im2col` and
- :meth:`~numpy_ml.neural_nets.utils.col2im` to vectorize the
- gradient calculation.
-
- See the private method :meth:`_backward_naive` for a more straightforward
- implementation.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, out_rows,
- out_cols, out_ch)` or list of arrays
- The gradient(s) of the loss with respect to the layer output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss with respect to the layer input volume.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- dX = []
- X = self.X
- Z = self.derived_variables["Z"]
-
- for dy, x, z in zip(dLdy, X, Z):
- dx, dw, db = self._bwd(dy, x, z)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["W"] += dw
- self.gradients["b"] += db
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdy, X, Z):
- """Actual computation of gradient of the loss wrt. X, W, and b"""
- W = self.parameters["W"]
-
- d = self.dilation
- fr, fc, in_ch, out_ch = W.shape
- n_ex, out_rows, out_cols, out_ch = dLdy.shape
- (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
-
- # columnize W, X, and dLdy
- dLdZ = dLdy * self.act_fn.grad(Z)
- dLdZ_col = dLdZ.transpose(3, 1, 2, 0).reshape(out_ch, -1)
- W_col = W.transpose(3, 2, 0, 1).reshape(out_ch, -1).T
- X_col, p = im2col(X, W.shape, p, s, d)
-
- # compute gradients via matrix multiplication and reshape
- dB = dLdZ_col.sum(axis=1).reshape(1, 1, 1, -1)
- dW = (dLdZ_col @ X_col.T).reshape(out_ch, in_ch, fr, fc).transpose(2, 3, 1, 0)
-
- # reshape columnized dX back into the same format as the input volume
- dX_col = W_col @ dLdZ_col
- dX = col2im(dX_col, X.shape, W.shape, p, s, d).transpose(0, 2, 3, 1)
-
- return dX, dW, dB
-
- def _backward_naive(self, dLdy, retain_grads=True):
- """
- A slower (ie., non-vectorized) but more straightforward implementation
- of the gradient computations for a 2D conv layer.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The gradient of the loss with respect to the layer output.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss with respect to the layer input volume.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdy, list):
- dLdy = [dLdy]
-
- W = self.parameters["W"]
- b = self.parameters["b"]
- Zs = self.derived_variables["Z"]
-
- Xs, d = self.X, self.dilation
- (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
-
- dXs = []
- for X, Z, dy in zip(Xs, Zs, dLdy):
- n_ex, out_rows, out_cols, out_ch = dy.shape
- X_pad, (pr1, pr2, pc1, pc2) = pad2D(X, p, self.kernel_shape, s, d)
-
- dZ = dLdy * self.act_fn.grad(Z)
-
- dX = np.zeros_like(X_pad)
- dW, dB = np.zeros_like(W), np.zeros_like(b)
- for m in range(n_ex):
- for i in range(out_rows):
- for j in range(out_cols):
- for c in range(out_ch):
- # compute window boundaries w. stride and dilation
- i0, i1 = i * s, (i * s) + fr * (d + 1) - d
- j0, j1 = j * s, (j * s) + fc * (d + 1) - d
-
- wc = W[:, :, :, c]
- kernel = dZ[m, i, j, c]
- window = X_pad[m, i0 : i1 : (d + 1), j0 : j1 : (d + 1), :]
-
- dB[:, :, :, c] += kernel
- dW[:, :, :, c] += window * kernel
- dX[m, i0 : i1 : (d + 1), j0 : j1 : (d + 1), :] += (
- wc * kernel
- )
-
- if retain_grads:
- self.gradients["W"] += dW
- self.gradients["b"] += dB
-
- pr2 = None if pr2 == 0 else -pr2
- pc2 = None if pc2 == 0 else -pc2
- dXs.append(dX[:, pr1:pr2, pc1:pc2, :])
- return dXs[0] if len(Xs) == 1 else dXs
-
-
-class Pool2D(LayerBase):
- def __init__(self, kernel_shape, stride=1, pad=0, mode="max", name=None):
- """
- A single two-dimensional pooling layer.
-
- Parameters
- ----------
- kernel_shape : 2-tuple
- The dimension of a single 2D filter/kernel in the current layer
- stride : int
- The stride/hop of the convolution kernels as they move over the
- input volume. Default is 1.
- pad : int, tuple, or 'same'
- The number of rows/columns of 0's to pad the input. Default is 0.
- mode : {"max", "average"}
- The pooling function to apply.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.pad = pad
- self.mode = mode
- self.in_ch = None
- self.out_ch = None
- self.stride = stride
- self.kernel_shape = kernel_shape
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- self.derived_variables = {"out_rows": [], "out_cols": []}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Pool2D",
- "act_fn": None,
- "pad": self.pad,
- "mode": self.mode,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "stride": self.stride,
- "kernel_shape": self.kernel_shape,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output given input volume `X`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The input volume consisting of `n_ex` examples, each with dimension
- (`in_rows`,`in_cols`, `in_ch`)
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The layer output.
- """ # noqa: E501
- if not self.is_initialized:
- self.in_ch = self.out_ch = X.shape[3]
- self._init_params()
-
- n_ex, in_rows, in_cols, nc_in = X.shape
- (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
- X_pad, (pr1, pr2, pc1, pc2) = pad2D(X, p, self.kernel_shape, s)
-
- out_rows = np.floor(1 + (in_rows + pr1 + pr2 - fr) / s).astype(int)
- out_cols = np.floor(1 + (in_cols + pc1 + pc2 - fc) / s).astype(int)
-
- if self.mode == "max":
- pool_fn = np.max
- elif self.mode == "average":
- pool_fn = np.mean
-
- Y = np.zeros((n_ex, out_rows, out_cols, self.out_ch))
- for m in range(n_ex):
- for i in range(out_rows):
- for j in range(out_cols):
- for c in range(self.out_ch):
- # calculate window boundaries, incorporating stride
- i0, i1 = i * s, (i * s) + fr
- j0, j1 = j * s, (j * s) + fc
-
- xi = X_pad[m, i0:i1, j0:j1, c]
- Y[m, i, j, c] = pool_fn(xi)
-
- if retain_derived:
- self.X.append(X)
- self.derived_variables["out_rows"].append(out_rows)
- self.derived_variables["out_cols"].append(out_cols)
-
- return Y
-
- def backward(self, dLdY, retain_grads=True):
- """
- Backprop from layer outputs to inputs
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss wrt. the layer output `Y`.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss wrt. the layer input `X`.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdY, list):
- dLdY = [dLdY]
-
- Xs = self.X
- out_rows = self.derived_variables["out_rows"]
- out_cols = self.derived_variables["out_cols"]
-
- (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
-
- dXs = []
- for X, dy, out_row, out_col in zip(Xs, dLdY, out_rows, out_cols):
- n_ex, in_rows, in_cols, nc_in = X.shape
- X_pad, (pr1, pr2, pc1, pc2) = pad2D(X, p, self.kernel_shape, s)
-
- dX = np.zeros_like(X_pad)
- for m in range(n_ex):
- for i in range(out_row):
- for j in range(out_col):
- for c in range(self.out_ch):
- # calculate window boundaries, incorporating stride
- i0, i1 = i * s, (i * s) + fr
- j0, j1 = j * s, (j * s) + fc
-
- if self.mode == "max":
- xi = X[m, i0:i1, j0:j1, c]
-
- # enforce that the mask can only consist of a
- # single `True` entry, even if multiple entries in
- # xi are equal to max(xi)
- mask = np.zeros_like(xi).astype(bool)
- x, y = np.argwhere(xi == np.max(xi))[0]
- mask[x, y] = True
-
- dX[m, i0:i1, j0:j1, c] += mask * dy[m, i, j, c]
- elif self.mode == "average":
- frame = np.ones((fr, fc)) * dy[m, i, j, c]
- dX[m, i0:i1, j0:j1, c] += frame / np.prod((fr, fc))
-
- pr2 = None if pr2 == 0 else -pr2
- pc2 = None if pc2 == 0 else -pc2
- dXs.append(dX[:, pr1:pr2, pc1:pc2, :])
- return dXs[0] if len(Xs) == 1 else dXs
-
-
-class Deconv2D(LayerBase):
- def __init__(
- self,
- out_ch,
- kernel_shape,
- pad=0,
- stride=1,
- act_fn=None,
- kernel_initializer="glorot_uniform",
- name=None,
- ):
- """
- Apply a two-dimensional "deconvolution" to an input volume.
-
- Notes
- -----
- The term "deconvolution" in this context does not correspond with the
- deconvolution operation in mathematics. More accurately, this layer is
- computing a transposed convolution / fractionally-strided convolution.
-
- Parameters
- ----------
- out_ch : int
- The number of filters/kernels to compute in the current layer
- kernel_shape : 2-tuple
- The dimension of a single 2D filter/kernel in the current layer
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``Y[t]``. If None, use
- :class:`~numpy_ml.neural_nets.activations.Affine`
- activations by default. Default is None.
- pad : int, tuple, or 'same'
- The number of rows/columns to zero-pad the input with. Default is 0.
- stride : int
- The stride/hop of the convolution kernels as they move over the
- input volume. Default is 1.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.pad = pad
- self.kernel_initializer = kernel_initializer
- self.in_ch = None
- self.stride = stride
- self.out_ch = out_ch
- self.kernel_shape = kernel_shape
- self.act_fn = ActivationInitializer(act_fn)()
- self.parameters = {"W": None, "b": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- fr, fc = self.kernel_shape
- if not self.weights_set:
- init_weights = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
- W = init_weights((fr, fc, self.in_ch, self.out_ch))
- b = np.zeros((1, 1, 1, self.out_ch))
- else:
- W, b = self.get_weights()
-
- self.parameters = {"W": W, "b": b}
- self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
- self.derived_variables = {"Z": [], "out_rows": [], "out_cols": []}
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "Deconv2D",
- "pad": self.pad,
- "kernel_initializer": self.kernel_initializer,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "stride": self.stride,
- "act_fn": str(self.act_fn),
- "kernel_shape": self.kernel_shape,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output given input volume `X`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The input volume consisting of `n_ex` examples, each with dimension
- (`in_rows`, `in_cols`, `in_ch`).
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The layer output.
- """ # noqa: E501
- if not self.is_initialized:
- self.in_ch = X.shape[3]
- self._init_params()
-
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- s, p = self.stride, self.pad
- n_ex, in_rows, in_cols, in_ch = X.shape
-
- # pad the input and perform the forward deconvolution
- Z = deconv2D_naive(X, W, s, p, 0) + b
- Y = self.act_fn(Z)
-
- if retain_derived:
- self.X.append(X)
- self.derived_variables["Z"].append(Z)
- self.derived_variables["out_rows"].append(Z.shape[1])
- self.derived_variables["out_cols"].append(Z.shape[2])
-
- return Y
-
- def backward(self, dLdY, retain_grads=True):
- """
- Compute the gradient of the loss with respect to the layer parameters.
-
- Notes
- -----
- Relies on :meth:`~numpy_ml.neural_nets.utils.im2col` and
- :meth:`~numpy_ml.neural_nets.utils.col2im` to vectorize the
- gradient calculations.
-
- Parameters
- ----------
- dLdY : :py:class:`ndarray ` of shape (`n_ex, out_rows, out_cols, out_ch`)
- The gradient of the loss with respect to the layer output.
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape (`n_ex, in_rows, in_cols, in_ch`)
- The gradient of the loss with respect to the layer input volume.
- """ # noqa: E501
- assert self.trainable, "Layer is frozen"
- if not isinstance(dLdY, list):
- dLdY = [dLdY]
-
- dX = []
- X, Z = self.X, self.derived_variables["Z"]
-
- for dy, x, z in zip(dLdY, X, Z):
- dx, dw, db = self._bwd(dy, x, z)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["W"] += dw
- self.gradients["b"] += db
-
- return dX[0] if len(X) == 1 else dX
-
- def _bwd(self, dLdY, X, Z):
- """Actual computation of gradient of the loss wrt. X, W, and b"""
- W = np.rot90(self.parameters["W"], 2)
-
- s = self.stride
- if self.stride > 1:
- X = dilate(X, s - 1)
- s = 1
-
- fr, fc, in_ch, out_ch = W.shape
- (fr, fc), p = self.kernel_shape, self.pad
- n_ex, out_rows, out_cols, out_ch = dLdY.shape
-
- # pad X the first time
- X_pad, p = pad2D(X, p, W.shape[:2], s)
- n_ex, in_rows, in_cols, in_ch = X_pad.shape
- pr1, pr2, pc1, pc2 = p
-
- # compute additional padding to produce the deconvolution
- out_rows = s * (in_rows - 1) - pr1 - pr2 + fr
- out_cols = s * (in_cols - 1) - pc1 - pc2 + fc
- out_dim = (out_rows, out_cols)
-
- # add additional "deconvolution" padding
- _p = calc_pad_dims_2D(X_pad.shape, out_dim, W.shape[:2], s, 0)
- X_pad, _ = pad2D(X_pad, _p, W.shape[:2], s)
-
- # columnize W, X, and dLdY
- dLdZ = dLdY * self.act_fn.grad(Z)
- dLdZ, _ = pad2D(dLdZ, p, W.shape[:2], s)
-
- dLdZ_col = dLdZ.transpose(3, 1, 2, 0).reshape(out_ch, -1)
- W_col = W.transpose(3, 2, 0, 1).reshape(out_ch, -1)
- X_col, _ = im2col(X_pad, W.shape, 0, s, 0)
-
- # compute gradients via matrix multiplication and reshape
- dB = dLdZ_col.sum(axis=1).reshape(1, 1, 1, -1)
- dW = (dLdZ_col @ X_col.T).reshape(out_ch, in_ch, fr, fc).transpose(2, 3, 1, 0)
- dW = np.rot90(dW, 2)
-
- # reshape columnized dX back into the same format as the input volume
- dX_col = W_col.T @ dLdZ_col
-
- total_pad = tuple(i + j for i, j in zip(p, _p))
- dX = col2im(dX_col, X.shape, W.shape, total_pad, s, 0).transpose(0, 2, 3, 1)
- dX = dX[:, :: self.stride, :: self.stride, :]
-
- return dX, dW, dB
-
-
-#######################################################################
-# Recurrent Layers #
-#######################################################################
-
-
-class RNNCell(LayerBase):
- def __init__(self, n_out, act_fn="Tanh", kernel_initializer="glorot_uniform", name=None):
- r"""
- A single step of a vanilla (Elman) RNN.
-
- Notes
- -----
- At timestep `t`, the vanilla RNN cell computes
-
- .. math::
-
- \mathbf{Z}^{(t)} &=
- \mathbf{W}_{ax} \mathbf{X}^{(t)} + \mathbf{b}_{ax} +
- \mathbf{W}_{aa} \mathbf{A}^{(t-1)} + \mathbf{b}_{aa} \\
- \mathbf{A}^{(t)} &= f(\mathbf{Z}^{(t)})
-
- where
-
- - :math:`\mathbf{X}^{(t)}` is the input at time `t`
- - :math:`\mathbf{A}^{(t)}` is the hidden state at timestep `t`
- - `f` is the layer activation function
- - :math:`\mathbf{W}_{ax}` and :math:`\mathbf{b}_{ax}` are the weights
- and bias for the input to hidden layer
- - :math:`\mathbf{W}_{aa}` and :math:`\mathbf{b}_{aa}` are the weights
- and biases for the hidden to hidden layer
-
- Parameters
- ----------
- n_out : int
- The dimension of a single hidden state / output on a given timestep
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``A[t]``. Default is `'Tanh'`.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.n_out = n_out
- self.n_timesteps = None
- self.act_fn = ActivationInitializer(act_fn)()
- self.parameters = {"Waa": None, "Wax": None, "ba": None, "bx": None}
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- self.X = []
- if not self.weights_set:
- init_weights = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
- Wax = init_weights((self.n_in, self.n_out))
- Waa = init_weights((self.n_out, self.n_out))
- ba = np.zeros((self.n_out, 1))
- bx = np.zeros((self.n_out, 1))
- else:
- Waa, ba, Wax, bx = self.get_weights()
-
- self.parameters = {"Waa": Waa, "ba": ba, "Wax": Wax, "bx": bx}
-
- self.gradients = {
- "Waa": np.zeros_like(Waa),
- "Wax": np.zeros_like(Wax),
- "ba": np.zeros_like(ba),
- "bx": np.zeros_like(bx),
- }
-
- self.derived_variables = {
- "A": [],
- "Z": [],
- "n_timesteps": 0,
- "current_step": 0,
- "dLdA_accumulator": None,
- }
-
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "RNNCell",
- "kernel_initializer": self.kernel_initializer,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "act_fn": str(self.act_fn),
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, Xt):
- """
- Compute the network output for a single timestep.
-
- Parameters
- ----------
- Xt : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Input at timestep `t` consisting of `n_ex` examples each of
- dimensionality `n_in`.
-
- Returns
- -------
- At: :py:class:`ndarray ` of shape `(n_ex, n_out)`
- The value of the hidden state at timestep `t` for each of the
- `n_ex` examples.
- """
- if not self.is_initialized:
- self.n_in = Xt.shape[1]
- self._init_params()
-
- # increment timestep
- self.derived_variables["n_timesteps"] += 1
- self.derived_variables["current_step"] += 1
-
- # Retrieve parameters
- ba = self.parameters["ba"]
- bx = self.parameters["bx"]
- Wax = self.parameters["Wax"]
- Waa = self.parameters["Waa"]
-
- # initialize the hidden state to zero
- As = self.derived_variables["A"]
- if len(As) == 0:
- n_ex, n_in = Xt.shape
- A0 = np.zeros((n_ex, self.n_out))
- As.append(A0)
-
- # compute next hidden state
- Zt = As[-1] @ Waa + ba.T + Xt @ Wax + bx.T
- At = self.act_fn(Zt)
-
- self.derived_variables["Z"].append(Zt)
- self.derived_variables["A"].append(At)
-
- # store intermediate variables
- self.X.append(Xt)
- return At
-
- def backward(self, dLdAt):
- """
- Backprop for a single timestep.
-
- Parameters
- ----------
- dLdAt : :py:class:`ndarray ` of shape `(n_ex, n_out)`
- The gradient of the loss wrt. the layer outputs (ie., hidden
- states) at timestep `t`.
-
- Returns
- -------
- dLdXt : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer inputs at timestep `t`.
- """
- assert self.trainable, "Layer is frozen"
-
- # decrement current step
- self.derived_variables["current_step"] -= 1
-
- # extract context variables
- Zs = self.derived_variables["Z"]
- As = self.derived_variables["A"]
- t = self.derived_variables["current_step"]
- dA_acc = self.derived_variables["dLdA_accumulator"]
-
- # initialize accumulator
- if dA_acc is None:
- dA_acc = np.zeros_like(As[0])
-
- # get network weights for gradient calcs
- Wax = self.parameters["Wax"]
- Waa = self.parameters["Waa"]
-
- # compute gradient components at timestep t
- dA = dLdAt + dA_acc
- dZ = self.act_fn.grad(Zs[t]) * dA
- dXt = dZ @ Wax.T
-
- # update parameter gradients with signal from current step
- self.gradients["Waa"] += As[t].T @ dZ
- self.gradients["Wax"] += self.X[t].T @ dZ
- self.gradients["ba"] += dZ.sum(axis=0, keepdims=True).T
- self.gradients["bx"] += dZ.sum(axis=0, keepdims=True).T
-
- # update accumulator variable for hidden state
- self.derived_variables["dLdA_accumulator"] = dZ @ Waa.T
- return dXt
-
- def flush_gradients(self):
- """Erase all the layer's derived variables and gradients."""
- assert self.trainable, "Layer is frozen"
-
- self.X = []
- for k, v in self.derived_variables.items():
- self.derived_variables[k] = []
-
- self.derived_variables["n_timesteps"] = 0
- self.derived_variables["current_step"] = 0
-
- # reset parameter gradients to 0
- for k, v in self.parameters.items():
- self.gradients[k] = np.zeros_like(v)
-
-
-class LSTMCell(LayerBase):
- def __init__(
- self,
- n_out,
- act_fn="Tanh",
- gate_fn="Sigmoid",
- kernel_initializer="glorot_uniform",
- name=None,
- ):
- """
- A single step of a long short-term memory (LSTM) RNN.
-
- Notes
- -----
- Notation:
-
- - ``Z[t]`` is the input to each of the gates at timestep `t`
- - ``A[t]`` is the value of the hidden state at timestep `t`
- - ``Cc[t]`` is the value of the *candidate* cell/memory state at timestep `t`
- - ``C[t]`` is the value of the *final* cell/memory state at timestep `t`
- - ``Gf[t]`` is the output of the forget gate at timestep `t`
- - ``Gu[t]`` is the output of the update gate at timestep `t`
- - ``Go[t]`` is the output of the output gate at timestep `t`
-
- Equations::
-
- Z[t] = stack([A[t-1], X[t]])
- Gf[t] = gate_fn(Wf @ Z[t] + bf)
- Gu[t] = gate_fn(Wu @ Z[t] + bu)
- Go[t] = gate_fn(Wo @ Z[t] + bo)
- Cc[t] = act_fn(Wc @ Z[t] + bc)
- C[t] = Gf[t] * C[t-1] + Gu[t] * Cc[t]
- A[t] = Go[t] * act_fn(C[t])
-
- where `@` indicates dot/matrix product, and '*' indicates elementwise
- multiplication.
-
- Parameters
- ----------
- n_out : int
- The dimension of a single hidden state / output on a given timestep.
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``A[t]``. Default is
- `'Tanh'`.
- gate_fn : str, :doc:`Activation ` object, or None
- The gate function for computing the update, forget, and output
- gates. Default is `'Sigmoid'`.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.n_out = n_out
- self.n_timesteps = None
- self.act_fn = ActivationInitializer(act_fn)()
- self.gate_fn = ActivationInitializer(gate_fn)()
- self.parameters = {
- "Wf": None,
- "Wu": None,
- "Wc": None,
- "Wo": None,
- "bf": None,
- "bu": None,
- "bc": None,
- "bo": None,
- }
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- self.X = []
- if not self.weights_set:
- init_weights_gate = WeightInitializer(str(self.gate_fn), mode=self.kernel_initializer)
- init_weights_act = WeightInitializer(str(self.act_fn), mode=self.kernel_initializer)
-
- Wf = init_weights_gate((self.n_in + self.n_out, self.n_out))
- Wu = init_weights_gate((self.n_in + self.n_out, self.n_out))
- Wc = init_weights_act((self.n_in + self.n_out, self.n_out))
- Wo = init_weights_gate((self.n_in + self.n_out, self.n_out))
-
- bf = np.zeros((1, self.n_out))
- bu = np.zeros((1, self.n_out))
- bc = np.zeros((1, self.n_out))
- bo = np.zeros((1, self.n_out))
- else:
- Wf, bf, Wu, bu, Wc, bc, Wo, bo = self.get_weights()
-
- self.parameters = {
- "Wf": Wf,
- "bf": bf,
- "Wu": Wu,
- "bu": bu,
- "Wc": Wc,
- "bc": bc,
- "Wo": Wo,
- "bo": bo,
- }
-
- self.gradients = {
- "Wf": np.zeros_like(Wf),
- "Wu": np.zeros_like(Wu),
- "Wc": np.zeros_like(Wc),
- "Wo": np.zeros_like(Wo),
- "bf": np.zeros_like(bf),
- "bu": np.zeros_like(bu),
- "bc": np.zeros_like(bc),
- "bo": np.zeros_like(bo),
- }
-
- self.derived_variables = {
- "C": [],
- "A": [],
- "Gf": [],
- "Gu": [],
- "Go": [],
- "Gc": [],
- "Cc": [],
- "n_timesteps": 0,
- "current_step": 0,
- "dLdA_accumulator": None,
- "dLdC_accumulator": None,
- }
-
- self.is_initialized = True
- self.weights_set = True
-
- def _get_params(self):
- Wf = self.parameters["Wf"]
- Wu = self.parameters["Wu"]
- Wc = self.parameters["Wc"]
- Wo = self.parameters["Wo"]
- bf = self.parameters["bf"]
- bu = self.parameters["bu"]
- bc = self.parameters["bc"]
- bo = self.parameters["bo"]
- return Wf, Wu, Wc, Wo, bf, bu, bc, bo
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "LSTMCell",
- "kernel_initializer": self.kernel_initializer,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "act_fn": str(self.act_fn),
- "gate_fn": str(self.gate_fn),
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def forward(self, Xt):
- """
- Compute the layer output for a single timestep.
-
- Parameters
- ----------
- Xt : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Input at timestep t consisting of `n_ex` examples each of
- dimensionality `n_in`.
-
- Returns
- -------
- At: :py:class:`ndarray ` of shape `(n_ex, n_out)`
- The value of the hidden state at timestep `t` for each of the `n_ex`
- examples.
- Ct: :py:class:`ndarray ` of shape `(n_ex, n_out)`
- The value of the cell/memory state at timestep `t` for each of the
- `n_ex` examples.
- """
- if not self.is_initialized:
- self.n_in = Xt.shape[1]
- self._init_params()
-
- Wf, Wu, Wc, Wo, bf, bu, bc, bo = self._get_params()
-
- self.derived_variables["n_timesteps"] += 1
- self.derived_variables["current_step"] += 1
-
- if len(self.derived_variables["A"]) == 0:
- n_ex, n_in = Xt.shape
- init = np.zeros((n_ex, self.n_out))
- self.derived_variables["A"].append(init)
- self.derived_variables["C"].append(init)
-
- A_prev = self.derived_variables["A"][-1]
- C_prev = self.derived_variables["C"][-1]
-
- # concatenate A_prev and Xt to create Zt
- Zt = np.hstack([A_prev, Xt])
-
- Gft = self.gate_fn(Zt @ Wf + bf)
- Gut = self.gate_fn(Zt @ Wu + bu)
- Got = self.gate_fn(Zt @ Wo + bo)
- Cct = self.act_fn(Zt @ Wc + bc)
- Ct = Gft * C_prev + Gut * Cct
- At = Got * self.act_fn(Ct)
-
- # bookkeeping
- self.X.append(Xt)
- self.derived_variables["A"].append(At)
- self.derived_variables["C"].append(Ct)
- self.derived_variables["Gf"].append(Gft)
- self.derived_variables["Gu"].append(Gut)
- self.derived_variables["Go"].append(Got)
- self.derived_variables["Cc"].append(Cct)
- return At, Ct
-
- def backward(self, dLdAt):
- """
- Backprop for a single timestep.
-
- Parameters
- ----------
- dLdAt : :py:class:`ndarray ` of shape `(n_ex, n_out)`
- The gradient of the loss wrt. the layer outputs (ie., hidden
- states) at timestep `t`.
-
- Returns
- -------
- dLdXt : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- The gradient of the loss wrt. the layer inputs at timestep `t`.
- """
- assert self.trainable, "Layer is frozen"
-
- Wf, Wu, Wc, Wo, bf, bu, bc, bo = self._get_params()
-
- self.derived_variables["current_step"] -= 1
- t = self.derived_variables["current_step"]
-
- Got = self.derived_variables["Go"][t]
- Gft = self.derived_variables["Gf"][t]
- Gut = self.derived_variables["Gu"][t]
- Cct = self.derived_variables["Cc"][t]
- At = self.derived_variables["A"][t + 1]
- Ct = self.derived_variables["C"][t + 1]
- C_prev = self.derived_variables["C"][t]
- A_prev = self.derived_variables["A"][t]
-
- Xt = self.X[t]
- Zt = np.hstack([A_prev, Xt])
-
- dA_acc = self.derived_variables["dLdA_accumulator"]
- dC_acc = self.derived_variables["dLdC_accumulator"]
-
- # initialize accumulators
- if dA_acc is None:
- dA_acc = np.zeros_like(At)
-
- if dC_acc is None:
- dC_acc = np.zeros_like(Ct)
-
- # Gradient calculations
- # ---------------------
-
- dA = dLdAt + dA_acc
- dC = dC_acc + dA * Got * self.act_fn.grad(Ct)
-
- # compute the input to the gate functions at timestep t
- _Go = Zt @ Wo + bo
- _Gf = Zt @ Wf + bf
- _Gu = Zt @ Wu + bu
- _Gc = Zt @ Wc + bc
-
- # compute gradients wrt the *input* to each gate
- dGot = dA * self.act_fn(Ct) * self.gate_fn.grad(_Go)
- dCct = dC * Gut * self.act_fn.grad(_Gc)
- dGut = dC * Cct * self.gate_fn.grad(_Gu)
- dGft = dC * C_prev * self.gate_fn.grad(_Gf)
-
- dZ = dGft @ Wf.T + dGut @ Wu.T + dCct @ Wc.T + dGot @ Wo.T
- dXt = dZ[:, self.n_out :]
-
- self.gradients["Wc"] += Zt.T @ dCct
- self.gradients["Wu"] += Zt.T @ dGut
- self.gradients["Wf"] += Zt.T @ dGft
- self.gradients["Wo"] += Zt.T @ dGot
- self.gradients["bo"] += dGot.sum(axis=0, keepdims=True)
- self.gradients["bu"] += dGut.sum(axis=0, keepdims=True)
- self.gradients["bf"] += dGft.sum(axis=0, keepdims=True)
- self.gradients["bc"] += dCct.sum(axis=0, keepdims=True)
-
- self.derived_variables["dLdA_accumulator"] = dZ[:, : self.n_out]
- self.derived_variables["dLdC_accumulator"] = Gft * dC
- return dXt
-
- def flush_gradients(self):
- """Erase all the layer's derived variables and gradients."""
- assert self.trainable, "Layer is frozen"
-
- self.X = []
- for k, v in self.derived_variables.items():
- self.derived_variables[k] = []
-
- self.derived_variables["n_timesteps"] = 0
- self.derived_variables["current_step"] = 0
-
- # reset parameter gradients to 0
- for k, v in self.parameters.items():
- self.gradients[k] = np.zeros_like(v)
-
-
-class RNN(LayerBase):
- def __init__(self, n_out, act_fn="Tanh", kernel_initializer="glorot_uniform", name=None):
- """
- A single vanilla (Elman)-RNN layer.
-
- Parameters
- ----------
- n_out : int
- The dimension of a single hidden state / output on a given
- timestep.
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``A[t]``. Default is
- `'Tanh'`.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.n_out = n_out
- self.n_timesteps = None
- self.act_fn = ActivationInitializer(act_fn)()
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- self.cell = RNNCell(
- n_in=self.n_in,
- n_out=self.n_out,
- act_fn=self.act_fn,
- kernel_initializer=self.kernel_initializer,
- )
- self.cell.set_optimizer() # FIXME
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "RNN",
- "kernel_initializer": self.kernel_initializer,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "act_fn": str(self.act_fn),
- "optimizer": self.cell.hyperparameters["optimizer"],
- }
-
- def forward(self, X):
- """
- Run a forward pass across all timesteps in the input.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in, n_t)`
- Input consisting of `n_ex` examples each of dimensionality `n_in`
- and extending for `n_t` timesteps.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_out, n_t)`
- The value of the hidden state for each of the `n_ex` examples
- across each of the `n_t` timesteps.
- """
- if not self.is_initialized:
- self.n_in = X.shape[1]
- self._init_params()
-
- Y = []
- n_ex, n_in, n_t = X.shape
- for t in range(n_t):
- yt = self.cell.forward(X[:, :, t])
- Y.append(yt)
- return np.dstack(Y)
-
- def backward(self, dLdA):
- """
- Run a backward pass across all timesteps in the input.
-
- Parameters
- ----------
- dLdA : :py:class:`ndarray ` of shape `(n_ex, n_out, n_t)`
- The gradient of the loss with respect to the layer output for each
- of the `n_ex` examples across all `n_t` timesteps.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape `(n_ex, n_in, n_t)`
- The value of the hidden state for each of the `n_ex` examples
- across each of the `n_t` timesteps.
- """
- assert self.cell.trainable, "Layer is frozen"
- dLdX = []
- n_ex, n_out, n_t = dLdA.shape
- for t in reversed(range(n_t)):
- dLdXt = self.cell.backward(dLdA[:, :, t])
- dLdX.insert(0, dLdXt)
- dLdX = np.dstack(dLdX)
- return dLdX
-
- @property
- def derived_variables(self):
- """
- Return a dictionary containing any intermediate variables computed
- during the forward / backward passes.
- """
- return self.cell.derived_variables
-
- @property
- def gradients(self):
- """
- Return a dictionary of the gradients computed during the backward
- pass
- """
- return self.cell.gradients
-
- @property
- def parameters(self):
- """Return a dictionary of the current layer parameters"""
- return self.cell.parameters
-
- def set_params(self, summary_dict):
- """
- Set the layer parameters from a dictionary of values.
-
- Parameters
- ----------
- summary_dict : dict
- A dictionary of layer parameters and hyperparameters. If a required
- parameter or hyperparameter is not included within `summary_dict`,
- this method will use the value in the current layer's
- :meth:`summary` method.
-
- Returns
- -------
- layer : :doc:`Layer ` object
- The newly-initialized layer.
- """
- self = super().set_params(summary_dict)
- return self.cell.set_parameters(summary_dict)
-
- def freeze(self):
- """
- Freeze the layer parameters at their current values so they can no
- longer be updated.
- """
- self.cell.freeze()
-
- def unfreeze(self):
- """Unfreeze the layer parameters so they can be updated."""
- self.cell.unfreeze()
-
- def flush_gradients(self):
- """Erase all the layer's derived variables and gradients."""
- self.cell.flush_gradients()
-
- def update(self):
- """
- Update the layer parameters using the accrued gradients and layer
- optimizer. Flush all gradients once the update is complete.
- """
- self.cell.update()
- self.flush_gradients()
-
-
-class LSTM(LayerBase):
- def __init__(
- self,
- n_out,
- act_fn="Tanh",
- gate_fn="Sigmoid",
- kernel_initializer="glorot_uniform",
- name=None,
- ):
- """
- A single long short-term memory (LSTM) RNN layer.
-
- Parameters
- ----------
- n_out : int
- The dimension of a single hidden state / output on a given timestep.
- act_fn : str, :doc:`Activation ` object, or None
- The activation function for computing ``A[t]``. Default is `'Tanh'`.
- gate_fn : str, :doc:`Activation ` object, or None
- The gate function for computing the update, forget, and output
- gates. Default is `'Sigmoid'`.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is `'glorot_uniform'`.
- """ # noqa: E501
- super().__init__(name=name)
-
- self.kernel_initializer = kernel_initializer
- self.n_in = None
- self.n_out = n_out
- self.n_timesteps = None
- self.act_fn = ActivationInitializer(act_fn)()
- self.gate_fn = ActivationInitializer(gate_fn)()
- self.is_initialized = False
- self.weights_set = False
-
- def _init_params(self):
- self.cell = LSTMCell(
- n_in=self.n_in,
- n_out=self.n_out,
- act_fn=self.act_fn,
- gate_fn=self.gate_fn,
- kernel_initializer=self.kernel_initializer,
- )
- ## FIXME: does LSTMCell need optimizer?
- self.is_initialized = True
- self.weights_set = True
-
- @property
- def hyperparameters(self):
- """Return a dictionary containing the layer hyperparameters."""
- return {
- "layer": "LSTM",
- "kernel_initializer": self.kernel_initializer,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "act_fn": str(self.act_fn),
- "gate_fn": str(self.gate_fn),
- "optimizer": self.cell.hyperparameters["optimizer"],
- }
-
- def forward(self, X):
- """
- Run a forward pass across all timesteps in the input.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in, n_t)`
- Input consisting of `n_ex` examples each of dimensionality `n_in`
- and extending for `n_t` timesteps.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_out, n_t)`
- The value of the hidden state for each of the `n_ex` examples
- across each of the `n_t` timesteps.
- """
- if not self.is_initialized:
- self.n_in = X.shape[1]
- self._init_params()
-
- Y = []
- n_ex, n_in, n_t = X.shape
- for t in range(n_t):
- yt, _ = self.cell.forward(X[:, :, t])
- Y.append(yt)
- return np.dstack(Y)
-
- def backward(self, dLdA):
- """
- Run a backward pass across all timesteps in the input.
-
- Parameters
- ----------
- dLdA : :py:class:`ndarray ` of shape `(n_ex, n_out, n_t)`
- The gradient of the loss with respect to the layer output for each
- of the `n_ex` examples across all `n_t` timesteps.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape (`n_ex`, `n_in`, `n_t`)
- The value of the hidden state for each of the `n_ex` examples
- across each of the `n_t` timesteps.
- """ # noqa: E501
- assert self.cell.trainable, "Layer is frozen"
- dLdX = []
- n_ex, n_out, n_t = dLdA.shape
- for t in reversed(range(n_t)):
- dLdXt, _ = self.cell.backward(dLdA[:, :, t])
- dLdX.insert(0, dLdXt)
- dLdX = np.dstack(dLdX)
- return dLdX
-
- @property
- def derived_variables(self):
- """
- Return a dictionary containing any intermediate variables computed
- during the forward / backward passes.
- """
- return self.cell.derived_variables
-
- @property
- def gradients(self):
- """
- Return a dictionary of the gradients computed during the backward
- pass
- """
- return self.cell.gradients
-
- @property
- def parameters(self):
- """Return a dictionary of the current layer parameters"""
- return self.cell.parameters
-
- def freeze(self):
- """
- Freeze the layer parameters at their current values so they can no
- longer be updated.
- """
- self.cell.freeze()
-
- def unfreeze(self):
- """Unfreeze the layer parameters so they can be updated."""
- self.cell.unfreeze()
-
- def set_params(self, summary_dict):
- """
- Set the layer parameters from a dictionary of values.
-
- Parameters
- ----------
- summary_dict : dict
- A dictionary of layer parameters and hyperparameters. If a required
- parameter or hyperparameter is not included within `summary_dict`,
- this method will use the value in the current layer's
- :meth:`summary` method.
-
- Returns
- -------
- layer : :doc:`Layer ` object
- The newly-initialized layer.
- """
- self = super().set_params(summary_dict)
- return self.cell.set_parameters(summary_dict)
-
- def flush_gradients(self):
- """Erase all the layer's derived variables and gradients."""
- self.cell.flush_gradients()
-
- def update(self):
- """
- Update the layer parameters using the accrued gradients and layer
- optimizer. Flush all gradients once the update is complete.
- """
- self.cell.update()
- self.flush_gradients()
diff --git a/aitk/keras/losses/README.md b/aitk/keras/losses/README.md
deleted file mode 100644
index 59e1008..0000000
--- a/aitk/keras/losses/README.md
+++ /dev/null
@@ -1,10 +0,0 @@
-# Losses
-
-The `losses.py` module implements several common loss functions, including:
-
-- Squared error
-- Cross-entropy
-- Variational lower-bound for binary VAE ([Kingma & Welling, 2014](https://arxiv.org/abs/1312.6114))
-- WGAN-GP loss for generator and critic ([Gulrajani et al., 2017](https://arxiv.org/pdf/1704.00028.pdf))
-- Noise contrastive estimation (NCE) loss ([Gutmann &
- Hyvärinen, 2010](https://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf); [Minh & Teh, 2012](https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf))
diff --git a/aitk/keras/losses/__init__.py b/aitk/keras/losses/__init__.py
deleted file mode 100644
index 908ff51..0000000
--- a/aitk/keras/losses/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-"""
-Common neural network loss functions.
-
-This module implements loss objects that can be used during neural network
-training.
-"""
-
-from .losses import *
diff --git a/aitk/keras/losses/losses.py b/aitk/keras/losses/losses.py
deleted file mode 100644
index 23f7fc8..0000000
--- a/aitk/keras/losses/losses.py
+++ /dev/null
@@ -1,946 +0,0 @@
-from abc import ABC, abstractmethod
-
-import numpy as np
-
-from ..numpy_ml_utils.testing import is_binary, is_stochastic
-from ..initializers import (
- WeightInitializer,
- ActivationInitializer,
- OptimizerInitializer,
-)
-
-
-class ObjectiveBase(ABC):
- def __init__(self):
- super().__init__()
- self.name = "base_loss"
-
- @abstractmethod
- def loss(self, y_true, y_pred):
- pass
-
- @abstractmethod
- def grad(self, y_true, y_pred, **kwargs):
- pass
-
-
-class MeanSquaredError(ObjectiveBase):
- def __init__(self):
- super().__init__()
- self.name = "mean_squared_error"
-
- def loss(self, y, y_pred):
- squared_error = np.square(y_pred - y)
- mse = np.mean(squared_error)
- return mse
-
- def __call__(self, y, y_pred):
- return self.loss(y, y_pred)
-
- def grad(self, y, y_pred):
- return 2 * (y_pred - y)
-
-class SquaredError(ObjectiveBase):
- def __init__(self):
- r"""
- A squared-error / `L2` loss.
-
- Notes
- -----
- For real-valued target **y** and predictions :math:`\hat{\mathbf{y}}`, the
- squared error is
-
- .. math::
- \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})
- = 0.5 ||\hat{\mathbf{y}} - \mathbf{y}||_2^2
- """
- super().__init__()
- self.name = "squared_error"
-
- def __call__(self, y, y_pred):
- return self.loss(y, y_pred)
-
- def __str__(self):
- return "SquaredError"
-
- @staticmethod
- def loss(y, y_pred):
- """
- Compute the squared error between `y` and `y_pred`.
-
- Parameters
- ----------
- y : :py:class:`ndarray ` of shape (n, m)
- Ground truth values for each of `n` examples
- y_pred : :py:class:`ndarray ` of shape (n, m)
- Predictions for the `n` examples in the batch.
-
- Returns
- -------
- loss : float
- The sum of the squared error across dimensions and examples.
- """
- return 0.5 * np.linalg.norm(y_pred - y) ** 2
-
- @staticmethod
- def grad(y, y_pred, z, act_fn):
- r"""
- Gradient of the squared error loss with respect to the pre-nonlinearity
- input, `z`.
-
- Notes
- -----
- The current method computes the gradient :math:`\\frac{\partial
- \mathcal{L}}{\partial \mathbf{z}}`, where
-
- .. math::
-
- \mathcal{L}(\mathbf{z})
- &= \\text{squared_error}(\mathbf{y}, g(\mathbf{z})) \\\\
- g(\mathbf{z})
- &= \\text{act_fn}(\mathbf{z})
-
- The gradient with respect to :math:`\mathbf{z}` is then
-
- .. math::
-
- \\frac{\partial \mathcal{L}}{\partial \mathbf{z}}
- = (g(\mathbf{z}) - \mathbf{y}) \left(
- \\frac{\partial g}{\partial \mathbf{z}} \\right)
-
- Parameters
- ----------
- y : :py:class:`ndarray ` of shape (n, m)
- Ground truth values for each of `n` examples.
- y_pred : :py:class:`ndarray ` of shape (n, m)
- Predictions for the `n` examples in the batch.
- act_fn : :doc:`Activation ` object
- The activation function for the output layer of the network.
-
- Returns
- -------
- grad : :py:class:`ndarray ` of shape (n, m)
- The gradient of the squared error loss with respect to `z`.
- """
- return (y_pred - y) * act_fn.grad(z)
-
-
-class CrossEntropy(ObjectiveBase):
- def __init__(self):
- r"""
- A cross-entropy loss.
-
- Notes
- -----
- For a one-hot target **y** and predicted class probabilities
- :math:`\hat{\mathbf{y}}`, the cross entropy is
-
- .. math::
- \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})
- = \sum_i y_i \log \hat{y}_i
- """
- super().__init__()
- self.name = "cross_entropy"
-
- def __call__(self, y, y_pred):
- return self.loss(y, y_pred)
-
- def __str__(self):
- return "CrossEntropy"
-
- @staticmethod
- def loss(y, y_pred):
- """
- Compute the cross-entropy (log) loss.
-
- Notes
- -----
- This method returns the sum (not the average!) of the losses for each
- sample.
-
- Parameters
- ----------
- y : :py:class:`ndarray ` of shape (n, m)
- Class labels (one-hot with `m` possible classes) for each of `n`
- examples.
- y_pred : :py:class:`ndarray ` of shape (n, m)
- Probabilities of each of `m` classes for the `n` examples in the
- batch.
-
- Returns
- -------
- loss : float
- The sum of the cross-entropy across classes and examples.
- """
- is_binary(y)
- is_stochastic(y_pred)
-
- # prevent taking the log of 0
- eps = np.finfo(float).eps
-
- # each example is associated with a single class; sum the negative log
- # probability of the correct label over all samples in the batch.
- # observe that we are taking advantage of the fact that y is one-hot
- # encoded
- cross_entropy = -np.sum(y * np.log(y_pred + eps))
- return cross_entropy
-
- @staticmethod
- def grad(y, y_pred):
- r"""
- Compute the gradient of the cross entropy loss with regard to the
- softmax input, `z`.
-
- Notes
- -----
- The gradient for this method goes through both the cross-entropy loss
- AND the softmax non-linearity to return :math:`\\frac{\partial
- \mathcal{L}}{\partial \mathbf{z}}` (rather than :math:`\\frac{\partial
- \mathcal{L}}{\partial \\text{softmax}(\mathbf{z})}`).
-
- In particular, let:
-
- .. math::
-
- \mathcal{L}(\mathbf{z})
- = \\text{cross_entropy}(\\text{softmax}(\mathbf{z})).
-
- The current method computes:
-
- .. math::
-
- \\frac{\partial \mathcal{L}}{\partial \mathbf{z}}
- &= \\text{softmax}(\mathbf{z}) - \mathbf{y} \\\\
- &= \hat{\mathbf{y}} - \mathbf{y}
-
- Parameters
- ----------
- y : :py:class:`ndarray ` of shape `(n, m)`
- A one-hot encoding of the true class labels. Each row constitues a
- training example, and each column is a different class.
- y_pred: :py:class:`ndarray ` of shape `(n, m)`
- The network predictions for the probability of each of `m` class
- labels on each of `n` examples in a batch.
-
- Returns
- -------
- grad : :py:class:`ndarray ` of shape (n, m)
- The gradient of the cross-entropy loss with respect to the *input*
- to the softmax function.
- """
- is_binary(y)
- is_stochastic(y_pred)
-
- # derivative of xe wrt z is y_pred - y_true, hence we can just
- # subtract 1 from the probability of the correct class labels
- grad = y_pred - y
-
- # [optional] scale the gradients by the number of examples in the batch
- # n, m = y.shape
- # grad /= n
- return grad
-
-
-class VAELoss(ObjectiveBase):
- def __init__(self):
- r"""
- The variational lower bound for a variational autoencoder with Bernoulli
- units.
-
- Notes
- -----
- The VLB to the sum of the binary cross entropy between the true input and
- the predicted output (the "reconstruction loss") and the KL divergence
- between the learned variational distribution :math:`q` and the prior,
- :math:`p`, assumed to be a unit Gaussian.
-
- .. math::
-
- \\text{VAELoss} =
- \\text{cross_entropy}(\mathbf{y}, \hat{\mathbf{y}})
- + \\mathbb{KL}[q \ || \ p]
-
- where :math:`\mathbb{KL}[q \ || \ p]` is the Kullback-Leibler
- divergence between the distributions :math:`q` and :math:`p`.
-
- References
- ----------
- .. [1] Kingma, D. P. & Welling, M. (2014). "Auto-encoding variational Bayes".
- *arXiv preprint arXiv:1312.6114.* https://arxiv.org/pdf/1312.6114.pdf
- """
- super().__init__()
- self.name = "vae_loss"
-
- def __call__(self, y, y_pred, t_mean, t_log_var):
- return self.loss(y, y_pred, t_mean, t_log_var)
-
- def __str__(self):
- return "VAELoss"
-
- @staticmethod
- def loss(y, y_pred, t_mean, t_log_var):
- r"""
- Variational lower bound for a Bernoulli VAE.
-
- Parameters
- ----------
- y : :py:class:`ndarray ` of shape `(n_ex, N)`
- The original images.
- y_pred : :py:class:`ndarray ` of shape `(n_ex, N)`
- The VAE reconstruction of the images.
- t_mean: :py:class:`ndarray ` of shape `(n_ex, T)`
- Mean of the variational distribution :math:`q(t \mid x)`.
- t_log_var: :py:class:`ndarray ` of shape `(n_ex, T)`
- Log of the variance vector of the variational distribution
- :math:`q(t \mid x)`.
-
- Returns
- -------
- loss : float
- The VLB, averaged across the batch.
- """
- # prevent nan on log(0)
- eps = np.finfo(float).eps
- y_pred = np.clip(y_pred, eps, 1 - eps)
-
- # reconstruction loss: binary cross-entropy
- rec_loss = -np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred), axis=1)
-
- # KL divergence between the variational distribution q and the prior p,
- # a unit gaussian
- kl_loss = -0.5 * np.sum(1 + t_log_var - t_mean ** 2 - np.exp(t_log_var), axis=1)
- loss = np.mean(kl_loss + rec_loss)
- return loss
-
- @staticmethod
- def grad(y, y_pred, t_mean, t_log_var):
- """
- Compute the gradient of the VLB with regard to the network parameters.
-
- Parameters
- ----------
- y : :py:class:`ndarray ` of shape `(n_ex, N)`
- The original images.
- y_pred : :py:class:`ndarray ` of shape `(n_ex, N)`
- The VAE reconstruction of the images.
- t_mean: :py:class:`ndarray ` of shape `(n_ex, T)`
- Mean of the variational distribution :math:`q(t | x)`.
- t_log_var: :py:class:`ndarray ` of shape `(n_ex, T)`
- Log of the variance vector of the variational distribution
- :math:`q(t | x)`.
-
- Returns
- -------
- dY_pred : :py:class:`ndarray ` of shape `(n_ex, N)`
- The gradient of the VLB with regard to `y_pred`.
- dLogVar : :py:class:`ndarray ` of shape `(n_ex, T)`
- The gradient of the VLB with regard to `t_log_var`.
- dMean : :py:class:`ndarray ` of shape `(n_ex, T)`
- The gradient of the VLB with regard to `t_mean`.
- """
- N = y.shape[0]
- eps = np.finfo(float).eps
- y_pred = np.clip(y_pred, eps, 1 - eps)
-
- dY_pred = -y / (N * y_pred) - (y - 1) / (N - N * y_pred)
- dLogVar = (np.exp(t_log_var) - 1) / (2 * N)
- dMean = t_mean / N
- return dY_pred, dLogVar, dMean
-
-
-class WGAN_GPLoss(ObjectiveBase):
- def __init__(self, lambda_=10):
- r"""
- The loss function for a Wasserstein GAN [*]_ [*]_ with gradient penalty.
-
- Notes
- -----
- Assuming an optimal critic, minimizing this quantity wrt. the generator
- parameters corresponds to minimizing the Wasserstein-1 (earth-mover)
- distance between the fake and real data distributions.
-
- The formula for the WGAN-GP critic loss is
-
- .. math::
-
- \\text{WGANLoss}
- &= \sum_{x \in X_{real}} p(x) D(x)
- - \sum_{x' \in X_{fake}} p(x') D(x') \\\\
- \\text{WGANLossGP}
- &= \\text{WGANLoss} + \lambda
- (||\\nabla_{X_{interp}} D(X_{interp})||_2 - 1)^2
-
- where
-
- .. math::
-
- X_{fake} &= \\text{Generator}(\mathbf{z}) \\\\
- X_{interp} &= \\alpha X_{real} + (1 - \\alpha) X_{fake} \\\\
-
- and
-
- .. math::
-
- \mathbf{z} &\sim \mathcal{N}(0, \mathbb{1}) \\\\
- \\alpha &\sim \\text{Uniform}(0, 1)
-
- References
- ----------
- .. [*] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., &
- Courville, A. (2017) "Improved training of Wasserstein GANs"
- *Advances in Neural Information Processing Systems, 31*: 5769-5779.
- .. [*] Goodfellow, I. J, Abadie, P. A., Mirza, M., Xu, B., Farley, D.
- W., Ozair, S., Courville, A., & Bengio, Y. (2014) "Generative
- adversarial nets" *Advances in Neural Information Processing
- Systems, 27*: 2672-2680.
-
- Parameters
- ----------
- lambda_ : float
- The gradient penalty coefficient. Default is 10.
- """
- self.lambda_ = lambda_
- super().__init__()
- self.name = "wgan_gp_loss"
-
- def __call__(self, Y_fake, module, Y_real=None, gradInterp=None):
- """
- Computes the generator and critic loss using the WGAN-GP value
- function.
-
- Parameters
- ----------
- Y_fake : :py:class:`ndarray ` of shape `(n_ex,)`
- The output of the critic for `X_fake`.
- module : {'C', 'G'}
- Whether to calculate the loss for the critic ('C') or the generator
- ('G'). If calculating loss for the critic, `Y_real` and
- `gradInterp` must not be None.
- Y_real : :py:class:`ndarray ` of shape `(n_ex,)`, or None
- The output of the critic for `X_real`. Default is None.
- gradInterp : :py:class:`ndarray ` of shape `(n_ex, n_feats)`, or None
- The gradient of the critic output for `X_interp` wrt. `X_interp`.
- Default is None.
-
- Returns
- -------
- loss : float
- Depending on the setting for `module`, either the critic or
- generator loss, averaged over examples in the minibatch.
- """
- return self.loss(Y_fake, module, Y_real=Y_real, gradInterp=gradInterp)
-
- def __str__(self):
- return "WGANLossGP(lambda_={})".format(self.lambda_)
-
- def loss(self, Y_fake, module, Y_real=None, gradInterp=None):
- """
- Computes the generator and critic loss using the WGAN-GP value
- function.
-
- Parameters
- ----------
- Y_fake : :py:class:`ndarray ` of shape (n_ex,)
- The output of the critic for `X_fake`.
- module : {'C', 'G'}
- Whether to calculate the loss for the critic ('C') or the generator
- ('G'). If calculating loss for the critic, `Y_real` and
- `gradInterp` must not be None.
- Y_real : :py:class:`ndarray ` of shape `(n_ex,)` or None
- The output of the critic for `X_real`. Default is None.
- gradInterp : :py:class:`ndarray ` of shape `(n_ex, n_feats)` or None
- The gradient of the critic output for `X_interp` wrt. `X_interp`.
- Default is None.
-
- Returns
- -------
- loss : float
- Depending on the setting for `module`, either the critic or
- generator loss, averaged over examples in the minibatch.
- """
- # calc critic loss including gradient penalty
- if module == "C":
- X_interp_norm = np.linalg.norm(gradInterp, axis=1, keepdims=True)
- gradient_penalty = (X_interp_norm - 1) ** 2
- loss = (
- Y_fake.mean() - Y_real.mean() + self.lambda_ * gradient_penalty.mean()
- )
-
- # calc generator loss
- elif module == "G":
- loss = -Y_fake.mean()
-
- else:
- raise ValueError("Unrecognized module: {}".format(module))
-
- return loss
-
- def grad(self, Y_fake, module, Y_real=None, gradInterp=None):
- """
- Computes the gradient of the generator or critic loss with regard to
- its inputs.
-
- Parameters
- ----------
- Y_fake : :py:class:`ndarray ` of shape `(n_ex,)`
- The output of the critic for `X_fake`.
- module : {'C', 'G'}
- Whether to calculate the gradient for the critic loss ('C') or the
- generator loss ('G'). If calculating grads for the critic, `Y_real`
- and `gradInterp` must not be None.
- Y_real : :py:class:`ndarray ` of shape `(n_ex,)` or None
- The output of the critic for `X_real`. Default is None.
- gradInterp : :py:class:`ndarray ` of shape `(n_ex, n_feats)` or None
- The gradient of the critic output on `X_interp` wrt. `X_interp`.
- Default is None.
-
- Returns
- -------
- grads : tuple
- If `module` == 'C', returns a 3-tuple containing the gradient of
- the critic loss with regard to (`Y_fake`, `Y_real`, `gradInterp`).
- If `module` == 'G', returns the gradient of the generator with
- regard to `Y_fake`.
- """
- eps = np.finfo(float).eps
- n_ex_fake = Y_fake.shape[0]
-
- # calc gradient of the critic loss
- if module == "C":
- n_ex_real = Y_real.shape[0]
-
- dY_fake = -1 / n_ex_fake * np.ones_like(Y_fake)
- dY_real = 1 / n_ex_real * np.ones_like(Y_real)
-
- # differentiate through gradient penalty
- X_interp_norm = np.linalg.norm(gradInterp, axis=1, keepdims=True) + eps
-
- dGradInterp = (
- (2 / n_ex_fake)
- * self.lambda_
- * (X_interp_norm - 1)
- * (gradInterp / X_interp_norm)
- )
- grad = (dY_fake, dY_real, dGradInterp)
-
- # calc gradient of the generator loss
- elif module == "G":
- grad = -1 / n_ex_fake * np.ones_like(Y_fake)
-
- else:
- raise ValueError("Unrecognized module: {}".format(module))
- return grad
-
-
-class NCELoss(ObjectiveBase):
- """
- """
-
- def __init__(
- self,
- n_classes,
- noise_sampler,
- num_negative_samples,
- optimizer=None,
- init="glorot_uniform",
- subtract_log_label_prob=True,
- ):
- r"""
- A noise contrastive estimation (NCE) loss function.
-
- Notes
- -----
- Noise contrastive estimation is a candidate sampling method often
- used to reduce the computational challenge of training a softmax
- layer on problems with a large number of output classes. It proceeds by
- training a logistic regression model to discriminate between samples
- from the true data distribution and samples from an artificial noise
- distribution.
-
- It can be shown that as the ratio of negative samples to data samples
- goes to infinity, the gradient of the NCE loss converges to the
- original softmax gradient.
-
- For input data **X**, target labels `targets`, loss parameters **W** and
- **b**, and noise samples `noise` sampled from the noise distribution `Q`,
- the NCE loss is
-
- .. math::
-
- \\text{NCE}(X, targets) =
- \\text{cross_entropy}(\mathbf{y}_{targets}, \hat{\mathbf{y}}_{targets}) +
- \\text{cross_entropy}(\mathbf{y}_{noise}, \hat{\mathbf{y}}_{noise})
-
- where
-
- .. math::
-
- \hat{\mathbf{y}}_{targets}
- &= \sigma(\mathbf{W}[targets] \mathbf{X} + \mathbf{b}[targets] - \log Q(targets)) \\\\
- \hat{\mathbf{y}}_{noise}
- &= \sigma(\mathbf{W}[noise] \mathbf{X} + \mathbf{b}[noise] - \log Q(noise))
-
- In the above equations, :math:`\sigma` is the logistic sigmoid
- function, and :math:`Q(x)` corresponds to the probability of the values
- in `x` under `Q`.
-
- References
- ----------
- .. [1] Gutmann, M. & Hyvarinen, A. (2010). Noise-contrastive
- estimation: A new estimation principle for unnormalized statistical
- models. *AISTATS, 13*: 297-304.
- .. [2] Minh, A. & Teh, Y. W. (2012). A fast and simple algorithm for
- training neural probabilistic language models. *ICML, 29*: 1751-1758.
-
- Parameters
- ----------
- n_classes : int
- The total number of output classes in the model.
- noise_sampler : :class:`~numpy_ml.utils.data_structures.DiscreteSampler` instance
- The negative sampler. Defines a distribution over all classes in
- the dataset.
- num_negative_samples : int
- The number of negative samples to draw for each target / batch of
- targets.
- init : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is 'glorot_uniform'.
- optimizer : str, :doc:`Optimizer ` object, or None
- The optimization strategy to use when performing gradient updates
- within the :meth:`update` method. If None, use the :class:`SGD
- ` optimizer with
- default parameters. Default is None.
- subtract_log_label_prob : bool
- Whether to subtract the log of the probability of each label under
- the noise distribution from its respective logit. Set to False for
- negative sampling, True for NCE. Default is True.
-
- Attributes
- ----------
- gradients : dict
- The accumulated parameter gradients.
- parameters: dict
- The loss parameter values.
- hyperparameters: dict
- The loss hyperparameter values.
- derived_variables: dict
- Useful intermediate values computed during the loss computation.
- """
- super().__init__()
- self.name = "nce_loss"
-
- self.init = init
- self.n_in = None
- self.trainable = True
- self.n_classes = n_classes
- self.noise_sampler = noise_sampler
- self.num_negative_samples = num_negative_samples
- self.act_fn = ActivationInitializer("Sigmoid")()
- self.optimizer = OptimizerInitializer(optimizer)()
- self.subtract_log_label_prob = subtract_log_label_prob
-
- self.is_initialized = False
-
- def _init_params(self):
- init_weights = WeightInitializer(str(self.act_fn), mode=self.init)
-
- self.X = []
- b = np.zeros((1, self.n_classes))
- W = init_weights((self.n_classes, self.n_in))
-
- self.parameters = {"W": W, "b": b}
-
- self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
-
- self.derived_variables = {
- "y_pred": [],
- "target": [],
- "true_w": [],
- "true_b": [],
- "sampled_b": [],
- "sampled_w": [],
- "out_labels": [],
- "target_logits": [],
- "noise_samples": [],
- "noise_logits": [],
- }
-
- self.is_initialized = True
-
- @property
- def hyperparameters(self):
- return {
- "id": "NCELoss",
- "n_in": self.n_in,
- "init": self.init,
- "n_classes": self.n_classes,
- "noise_sampler": self.noise_sampler,
- "num_negative_samples": self.num_negative_samples,
- "subtract_log_label_prob": self.subtract_log_label_prob,
- "optimizer": {
- "cache": self.optimizer.cache,
- "hyperparameters": self.optimizer.hyperparameters,
- },
- }
-
- def __call__(self, target, X, neg_samples=None, retain_derived=True):
- return self.loss(target, X, neg_samples, retain_derived)
-
- def __str__(self):
- keys = [
- "{}={}".format(k, v)
- for k, v in self.hyperparameters.items()
- if k not in ["id", "optimizer"]
- ] + ["optimizer={}".format(self.optimizer)]
- return "NCELoss({})".format(", ".join(keys))
-
- def freeze(self):
- """
- Freeze the loss parameters at their current values so they can no
- longer be updated.
- """
- self.trainable = False
-
- def unfreeze(self):
- """Unfreeze the layer parameters so they can be updated."""
- self.trainable = True
-
- def flush_gradients(self):
- """Erase all the layer's derived variables and gradients."""
- assert self.trainable, "NCELoss is frozen"
- self.X = []
- for k, v in self.derived_variables.items():
- self.derived_variables[k] = []
-
- for k, v in self.gradients.items():
- self.gradients[k] = np.zeros_like(v)
-
- def update(self, cur_loss=None):
- """
- Update the loss parameters using the accrued gradients and optimizer.
- Flush all gradients once the update is complete.
- """
- assert self.trainable, "NCELoss is frozen"
- self.optimizer.step()
- for k, v in self.gradients.items():
- if k in self.parameters:
- self.parameters[k] = self.optimizer(self.parameters[k], v, k, cur_loss)
- self.flush_gradients()
-
- def loss(self, target, X, neg_samples=None, retain_derived=True):
- """
- Compute the NCE loss for a collection of inputs and associated targets.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_c, n_in)`
- Layer input. A minibatch of `n_ex` examples, where each example is
- an `n_c` by `n_in` matrix (e.g., the matrix of `n_c` context
- embeddings, each of dimensionality `n_in`, for a CBOW model).
- target : :py:class:`ndarray ` of shape `(n_ex,)`
- Integer indices of the target class(es) for each example in the
- minibatch (e.g., the target word id for an example in a CBOW model).
- neg_samples : :py:class:`ndarray ` of shape (`num_negative_samples`,) or None
- An optional array of negative samples to use during the loss
- calculation. These will be used instead of samples draw from
- ``self.noise_sampler``. Default is None.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through with regard to this input.
- Default is True.
-
- Returns
- -------
- loss : float
- The NCE loss summed over the minibatch and samples.
- y_pred : :py:class:`ndarray ` of shape (`n_ex`, `n_c`)
- The network predictions for the conditional probability of each
- target given each context: entry (`i`, `j`) gives the predicted
- probability of target `i` under context vector `j`.
- """
- if not self.is_initialized:
- self.n_in = X.shape[-1]
- self._init_params()
-
- loss, Z_target, Z_neg, y_pred, y_true, noise_samples = self._loss(
- X, target, neg_samples
- )
-
- # cache derived variables for gradient calculation
- if retain_derived:
- self.X.append(X)
-
- self.derived_variables["y_pred"].append(y_pred)
- self.derived_variables["target"].append(target)
- self.derived_variables["out_labels"].append(y_true)
- self.derived_variables["target_logits"].append(Z_target)
- self.derived_variables["noise_samples"].append(noise_samples)
- self.derived_variables["noise_logits"].append(Z_neg)
-
- return loss, np.squeeze(y_pred[..., :1], -1)
-
- def _loss(self, X, target, neg_samples):
- """Actual computation of NCE loss"""
- fstr = "X must have shape (n_ex, n_c, n_in), but got {} dims instead"
- assert X.ndim == 3, fstr.format(X.ndim)
-
- W = self.parameters["W"]
- b = self.parameters["b"]
-
- # sample negative samples from the noise distribution
- if neg_samples is None:
- neg_samples = self.noise_sampler(self.num_negative_samples)
- assert len(neg_samples) == self.num_negative_samples
-
- # get the probability of the negative sample class and the target
- # class under the noise distribution
- p_neg_samples = self.noise_sampler.probs[neg_samples]
- p_target = np.atleast_2d(self.noise_sampler.probs[target])
-
- # save the noise samples for debugging
- noise_samples = (neg_samples, p_target, p_neg_samples)
-
- # compute the logit for the negative samples and target
- Z_target = X @ W[target].T + b[0, target]
- Z_neg = X @ W[neg_samples].T + b[0, neg_samples]
-
- # subtract the log probability of each label under the noise dist
- if self.subtract_log_label_prob:
- n, m = Z_target.shape[0], Z_neg.shape[0]
- Z_target[range(n), ...] -= np.log(p_target)
- Z_neg[range(m), ...] -= np.log(p_neg_samples)
-
- # only retain the probability of the target under its associated
- # minibatch example
- aa, _, cc = Z_target.shape
- Z_target = Z_target[range(aa), :, range(cc)][..., None]
-
- # p_target = (n_ex, n_c, 1)
- # p_neg = (n_ex, n_c, n_samples)
- pred_p_target = self.act_fn(Z_target)
- pred_p_neg = self.act_fn(Z_neg)
-
- # if we're in evaluation mode, ignore the negative samples - just
- # return the binary cross entropy on the targets
- y_pred = pred_p_target
- if self.trainable:
- # (n_ex, n_c, 1 + n_samples) (target is first column)
- y_pred = np.concatenate((y_pred, pred_p_neg), axis=-1)
-
- n_targets = 1
- y_true = np.zeros_like(y_pred)
- y_true[..., :n_targets] = 1
-
- # binary cross entropy
- eps = np.finfo(float).eps
- np.clip(y_pred, eps, 1 - eps, y_pred)
- loss = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
- return loss, Z_target, Z_neg, y_pred, y_true, noise_samples
-
- def grad(self, retain_grads=True, update_params=True):
- """
- Compute the gradient of the NCE loss with regard to the inputs,
- weights, and biases.
-
- Parameters
- ----------
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
- update_params : bool
- Whether to perform a single step of gradient descent on the layer
- weights and bias using the calculated gradients. If `retain_grads`
- is False, this option is ignored and the parameter gradients are
- not updated. Default is True.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape (`n_ex`, `n_in`) or list of arrays
- The gradient of the loss with regard to the layer input(s) `X`.
- """
- assert self.trainable, "NCE loss is frozen"
-
- dX = []
- for input_idx, x in enumerate(self.X):
- dx, dw, db = self._grad(x, input_idx)
- dX.append(dx)
-
- if retain_grads:
- self.gradients["W"] += dw
- self.gradients["b"] += db
-
- dX = dX[0] if len(self.X) == 1 else dX
-
- if retain_grads and update_params:
- self.update()
-
- return dX
-
- def _grad(self, X, input_idx):
- """Actual computation of gradient wrt. loss weights + input"""
- W, b = self.parameters["W"], self.parameters["b"]
-
- y_pred = self.derived_variables["y_pred"][input_idx]
- target = self.derived_variables["target"][input_idx]
- y_true = self.derived_variables["out_labels"][input_idx]
- Z_neg = self.derived_variables["noise_logits"][input_idx]
- Z_target = self.derived_variables["target_logits"][input_idx]
- neg_samples = self.derived_variables["noise_samples"][input_idx][0]
-
- # the number of target classes per minibatch example
- n_targets = 1
-
- # calculate the grad of the binary cross entropy wrt. the network
- # predictions
- preds, classes = y_pred.flatten(), y_true.flatten()
-
- dLdp_real = ((1 - classes) / (1 - preds)) - (classes / preds)
- dLdp_real = dLdp_real.reshape(*y_pred.shape)
-
- # partition the gradients into target and negative sample portions
- dLdy_pred_target = dLdp_real[..., :n_targets]
- dLdy_pred_neg = dLdp_real[..., n_targets:]
-
- # compute gradients of the loss wrt the data and noise logits
- dLdZ_target = dLdy_pred_target * self.act_fn.grad(Z_target)
- dLdZ_neg = dLdy_pred_neg * self.act_fn.grad(Z_neg)
-
- # compute param gradients on target + negative samples
- dB_neg = dLdZ_neg.sum(axis=(0, 1))
- dB_target = dLdZ_target.sum(axis=(1, 2))
-
- dW_neg = (dLdZ_neg.transpose(0, 2, 1) @ X).sum(axis=0)
- dW_target = (dLdZ_target.transpose(0, 2, 1) @ X).sum(axis=1)
-
- # TODO: can this be done with np.einsum instead?
- dX_target = np.vstack(
- [dLdZ_target[[ix]] @ W[[t]] for ix, t in enumerate(target)]
- )
- dX_neg = dLdZ_neg @ W[neg_samples]
-
- hits = list(set(target).intersection(set(neg_samples)))
- hit_ixs = [np.where(target == h)[0] for h in hits]
-
- # adjust param gradients if there's an accidental hit
- if len(hits) != 0:
- hit_ixs = np.concatenate(hit_ixs)
- target = np.delete(target, hit_ixs)
- dB_target = np.delete(dB_target, hit_ixs)
- dW_target = np.delete(dW_target, hit_ixs, 0)
-
- dX = dX_target + dX_neg
-
- # use np.add.at to ensure that repeated indices in the target (or
- # possibly in neg_samples if sampling is done with replacement) are
- # properly accounted for
- dB = np.zeros_like(b).flatten()
- np.add.at(dB, target, dB_target)
- np.add.at(dB, neg_samples, dB_neg)
- dB = dB.reshape(*b.shape)
-
- dW = np.zeros_like(W)
- np.add.at(dW, target, dW_target)
- np.add.at(dW, neg_samples, dW_neg)
-
- return dX, dW, dB
diff --git a/aitk/keras/metrics.py b/aitk/keras/metrics.py
deleted file mode 100644
index 4bcf51c..0000000
--- a/aitk/keras/metrics.py
+++ /dev/null
@@ -1,71 +0,0 @@
-# -*- coding: utf-8 -*-
-# **************************************************************
-# aitk.keras: A Python Keras model API
-#
-# Copyright (c) 2021 AITK Developers
-#
-# https://github.com/ArtificialIntelligenceToolkit/aitk.keras
-#
-# **************************************************************
-
-"""
-Metrics can be computed as a stateless function:
-
-metric(targets, outputs)
-
-or as a stateful subclass of Metric.
-"""
-
-import numpy as np
-from abc import ABC, abstractmethod
-
-class Metric(ABC):
- def __init__(self, name):
- super().__init__()
- self.name = name
-
- @abstractmethod
- def reset_state(self):
- raise NotImplementedError
-
- @abstractmethod
- def update_state(self, targets, outputs):
- raise NotImplementedError
-
- @abstractmethod
- def result(self):
- raise NotImplementedError
-
- def __str__(self):
- return self.name
-
-class ToleranceAccuracy(Metric):
- def __init__(self, tolerance):
- super().__init__("tolerance_accuracy")
- self.tolerance = tolerance
- self.reset_state()
-
- def reset_state(self):
- self.accurate = 0
- self.total = 0
-
- def update_state(self, targets, outputs):
- results = np.all(
- np.less_equal(np.abs(targets - outputs),
- self.tolerance), axis=-1)
- self.accurate += sum(results)
- self.total += len(results)
-
- def result(self):
- return self.accurate / self.total
-
-def tolerance_accuracy(targets, outputs):
- return np.mean(
- np.all(
- np.less_equal(np.abs(targets - outputs),
- tolerance_accuracy.tolerance),
- axis=-1),
- axis=-1,
- )
-# Needs the tolerance from somewhere:
-tolerance_accuracy.tolerance = 0.1
diff --git a/aitk/keras/models/README.md b/aitk/keras/models/README.md
deleted file mode 100644
index 1a15ce7..0000000
--- a/aitk/keras/models/README.md
+++ /dev/null
@@ -1,10 +0,0 @@
-# Models
-
-The models module implements popular full neural networks. It includes:
-
-- `vae.py`: A Bernoulli variational autoencoder ([Kingma & Welling, 2014](https://arxiv.org/abs/1312.6114))
-- `wgan_gp.py`: A Wasserstein generative adversarial network with gradient
- penalty ([Gulrajani et al., 2017](https://arxiv.org/pdf/1704.00028.pdf);
-[Goodfellow et al., 2014](https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf))
-- `w2v.py`: word2vec model with CBOW and skip-gram architectures and
- training via noise contrastive estimation ([Mikolov et al., 2012](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf))
diff --git a/aitk/keras/models/__init__.py b/aitk/keras/models/__init__.py
deleted file mode 100644
index af5d12c..0000000
--- a/aitk/keras/models/__init__.py
+++ /dev/null
@@ -1,540 +0,0 @@
-# -*- coding: utf-8 -*-
-# **************************************************************
-# aitk.keras: A Python Keras model API
-#
-# Copyright (c) 2021 AITK Developers
-#
-# https://github.com/ArtificialIntelligenceToolkit/aitk.keras
-#
-# **************************************************************
-
-from ..layers import Input, Activation, Concatenate
-from ..losses import MeanSquaredError, CrossEntropy
-from ..initializers import OptimizerInitializer
-from ..callbacks import History
-from ..utils import topological_sort
-
-import numpy as np
-import time
-import math
-import numbers
-import functools
-import operator
-from collections import defaultdict
-
-LOSS_FUNCTIONS = {
- "mse": MeanSquaredError,
- "mean_squared_error": MeanSquaredError,
- "crossentropy": CrossEntropy,
- # FIXME: add more error functions
-}
-
-NAME_CACHE = {}
-
-def get_metric_name(metric):
- if hasattr(metric, "name"):
- return metric.name
- elif hasattr(metric, "__name__"):
- return metric.__name__
- else:
- return str(metric)
-
-
-class Model():
- def __init__(self, inputs=None, outputs=None, name=None):
- self.stop_training = False
- self.built = False
- self.sequential = False
- self.history = History()
- self.name = self.make_name(name)
- self.layers = []
- self.layer_map = {}
- self._input_layers = None
- self._output_layers = None
- self.step = 0
- # Build a model graph from inputs to outputs:
- if inputs is not None and outputs is not None:
- if not isinstance(outputs, (list, tuple)):
- outputs = [outputs]
- queue = [] if inputs is None else inputs
- if not isinstance(queue, (list, tuple)):
- queue = [queue]
- while len(queue) > 0:
- layer = queue.pop(0)
- if layer not in self.layers:
- if layer.name in self.layer_map:
- raise AttributeError("duplicate layer name: '%s'" % layer.name)
- self.layers.append(layer)
- self.layer_map[layer.name] = layer
- if layer in outputs:
- # Make sure no more layers:
- layer.output_layers = []
- else:
- queue.extend(layer.output_layers)
- self.sequential = self.is_sequential()
- self.build()
-
- def is_sequential(self):
- return ((len(self.get_input_layers()) == 1) and
- (len(self.get_output_layers()) == 1) and
- (not any([isinstance(layer, Concatenate)
- for layer in self.layers])))
-
- def get_input_layers(self):
- if self._input_layers is None:
- return [layer for layer in self.layers if len(layer.input_layers) == 0]
- else:
- return self._input_layers
-
- def get_output_layers(self):
- if self._output_layers is None:
- return [layer for layer in self.layers if len(layer.output_layers) == 0]
- else:
- return self._output_layers
-
- def connect(self, in_layer, out_layer):
- """
- Connect first layer to second layer.
- """
- if in_layer not in out_layer.input_layers:
- out_layer.input_layers.append(in_layer)
- if out_layer not in in_layer.output_layers:
- in_layer.output_layers.append(out_layer)
-
- def make_name(self, name):
- if name is None:
- class_name = self.__class__.__name__.lower()
- count = NAME_CACHE.get(class_name, 0)
- if count == 0:
- new_name = class_name
- else:
- new_name = "%s_%s" % (class_name, count)
- NAME_CACHE[class_name] = count + 1
- return new_name
- else:
- return name
-
- def summary(self):
- if not self.built:
- print(f'Model: "{self.name}" (unbuilt)')
- else:
- print(f'Model: "{self.name}"')
- print('_' * 65)
- print("Layer (type) Output Shape Param #")
- print("=" * 65)
- total_parameters = 0
- # FIXME: sum up other, non-trainable params
- other_params = 0
- for i, layer in enumerate(topological_sort(self.get_input_layers())):
- layer_name = ("%s (%s)" % (layer.name, layer.__class__.__name__))[:25]
- output_shape = (None, layer.n_out) if isinstance(layer.n_out, numbers.Number) else layer.n_out
- if self.built:
- parameters = sum([np.prod(item.shape) for item in layer.parameters.values() if item is not None])
- total_parameters += parameters
- print(f"{layer_name:25s} {str(output_shape)[:15]:>15s} {parameters:>20,}")
- else:
- print(f"{layer_name:25s} {str(output_shape)[:15]:>15s} {'(unbuilt)':>20}")
- if i != len(self.layers) - 1:
- print("_" * 65)
- print("=" * 65)
- if self.built:
- print(f"Total params: {total_parameters:,}")
- print(f"Trainable params: {total_parameters + other_params:,}")
- print(f"Non-trainable params: {other_params:,}")
- print("_" * 65)
-
- def build(self):
- self._input_layers = [layer for layer in self.layers if len(layer.input_layers) == 0]
- self._output_layers = [layer for layer in self.layers if len(layer.output_layers) == 0]
- for layer in self.layers:
- if not isinstance(layer, Input):
- self.is_initialized = False
- # now, let's force the layers to initialize:
- inputs = self.build_inputs()
- self.predict(inputs)
- self.built = True
-
- def compile(self, optimizer, loss, metrics=None):
- for layer in self.layers:
- if not isinstance(layer, Input):
- self.is_initialized = False
- layer.optimizer = OptimizerInitializer(optimizer)()
- loss_function = LOSS_FUNCTIONS[loss]
- self.loss_function = loss_function()
- self.metrics = metrics if metrics is not None else []
- self.build()
-
- def get_layer_output_shape(self, layer, n=1):
- """
- Get the shape of the layer with a dataset
- size of n.
- """
- if isinstance(layer.n_out, numbers.Number):
- shape = (n, layer.n_out)
- else:
- shape = tuple([n] + list(layer.n_out))
- return shape
-
- def get_layer_output_array(self, layer):
- """
- Get an output array of a layer (dataset, n = 1).
- """
- shape = self.get_layer_output_shape(layer)
- output = np.ndarray(shape)
- return output
-
- def build_inputs(self):
- """
- Build a dataset of dummy inputs.
- """
- if self.sequential:
- inputs = self.get_layer_output_array(self.layers[0])
- else:
- if len(self.get_input_layers()) > 1:
- inputs = [self.get_layer_output_array(input)
- for input in self._input_layers]
- else:
- inputs = self.get_layer_output_array(self._input_layers[0])
- return inputs
-
- def get_weights(self, flat=False):
- """
- Get the weights from the model.
- """
- array = []
- if flat:
- for layer in self.layers:
- if layer.has_trainable_params():
- for weight in layer.get_weights():
- if isinstance(weight, numbers.Number):
- array.extend(weight)
- else:
- array.extend(weight.flatten())
- else:
- for layer in self.layers:
- if layer.has_trainable_params():
- array.extend(layer.get_weights())
- return array
-
- def copy_weights(self, model):
- """
- Copy the weights from another model by layer name.
- """
- for layer in model.layers:
- weights = layer.get_weights()
- self.layer_map[layer.name].set_weights(weights)
-
- def get_weights_by_name(self):
- """
- Copy the weights from another model by layer name.
- """
- return {layer.name: layer.get_weights() for layer in self.layers}
-
- def set_weights(self, weights):
- """
- Set the weights in a network.
-
- Args:
- weights: a list of pairs of weights and biases for each layer,
- or a single (flat) array of values
- """
- if len(weights) > 0 and isinstance(weights[0], numbers.Number):
- # Flat
- current = 0
- for layer in self.layers:
- if layer.has_trainable_params():
- orig = layer.get_weights()
- new_weights = []
- for item in orig:
- if isinstance(item, numbers.Number):
- total = 1
- new_weights.append(item)
- else:
- total = functools.reduce(operator.mul, item.shape, 1)
- w = np.array(weights[current:current + total], dtype=float)
- new_weights.append(w.reshape(item.shape))
- current += total
- layer.set_weights(new_weights)
- else:
- i = 0
- for layer in self.layers:
- if layer.has_trainable_params():
- orig = layer.get_weights()
- count = len(orig)
- layer.set_weights(weights[i:i+count])
- i += count
-
- def format_time(self, seconds):
- """
- Format time for easy human reading.
- """
- if seconds > 1:
- return f"{seconds:.0f}s"
- elif seconds * 1000 > 1:
- return f"{seconds * 1000:.0f}ms"
- else:
- return f"{seconds * 1000000:.0f}µs"
-
- def fit(self, inputs, targets, batch_size=32, epochs=1, verbose="auto", callbacks=None,
- initial_epoch=0, shuffle=True):
- """
- The training loop for all models.
- """
- self.history = History()
- self.stop_training = False
- verbose = 1 if verbose == "auto" else verbose
- callbacks = [] if callbacks is None else callbacks
- callbacks.append(self.history)
- inputs = np.array(inputs, dtype=float)
- targets = np.array(targets, dtype=float)
- self.flush_gradients()
- for callback in callbacks:
- callback.set_model(self)
- callback.on_train_begin()
- for epoch in range(initial_epoch, epochs):
- if self.stop_training:
- break
- epoch_metric_values = {}
- for metric in self.metrics:
- if hasattr(metric, "reset_state"):
- metric.reset_state()
- else:
- epoch_metric_values[get_metric_name(metric)] = 0
-
- for callback in callbacks:
- callback.on_epoch_begin(epoch)
-
- loss = 0
- total_batches = math.ceil(self.get_length_of_inputs(inputs) / batch_size)
- if verbose:
- print(f"Epoch {epoch+1}/{epochs}")
- for batch, length, batch_data in self.enumerate_batches(inputs, targets, batch_size, shuffle):
- start_time = time.monotonic()
- batch_loss, batch_metric_values = self.train_batch(batch_data, batch, length, batch_size, callbacks)
- loss += batch_loss
- for metric in batch_metric_values:
- # FIXME: Need to account for uneven batch sizes?
- epoch_metric_values[metric] += batch_metric_values[metric]
- end_time = time.monotonic()
- self.step += length
- if verbose:
- logs = {}
- ftime = self.format_time((end_time - start_time) / length)
- for metric in self.metrics:
- if hasattr(metric, "result"):
- logs[metric.name] = metric.result()
- else:
- if get_metric_name(metric) in batch_metric_values:
- logs[get_metric_name(metric)] = batch_metric_values[get_metric_name(metric)]
- metrics = " - ".join(["%s: %.4f" % (metric, logs[metric]) for metric in batch_metric_values])
- if metrics:
- metrics = " - " + metrics
- # ideally update output here
- logs = {
- "loss": loss,
- }
- for metric in self.metrics:
- if hasattr(metric, "result"):
- logs[metric.name] = metric.result()
- else:
- if get_metric_name(metric) in epoch_metric_values:
- logs[get_metric_name(metric)] = epoch_metric_values[get_metric_name(metric)] / total_batches
- if verbose:
- metrics = " - ".join(["%s: %.4f" % (metric, logs[metric]) for metric in logs])
- if metrics:
- metrics = " - " + metrics
- # Until we have output screen formatting; uses the last computed times, metrics
- print(f"{batch + 1}/{total_batches} [==============================] - {end_time - start_time:.0f}s {ftime}/step{metrics}")
- for callback in callbacks:
- callback.on_epoch_end(
- epoch,
- logs
- )
- if self.stop_training:
- print("Training stopped early.")
- for callback in callbacks:
- callback.on_train_end()
- return self.history
-
- def flush_gradients(self):
- for layer in self.layers:
- if layer.has_trainable_params():
- layer.flush_gradients()
-
- def enumerate_batches(self, inputs, targets, batch_size, shuffle):
- indexes = np.arange(self.get_length_of_inputs(inputs))
- if shuffle:
- # In place shuffle
- np.random.shuffle(indexes)
- current_row = 0
- batch = 0
- while (current_row * batch_size) < self.get_length_of_inputs(inputs):
- batch_inputs = self.get_batch_inputs(
- inputs, indexes, current_row, batch_size)
- batch_targets = self.get_batch_targets(
- targets, indexes, current_row, batch_size)
- current_row += 1
- yield batch, self.get_length_of_inputs(batch_inputs), (batch_inputs, batch_targets)
- batch += 1
-
- def get_length_of_inputs(self, inputs):
- if len(self.get_input_layers()) == 1:
- return len(inputs)
- else:
- return len(inputs[0])
-
- def get_batch_inputs(self, inputs, indexes, current_row, batch_size):
- batch_indexes = indexes[current_row:current_row + batch_size]
- if len(self.get_input_layers()) == 1:
- return inputs[batch_indexes]
- else:
- return [np.array(inputs[i][batch_indexes])
- for i in range(len(self.get_input_layers()))]
-
- def get_batch_targets(self, targets, indexes, current_row, batch_size):
- batch_indexes = indexes[current_row:current_row + batch_size]
- if self.sequential:
- # Numpy, one bank:
- return targets[batch_indexes]
- else:
- return [np.array(targets[i][batch_indexes])
- for i in range(len(self.get_output_layers()))]
-
- def train_batch(self, dataset, batch, length, batch_size, callbacks):
- """
- dataset = (inputs, targets)
- batch = batch number (eg, step)
- length = the actual size of the batch
- batch_size = desired size of batch
- """
- inputs, targets = dataset
- # If the size of this batch is less than desired, scale it?
- #scale = length / batch_size
- scale = 1.0
- # Use predict to forward the activations, saving
- # needed information:
- outputs = self.predict(inputs, True)
- # Compute the derivative with respect
- # to this batch of the dataset:
- batch_loss = 0
- batch_metric_values = defaultdict(int)
- for callback in callbacks:
- callback.on_train_batch_begin(batch)
- results = 0
- # FIXME: If batch_size is different from others? Scale it?
- if self.sequential:
- dY_pred = self.loss_function.grad(
- targets,
- outputs,
- )
- queue = [(self.get_output_layers()[0], dY_pred)]
- while len(queue) > 0:
- layer, dY_pred = queue.pop(0)
- if not isinstance(layer, Input):
- dY_pred = layer.backward(dY_pred)
- for input_layer in layer.input_layers:
- queue.append((input_layer, dY_pred))
-
- batch_loss = self.loss_function(targets, outputs) * scale
- for metric in self.metrics:
- if hasattr(metric, "update_state"):
- metric.update_state(targets, outputs)
- else:
- batch_metric_values[get_metric_name(metric)] = metric(targets, outputs)
- else:
- for out_n in range(len(self.get_output_layers())):
- dY_pred = self.loss_function.grad(
- targets[out_n],
- outputs[out_n],
- ) * scale
- queue = [(self.get_output_layers()[out_n], dY_pred)]
- while len(queue) > 0:
- layer, dY_pred = queue.pop(0)
- if not isinstance(layer, Input):
- dY_pred = layer.backward(dY_pred)
- for input_layer in layer.input_layers:
- queue.append((input_layer, dY_pred))
-
- batch_loss += self.loss_function(targets[out_n], outputs[out_n]) * scale
- for metric in self.metrics:
- if hasattr(metric, "update_state"):
- metric.update_state(targets[out_n], outputs[out_n])
- else:
- batch_metric_values[get_metric_name(metric)] += metric(targets, outputs)
-
- for callback in callbacks:
- logs = {"batch_loss": batch_loss}
- logs.update(batch_metric_values)
- callback.on_train_batch_end(batch, logs)
- self.update(batch_loss)
- return batch_loss, batch_metric_values
-
- def update(self, batch_loss):
- """
- Update the weights based on the batch_loss.
- The weight delatas were computed in train_batch().
- """
- # FIXME? Need to pass the batch_loss to just the layers
- # responsible for this loss (eg, in case of multiple
- # output layers)
- # FIXME: layers need to be able to accumulate delta changes
- for layer in self.layers:
- if not isinstance(layer, Input):
- layer.update(batch_loss)
-
- def predict(self, inputs, retain_derived=False):
- inputs = np.array(inputs, dtype=float)
- results = []
- # First, load the outputs of the input layers:
- if self.sequential:
- outputs = {self._input_layers[0].name: inputs}
- else:
- if len(self._input_layers) > 1:
- outputs = {self._input_layers[i].name: input for i, input in enumerate(inputs)}
- else:
- outputs = {self._input_layers[0].name: inputs}
-
- # Propagate in topological order:
- for layer in topological_sort(self.get_input_layers()):
- if not isinstance(layer, Input):
- inputs = [outputs[in_layer.name] for in_layer in layer.input_layers]
- if len(inputs) == 1:
- outputs[layer.name] = layer.forward(inputs[0], retain_derived=retain_derived)
- else:
- outputs[layer.name] = layer.forward(inputs, retain_derived=retain_derived)
-
- for layer in self.get_output_layers():
- results.append(outputs[layer.name])
- if self.sequential:
- return results[0]
- else:
- return results
-
-class Sequential(Model):
- def __init__(self, layers=None, name="sequential"):
- super().__init__(name=name)
- self.sequential = True
- if layers is not None:
- for layer in layers:
- self.add(layer)
- self.build()
-
- def add(self, layer):
- if layer.name in self.layer_map:
- raise AttributeError("duplicate layer name: '%s'" % layer.name)
- self.layer_map[layer.name] = layer
- if len(self.layers) == 0:
- if isinstance(layer, Input):
- self.layers.append(layer)
- else:
- input_layer = Input(input_shape=layer.input_shape)
- self.connect(input_layer, layer)
- self.layers.append(input_layer)
- self.layers.append(layer)
- elif isinstance(layer, Activation):
- self.layers[-1].act_fn = layer.activation
- else:
- input_layer = self.layers[-1]
- self.connect(input_layer, layer)
- self.layers.append(layer)
- self.build()
diff --git a/aitk/keras/models/vae.py b/aitk/keras/models/vae.py
deleted file mode 100644
index e136355..0000000
--- a/aitk/keras/models/vae.py
+++ /dev/null
@@ -1,453 +0,0 @@
-from time import time
-from collections import OrderedDict
-
-import numpy as np
-
-from ..losses import VAELoss
-from ..utils import minibatch
-from ..activations import ReLU, Affine, Sigmoid
-from ..layers import Conv2D, Pool2D, Flatten, FullyConnected
-
-
-class BernoulliVAE(object):
- def __init__(
- self,
- T=5,
- latent_dim=256,
- enc_conv1_pad=0,
- enc_conv2_pad=0,
- enc_conv1_out_ch=32,
- enc_conv2_out_ch=64,
- enc_conv1_stride=1,
- enc_pool1_stride=2,
- enc_conv2_stride=1,
- enc_pool2_stride=1,
- enc_conv1_kernel_shape=(5, 5),
- enc_pool1_kernel_shape=(2, 2),
- enc_conv2_kernel_shape=(5, 5),
- enc_pool2_kernel_shape=(2, 2),
- optimizer="RMSProp(lr=0.0001)",
- init="glorot_uniform",
- ):
- """
- A variational autoencoder (VAE) with 2D convolutional encoder and Bernoulli
- input and output units.
-
- Notes
- -----
- The VAE architecture is
-
- .. code-block:: text
-
- |-- t_mean ----|
- X -> [Encoder] -| |--> [Sampler] -> [Decoder] -> X_recon
- |-- t_log_var -|
-
- where ``[Encoder]`` is
-
- .. code-block:: text
-
- Conv1 -> ReLU -> MaxPool1 -> Conv2 -> ReLU ->
- MaxPool2 -> Flatten -> FC1 -> ReLU -> FC2
-
- ``[Decoder]`` is
-
- .. code-block:: text
-
- FC1 -> FC2 -> Sigmoid
-
- and ``[Sampler]`` draws a sample from the distribution
-
- .. math::
-
- \mathcal{N}(\\text{t_mean}, \exp \left\{\\text{t_log_var}\\right\} I)
-
- using the reparameterization trick.
-
- Parameters
- ----------
- T : int
- The dimension of the variational parameter `t`. Default is 5.
- enc_conv1_pad : int
- The padding for the first convolutional layer of the encoder. Default is 0.
- enc_conv1_stride : int
- The stride for the first convolutional layer of the encoder. Default is 1.
- enc_conv1_out_ch : int
- The number of output channels for the first convolutional layer of
- the encoder. Default is 32.
- enc_conv1_kernel_shape : tuple
- The number of rows and columns in each filter of the first
- convolutional layer of the encoder. Default is (5, 5).
- enc_pool1_kernel_shape : tuple
- The number of rows and columns in the receptive field of the first
- max pool layer of the encoder. Default is (2, 3).
- enc_pool1_stride : int
- The stride for the first MaxPool layer of the encoder. Default is
- 2.
- enc_conv2_pad : int
- The padding for the second convolutional layer of the encoder.
- Default is 0.
- enc_conv2_out_ch : int
- The number of output channels for the second convolutional layer of
- the encoder. Default is 64.
- enc_conv2_kernel_shape : tuple
- The number of rows and columns in each filter of the second
- convolutional layer of the encoder. Default is (5, 5).
- enc_conv2_stride : int
- The stride for the second convolutional layer of the encoder.
- Default is 1.
- enc_pool2_stride : int
- The stride for the second MaxPool layer of the encoder. Default is
- 1.
- enc_pool2_kernel_shape : tuple
- The number of rows and columns in the receptive field of the second
- max pool layer of the encoder. Default is (2, 3).
- latent_dim : int
- The dimension of the output for the first FC layer of the encoder.
- Default is 256.
- optimizer : str or :doc:`Optimizer ` object or None
- The optimization strategy to use when performing gradient updates.
- If None, use the :class:`~numpy_ml.neural_nets.optimizers.SGD`
- optimizer with default parameters. Default is "RMSProp(lr=0.0001)".
- init : str
- The weight initialization strategy. Valid entries are
- {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform',
- 'std_normal', 'trunc_normal'}. Default is 'glorot_uniform'.
- """
- self.T = T
- self.init = init
- self.loss = VAELoss()
- self.optimizer = optimizer
- self.latent_dim = latent_dim
- self.enc_conv1_pad = enc_conv1_pad
- self.enc_conv2_pad = enc_conv2_pad
- self.enc_conv1_stride = enc_conv1_stride
- self.enc_conv1_out_ch = enc_conv1_out_ch
- self.enc_pool1_stride = enc_pool1_stride
- self.enc_conv2_out_ch = enc_conv2_out_ch
- self.enc_conv2_stride = enc_conv2_stride
- self.enc_pool2_stride = enc_pool2_stride
- self.enc_conv2_kernel_shape = enc_conv2_kernel_shape
- self.enc_pool2_kernel_shape = enc_pool2_kernel_shape
- self.enc_conv1_kernel_shape = enc_conv1_kernel_shape
- self.enc_pool1_kernel_shape = enc_pool1_kernel_shape
-
- self._init_params()
-
- def _init_params(self):
- self._dv = {}
- self._build_encoder()
- self._build_decoder()
-
- def _build_encoder(self):
- """
- CNN encoder
-
- Conv1 -> ReLU -> MaxPool1 -> Conv2 -> ReLU -> MaxPool2 ->
- Flatten -> FC1 -> ReLU -> FC2
- """
- self.encoder = OrderedDict()
- self.encoder["Conv1"] = Conv2D(
- act_fn=ReLU(),
- init=self.init,
- pad=self.enc_conv1_pad,
- optimizer=self.optimizer,
- out_ch=self.enc_conv1_out_ch,
- stride=self.enc_conv1_stride,
- kernel_shape=self.enc_conv1_kernel_shape,
- )
- self.encoder["Pool1"] = Pool2D(
- mode="max",
- optimizer=self.optimizer,
- stride=self.enc_pool1_stride,
- kernel_shape=self.enc_pool1_kernel_shape,
- )
- self.encoder["Conv2"] = Conv2D(
- act_fn=ReLU(),
- init=self.init,
- pad=self.enc_conv2_pad,
- optimizer=self.optimizer,
- out_ch=self.enc_conv2_out_ch,
- stride=self.enc_conv2_stride,
- kernel_shape=self.enc_conv2_kernel_shape,
- )
- self.encoder["Pool2"] = Pool2D(
- mode="max",
- optimizer=self.optimizer,
- stride=self.enc_pool2_stride,
- kernel_shape=self.enc_pool2_kernel_shape,
- )
- self.encoder["Flatten3"] = Flatten(optimizer=self.optimizer)
- self.encoder["FC4"] = FullyConnected(
- n_out=self.latent_dim, act_fn=ReLU(), optimizer=self.optimizer
- )
- self.encoder["FC5"] = FullyConnected(
- n_out=self.T * 2,
- optimizer=self.optimizer,
- act_fn=Affine(slope=1, intercept=0),
- init=self.init,
- )
-
- def _build_decoder(self):
- """
- MLP decoder
-
- FC1 -> ReLU -> FC2 -> Sigmoid
- """
- self.decoder = OrderedDict()
- self.decoder["FC1"] = FullyConnected(
- act_fn=ReLU(),
- init=self.init,
- n_out=self.latent_dim,
- optimizer=self.optimizer,
- )
- # NB. `n_out` is dependent on the dimensionality of X. we use a
- # placeholder for now, and update it within the `forward` method
- self.decoder["FC2"] = FullyConnected(
- n_out=None, act_fn=Sigmoid(), optimizer=self.optimizer, init=self.init
- )
-
- @property
- def parameters(self):
- return {
- "components": {
- "encoder": {k: v.parameters for k, v in self.encoder.items()},
- "decoder": {k: v.parameters for k, v in self.decoder.items()},
- }
- }
-
- @property
- def hyperparameters(self):
- return {
- "layer": "BernoulliVAE",
- "T": self.T,
- "init": self.init,
- "loss": str(self.loss),
- "optimizer": self.optimizer,
- "latent_dim": self.latent_dim,
- "enc_conv1_pad": self.enc_conv1_pad,
- "enc_conv2_pad": self.enc_conv2_pad,
- "enc_conv1_in_ch": self.enc_conv1_in_ch,
- "enc_conv1_stride": self.enc_conv1_stride,
- "enc_conv1_out_ch": self.enc_conv1_out_ch,
- "enc_pool1_stride": self.enc_pool1_stride,
- "enc_conv2_out_ch": self.enc_conv2_out_ch,
- "enc_conv2_stride": self.enc_conv2_stride,
- "enc_pool2_stride": self.enc_pool2_stride,
- "enc_conv2_kernel_shape": self.enc_conv2_kernel_shape,
- "enc_pool2_kernel_shape": self.enc_pool2_kernel_shape,
- "enc_conv1_kernel_shape": self.enc_conv1_kernel_shape,
- "enc_pool1_kernel_shape": self.enc_pool1_kernel_shape,
- "encoder_ids": list(self.encoder.keys()),
- "decoder_ids": list(self.decoder.keys()),
- "components": {
- "encoder": {k: v.hyperparameters for k, v in self.encoder.items()},
- "decoder": {k: v.hyperparameters for k, v in self.decoder.items()},
- },
- }
-
- @property
- def derived_variables(self):
- dv = {
- "noise": None,
- "t_mean": None,
- "t_log_var": None,
- "dDecoder_FC1_in": None,
- "dDecoder_t_mean": None,
- "dEncoder_FC5_out": None,
- "dDecoder_FC1_out": None,
- "dEncoder_FC4_out": None,
- "dEncoder_Pool2_out": None,
- "dEncoder_Conv2_out": None,
- "dEncoder_Pool1_out": None,
- "dEncoder_Conv1_out": None,
- "dDecoder_t_log_var": None,
- "dEncoder_Flatten3_out": None,
- "components": {
- "encoder": {k: v.derived_variables for k, v in self.encoder.items()},
- "decoder": {k: v.derived_variables for k, v in self.decoder.items()},
- },
- }
- dv.update(self._dv)
- return dv
-
- @property
- def gradients(self):
- return {
- "components": {
- "encoder": {k: v.gradients for k, v in self.encoder.items()},
- "decoder": {k: v.gradients for k, v in self.decoder.items()},
- }
- }
-
- def _sample(self, t_mean, t_log_var):
- """
- Returns a sample from the distribution
-
- q(t | x) = N(t_mean, diag(exp(t_log_var)))
-
- using the reparameterization trick.
-
- Parameters
- ----------
- t_mean : :py:class:`ndarray ` of shape `(n_ex, latent_dim)`
- Mean of the desired distribution.
- t_log_var : :py:class:`ndarray ` of shape `(n_ex, latent_dim)`
- Log variance vector of the desired distribution.
-
- Returns
- -------
- samples: :py:class:`ndarray ` of shape `(n_ex, latent_dim)`
- """
- noise = np.random.normal(loc=0.0, scale=1.0, size=t_mean.shape)
- samples = noise * np.exp(t_log_var) + t_mean
- # save sampled noise for backward pass
- self._dv["noise"] = noise
- return samples
-
- def forward(self, X_train):
- """VAE forward pass"""
- if self.decoder["FC2"].n_out is None:
- fc2 = self.decoder["FC2"]
- self.decoder["FC2"] = fc2.set_params({"n_out": self.N})
-
- # assume each image is represented as a flattened row vector,
- n_ex, in_rows, N, in_ch = X_train.shape
-
- # encode the training batch to estimate the mean and variance of the
- # variational distribution
- out = X_train
- for k, v in self.encoder.items():
- out = v.forward(out)
-
- # extract the mean and log variance of the variational distribution
- # q(t | x) from the encoder output
- t_mean = out[:, : self.T]
- t_log_var = out[:, self.T :]
-
- # sample t from q(t | x) using reparamterization trick
- t = self._sample(t_mean, t_log_var)
-
- # pass the sampled latent value, t, through the decoder
- # to generate the average reconstruction
- X_recon = t
- for k, v in self.decoder.items():
- X_recon = v.forward(X_recon)
-
- self._dv["t_mean"] = t_mean
- self._dv["t_log_var"] = t_log_var
- return X_recon
-
- def backward(self, X_train, X_recon):
- """VAE backward pass"""
- n_ex = X_train.shape[0]
- D, E = self.decoder, self.encoder
- noise = self.derived_variables["noise"]
- t_mean = self.derived_variables["t_mean"]
- t_log_var = self.derived_variables["t_log_var"]
-
- # compute gradients through the VAE loss
- dY_pred, dLogVar, dMean = self.loss.grad(
- X_train.reshape(n_ex, -1), X_recon, t_mean, t_log_var
- )
-
- # backprop through the decoder
- dDecoder_FC1_out = D["FC2"].backward(dY_pred)
- dDecoder_FC1_in = D["FC1"].backward(dDecoder_FC1_out)
-
- # backprop through the sampler
- dDecoder_t_log_var = dDecoder_FC1_in * (noise * np.exp(t_log_var))
- dDecoder_t_mean = dDecoder_FC1_in
-
- # backprop through the encoder
- dEncoder_FC5_out = np.hstack(
- [dDecoder_t_mean + dMean, dDecoder_t_log_var + dLogVar]
- )
- dEncoder_FC4_out = E["FC5"].backward(dEncoder_FC5_out)
- dEncoder_Flatten3_out = E["FC4"].backward(dEncoder_FC4_out)
- dEncoder_Pool2_out = E["Flatten3"].backward(dEncoder_Flatten3_out)
- dEncoder_Conv2_out = E["Pool2"].backward(dEncoder_Pool2_out)
- dEncoder_Pool1_out = E["Conv2"].backward(dEncoder_Conv2_out)
- dEncoder_Conv1_out = E["Pool1"].backward(dEncoder_Pool1_out)
- dX = E["Conv1"].backward(dEncoder_Conv1_out)
-
- self._dv["dDecoder_t_mean"] = dDecoder_t_mean
- self._dv["dDecoder_FC1_in"] = dDecoder_FC1_in
- self._dv["dDecoder_FC1_out"] = dDecoder_FC1_out
- self._dv["dEncoder_FC5_out"] = dEncoder_FC5_out
- self._dv["dEncoder_FC4_out"] = dEncoder_FC4_out
- self._dv["dDecoder_t_log_var"] = dDecoder_t_log_var
- self._dv["dEncoder_Pool2_out"] = dEncoder_Pool2_out
- self._dv["dEncoder_Conv2_out"] = dEncoder_Conv2_out
- self._dv["dEncoder_Pool1_out"] = dEncoder_Pool1_out
- self._dv["dEncoder_Conv1_out"] = dEncoder_Conv1_out
- self._dv["dEncoder_Flatten3_out"] = dEncoder_Flatten3_out
- return dX
-
- def update(self, cur_loss=None):
- """Perform gradient updates"""
- for k, v in reversed(list(self.decoder.items())):
- v.update(cur_loss)
- for k, v in reversed(list(self.encoder.items())):
- v.update(cur_loss)
- self.flush_gradients()
-
- def flush_gradients(self):
- """Reset parameter gradients after update"""
- for k, v in self.decoder.items():
- v.flush_gradients()
- for k, v in self.encoder.items():
- v.flush_gradients()
-
- def fit(self, X_train, n_epochs=20, batchsize=128, verbose=True):
- """
- Fit the VAE to a training dataset.
-
- Parameters
- ----------
- X_train : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The input volume
- n_epochs : int
- The maximum number of training epochs to run. Default is 20.
- batchsize : int
- The desired number of examples in each training batch. Default is 128.
- verbose : bool
- Print batch information during training. Default is True.
- """
- self.verbose = verbose
- self.n_epochs = n_epochs
- self.batchsize = batchsize
-
- _, self.in_rows, self.in_cols, self.in_ch = X_train.shape
- self.N = self.in_rows * self.in_cols * self.in_ch
-
- prev_loss = np.inf
- for i in range(n_epochs):
- loss, estart = 0.0, time()
- batch_generator, nb = minibatch(X_train, batchsize, shuffle=True)
-
- # TODO: parallelize inner loop
- for j, b_ix in enumerate(batch_generator):
- bsize, bstart = len(b_ix), time()
-
- X_batch = X_train[b_ix]
- X_batch_col = X_train[b_ix].reshape(bsize, -1)
-
- X_recon = self.forward(X_batch)
- t_mean = self.derived_variables["t_mean"]
- t_log_var = self.derived_variables["t_log_var"]
-
- self.backward(X_batch, X_recon)
- batch_loss = self.loss(X_batch_col, X_recon, t_mean, t_log_var)
- loss += batch_loss
-
- self.update(batch_loss)
-
- if self.verbose:
- fstr = "\t[Batch {}/{}] Train loss: {:.3f} ({:.1f}s/batch)"
- print(fstr.format(j + 1, nb, batch_loss, time() - bstart))
-
- loss /= nb
- fstr = "[Epoch {}] Avg. loss: {:.3f} Delta: {:.3f} ({:.2f}m/epoch)"
- print(fstr.format(i + 1, loss, prev_loss - loss, (time() - estart) / 60.0))
- prev_loss = loss
diff --git a/aitk/keras/models/w2v.py b/aitk/keras/models/w2v.py
deleted file mode 100644
index b14ae74..0000000
--- a/aitk/keras/models/w2v.py
+++ /dev/null
@@ -1,451 +0,0 @@
-from time import time
-
-import numpy as np
-
-from ..layers import Embedding
-from ..losses import NCELoss
-
-from ..preprocessing.nlp import Vocabulary, tokenize_words
-from ..numpy_ml_utils.data_structures import DiscreteSampler
-
-
-class Word2Vec(object):
- def __init__(
- self,
- context_len=5,
- min_count=None,
- skip_gram=False,
- max_tokens=None,
- embedding_dim=300,
- filter_stopwords=True,
- noise_dist_power=0.75,
- kernel_initializer="glorot_uniform",
- num_negative_samples=64,
- optimizer="SGD(lr=0.1)",
- ):
- """
- A word2vec model supporting both continuous bag of words (CBOW) and
- skip-gram architectures, with training via noise contrastive
- estimation.
-
- Parameters
- ----------
- context_len : int
- The number of words to the left and right of the current word to
- use as context during training. Larger values result in more
- training examples and thus can lead to higher accuracy at the
- expense of additional training time. Default is 5.
- min_count : int or None
- Minimum number of times a token must occur in order to be included
- in vocab. If None, include all tokens from `corpus_fp` in vocab.
- Default is None.
- skip_gram : bool
- Whether to train the skip-gram or CBOW model. The skip-gram model
- is trained to predict the target word i given its surrounding
- context, ``words[i - context:i]`` and ``words[i + 1:i + 1 +
- context]`` as input. Default is False.
- max_tokens : int or None
- Only add the first `max_tokens` most frequent tokens that occur
- more than `min_count` to the vocabulary. If None, add all tokens
- that occur more than than `min_count`. Default is None.
- embedding_dim : int
- The number of dimensions in the final word embeddings. Default is
- 300.
- filter_stopwords : bool
- Whether to remove stopwords before encoding the words in the
- corpus. Default is True.
- noise_dist_power : float
- The power the unigram count is raised to when computing the noise
- distribution for negative sampling. A value of 0 corresponds to a
- uniform distribution over tokens, and a value of 1 corresponds to a
- distribution proportional to the token unigram counts. Default is
- 0.75.
- kernel_initializer : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is 'glorot_uniform'.
- num_negative_samples: int
- The number of negative samples to draw from the noise distribution
- for each positive training sample. If 0, use the hierarchical
- softmax formulation of the model instead. Default is 5.
- optimizer : str, :doc:`Optimizer ` object, or None
- The optimization strategy to use when performing gradient updates
- within the `update` method. If None, use the
- :class:`~numpy_ml.neural_nets.optimizers.SGD` optimizer with
- default parameters. Default is None.
-
- Attributes
- ----------
- parameters : dict
- hyperparameters : dict
- derived_variables : dict
- gradients : dict
-
- Notes
- -----
- The word2vec model is outlined in in [1].
-
- CBOW architecture::
-
- w_{t-R} ----|
- w_{t-R+1} ----|
- ... --> Average --> Embedding layer --> [NCE Layer / HSoftmax] --> P(w_{t} | w_{...})
- w_{t+R-1} ----|
- w_{t+R} ----|
-
- Skip-gram architecture::
-
- |--> P(w_{t-R} | w_{t})
- |--> P(w_{t-R+1} | w_{t})
- w_{t} --> Embedding layer --> [NCE Layer / HSoftmax] --| ...
- |--> P(w_{t+R-1} | w_{t})
- |--> P(w_{t+R} | w_{t})
-
- where :math:`w_{i}` is the one-hot representation of the word at position
- `i` within a sentence in the corpus and `R` is the length of the context
- window on either side of the target word.
-
- References
- ----------
- .. [1] Mikolov et al. (2013). "Distributed representations of words
- and phrases and their compositionality," Proceedings of the 26th
- International Conference on Neural Information Processing Systems.
- https://arxiv.org/pdf/1310.4546.pdf
- """
- self.kernel_initializer = kernel_initializer
- self.optimizer = optimizer
- self.skip_gram = skip_gram
- self.min_count = min_count
- self.max_tokens = max_tokens
- self.context_len = context_len
- self.embedding_dim = embedding_dim
- self.filter_stopwords = filter_stopwords
- self.noise_dist_power = noise_dist_power
- self.num_negative_samples = num_negative_samples
- self.special_chars = set(["", "", ""])
-
- def _init_params(self):
- self._dv = {}
- self._build_noise_distribution()
-
- self.embeddings = Embedding(
- kernel_initializer=self.kernel_initializer,
- vocab_size=self.vocab_size,
- n_out=self.embedding_dim,
- optimizer=self.optimizer,
- pool=None if self.skip_gram else "mean",
- )
-
- self.loss = NCELoss(
- kernel_initializer=self.kernel_initializer,
- optimizer=self.optimizer,
- n_classes=self.vocab_size,
- subtract_log_label_prob=False,
- noise_sampler=self._noise_sampler,
- num_negative_samples=self.num_negative_samples,
- )
-
- @property
- def parameters(self):
- """Model parameters"""
- param = {"components": {"embeddings": {}, "loss": {}}}
- if hasattr(self, "embeddings"):
- param["components"] = {
- "embeddings": self.embeddings.parameters,
- "loss": self.loss.parameters,
- }
- return param
-
- @property
- def hyperparameters(self):
- """Model hyperparameters"""
- hp = {
- "layer": "Word2Vec",
- "kernel_initializer": self.kernel_initializer,
- "skip_gram": self.skip_gram,
- "optimizer": self.optimizer,
- "max_tokens": self.max_tokens,
- "context_len": self.context_len,
- "embedding_dim": self.embedding_dim,
- "noise_dist_power": self.noise_dist_power,
- "filter_stopwords": self.filter_stopwords,
- "num_negative_samples": self.num_negative_samples,
- "vocab_size": self.vocab_size if hasattr(self, "vocab_size") else None,
- "components": {"embeddings": {}, "loss": {}},
- }
-
- if hasattr(self, "embeddings"):
- hp["components"] = {
- "embeddings": self.embeddings.hyperparameters,
- "loss": self.loss.hyperparameters,
- }
- return hp
-
- @property
- def derived_variables(self):
- """Variables computed during model operation"""
- dv = {"components": {"embeddings": {}, "loss": {}}}
- dv.update(self._dv)
-
- if hasattr(self, "embeddings"):
- dv["components"] = {
- "embeddings": self.embeddings.derived_variables,
- "loss": self.loss.derived_variables,
- }
- return dv
-
- @property
- def gradients(self):
- """Model parameter gradients"""
- grad = {"components": {"embeddings": {}, "loss": {}}}
- if hasattr(self, "embeddings"):
- grad["components"] = {
- "embeddings": self.embeddings.gradients,
- "loss": self.loss.gradients,
- }
- return grad
-
- def forward(self, X, targets, retain_derived=True):
- """
- Evaluate the network on a single minibatch.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in)`
- Layer input, representing a minibatch of `n_ex` examples, each
- consisting of `n_in` integer word indices
- targets : :py:class:`ndarray ` of shape `(n_ex,)`
- Target word index for each example in the minibatch.
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If `False`, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- True.
-
- Returns
- -------
- loss : float
- The loss associated with the current minibatch
- y_pred : :py:class:`ndarray ` of shape `(n_ex,)`
- The conditional probabilities of the words in `targets` given the
- corresponding example / context in `X`.
- """
- X_emb = self.embeddings.forward(X, retain_derived=True)
- loss, y_pred = self.loss.loss(X_emb, targets.flatten(), retain_derived=True)
- return loss, y_pred
-
- def backward(self):
- """
- Compute the gradient of the loss wrt the current network parameters.
- """
- dX_emb = self.loss.grad(retain_grads=True, update_params=False)
- self.embeddings.backward(dX_emb)
-
- def update(self, cur_loss=None):
- """Perform gradient updates"""
- self.loss.update(cur_loss)
- self.embeddings.update(cur_loss)
- self.flush_gradients()
-
- def flush_gradients(self):
- """Reset parameter gradients after update"""
- self.loss.flush_gradients()
- self.embeddings.flush_gradients()
-
- def get_embedding(self, word_ids):
- """
- Retrieve the embeddings for a collection of word IDs.
-
- Parameters
- ----------
- word_ids : :py:class:`ndarray ` of shape `(M,)`
- An array of word IDs to retrieve embeddings for.
-
- Returns
- -------
- embeddings : :py:class:`ndarray ` of shape `(M, n_out)`
- The embedding vectors for each of the `M` word IDs.
- """
- if isinstance(word_ids, list):
- word_ids = np.array(word_ids)
- return self.embeddings.lookup(word_ids)
-
- def _build_noise_distribution(self):
- """
- Construct the noise distribution for use during negative sampling.
-
- For a word ``w`` in the corpus, the noise distribution is::
-
- P_n(w) = Count(w) ** noise_dist_power / Z
-
- where ``Z`` is a normalizing constant, and `noise_dist_power` is a
- hyperparameter of the model. Mikolov et al. report best performance
- using a `noise_dist_power` of 0.75.
- """
- if not hasattr(self, "vocab"):
- raise ValueError("Must call `fit` before constructing noise distribution")
-
- probs = np.zeros(len(self.vocab))
- power = self.hyperparameters["noise_dist_power"]
-
- for ix, token in enumerate(self.vocab):
- count = token.count
- probs[ix] = count ** power
-
- probs /= np.sum(probs)
- self._noise_sampler = DiscreteSampler(probs, log=False, with_replacement=False)
-
- def _train_epoch(self, corpus_fps, encoding):
- total_loss = 0
- batch_generator = self.minibatcher(corpus_fps, encoding)
- for ix, (X, target) in enumerate(batch_generator):
- loss = self._train_batch(X, target)
- total_loss += loss
- if self.verbose:
- smooth_loss = 0.99 * smooth_loss + 0.01 * loss if ix > 0 else loss
- fstr = "[Batch {}] Loss: {:.5f} | Smoothed Loss: {:.5f}"
- print(fstr.format(ix + 1, loss, smooth_loss))
- return total_loss / (ix + 1)
-
- def _train_batch(self, X, target):
- loss, _ = self.forward(X, target)
- self.backward()
- self.update(loss)
- return loss
-
- def minibatcher(self, corpus_fps, encoding):
- """
- A minibatch generator for skip-gram and CBOW models.
-
- Parameters
- ----------
- corpus_fps : str or list of strs
- The filepath / list of filepaths to the document(s) to be encoded.
- Each document is expected to be encoded as newline-separated
- string of text, with adjacent tokens separated by a whitespace
- character.
- encoding : str
- Specifies the text encoding for corpus. This value is passed
- directly to Python's `open` builtin. Common entries are either
- 'utf-8' (no header byte), or 'utf-8-sig' (header byte).
-
- Yields
- ------
- X : list of length `batchsize` or :py:class:`ndarray ` of shape (`batchsize`, `n_in`)
- The context IDs for a minibatch of `batchsize` examples. If
- ``self.skip_gram`` is False, `X` will be a ragged list consisting
- of `batchsize` variable-length lists. If ``self.skip_gram`` is
- `True`, all sublists will be of the same length (`n_in`) and `X`
- will be returned as a :py:class:`ndarray ` of shape (`batchsize`, `n_in`).
- target : :py:class:`ndarray ` of shape (`batchsize`, 1)
- The target IDs associated with each example in `X`
- """
- batchsize = self.batchsize
- X_mb, target_mb, mb_ready = [], [], False
-
- for d_ix, doc_fp in enumerate(corpus_fps):
- with open(doc_fp, "r", encoding=encoding) as doc:
- for line in doc:
- words = tokenize_words(
- line, lowercase=True, filter_stopwords=self.filter_stopwords
- )
- word_ixs = self.vocab.words_to_indices(
- self.vocab.filter(words, unk=False)
- )
- for word_loc, word in enumerate(word_ixs):
- # since more distant words are usually less related to
- # the target word, we downweight them by sampling from
- # them less frequently during training.
- R = np.random.randint(1, self.context_len)
- left = word_ixs[max(word_loc - R, 0) : word_loc]
- right = word_ixs[word_loc + 1 : word_loc + 1 + R]
- context = left + right
-
- if len(context) == 0:
- continue
-
- # in the skip-gram architecture we use each of the
- # surrounding context to predict `word` / avoid
- # predicting negative samples
- if self.skip_gram:
- X_mb.extend([word] * len(context))
- target_mb.extend(context)
- mb_ready = len(target_mb) >= batchsize
-
- # in the CBOW architecture we use the average of the
- # context embeddings to predict the target `word` / avoid
- # predicting the negative samples
- else:
- context = np.array(context)
- X_mb.append(context) # X_mb will be a ragged array
- target_mb.append(word)
- mb_ready = len(X_mb) == batchsize
-
- if mb_ready:
- mb_ready = False
- X_batch, target_batch = X_mb.copy(), target_mb.copy()
- X_mb, target_mb = [], []
- if self.skip_gram:
- X_batch = np.array(X_batch)[:, None]
- target_batch = np.array(target_batch)[:, None]
- yield X_batch, target_batch
-
- # if we've reached the end of our final document and there are
- # remaining examples, yield the stragglers as a partial minibatch
- if len(X_mb) > 0:
- if self.skip_gram:
- X_mb = np.array(X_mb)[:, None]
- target_mb = np.array(target_mb)[:, None]
- yield X_mb, target_mb
-
- def fit(
- self, corpus_fps, encoding="utf-8-sig", n_epochs=20, batchsize=128, verbose=True
- ):
- """
- Learn word2vec embeddings for the examples in `X_train`.
-
- Parameters
- ----------
- corpus_fps : str or list of strs
- The filepath / list of filepaths to the document(s) to be encoded.
- Each document is expected to be encoded as newline-separated
- string of text, with adjacent tokens separated by a whitespace
- character.
- encoding : str
- Specifies the text encoding for corpus. Common entries are either
- 'utf-8' (no header byte), or 'utf-8-sig' (header byte). Default
- value is 'utf-8-sig'.
- n_epochs : int
- The maximum number of training epochs to run. Default is 20.
- batchsize : int
- The desired number of examples in each training batch. Default is
- 128.
- verbose : bool
- Print batch information during training. Default is True.
- """
- self.verbose = verbose
- self.n_epochs = n_epochs
- self.batchsize = batchsize
-
- self.vocab = Vocabulary(
- lowercase=True,
- min_count=self.min_count,
- max_tokens=self.max_tokens,
- filter_stopwords=self.filter_stopwords,
- )
- self.vocab.fit(corpus_fps, encoding=encoding)
- self.vocab_size = len(self.vocab)
-
- # ignore special characters when training the model
- for sp in self.special_chars:
- self.vocab.counts[sp] = 0
-
- # now that we know our vocabulary size, we can initialize the embeddings
- self._init_params()
-
- prev_loss = np.inf
- for i in range(n_epochs):
- loss, estart = 0.0, time()
- loss = self._train_epoch(corpus_fps, encoding)
-
- fstr = "[Epoch {}] Avg. loss: {:.3f} Delta: {:.3f} ({:.2f}m/epoch)"
- print(fstr.format(i + 1, loss, prev_loss - loss, (time() - estart) / 60.0))
- prev_loss = loss
diff --git a/aitk/keras/models/wgan_gp.py b/aitk/keras/models/wgan_gp.py
deleted file mode 100644
index a48e194..0000000
--- a/aitk/keras/models/wgan_gp.py
+++ /dev/null
@@ -1,528 +0,0 @@
-from time import time
-from collections import OrderedDict
-
-import numpy as np
-
-from ..utils import minibatch
-from ..layers import Dense
-from ..losses import WGAN_GPLoss
-
-
-class WGAN_GP(object):
- """
- A Wasserstein generative adversarial network (WGAN) architecture with
- gradient penalty (GP).
-
- Notes
- -----
- In contrast to a regular WGAN, WGAN-GP uses gradient penalty on the
- generator rather than weight clipping to encourage the 1-Lipschitz
- constraint:
-
- .. math::
-
- | \\text{Generator}(\mathbf{x}_1) - \\text{Generator}(\mathbf{x}_2) |
- \leq |\mathbf{x}_1 - \mathbf{x}_2 | \ \ \ \ \\forall \mathbf{x}_1, \mathbf{x}_2
-
- In other words, the generator must have input gradients with a norm of at
- most 1 under the :math:`\mathbf{X}_{real}` and :math:`\mathbf{X}_{fake}`
- data distributions.
-
- To enforce this constraint, WGAN-GP penalizes the model if the generator
- gradient norm moves away from a target norm of 1. See
- :class:`~numpy_ml.neural_nets.losses.WGAN_GPLoss` for more details.
-
- In contrast to a standard WGAN, WGAN-GP avoids using BatchNorm in the
- critic, as correlation between samples in a batch can impact the stability
- of the gradient penalty.
-
- WGAP-GP architecture:
-
- .. code-block:: text
-
- X_real ------------------------|
- >---> [Critic] --> Y_out
- Z --> [Generator] --> X_fake --|
-
- where ``[Generator]`` is
-
- .. code-block:: text
-
- FC1 -> ReLU -> FC2 -> ReLU -> FC3 -> ReLU -> FC4
-
- and ``[Critic]`` is
-
- .. code-block:: text
-
- FC1 -> ReLU -> FC2 -> ReLU -> FC3 -> ReLU -> FC4
-
- and
-
- .. math::
-
- Z \sim \mathcal{N}(0, 1)
- """
-
- def __init__(
- self,
- g_hidden=512,
- kernel_initializer="he_uniform",
- optimizer="RMSProp(lr=0.0001)",
- debug=False,
- ):
- """
- Wasserstein generative adversarial network with gradient penalty.
-
- Parameters
- ----------
- g_hidden : int
- The number of units in the critic and generator hidden layers.
- Default is 512.
- kernel_initializer : str
- The weight initialization strategy. Valid entries are
- {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform',
- 'std_normal', 'trunc_normal'}. Default is "he_uniform".
- optimizer : str or :doc:`Optimizer ` object or None
- The optimization strategy to use when performing gradient updates.
- If None, use the :class:`~numpy_ml.neural_nets.optimizers.SGD`
- optimizer with default parameters. Default is "RMSProp(lr=0.0001)".
- debug : bool
- Whether to store additional intermediate output within
- ``self.derived_variables``. Default is False.
- """
- self.kernel_initializer = kernel_initializer
- self.debug = debug
- self.g_hidden = g_hidden
- self.optimizer = optimizer
-
- self.lambda_ = None
- self.n_steps = None
- self.batchsize = None
-
- self.is_initialized = False
-
- def _init_params(self):
- self._dv = {}
- self._gr = {}
- self._build_critic()
- self._build_generator()
- self.is_initialized = True
-
- def _build_generator(self):
- """
- FC1 -> ReLU -> FC2 -> ReLU -> FC3 -> ReLU -> FC4
- """
- self.generator = OrderedDict()
- self.generator["FC1"] = Dense(
- self.g_hidden, act_fn="ReLU", optimizer=self.optimizer, kernel_initializer=self.kernel_initializer
- )
- self.generator["FC2"] = Dense(
- self.g_hidden, act_fn="ReLU", optimizer=self.optimizer, kernel_initializer=self.kernel_initializer
- )
- self.generator["FC3"] = Dense(
- self.g_hidden, act_fn="ReLU", optimizer=self.optimizer, kernel_initializer=self.kernel_initializer
- )
- self.generator["FC4"] = Dense(
- self.n_feats,
- act_fn="Affine(slope=1, intercept=0)",
- optimizer=self.optimizer,
- kernel_initializer=self.kernel_initializer,
- )
-
- def _build_critic(self):
- """
- FC1 -> ReLU -> FC2 -> ReLU -> FC3 -> ReLU -> FC4
- """
- self.critic = OrderedDict()
- self.critic["FC1"] = Dense(
- self.g_hidden, act_fn="ReLU", optimizer=self.optimizer, kernel_initializer=self.kernel_initializer
- )
- self.critic["FC2"] = Dense(
- self.g_hidden, act_fn="ReLU", optimizer=self.optimizer, kernel_initializer=self.kernel_initializer
- )
- self.critic["FC3"] = Dense(
- self.g_hidden, act_fn="ReLU", optimizer=self.optimizer, kernel_initializer=self.kernel_initializer
- )
- self.critic["FC4"] = Dense(
- 1,
- act_fn="Affine(slope=1, intercept=0)",
- optimizer=self.optimizer,
- kernel_initializer=self.kernel_initializer,
- )
-
- @property
- def hyperparameters(self):
- return {
- "kernel_initializer": self.kernel_initializer,
- "lambda_": self.lambda_,
- "g_hidden": self.g_hidden,
- "n_steps": self.n_steps,
- "optimizer": self.optimizer,
- "batchsize": self.batchsize,
- "c_updates_per_epoch": self.c_updates_per_epoch,
- "components": {
- "critic": {k: v.hyperparameters for k, v in self.critic.items()},
- "generator": {k: v.hyperparameters for k, v in self.generator.items()},
- },
- }
-
- @property
- def parameters(self):
- return {
- "components": {
- "critic": {k: v.parameters for k, v in self.critic.items()},
- "generator": {k: v.parameters for k, v in self.generator.items()},
- }
- }
-
- @property
- def derived_variables(self):
- C = self.critic.items()
- G = self.generator.items()
- dv = {
- "components": {
- "critic": {k: v.derived_variables for k, v in C},
- "generator": {k: v.derived_variables for k, v in G},
- }
- }
- dv.update(self._dv)
- return dv
-
- @property
- def gradients(self):
- grads = {
- "dC_Y_fake": None,
- "dC_Y_real": None,
- "dG_Y_fake": None,
- "dC_gradInterp": None,
- "components": {
- "critic": {k: v.gradients for k, v in self.critic.items()},
- "generator": {k: v.gradients for k, v in self.generator.items()},
- },
- }
- grads.update(self._gr)
- return grads
-
- def forward(self, X, module, retain_derived=True):
- """
- Perform the forward pass for either the generator or the critic.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(batchsize, \*)`
- Input data
- module : {'C' or 'G'}
- Whether to perform the forward pass for the critic ('C') or for the
- generator ('G').
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- out : :py:class:`ndarray ` of shape `(batchsize, \*)`
- The output of the final layer of the module.
- Xs : dict
- A dictionary with layer ids as keys and values corresponding to the
- input to each intermediate layer during the forward pass. Useful
- during debugging.
- """
- if module == "G":
- mod = self.generator
- elif module == "C":
- mod = self.critic
- else:
- raise ValueError("Unrecognized module name: {}".format(module))
-
- Xs = {}
- out, rd = X, retain_derived
- for k, v in mod.items():
- Xs[k] = out
- out = v.forward(out, retain_derived=rd)
- return out, Xs
-
- def backward(self, grad, module, retain_grads=True):
- """
- Perform the backward pass for either the generator or the critic.
-
- Parameters
- ----------
- grad : :py:class:`ndarray ` of shape `(batchsize, \*)` or list of arrays
- Gradient of the loss with respect to module output(s).
- module : {'C' or 'G'}
- Whether to perform the backward pass for the critic ('C') or for the
- generator ('G').
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is True.
-
- Returns
- -------
- out : :py:class:`ndarray ` of shape `(batchsize, \*)`
- The gradient of the loss with respect to the module input.
- dXs : dict
- A dictionary with layer ids as keys and values corresponding to the
- input to each intermediate layer during the backward pass. Useful
- during debugging.
- """
- if module == "G":
- mod = self.generator
- elif module == "C":
- mod = self.critic
- else:
- raise ValueError("Unrecognized module name: {}".format(module))
-
- dXs = {}
- out, rg = grad, retain_grads
- for k, v in reversed(list(mod.items())):
- dXs[k] = out
- out = v.backward(out, retain_grads=rg)
- return out, dXs
-
- def _dGradInterp(self, dLdGradInterp, dYi_outs):
- """
- Compute the gradient penalty's contribution to the critic loss and
- update the parameter gradients accordingly.
-
- Parameters
- ----------
- dLdGradInterp : :py:class:`ndarray ` of shape `(batchsize, critic_in_dim)`
- Gradient of `Y_interp` with respect to `X_interp`.
- dYi_outs : dict
- The intermediate outputs generated during the backward pass when
- computing `dLdGradInterp`.
- """
- dy = dLdGradInterp
- for k, v in self.critic.items():
- X = v.X[-1] # layer input during forward pass
- dy, dW, dB = v._bwd2(dy, X, dYi_outs[k][2])
- self.critic[k].gradients["W"] += dW
- self.critic[k].gradients["b"] += dB
-
- def update_critic(self, X_real):
- """
- Compute parameter gradients for the critic on a single minibatch.
-
- Parameters
- ----------
- X_real : :py:class:`ndarray ` of shape `(batchsize, n_feats)`
- Input data.
-
- Returns
- -------
- C_loss : float
- The critic loss on the current data.
- """
- self.flush_gradients("C")
-
- n_ex = X_real.shape[0]
- noise = np.random.randn(*X_real.shape)
-
- # generate and score the real and fake data
- X_fake, Xf_outs = self.forward(noise, "G")
- Y_real, Yr_outs = self.forward(X_real, "C")
- Y_fake, Yf_outs = self.forward(X_fake, "C")
-
- # sample a random point on the linear interpolation between real and
- # fake data and compute its score
- alpha = np.random.rand(n_ex, 1)
- X_interp = alpha * X_real + (1 - alpha) * X_fake
- Y_interp, Yi_outs = self.forward(X_interp, "C")
-
- # compute the gradient of Y_interp wrt. X_interp
- # Note that we don't save intermediate gradients here since this is not
- # the real backward pass
- dLdy = [0, 0, np.ones_like(Y_interp)]
- (_, _, gradInterp), dYi_outs = self.backward(dLdy, "C", retain_grads=False)
-
- # calculate critic loss and differentiate with respect to each term
- C_loss = self.loss(Y_fake, "C", Y_real, gradInterp)
- dY_real, dY_fake, dGrad_interp = self.loss.grad(Y_fake, "C", Y_real, gradInterp)
-
- # compute `dY_real` and `dY_fake` contributions to critic loss, update
- # param gradients accordingly
- self.backward([dY_real, dY_fake, 0], "C")
-
- # compute `gradInterp`'s contribution to the critic loss, updating
- # param gradients accordingly
- self._dGradInterp(dGrad_interp, dYi_outs)
-
- # cache intermediate vars for the generator update
- self._dv["alpha"] = alpha
- self._dv["Y_fake"] = Y_fake
-
- # log additional intermediate values for debugging
- if self.debug:
- self._dv["G_fwd_X_fake"] = {}
- self._dv["C_fwd_Y_real"] = {}
- self._dv["C_fwd_Y_fake"] = {}
- self._dv["C_fwd_Y_interp"] = {}
-
- N = len(self.critic.keys())
- N2 = len(self.generator.keys())
-
- for i in range(N2):
- self._dv["G_fwd_X_fake"]["FC" + str(i)] = Xf_outs["FC" + str(i + 1)]
-
- for i in range(N):
- self._dv["C_fwd_Y_real"]["FC" + str(i)] = Yr_outs["FC" + str(i + 1)]
- self._dv["C_fwd_Y_fake"]["FC" + str(i)] = Yf_outs["FC" + str(i + 1)]
- self._dv["C_fwd_Y_interp"]["FC" + str(i)] = Yi_outs["FC" + str(i + 1)]
-
- self._dv["C_fwd_Y_real"]["FC" + str(N)] = Y_real
- self._dv["C_fwd_Y_fake"]["FC" + str(N)] = Y_fake
- self._dv["G_fwd_X_fake"]["FC" + str(N2)] = X_fake
- self._dv["C_fwd_Y_interp"]["FC" + str(N)] = Y_interp
- self._dv["C_dY_interp_wrt"] = {k: v[2] for k, v in dYi_outs.items()}
-
- self._dv["noise"] = noise
- self._dv["X_fake"] = X_fake
- self._dv["X_real"] = X_real
- self._dv["Y_real"] = Y_real
- self._dv["Y_fake"] = Y_fake
- self._dv["C_loss"] = C_loss
- self._dv["dY_real"] = dY_real
- self._dv["dC_Y_fake"] = dY_fake
- self._dv["X_interp"] = X_interp
- self._dv["Y_interp"] = Y_interp
- self._dv["gradInterp"] = gradInterp
- self._dv["dGrad_interp"] = dGrad_interp
-
- return C_loss
-
- def update_generator(self, X_shape):
- """
- Compute parameter gradients for the generator on a single minibatch.
-
- Parameters
- ----------
- X_shape : tuple of `(batchsize, n_feats)`
- Shape for the input batch.
-
- Returns
- -------
- G_loss : float
- The generator loss on the fake data (generated during the critic
- update)
- """
- self.flush_gradients("G")
- Y_fake = self.derived_variables["Y_fake"]
-
- n_ex, _ = Y_fake.shape
- G_loss = -Y_fake.mean()
- dG_loss = -np.ones_like(Y_fake) / n_ex
- self.backward(dG_loss, "G")
-
- if self.debug:
- self._dv["G_loss"] = G_loss
- self._dv["dG_Y_fake"] = dG_loss
-
- return G_loss
-
- def flush_gradients(self, module):
- """Reset parameter gradients to 0 after an update."""
- if module == "G":
- mod = self.generator
- elif module == "C":
- mod = self.critic
- else:
- raise ValueError("Unrecognized module name: {}".format(module))
-
- for k, v in mod.items():
- v.flush_gradients()
-
- def update(self, module, module_loss=None):
- """Perform gradient updates and flush gradients upon completion"""
- if module == "G":
- mod = self.generator
- elif module == "C":
- mod = self.critic
- else:
- raise ValueError("Unrecognized module name: {}".format(module))
-
- for k, v in reversed(list(mod.items())):
- v.update(module_loss)
- self.flush_gradients(module)
-
- def fit(
- self,
- X_real,
- lambda_,
- n_steps=1000,
- batchsize=128,
- c_updates_per_epoch=5,
- verbose=True,
- ):
- """
- Fit WGAN_GP on a training dataset.
-
- Parameters
- ----------
- X_real : :py:class:`ndarray ` of shape `(n_ex, n_feats)`
- Training dataset
- lambda_ : float
- Gradient penalty coefficient for the critic loss
- n_steps : int
- The maximum number of generator updates to perform. Default is
- 1000.
- batchsize : int
- Number of examples to use in each training minibatch. Default is
- 128.
- c_updates_per_epoch : int
- The number of critic updates to perform at each generator update.
- verbose : bool
- Print loss values after each update. If False, only print loss
- every 100 steps. Default is True.
- """
- self.lambda_ = lambda_
- self.verbose = verbose
- self.n_steps = n_steps
- self.batchsize = batchsize
- self.c_updates_per_epoch = c_updates_per_epoch
-
- # adjust output of the generator to match the dimensionality of X
- if not self.is_initialized:
- self.n_feats = X_real.shape[1]
- self._init_params()
-
- # (re-)initialize loss
- prev_C, prev_G = np.inf, np.inf
- self.loss = WGAN_GPLoss(lambda_=self.lambda_)
-
- # training loop
- NC, NG = self.c_updates_per_epoch, self.n_steps
- for i in range(NG):
- estart = time()
- batch_generator, _ = minibatch(X_real, batchsize, shuffle=False)
-
- for j, b_ix in zip(range(NC), batch_generator):
- bstart = time()
- X_batch = X_real[b_ix]
- C_loss = self.update_critic(X_batch)
-
- # for testing, don't perform gradient update so we can inspect each grad
- if not self.debug:
- self.update("C", C_loss)
-
- if self.verbose:
- fstr = "\t[Critic batch {}] Critic loss: {:.3f} {:.3f}∆ ({:.1f}s/batch)"
- print(fstr.format(j + 1, C_loss, prev_C - C_loss, time() - bstart))
- prev_C = C_loss
-
- # generator update
- G_loss = self.update_generator(X_batch.shape)
-
- # for testing, don't perform gradient update so we can inspect each grad
- if not self.debug:
- self.update("G", G_loss)
-
- if i % 99 == 0:
- fstr = "[Epoch {}] Gen. loss: {:.3f} Critic loss: {:.3f}"
- print(fstr.format(i + 1, G_loss, C_loss))
-
- elif self.verbose:
- fstr = "[Epoch {}] Gen. loss: {:.3f} {:.3f}∆ ({:.1f}s/epoch)"
- print(fstr.format(i + 1, G_loss, prev_G - G_loss, time() - estart))
- prev_G = G_loss
diff --git a/aitk/keras/modules/README.md b/aitk/keras/modules/README.md
deleted file mode 100644
index 8590b6b..0000000
--- a/aitk/keras/modules/README.md
+++ /dev/null
@@ -1,10 +0,0 @@
-# Modules
-
-The `modules.py` module implements common multi-layer blocks that appear across
-many modern deep networks. It includes:
-
-- Bidirectional LSTMs ([Schuster & Paliwal, 1997](https://pdfs.semanticscholar.org/4b80/89bc9b49f84de43acc2eb8900035f7d492b2.pdf))
-- ResNet-style "identity" (i.e., `same`-convolution) residual blocks ([He et al., 2015](https://arxiv.org/pdf/1512.03385.pdf))
-- ResNet-style "convolutional" (i.e., parametric) residual blocks ([He et al., 2015](https://arxiv.org/pdf/1512.03385.pdf))
-- WaveNet-style residual block with dilated causal convolutions ([van den Oord et al., 2016](https://arxiv.org/pdf/1609.03499.pdf))
-- Transformer-style multi-headed dot-product attention ([Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))
diff --git a/aitk/keras/modules/__init__.py b/aitk/keras/modules/__init__.py
deleted file mode 100644
index 270dceb..0000000
--- a/aitk/keras/modules/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .modules import *
diff --git a/aitk/keras/modules/modules.py b/aitk/keras/modules/modules.py
deleted file mode 100644
index cc31ea7..0000000
--- a/aitk/keras/modules/modules.py
+++ /dev/null
@@ -1,1427 +0,0 @@
-from abc import ABC, abstractmethod
-
-import re
-import numpy as np
-
-from ..wrappers import Dropout
-from ..utils import calc_pad_dims_2D
-from ..activations import Tanh, Sigmoid, ReLU, LeakyReLU, Affine
-from ..layers import (
- DotProductAttention,
- Dense,
- BatchNorm2D,
- Conv1D,
- Conv2D,
- Multiply,
- LSTMCell,
- Add,
-)
-
-
-class ModuleBase(ABC):
- def __init__(self):
- self.X = None
- self.trainable = True
-
- super().__init__()
-
- @abstractmethod
- def _init_params(self, **kwargs):
- raise NotImplementedError
-
- @abstractmethod
- def forward(self, z, **kwargs):
- raise NotImplementedError
-
- @abstractmethod
- def backward(self, out, **kwargs):
- raise NotImplementedError
-
- @property
- def components(self):
- comps = []
- for c in self.hyperparameters["component_ids"]:
- if hasattr(self, c):
- comps.append(getattr(self, c))
- return comps
-
- def freeze(self):
- self.trainable = False
- for c in self.components:
- c.freeze()
-
- def unfreeze(self):
- self.trainable = True
- for c in self.components:
- c.unfreeze()
-
- def update(self, cur_loss=None):
- assert self.trainable, "Layer is frozen"
- for c in self.components:
- c.update(cur_loss)
- self.flush_gradients()
-
- def flush_gradients(self):
- assert self.trainable, "Layer is frozen"
-
- self.X = []
- self._dv = {}
- for c in self.components:
- for k, v in c.derived_variables.items():
- c.derived_variables[k] = None
-
- for k, v in c.gradients.items():
- c.gradients[k] = np.zeros_like(v)
-
- def set_params(self, summary_dict):
- cids = self.hyperparameters["component_ids"]
- for k, v in summary_dict["parameters"].items():
- if k == "components":
- for c, cd in summary_dict["parameters"][k].items():
- if c in cids:
- getattr(self, c).set_params(cd)
-
- elif k in self.parameters:
- self.parameters[k] = v
-
- for k, v in summary_dict["hyperparameters"].items():
- if k == "components":
- for c, cd in summary_dict["hyperparameters"][k].items():
- if c in cids:
- getattr(self, c).set_params(cd)
-
- if k in self.hyperparameters:
- if k == "act_fn" and v == "ReLU":
- self.hyperparameters[k] = ReLU()
- elif v == "act_fn" and v == "Sigmoid":
- self.hyperparameters[k] = Sigmoid()
- elif v == "act_fn" and v == "Tanh":
- self.hyperparameters[k] = Tanh()
- elif v == "act_fn" and "Affine" in v:
- r = r"Affine\(slope=(.*), intercept=(.*)\)"
- slope, intercept = re.match(r, v).groups()
- self.hyperparameters[k] = Affine(float(slope), float(intercept))
- elif v == "act_fn" and "Leaky ReLU" in v:
- r = r"Leaky ReLU\(alpha=(.*)\)"
- alpha = re.match(r, v).groups()[0]
- self.hyperparameters[k] = LeakyReLU(float(alpha))
- else:
- self.hyperparameters[k] = v
-
- def summary(self):
- return {
- "parameters": self.parameters,
- "layer": self.hyperparameters["layer"],
- "hyperparameters": self.hyperparameters,
- }
-
-
-class WavenetResidualModule(ModuleBase):
- def __init__(
- self,
- ch_residual,
- ch_dilation,
- dilation,
- kernel_width,
- optimizer=None,
- init="glorot_uniform",
- ):
- """
- A WaveNet-like residual block with causal dilated convolutions.
-
- .. code-block:: text
-
- *Skip path in* >-------------------------------------------> + ---> *Skip path out*
- Causal |--> Tanh --| |
- *Main |--> Dilated Conv1D -| * --> 1x1 Conv1D --|
- path >--| |--> Sigm --| |
- in* |-------------------------------------------------> + ---> *Main path out*
- *Residual path*
-
- On the final block, the output of the skip path is further processed to
- produce the network predictions.
-
- References
- ----------
- .. [1] van den Oord et al. (2016). "Wavenet: a generative model for raw
- audio". https://arxiv.org/pdf/1609.03499.pdf
-
- Parameters
- ----------
- ch_residual : int
- The number of output channels for the 1x1
- :class:`~numpy_ml.neural_nets.layers.Conv1D` layer in the main path.
- ch_dilation : int
- The number of output channels for the causal dilated
- :class:`~numpy_ml.neural_nets.layers.Conv1D` layer in the main path.
- dilation : int
- The dilation rate for the causal dilated
- :class:`~numpy_ml.neural_nets.layers.Conv1D` layer in the main path.
- kernel_width : int
- The width of the causal dilated
- :class:`~numpy_ml.neural_nets.layers.Conv1D` kernel in the main
- path.
- init : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is 'glorot_uniform'.
- optimizer : str or :doc:`Optimizer ` object or None
- The optimization strategy to use when performing gradient updates
- within the :meth:`update` method. If None, use the
- :class:`~numpy_ml.neural_nets.optimizers.SGD` optimizer with default
- parameters. Default is None.
- """
- super().__init__()
-
- self.init = init
- self.dilation = dilation
- self.optimizer = optimizer
- self.ch_residual = ch_residual
- self.ch_dilation = ch_dilation
- self.kernel_width = kernel_width
-
- self._init_params()
-
- def _init_params(self):
- self._dv = {}
-
- self.conv_dilation = Conv1D(
- stride=1,
- pad="causal",
- init=self.init,
- kernel_width=2,
- dilation=self.dilation,
- out_ch=self.ch_dilation,
- optimizer=self.optimizer,
- act_fn=Affine(slope=1, intercept=0),
- )
-
- self.tanh = Tanh()
- self.sigm = Sigmoid()
- self.multiply_gate = Multiply(act_fn=Affine(slope=1, intercept=0))
-
- self.conv_1x1 = Conv1D(
- stride=1,
- pad="same",
- dilation=0,
- init=self.init,
- kernel_width=1,
- out_ch=self.ch_residual,
- optimizer=self.optimizer,
- act_fn=Affine(slope=1, intercept=0),
- )
-
- self.add_residual = Add(act_fn=Affine(slope=1, intercept=0))
- self.add_skip = Add(act_fn=Affine(slope=1, intercept=0))
-
- @property
- def parameters(self):
- """A dictionary of the module parameters."""
- return {
- "components": {
- "conv_1x1": self.conv_1x1.parameters,
- "add_skip": self.add_skip.parameters,
- "add_residual": self.add_residual.parameters,
- "conv_dilation": self.conv_dilation.parameters,
- "multiply_gate": self.multiply_gate.parameters,
- }
- }
-
- @property
- def hyperparameters(self):
- """A dictionary of the module hyperparameters"""
- return {
- "layer": "WavenetResidualModule",
- "init": self.init,
- "dilation": self.dilation,
- "optimizer": self.optimizer,
- "ch_residual": self.ch_residual,
- "ch_dilation": self.ch_dilation,
- "kernel_width": self.kernel_width,
- "component_ids": [
- "conv_1x1",
- "add_skip",
- "add_residual",
- "conv_dilation",
- "multiply_gate",
- ],
- "components": {
- "conv_1x1": self.conv_1x1.hyperparameters,
- "add_skip": self.add_skip.hyperparameters,
- "add_residual": self.add_residual.hyperparameters,
- "conv_dilation": self.conv_dilation.hyperparameters,
- "multiply_gate": self.multiply_gate.hyperparameters,
- },
- }
-
- @property
- def derived_variables(self):
- """A dictionary of intermediate values computed during the
- forward/backward passes."""
- dv = {
- "conv_1x1_out": None,
- "conv_dilation_out": None,
- "multiply_gate_out": None,
- "components": {
- "conv_1x1": self.conv_1x1.derived_variables,
- "add_skip": self.add_skip.derived_variables,
- "add_residual": self.add_residual.derived_variables,
- "conv_dilation": self.conv_dilation.derived_variables,
- "multiply_gate": self.multiply_gate.derived_variables,
- },
- }
- dv.update(self._dv)
- return dv
-
- @property
- def gradients(self):
- """A dictionary of the module parameter gradients."""
- return {
- "components": {
- "conv_1x1": self.conv_1x1.gradients,
- "add_skip": self.add_skip.gradients,
- "add_residual": self.add_residual.gradients,
- "conv_dilation": self.conv_dilation.gradients,
- "multiply_gate": self.multiply_gate.gradients,
- }
- }
-
- def forward(self, X_main, X_skip=None):
- """
- Compute the module output on a single minibatch.
-
- Parameters
- ----------
- X_main : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The input volume consisting of `n_ex` examples, each with dimension
- (`in_rows`, `in_cols`, `in_ch`).
- X_skip : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`, or None
- The output of the preceding skip-connection if this is not the
- first module in the network.
-
- Returns
- -------
- Y_main : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The output of the main pathway.
- Y_skip : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The output of the skip-connection pathway.
- """
- self.X_main, self.X_skip = X_main, X_skip
- conv_dilation_out = self.conv_dilation.forward(X_main)
-
- tanh_gate = self.tanh.fn(conv_dilation_out)
- sigm_gate = self.sigm.fn(conv_dilation_out)
-
- multiply_gate_out = self.multiply_gate.forward([tanh_gate, sigm_gate])
- conv_1x1_out = self.conv_1x1.forward(multiply_gate_out)
-
- # if this is the first wavenet block, initialize the "previous" skip
- # connection sum to 0
- self.X_skip = np.zeros_like(conv_1x1_out) if X_skip is None else X_skip
-
- Y_skip = self.add_skip.forward([X_skip, conv_1x1_out])
- Y_main = self.add_residual.forward([X_main, conv_1x1_out])
-
- self._dv["tanh_out"] = tanh_gate
- self._dv["sigm_out"] = sigm_gate
- self._dv["conv_dilation_out"] = conv_dilation_out
- self._dv["multiply_gate_out"] = multiply_gate_out
- self._dv["conv_1x1_out"] = conv_1x1_out
- return Y_main, Y_skip
-
- def backward(self, dY_skip, dY_main=None):
- dX_skip, dConv_1x1_out = self.add_skip.backward(dY_skip)
-
- # if this is the last wavenet block, dY_main will be None. if not,
- # calculate the error contribution from dY_main and add it to the
- # contribution from the skip path
- dX_main = np.zeros_like(self.X_main)
- if dY_main is not None:
- dX_main, dConv_1x1_main = self.add_residual.backward(dY_main)
- dConv_1x1_out += dConv_1x1_main
-
- dMultiply_out = self.conv_1x1.backward(dConv_1x1_out)
- dTanh_out, dSigm_out = self.multiply_gate.backward(dMultiply_out)
-
- conv_dilation_out = self.derived_variables["conv_dilation_out"]
- dTanh_in = dTanh_out * self.tanh.grad(conv_dilation_out)
- dSigm_in = dSigm_out * self.sigm.grad(conv_dilation_out)
- dDilation_out = dTanh_in + dSigm_in
-
- conv_back = self.conv_dilation.backward(dDilation_out)
- dX_main += conv_back
-
- self._dv["dLdTanh"] = dTanh_out
- self._dv["dLdSigmoid"] = dSigm_out
- self._dv["dLdConv_1x1"] = dConv_1x1_out
- self._dv["dLdMultiply"] = dMultiply_out
- self._dv["dLdConv_dilation"] = dDilation_out
- return dX_main, dX_skip
-
-
-class SkipConnectionIdentityModule(ModuleBase):
- def __init__(
- self,
- out_ch,
- kernel_shape1,
- kernel_shape2,
- stride1=1,
- stride2=1,
- act_fn=None,
- epsilon=1e-5,
- momentum=0.9,
- optimizer=None,
- init="glorot_uniform",
- ):
- """
- A ResNet-like "identity" shortcut module.
-
- Notes
- -----
- The identity module enforces `same` padding during each convolution to
- ensure module output has same dims as its input.
-
- .. code-block:: text
-
- X -> Conv2D -> Act_fn -> BatchNorm2D -> Conv2D -> BatchNorm2D -> + -> Act_fn
- \______________________________________________________________/
-
- References
- ----------
- .. [1] He et al. (2015). "Deep residual learning for image
- recognition." https://arxiv.org/pdf/1512.03385.pdf
-
- Parameters
- ----------
- out_ch : int
- The number of filters/kernels to compute in the first convolutional
- layer.
- kernel_shape1 : 2-tuple
- The dimension of a single 2D filter/kernel in the first
- convolutional layer.
- kernel_shape2 : 2-tuple
- The dimension of a single 2D filter/kernel in the second
- convolutional layer.
- stride1 : int
- The stride/hop of the convolution kernels in the first
- convolutional layer. Default is 1.
- stride2 : int
- The stride/hop of the convolution kernels in the second
- convolutional layer. Default is 1.
- act_fn : :doc:`Activation ` object or None
- The activation function for computing Y[t]. If None, use the
- identity :math:`f(x) = x` by default. Default is None.
- epsilon : float
- A small smoothing constant to use during
- :class:`~numpy_ml.neural_nets.layers.BatchNorm2D` computation to
- avoid divide-by-zero errors. Default is 1e-5.
- momentum : float
- The momentum term for the running mean/running std calculations in
- the :class:`~numpy_ml.neural_nets.layers.BatchNorm2D` layers. The
- closer this is to 1, the less weight will be given to the mean/std
- of the current batch (i.e., higher smoothing). Default is 0.9.
- optimizer : str or :doc:`Optimizer ` object or None
- The optimization strategy to use when performing gradient updates
- within the :meth:`update` method. If None, use the
- :class:`~numpy_ml.neural_nets.optimizers.SGD` optimizer with
- default parameters. Default is None.
- init : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is 'glorot_uniform'.
- """
- super().__init__()
-
- self.init = init
- self.in_ch = None
- self.out_ch = out_ch
- self.epsilon = epsilon
- self.stride1 = stride1
- self.stride2 = stride2
- self.optimizer = optimizer
- self.momentum = momentum
- self.kernel_shape1 = kernel_shape1
- self.kernel_shape2 = kernel_shape2
- self.act_fn = Affine(slope=1, intercept=0) if act_fn is None else act_fn
-
- self._init_params()
-
- def _init_params(self):
- self._dv = {}
-
- self.conv1 = Conv2D(
- pad="same",
- init=self.init,
- out_ch=self.out_ch,
- act_fn=self.act_fn,
- stride=self.stride1,
- optimizer=self.optimizer,
- kernel_shape=self.kernel_shape1,
- )
- # we can't initialize `conv2` without X's dimensions; see `forward`
- # for further details
- self.batchnorm1 = BatchNorm2D(epsilon=self.epsilon, momentum=self.momentum)
- self.batchnorm2 = BatchNorm2D(epsilon=self.epsilon, momentum=self.momentum)
- self.add3 = Add(self.act_fn)
-
- def _init_conv2(self):
- self.conv2 = Conv2D(
- pad="same",
- init=self.init,
- out_ch=self.in_ch,
- stride=self.stride2,
- optimizer=self.optimizer,
- kernel_shape=self.kernel_shape2,
- act_fn=Affine(slope=1, intercept=0),
- )
-
- @property
- def parameters(self):
- """A dictionary of the module parameters."""
- return {
- "components": {
- "add3": self.add3.parameters,
- "conv1": self.conv1.parameters,
- "conv2": self.conv2.parameters,
- "batchnorm1": self.batchnorm1.parameters,
- "batchnorm2": self.batchnorm2.parameters,
- }
- }
-
- @property
- def hyperparameters(self):
- """A dictionary of the module hyperparameters."""
- return {
- "layer": "SkipConnectionIdentityModule",
- "init": self.init,
- "in_ch": self.in_ch,
- "out_ch": self.out_ch,
- "epsilon": self.epsilon,
- "stride1": self.stride1,
- "stride2": self.stride2,
- "momentum": self.momentum,
- "optimizer": self.optimizer,
- "act_fn": str(self.act_fn),
- "kernel_shape1": self.kernel_shape1,
- "kernel_shape2": self.kernel_shape2,
- "component_ids": ["conv1", "batchnorm1", "conv2", "batchnorm2", "add3"],
- "components": {
- "add3": self.add3.hyperparameters,
- "conv1": self.conv1.hyperparameters,
- "conv2": self.conv2.hyperparameters,
- "batchnorm1": self.batchnorm1.hyperparameters,
- "batchnorm2": self.batchnorm2.hyperparameters,
- },
- }
-
- @property
- def derived_variables(self):
- """A dictionary of intermediate values computed during the
- forward/backward passes."""
- dv = {
- "conv1_out": None,
- "conv2_out": None,
- "batchnorm1_out": None,
- "batchnorm2_out": None,
- "components": {
- "add3": self.add3.derived_variables,
- "conv1": self.conv1.derived_variables,
- "conv2": self.conv2.derived_variables,
- "batchnorm1": self.batchnorm1.derived_variables,
- "batchnorm2": self.batchnorm2.derived_variables,
- },
- }
- dv.update(self._dv)
- return dv
-
- @property
- def gradients(self):
- """A dictionary of the accumulated module parameter gradients."""
- return {
- "components": {
- "add3": self.add3.gradients,
- "conv1": self.conv1.gradients,
- "conv2": self.conv2.gradients,
- "batchnorm1": self.batchnorm1.gradients,
- "batchnorm2": self.batchnorm2.gradients,
- }
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the module output given input volume `X`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape (n_ex, in_rows, in_cols, in_ch)
- The input volume consisting of `n_ex` examples, each with dimension
- (`in_rows`, `in_cols`, `in_ch`).
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape (n_ex, out_rows, out_cols, out_ch)
- The module output volume.
- """
- if not hasattr(self, "conv2"):
- self.in_ch = X.shape[3]
- self._init_conv2()
-
- conv1_out = self.conv1.forward(X, retain_derived)
- bn1_out = self.batchnorm1.forward(conv1_out, retain_derived)
- conv2_out = self.conv2.forward(bn1_out, retain_derived)
- bn2_out = self.batchnorm2.forward(conv2_out, retain_derived)
- Y = self.add3.forward([X, bn2_out], retain_derived)
-
- if retain_derived:
- self._dv["conv1_out"] = conv1_out
- self._dv["conv2_out"] = conv2_out
- self._dv["batchnorm1_out"] = bn1_out
- self._dv["batchnorm2_out"] = bn2_out
- return Y
-
- def backward(self, dLdY, retain_grads=True):
- """
- Compute the gradient of the loss with respect to the layer parameters.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape (`n_ex, out_rows, out_cols, out_ch`) or list of arrays
- The gradient(s) of the loss with respect to the module output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape (n_ex, in_rows, in_cols, in_ch)
- The gradient of the loss with respect to the module input volume.
- """
- dX, dBn2_out = self.add3.backward(dLdY, retain_grads)
- dConv2_out = self.batchnorm2.backward(dBn2_out, retain_grads)
- dBn1_out = self.conv2.backward(dConv2_out, retain_grads)
- dConv1_out = self.batchnorm1.backward(dBn1_out, retain_grads)
- dX += self.conv1.backward(dConv1_out, retain_grads)
-
- self._dv["dLdAdd3_X"] = dX
- self._dv["dLdBn2"] = dBn2_out
- self._dv["dLdBn1"] = dBn1_out
- self._dv["dLdConv2"] = dConv2_out
- self._dv["dLdConv1"] = dConv1_out
- return dX
-
-
-class SkipConnectionConvModule(ModuleBase):
- def __init__(
- self,
- out_ch1,
- out_ch2,
- kernel_shape1,
- kernel_shape2,
- kernel_shape_skip,
- pad1=0,
- pad2=0,
- stride1=1,
- stride2=1,
- act_fn=None,
- epsilon=1e-5,
- momentum=0.9,
- stride_skip=1,
- optimizer=None,
- init="glorot_uniform",
- ):
- """
- A ResNet-like "convolution" shortcut module.
-
- Notes
- -----
- In contrast to :class:`SkipConnectionIdentityModule`, the additional
- `conv2d_skip` and `batchnorm_skip` layers in the shortcut path allow
- adjusting the dimensions of `X` to match the output of the main set of
- convolutions.
-
- .. code-block:: text
-
- X -> Conv2D -> Act_fn -> BatchNorm2D -> Conv2D -> BatchNorm2D -> + -> Act_fn
- \_____________________ Conv2D -> Batchnorm2D __________________/
-
- References
- ----------
- .. [1] He et al. (2015). "Deep residual learning for image
- recognition." https://arxiv.org/pdf/1512.03385.pdf
-
- Parameters
- ----------
- out_ch1 : int
- The number of filters/kernels to compute in the first convolutional
- layer.
- out_ch2 : int
- The number of filters/kernels to compute in the second
- convolutional layer.
- kernel_shape1 : 2-tuple
- The dimension of a single 2D filter/kernel in the first
- convolutional layer.
- kernel_shape2 : 2-tuple
- The dimension of a single 2D filter/kernel in the second
- convolutional layer.
- kernel_shape_skip : 2-tuple
- The dimension of a single 2D filter/kernel in the "skip"
- convolutional layer.
- stride1 : int
- The stride/hop of the convolution kernels in the first
- convolutional layer. Default is 1.
- stride2 : int
- The stride/hop of the convolution kernels in the second
- convolutional layer. Default is 1.
- stride_skip : int
- The stride/hop of the convolution kernels in the "skip"
- convolutional layer. Default is 1.
- pad1 : int, tuple, or 'same'
- The number of rows/columns of 0's to pad the input to the first
- convolutional layer with. Default is 0.
- pad2 : int, tuple, or 'same'
- The number of rows/columns of 0's to pad the input to the second
- convolutional layer with. Default is 0.
- act_fn : :doc:`Activation ` object or None
- The activation function for computing ``Y[t]``. If None, use the
- identity :math:`f(x) = x` by default. Default is None.
- epsilon : float
- A small smoothing constant to use during
- :class:`~numpy_ml.neural_nets.layers.BatchNorm2D` computation to
- avoid divide-by-zero errors. Default is 1e-5.
- momentum : float
- The momentum term for the running mean/running std calculations in
- the :class:`~numpy_ml.neural_nets.layers.BatchNorm2D` layers. The
- closer this is to 1, the less weight will be given to the mean/std
- of the current batch (i.e., higher smoothing). Default is 0.9.
- init : str
- The weight initialization strategy. Valid entries are
- {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}.
- optimizer : str or :doc:`Optimizer ` object
- The optimization strategy to use when performing gradient updates
- within the :class:`update` method. If None, use the
- :class:`~numpy_ml.neural_nets.optimizers.SGD` optimizer with
- default parameters. Default is None.
- """
- super().__init__()
-
- self.init = init
- self.pad1 = pad1
- self.pad2 = pad2
- self.in_ch = None
- self.out_ch1 = out_ch1
- self.out_ch2 = out_ch2
- self.epsilon = epsilon
- self.stride1 = stride1
- self.stride2 = stride2
- self.momentum = momentum
- self.optimizer = optimizer
- self.stride_skip = stride_skip
- self.kernel_shape1 = kernel_shape1
- self.kernel_shape2 = kernel_shape2
- self.kernel_shape_skip = kernel_shape_skip
- self.act_fn = Affine(slope=1, intercept=0) if act_fn is None else act_fn
-
- self._init_params()
-
- def _init_params(self, X=None):
- self._dv = {}
- self.conv1 = Conv2D(
- pad=self.pad1,
- init=self.init,
- act_fn=self.act_fn,
- out_ch=self.out_ch1,
- stride=self.stride1,
- optimizer=self.optimizer,
- kernel_shape=self.kernel_shape1,
- )
- self.conv2 = Conv2D(
- pad=self.pad2,
- init=self.init,
- out_ch=self.out_ch2,
- stride=self.stride2,
- optimizer=self.optimizer,
- kernel_shape=self.kernel_shape2,
- act_fn=Affine(slope=1, intercept=0),
- )
- # we can't initialize `conv_skip` without X's dimensions; see `forward`
- # for further details
- self.batchnorm1 = BatchNorm2D(epsilon=self.epsilon, momentum=self.momentum)
- self.batchnorm2 = BatchNorm2D(epsilon=self.epsilon, momentum=self.momentum)
- self.batchnorm_skip = BatchNorm2D(epsilon=self.epsilon, momentum=self.momentum)
- self.add3 = Add(self.act_fn)
-
- def _calc_skip_padding(self, X):
- pads = []
- for p in [self.pad1, self.pad2]:
- if isinstance(p, int):
- pads.append((p, p, p, p))
- elif isinstance(p, tuple) and len(p) == 2:
- pads.append((p[0], p[0], p[1], p[1]))
- self.pad1, self.pad2 = pads
-
- # compute the dimensions of the convolution1 output
- s1 = self.stride1
- fr1, fc1 = self.kernel_shape1
- _, in_rows, in_cols, _ = X.shape
- pr11, pr12, pc11, pc12 = self.pad1
-
- out_rows1 = np.floor(1 + (in_rows + pr11 + pr12 - fr1) / s1).astype(int)
- out_cols1 = np.floor(1 + (in_cols + pc11 + pc12 - fc1) / s1).astype(int)
-
- # compute the dimensions of the convolution2 output
- s2 = self.stride2
- fr2, fc2 = self.kernel_shape2
- pr21, pr22, pc21, pc22 = self.pad2
-
- out_rows2 = np.floor(1 + (out_rows1 + pr21 + pr22 - fr2) / s2).astype(int)
- out_cols2 = np.floor(1 + (out_cols1 + pc21 + pc22 - fc2) / s2).astype(int)
-
- # finally, compute the appropriate padding dims for the skip convolution
- desired_dims = (out_rows2, out_cols2)
- self.pad_skip = calc_pad_dims_2D(
- X.shape,
- desired_dims,
- stride=self.stride_skip,
- kernel_shape=self.kernel_shape_skip,
- )
-
- def _init_conv_skip(self, X):
- self._calc_skip_padding(X)
- self.conv_skip = Conv2D(
- init=self.init,
- pad=self.pad_skip,
- out_ch=self.out_ch2,
- stride=self.stride_skip,
- kernel_shape=self.kernel_shape_skip,
- act_fn=Affine(slope=1, intercept=0),
- optimizer=self.optimizer,
- )
-
- @property
- def parameters(self):
- """A dictionary of the module parameters."""
- return {
- "components": {
- "add3": self.add3.parameters,
- "conv1": self.conv1.parameters,
- "conv2": self.conv2.parameters,
- "conv_skip": self.conv_skip.parameters
- if hasattr(self, "conv_skip")
- else None,
- "batchnorm1": self.batchnorm1.parameters,
- "batchnorm2": self.batchnorm2.parameters,
- "batchnorm_skip": self.batchnorm_skip.parameters,
- }
- }
-
- @property
- def hyperparameters(self):
- """A dictionary of the module hyperparameters."""
- return {
- "layer": "SkipConnectionConvModule",
- "init": self.init,
- "pad1": self.pad1,
- "pad2": self.pad2,
- "in_ch": self.in_ch,
- "out_ch1": self.out_ch1,
- "out_ch2": self.out_ch2,
- "epsilon": self.epsilon,
- "stride1": self.stride1,
- "stride2": self.stride2,
- "momentum": self.momentum,
- "act_fn": str(self.act_fn),
- "stride_skip": self.stride_skip,
- "kernel_shape1": self.kernel_shape1,
- "kernel_shape2": self.kernel_shape2,
- "kernel_shape_skip": self.kernel_shape_skip,
- "pad_skip": self.pad_skip if hasattr(self, "pad_skip") else None,
- "component_ids": [
- "add3",
- "conv1",
- "conv2",
- "conv_skip",
- "batchnorm1",
- "batchnorm2",
- "batchnorm_skip",
- ],
- "components": {
- "add3": self.add3.hyperparameters,
- "conv1": self.conv1.hyperparameters,
- "conv2": self.conv2.hyperparameters,
- "conv_skip": self.conv_skip.hyperparameters
- if hasattr(self, "conv_skip")
- else None,
- "batchnorm1": self.batchnorm1.hyperparameters,
- "batchnorm2": self.batchnorm2.hyperparameters,
- "batchnorm_skip": self.batchnorm_skip.hyperparameters,
- },
- }
-
- @property
- def derived_variables(self):
- """A dictionary of intermediate values computed during the
- forward/backward passes."""
- dv = {
- "conv1_out": None,
- "conv2_out": None,
- "conv_skip_out": None,
- "batchnorm1_out": None,
- "batchnorm2_out": None,
- "batchnorm_skip_out": None,
- "components": {
- "add3": self.add3.derived_variables,
- "conv1": self.conv1.derived_variables,
- "conv2": self.conv2.derived_variables,
- "conv_skip": self.conv_skip.derived_variables
- if hasattr(self, "conv_skip")
- else None,
- "batchnorm1": self.batchnorm1.derived_variables,
- "batchnorm2": self.batchnorm2.derived_variables,
- "batchnorm_skip": self.batchnorm_skip.derived_variables,
- },
- }
- dv.update(self._dv)
- return dv
-
- @property
- def gradients(self):
- """A dictionary of the accumulated module parameter gradients."""
- return {
- "components": {
- "add3": self.add3.gradients,
- "conv1": self.conv1.gradients,
- "conv2": self.conv2.gradients,
- "conv_skip": self.conv_skip.gradients
- if hasattr(self, "conv_skip")
- else None,
- "batchnorm1": self.batchnorm1.gradients,
- "batchnorm2": self.batchnorm2.gradients,
- "batchnorm_skip": self.batchnorm_skip.gradients,
- }
- }
-
- def forward(self, X, retain_derived=True):
- """
- Compute the layer output given input volume `X`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The input volume consisting of `n_ex` examples, each with dimension
- (`in_rows`, `in_cols`, `in_ch`).
- retain_derived : bool
- Whether to retain the variables calculated during the forward pass
- for use later during backprop. If False, this suggests the layer
- will not be expected to backprop through wrt. this input. Default
- is True.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- The module output volume.
- """
- # now that we have the input dims for X we can initialize the proper
- # padding in the `conv_skip` layer
- if not hasattr(self, "conv_skip"):
- self._init_conv_skip(X)
- self.in_ch = X.shape[3]
-
- conv1_out = self.conv1.forward(X, retain_derived)
- bn1_out = self.batchnorm1.forward(conv1_out, retain_derived)
- conv2_out = self.conv2.forward(bn1_out, retain_derived)
- bn2_out = self.batchnorm2.forward(conv2_out, retain_derived)
- conv_skip_out = self.conv_skip.forward(X, retain_derived)
- bn_skip_out = self.batchnorm_skip.forward(conv_skip_out, retain_derived)
- Y = self.add3.forward([bn_skip_out, bn2_out], retain_derived)
-
- if retain_derived:
- self._dv["conv1_out"] = conv1_out
- self._dv["conv2_out"] = conv2_out
- self._dv["batchnorm1_out"] = bn1_out
- self._dv["batchnorm2_out"] = bn2_out
- self._dv["conv_skip_out"] = conv_skip_out
- self._dv["batchnorm_skip_out"] = bn_skip_out
- return Y
-
- def backward(self, dLdY, retain_grads=True):
- """
- Compute the gradient of the loss with respect to the module parameters.
-
- Parameters
- ----------
- dLdy : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, out_ch)`
- or list of arrays
- The gradient(s) of the loss with respect to the module output(s).
- retain_grads : bool
- Whether to include the intermediate parameter gradients computed
- during the backward pass in the final parameter update. Default is
- True.
-
- Returns
- -------
- dX : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_ch)`
- The gradient of the loss with respect to the module input volume.
- """
- dBnskip_out, dBn2_out = self.add3.backward(dLdY)
- dConvskip_out = self.batchnorm_skip.backward(dBnskip_out)
- dX = self.conv_skip.backward(dConvskip_out)
-
- dConv2_out = self.batchnorm2.backward(dBn2_out)
- dBn1_out = self.conv2.backward(dConv2_out)
- dConv1_out = self.batchnorm1.backward(dBn1_out)
- dX += self.conv1.backward(dConv1_out)
-
- if retain_grads:
- self._dv["dLdAdd3_X"] = dX
- self._dv["dLdBn1"] = dBn1_out
- self._dv["dLdBn2"] = dBn2_out
- self._dv["dLdConv1"] = dConv1_out
- self._dv["dLdConv2"] = dConv2_out
- self._dv["dLdBnSkip"] = dBnskip_out
- self._dv["dLdConvSkip"] = dConvskip_out
- return dX
-
-
-class BidirectionalLSTM(ModuleBase):
- def __init__(
- self,
- n_out,
- act_fn=None,
- gate_fn=None,
- merge_mode="concat",
- init="glorot_uniform",
- optimizer=None,
- ):
- """
- A single bidirectional long short-term memory (LSTM) layer.
-
- Parameters
- ----------
- n_out : int
- The dimension of a single hidden state / output on a given timestep
- act_fn : :doc:`Activation ` object or None
- The activation function for computing ``A[t]``. If not specified,
- use :class:`~numpy_ml.neural_nets.activations.Tanh` by default.
- gate_fn : :doc:`Activation ` object or None
- The gate function for computing the update, forget, and output
- gates. If not specified, use
- :class:`~numpy_ml.neural_nets.activations.Sigmoid` by default.
- merge_mode : {"sum", "multiply", "concat", "average"}
- Mode by which outputs of the forward and backward LSTMs will be
- combined. Default is 'concat'.
- optimizer : str or :doc:`Optimizer ` object or None
- The optimization strategy to use when performing gradient updates
- within the `update` method. If None, use the
- :class:`~numpy_ml.neural_nets.optimizers.SGD` optimizer with
- default parameters. Default is None.
- init : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is 'glorot_uniform'.
- """
- super().__init__()
-
- self.init = init
- self.n_in = None
- self.n_out = n_out
- self.optimizer = optimizer
- self.merge_mode = merge_mode
- self.act_fn = Tanh() if act_fn is None else act_fn
- self.gate_fn = Sigmoid() if gate_fn is None else gate_fn
- self._init_params()
-
- def _init_params(self):
- self.cell_fwd = LSTMCell(
- init=self.init,
- n_out=self.n_out,
- act_fn=self.act_fn,
- gate_fn=self.gate_fn,
- optimizer=self.optimizer,
- )
- self.cell_bwd = LSTMCell(
- init=self.init,
- n_out=self.n_out,
- act_fn=self.act_fn,
- gate_fn=self.gate_fn,
- optimizer=self.optimizer,
- )
-
- def forward(self, X):
- """
- Run a forward pass across all timesteps in the input.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, n_in, n_t)`
- Input consisting of `n_ex` examples each of dimensionality `n_in`
- and extending for `n_t` timesteps.
-
- Returns
- -------
- Y : :py:class:`ndarray ` of shape `(n_ex, n_out, n_t)`
- The value of the hidden state for each of the `n_ex` examples
- across each of the `n_t` timesteps.
- """
- Y_fwd, Y_bwd, Y = [], [], []
- n_ex, self.n_in, n_t = X.shape
-
- # forward LSTM
- for t in range(n_t):
- yt, ct = self.cell_fwd.forward(X[:, :, t])
- Y_fwd.append(yt)
-
- # backward LSTM
- for t in reversed(range(n_t)):
- yt, ct = self.cell_bwd.forward(X[:, :, t])
- Y_bwd.insert(0, yt)
-
- # merge forward and backward states
- for t in range(n_t):
- if self.merge_mode == "concat":
- Y.append(np.concatenate([Y_fwd[t], Y_bwd[t]], axis=1))
- elif self.merge_mode == "sum":
- Y.append(Y_fwd[t] + Y_bwd[t])
- elif self.merge_mode == "average":
- Y.append((Y_fwd[t] + Y_bwd[t]) / 2)
- elif self.merge_mode == "multiply":
- Y.append(Y_fwd[t] * Y_bwd[t])
-
- self.Y_fwd, self.Y_bwd = Y_fwd, Y_bwd
- return np.dstack(Y)
-
- def backward(self, dLdA):
- """
- Run a backward pass across all timesteps in the input.
-
- Parameters
- ----------
- dLdA : :py:class:`ndarray ` of shape `(n_ex, n_out, n_t)`
- The gradient of the loss with respect to the layer output for each
- of the `n_ex` examples across all `n_t` timesteps.
-
- Returns
- -------
- dLdX : :py:class:`ndarray ` of shape `(n_ex, n_in, n_t)`
- The value of the hidden state for each of the `n_ex` examples
- across each of the `n_t` timesteps.
- """
- assert self.trainable, "Layer is frozen"
-
- n_ex, n_out, n_t = dLdA.shape
- dLdX_f, dLdX_b, dLdX = [], [], []
-
- # forward LSTM
- for t in reversed(range(n_t)):
- if self.merge_mode == "concat":
- dLdXt_f = self.cell_fwd.backward(dLdA[:, : self.n_out, t])
- elif self.merge_mode == "sum":
- dLdXt_f = self.cell_fwd.backward(dLdA[:, :, t])
- elif self.merge_mode == "multiplty":
- dLdXt_f = self.cell_fwd.backward(dLdA[:, :, t] * self.Y_bwd[t])
- elif self.merge_mode == "average":
- dLdXt_f = self.cell_fwd.backward(dLdA[:, :, t] * 0.5)
- dLdX_f.insert(0, dLdXt_f)
-
- # backward LSTM
- for t in range(n_t):
- if self.merge_mode == "concat":
- dLdXt_b = self.cell_bwd.backward(dLdA[:, self.n_out :, t])
- elif self.merge_mode == "sum":
- dLdXt_b = self.cell_bwd.backward(dLdA[:, :, t])
- elif self.merge_mode == "multiplty":
- dLdXt_b = self.cell_bwd.backward(dLdA[:, :, t] * self.Y_fwd[t])
- elif self.merge_mode == "average":
- dLdXt_b = self.cell_bwd.backward(dLdA[:, :, t] * 0.5)
- dLdX_b.append(dLdXt_b)
-
- for t in range(n_t):
- dLdX.append(dLdX_f[t] + dLdX_b[t])
-
- return np.dstack(dLdX)
-
- @property
- def derived_variables(self):
- """A dictionary of intermediate values computed during the
- forward/backward passes."""
- return {
- "components": {
- "cell_fwd": self.cell_fwd.derived_variables,
- "cell_bwd": self.cell_bwd.derived_variables,
- }
- }
-
- @property
- def gradients(self):
- """A dictionary of the accumulated module parameter gradients."""
- return {
- "components": {
- "cell_fwd": self.cell_fwd.gradients,
- "cell_bwd": self.cell_bwd.gradients,
- }
- }
-
- @property
- def parameters(self):
- """A dictionary of the module parameters."""
- return {
- "components": {
- "cell_fwd": self.cell_fwd.parameters,
- "cell_bwd": self.cell_bwd.parameters,
- }
- }
-
- @property
- def hyperparameters(self):
- """A dictionary of the module hyperparameters."""
- return {
- "layer": "BidirectionalLSTM",
- "init": self.init,
- "n_in": self.n_in,
- "n_out": self.n_out,
- "act_fn": str(self.act_fn),
- "optimizer": self.optimizer,
- "merge_mode": self.merge_mode,
- "component_ids": ["cell_fwd", "cell_bwd"],
- "components": {
- "cell_fwd": self.cell_fwd.hyperparameters,
- "cell_bwd": self.cell_bwd.hyperparameters,
- },
- }
-
-
-class MultiHeadedAttentionModule(ModuleBase):
- def __init__(self, n_heads=8, dropout_p=0, init="glorot_uniform", optimizer=None):
- """
- A mutli-headed attention module.
-
- Notes
- -----
- Multi-head attention allows a model to jointly attend to information from
- different representation subspaces at different positions. With a
- single head, this information would get averaged away when the
- attention weights are combined with the value
-
- .. math::
-
- \\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V})
- = [\\text{head}_1; ...; \\text{head}_h] \\mathbf{W}^{(O)}
-
- where
-
- .. math::
-
- \\text{head}_i = \\text{SDP_attention}(
- \mathbf{Q W}_i^{(Q)}, \mathbf{K W}_i^{(K)}, \mathbf{V W}_i^{(V)})
-
- and the projection weights are parameter matrices:
-
- .. math::
-
- \mathbf{W}_i^{(Q)} &\in
- \mathbb{R}^{(\\text{kqv_dim} \ \\times \ \\text{latent_dim})} \\\\
- \mathbf{W}_i^{(K)} &\in
- \mathbb{R}^{(\\text{kqv_dim} \ \\times \ \\text{latent_dim})} \\\\
- \mathbf{W}_i^{(V)} &\in
- \mathbb{R}^{(\\text{kqv_dim} \ \\times \ \\text{latent_dim})} \\\\
- \mathbf{W}^{(O)} &\in
- \mathbb{R}^{(\\text{n_heads} \cdot \\text{latent_dim} \ \\times \ \\text{kqv_dim})}
-
- Importantly, the current module explicitly assumes that
-
- .. math::
-
- \\text{kqv_dim} = \\text{dim(query)} = \\text{dim(keys)} = \\text{dim(values)}
-
- and that
-
- .. math::
-
- \\text{latent_dim} = \\text{kqv_dim / n_heads}
-
- **[MH Attention Head h]**:
-
- .. code-block:: text
-
- K --> W_h^(K) ------\\
- V --> W_h^(V) ------- > DP_Attention --> head_h
- Q --> W_h^(Q) ------/
-
- The full **[MultiHeadedAttentionModule]** then becomes
-
- .. code-block:: text
-
- -----------------
- K --> | [Attn Head 1] | --> head_1 --\\
- V --> | [Attn Head 2] | --> head_2 --\\
- Q --> | ... | ... --> Concat --> W^(O) --> MH_out
- | [Attn Head Z] | --> head_Z --/
- -----------------
-
- Due to the reduced dimension of each head, the total computational cost
- is similar to that of a single attention head with full (i.e., kqv_dim)
- dimensionality.
-
- Parameters
- ----------
- n_heads : int
- The number of simultaneous attention heads to use. Note that the
- larger `n_heads`, the smaller the dimensionality of any single
- head, since ``latent_dim = kqv_dim / n_heads``. Default is 8.
- dropout_p : float in [0, 1)
- The dropout propbability during training, applied to the output of
- the softmax in each dot-product attention head. If 0, no dropout is
- applied. Default is 0.
- init : {'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'}
- The weight initialization strategy. Default is 'glorot_uniform'.
- optimizer : str, :doc:`Optimizer ` object, or None
- The optimization strategy to use when performing gradient updates
- within the :meth:`update` method. If None, use the
- :class:`~numpy_ml.neural_nets.optimizers.SGD` optimizer with default
- parameters. Default is None.
- """
- self.init = init
- self.kqv_dim = None
- self.projections = {}
- self.n_heads = n_heads
- self.optimizer = optimizer
- self.dropout_p = dropout_p
- self.is_initialized = False
-
- def _init_params(self):
- self._dv = {}
-
- # assume dim(keys) = dim(query) = dim(values)
- assert self.kqv_dim % self.n_heads == 0
- self.latent_dim = self.kqv_dim // self.n_heads
-
- self.attention = DotProductAttention(scale=True, dropout_p=self.dropout_p)
- self.projections = {
- k: Dropout(
- FullyConnected(
- init=self.init,
- n_out=self.kqv_dim,
- optimizer=self.optimizer,
- act_fn="Affine(slope=1, intercept=0)",
- ),
- self.dropout_p,
- )
- for k in ["Q", "K", "V", "O"]
- }
-
- self.is_initialized = True
-
- def forward(self, Q, K, V):
- if not self.is_initialized:
- self.kqv_dim = Q.shape[-1]
- self._init_params()
-
- # project queries, keys, and values into the `latent_dim`-dimensional subspace
- n_ex = Q.shape[0]
- for k, x in zip(["Q", "K", "V"], [Q, K, V]):
- proj = self.projections[k].forward(x)
- proj = proj.reshape(n_ex, -1, self.n_heads, self.latent_dim).swapaxes(1, 2)
- self._dv["{}_proj".format(k)] = proj
-
- dv = self.derived_variables
- Q_proj, K_proj, V_proj = dv["Q_proj"], dv["K_proj"], dv["V_proj"]
-
- # apply scaled dot-product attention to the projected vectors
- attn = self.attention
- attn_out = attn.forward(Q_proj, K_proj, V_proj)
- self._dv["attention_weights"] = attn.derived_variables["attention_weights"]
-
- # concatenate the different heads using `reshape` to create an
- # `kqv_dim`-dim vector
- attn_out = attn_out.swapaxes(1, 2).reshape(n_ex, self.kqv_dim)
- self._dv["attention_out"] = attn_out.reshape(n_ex, -1, self.kqv_dim)
-
- # apply the final output projection
- Y = self.projections["O"].forward(attn_out)
- Y = Y.reshape(n_ex, -1, self.kqv_dim)
- return Y
-
- def backward(self, dLdy):
- n_ex = dLdy.shape[0]
- dLdy = dLdy.reshape(n_ex, self.kqv_dim)
- dLdX = self.projections["O"].backward(dLdy)
- dLdX = dLdX.reshape(n_ex, self.n_heads, -1, self.latent_dim)
-
- dLdQ_proj, dLdK_proj, dLdV_proj = self.attention.backward(dLdX)
-
- self._dv["dQ_proj"] = dLdQ_proj
- self._dv["dK_proj"] = dLdK_proj
- self._dv["dV_proj"] = dLdV_proj
-
- dLdQ_proj = dLdQ_proj.reshape(n_ex, self.kqv_dim)
- dLdK_proj = dLdK_proj.reshape(n_ex, self.kqv_dim)
- dLdV_proj = dLdV_proj.reshape(n_ex, self.kqv_dim)
-
- dLdQ = self.projections["Q"].backward(dLdQ_proj)
- dLdK = self.projections["K"].backward(dLdK_proj)
- dLdV = self.projections["V"].backward(dLdV_proj)
- return dLdQ, dLdK, dLdV
-
- @property
- def derived_variables(self):
- """A dictionary of intermediate values computed during the
- forward/backward passes."""
- dv = {
- "Q_proj": None,
- "K_proj": None,
- "V_proj": None,
- "components": {
- "Q": self.projections["Q"].derived_variables,
- "K": self.projections["K"].derived_variables,
- "V": self.projections["V"].derived_variables,
- "O": self.projections["O"].derived_variables,
- "attention": self.attention.derived_variables,
- },
- }
- dv.update(self._dv)
- return dv
-
- @property
- def gradients(self):
- """A dictionary of the accumulated module parameter gradients."""
- return {
- "components": {
- "Q": self.projections["Q"].gradients,
- "K": self.projections["K"].gradients,
- "V": self.projections["V"].gradients,
- "O": self.projections["O"].gradients,
- "attention": self.attention.gradients,
- }
- }
-
- @property
- def parameters(self):
- """A dictionary of the module parameters."""
- return {
- "components": {
- "Q": self.projections["Q"].parameters,
- "K": self.projections["K"].parameters,
- "V": self.projections["V"].parameters,
- "O": self.projections["O"].parameters,
- "attention": self.attention.parameters,
- }
- }
-
- @property
- def hyperparameters(self):
- """A dictionary of the module hyperparameters."""
- return {
- "layer": "MultiHeadedAttentionModule",
- "init": self.init,
- "kqv_dim": self.kqv_dim,
- "latent_dim": self.latent_dim,
- "n_heads": self.n_heads,
- "dropout_p": self.dropout_p,
- "component_ids": ["attention", "Q", "K", "V", "O"],
- "components": {
- "Q": self.projections["Q"].hyperparameters,
- "K": self.projections["K"].hyperparameters,
- "V": self.projections["V"].hyperparameters,
- "O": self.projections["O"].hyperparameters,
- "attention": self.attention.hyperparameters,
- },
- }
diff --git a/aitk/keras/numpy_ml_utils/README.md b/aitk/keras/numpy_ml_utils/README.md
deleted file mode 100644
index a50b58b..0000000
--- a/aitk/keras/numpy_ml_utils/README.md
+++ /dev/null
@@ -1,38 +0,0 @@
-# Utilities
-
-The utilities module implements a number of useful functions and objects that
-power other ML algorithms across the repo.
-
-- `data_structures.py` implements a few useful data structures
- - A max- and min-heap ordered priority queue
- - A [ball tree](https://en.wikipedia.org/wiki/Ball_tree) with the KNS1 algorithm ([Omohundro, 1989](http://ftp.icsi.berkeley.edu/ftp/pub/techreports/1989/tr-89-063.pdf); [Moore & Gray, 2006](http://people.ee.duke.edu/~lcarin/liu06a.pdf))
- - A discrete sampler implementing Vose's algorithm for the [alias method](https://en.wikipedia.org/wiki/Alias_method) ([Walker, 1977](https://dl.acm.org/citation.cfm?id=355749); [Vose, 1991](https://pdfs.semanticscholar.org/f65b/cde1fcf82e05388b31de80cba10bf65acc07.pdf))
-
-- `kernels.py` implements several general-purpose similarity kernels
- - Linear kernel
- - Polynomial kernel
- - Radial basis function kernel
-
-- `distance_metrics.py` implements common distance metrics
- - Euclidean (L2) distance
- - Manhattan (L1) distance
- - Chebyshev (L-infinity) distance
- - Minkowski-p distance
- - Hamming distance
-
-- `graphs.py` implements simple data structures and algorithms for graph
- processing.
- - Undirected + directed graph objects allowing for probabilistic edge weights
- - Graph generators (Erdos-Renyi, random DAGs)
- - Topological sorting for DAGs
- - Cycle detection
- - Simple path-finding
-
-- `windows.py` implements several common windowing functions
- - Hann
- - Hamming
- - Blackman-Harris
- - Generalized cosine
-
-- `testing.py` implements helper functions that prove useful when writing unit
- tests, including data generators and various assert statements
diff --git a/aitk/keras/numpy_ml_utils/__init__.py b/aitk/keras/numpy_ml_utils/__init__.py
deleted file mode 100644
index c90b4df..0000000
--- a/aitk/keras/numpy_ml_utils/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from . import testing
-from . import data_structures
-from . import distance_metrics
-from . import kernels
-from . import windows
-from . import graphs
diff --git a/aitk/keras/numpy_ml_utils/data_structures.py b/aitk/keras/numpy_ml_utils/data_structures.py
deleted file mode 100644
index 4a1ea31..0000000
--- a/aitk/keras/numpy_ml_utils/data_structures.py
+++ /dev/null
@@ -1,522 +0,0 @@
-import heapq
-from copy import copy
-from collections.abc import Hashable
-
-import numpy as np
-
-from .distance_metrics import euclidean
-
-#######################################################################
-# Priority Queue #
-#######################################################################
-
-
-class PQNode(object):
- def __init__(self, key, val, priority, entry_id, **kwargs):
- """A generic node object for holding entries in :class:`PriorityQueue`"""
- self.key = key
- self.val = val
- self.entry_id = entry_id
- self.priority = priority
-
- def __repr__(self):
- fstr = "PQNode(key={}, val={}, priority={}, entry_id={})"
- return fstr.format(self.key, self.val, self.priority, self.entry_id)
-
- def to_dict(self):
- """Return a dictionary representation of the node's contents"""
- d = self.__dict__
- d["id"] = "PQNode"
- return d
-
- def __gt__(self, other):
- if not isinstance(other, PQNode):
- return -1
- if self.priority == other.priority:
- return self.entry_id > other.entry_id
- return self.priority > other.priority
-
- def __ge__(self, other):
- if not isinstance(other, PQNode):
- return -1
- return self.priority >= other.priority
-
- def __lt__(self, other):
- if not isinstance(other, PQNode):
- return -1
- if self.priority == other.priority:
- return self.entry_id < other.entry_id
- return self.priority < other.priority
-
- def __le__(self, other):
- if not isinstance(other, PQNode):
- return -1
- return self.priority <= other.priority
-
-
-class PriorityQueue:
- def __init__(self, capacity, heap_order="max"):
- """
- A priority queue implementation using a binary heap.
-
- Notes
- -----
- A priority queue is a data structure useful for storing the top
- `capacity` largest or smallest elements in a collection of values. As a
- result of using a binary heap, ``PriorityQueue`` offers `O(log N)`
- :meth:`push` and :meth:`pop` operations.
-
- Parameters
- ----------
- capacity: int
- The maximum number of items that can be held in the queue.
- heap_order: {"max", "min"}
- Whether the priority queue should retain the items with the
- `capacity` smallest (`heap_order` = 'min') or `capacity` largest
- (`heap_order` = 'max') priorities.
- """
- assert heap_order in ["max", "min"], "heap_order must be either 'max' or 'min'"
- self.capacity = capacity
- self.heap_order = heap_order
-
- self._pq = []
- self._count = 0
- self._entry_counter = 0
-
- def __repr__(self):
- fstr = "PriorityQueue(capacity={}, heap_order={}) with {} items"
- return fstr.format(self.capacity, self.heap_order, self._count)
-
- def __len__(self):
- return self._count
-
- def __iter__(self):
- return iter(self._pq)
-
- def push(self, key, priority, val=None):
- """
- Add a new (key, value) pair with priority `priority` to the queue.
-
- Notes
- -----
- If the queue is at capacity and `priority` exceeds the priority of the
- item with the largest/smallest priority currently in the queue, replace
- the current queue item with (`key`, `val`).
-
- Parameters
- ----------
- key : hashable object
- The key to insert into the queue.
- priority : comparable
- The priority for the `key`, `val` pair.
- val : object
- The value associated with `key`. Default is None.
- """
- if self.heap_order == "max":
- priority = -1 * priority
-
- item = PQNode(key=key, val=val, priority=priority, entry_id=self._entry_counter)
- heapq.heappush(self._pq, item)
-
- self._count += 1
- self._entry_counter += 1
-
- while self._count > self.capacity:
- self.pop()
-
- def pop(self):
- """
- Remove the item with the largest/smallest (depending on
- ``self.heap_order``) priority from the queue and return it.
-
- Notes
- -----
- In contrast to :meth:`peek`, this operation is `O(log N)`.
-
- Returns
- -------
- item : :class:`PQNode` instance or None
- Item with the largest/smallest priority, depending on
- ``self.heap_order``.
- """
- item = heapq.heappop(self._pq).to_dict()
- if self.heap_order == "max":
- item["priority"] = -1 * item["priority"]
- self._count -= 1
- return item
-
- def peek(self):
- """
- Return the item with the largest/smallest (depending on
- ``self.heap_order``) priority *without* removing it from the queue.
-
- Notes
- -----
- In contrast to :meth:`pop`, this operation is O(1).
-
- Returns
- -------
- item : :class:`PQNode` instance or None
- Item with the largest/smallest priority, depending on
- ``self.heap_order``.
- """
- item = None
- if self._count > 0:
- item = copy(self._pq[0].to_dict())
- if self.heap_order == "max":
- item["priority"] = -1 * item["priority"]
- return item
-
-
-#######################################################################
-# Ball Tree #
-#######################################################################
-
-
-class BallTreeNode:
- def __init__(self, centroid=None, X=None, y=None):
- self.left = None
- self.right = None
- self.radius = None
- self.is_leaf = False
-
- self.data = X
- self.targets = y
- self.centroid = centroid
-
- def __repr__(self):
- fstr = "BallTreeNode(centroid={}, is_leaf={})"
- return fstr.format(self.centroid, self.is_leaf)
-
- def to_dict(self):
- d = self.__dict__
- d["id"] = "BallTreeNode"
- return d
-
-
-class BallTree:
- def __init__(self, leaf_size=40, metric=None):
- """
- A ball tree data structure.
-
- Notes
- -----
- A ball tree is a binary tree in which every node defines a
- `D`-dimensional hypersphere ("ball") containing a subset of the points
- to be searched. Each internal node of the tree partitions the data
- points into two disjoint sets which are associated with different
- balls. While the balls themselves may intersect, each point is assigned
- to one or the other ball in the partition according to its distance
- from the ball's center. Each leaf node in the tree defines a ball and
- enumerates all data points inside that ball.
-
- Parameters
- ----------
- leaf_size : int
- The maximum number of datapoints at each leaf. Default is 40.
- metric : :doc:`Distance metric ` or None
- The distance metric to use for computing nearest neighbors. If
- None, use the :func:`~numpy_ml.utils.distance_metrics.euclidean`
- metric. Default is None.
-
- References
- ----------
- .. [1] Omohundro, S. M. (1989). "Five balltree construction algorithms". *ICSI
- Technical Report TR-89-063*.
- .. [2] Liu, T., Moore, A., & Gray A. (2006). "New algorithms for efficient
- high-dimensional nonparametric classification". *J. Mach. Learn. Res.,
- 7*, 1135-1158.
- """
- self.root = None
- self.leaf_size = leaf_size
- self.metric = metric if metric is not None else euclidean
-
- def fit(self, X, y=None):
- """
- Build a ball tree recursively using the O(M log N) `k`-d construction
- algorithm.
-
- Notes
- -----
- Recursively divides data into nodes defined by a centroid `C` and radius
- `r` such that each point below the node lies within the hyper-sphere
- defined by `C` and `r`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(N, M)`
- An array of `N` examples each with `M` features.
- y : :py:class:`ndarray ` of shape `(N, *)` or None
- An array of target values / labels associated with the entries in
- `X`. Default is None.
- """
- centroid, left_X, left_y, right_X, right_y = self._split(X, y)
- self.root = BallTreeNode(centroid=centroid)
- self.root.radius = np.max([self.metric(centroid, x) for x in X])
- self.root.left = self._build_tree(left_X, left_y)
- self.root.right = self._build_tree(right_X, right_y)
-
- def _build_tree(self, X, y):
- centroid, left_X, left_y, right_X, right_y = self._split(X, y)
-
- if X.shape[0] <= self.leaf_size:
- leaf = BallTreeNode(centroid=centroid, X=X, y=y)
- leaf.radius = np.max([self.metric(centroid, x) for x in X])
- leaf.is_leaf = True
- return leaf
-
- node = BallTreeNode(centroid=centroid)
- node.radius = np.max([self.metric(centroid, x) for x in X])
- node.left = self._build_tree(left_X, left_y)
- node.right = self._build_tree(right_X, right_y)
- return node
-
- def _split(self, X, y=None):
- # find the dimension with greatest variance
- split_dim = np.argmax(np.var(X, axis=0))
-
- # sort X and y along split_dim
- sort_ixs = np.argsort(X[:, split_dim])
- X, y = X[sort_ixs], y[sort_ixs] if y is not None else None
-
- # divide at median value of split_dim
- med_ix = X.shape[0] // 2
- centroid = X[med_ix] # , split_dim
-
- # split data into two halves at the centroid (median always appears on
- # the right split)
- left_X, left_y = X[:med_ix], y[:med_ix] if y is not None else None
- right_X, right_y = X[med_ix:], y[med_ix:] if y is not None else None
- return centroid, left_X, left_y, right_X, right_y
-
- def nearest_neighbors(self, k, x):
- """
- Find the `k` nearest neighbors in the ball tree to a query vector `x`
- using the KNS1 algorithm.
-
- Parameters
- ----------
- k : int
- The number of closest points in `X` to return
- x : :py:class:`ndarray ` of shape `(1, M)`
- The query vector.
-
- Returns
- -------
- nearest : list of :class:`PQNode` s of length `k`
- List of the `k` points in `X` to closest to the query vector. The
- ``key`` attribute of each :class:`PQNode` contains the point itself, the
- ``val`` attribute contains its target, and the ``distance``
- attribute contains its distance to the query vector.
- """
- # maintain a max-first priority queue with priority = distance to x
- PQ = PriorityQueue(capacity=k, heap_order="max")
- nearest = self._knn(k, x, PQ, self.root)
- for n in nearest:
- n.distance = self.metric(x, n.key)
- return nearest
-
- def _knn(self, k, x, PQ, root):
- dist = self.metric
- dist_to_ball = dist(x, root.centroid) - root.radius
- dist_to_farthest_neighbor = dist(x, PQ.peek()["key"]) if len(PQ) > 0 else np.inf
-
- if dist_to_ball >= dist_to_farthest_neighbor and len(PQ) == k:
- return PQ
- if root.is_leaf:
- targets = [None] * len(root.data) if root.targets is None else root.targets
- for point, target in zip(root.data, targets):
- dist_to_x = dist(x, point)
- if len(PQ) == k and dist_to_x < dist_to_farthest_neighbor:
- PQ.push(key=point, val=target, priority=dist_to_x)
- else:
- PQ.push(key=point, val=target, priority=dist_to_x)
- else:
- l_closest = dist(x, root.left.centroid) < dist(x, root.right.centroid)
- PQ = self._knn(k, x, PQ, root.left if l_closest else root.right)
- PQ = self._knn(k, x, PQ, root.right if l_closest else root.left)
- return PQ
-
-
-#######################################################################
-# Multinomial Sampler #
-#######################################################################
-
-
-class DiscreteSampler:
- def __init__(self, probs, log=False, with_replacement=True):
- """
- Sample from an arbitrary multinomial PMF over the first `N` nonnegative
- integers using Vose's algorithm for the alias method.
-
- Notes
- -----
- Vose's algorithm takes `O(n)` time to initialize, requires `O(n)` memory,
- and generates samples in constant time.
-
- References
- ----------
- .. [1] Walker, A. J. (1977) "An efficient method for generating discrete
- random variables with general distributions". *ACM Transactions on
- Mathematical Software, 3(3)*, 253-256.
-
- .. [2] Vose, M. D. (1991) "A linear algorithm for generating random numbers
- with a given distribution". *IEEE Trans. Softw. Eng., 9*, 972-974.
-
- .. [3] Schwarz, K (2011) "Darts, dice, and coins: sampling from a discrete
- distribution". http://www.keithschwarz.com/darts-dice-coins/
-
- Parameters
- ----------
- probs : :py:class:`ndarray ` of length `(N,)`
- A list of probabilities of the `N` outcomes in the sample space.
- `probs[i]` returns the probability of outcome `i`.
- log : bool
- Whether the probabilities in `probs` are in logspace. Default is
- False.
- with_replacement : bool
- Whether to generate samples with or without replacement. Default is
- True.
- """
- if not isinstance(probs, np.ndarray):
- probs = np.array(probs)
-
- self.log = log
- self.N = len(probs)
- self.probs = probs
- self.with_replacement = with_replacement
-
- alias = np.zeros(self.N)
- prob = np.zeros(self.N)
- scaled_probs = self.probs + np.log(self.N) if log else self.probs * self.N
-
- selector = scaled_probs < 0 if log else scaled_probs < 1
- small, large = np.where(selector)[0].tolist(), np.where(~selector)[0].tolist()
-
- while len(small) and len(large):
- l, g = small.pop(), large.pop()
-
- alias[l] = g
- prob[l] = scaled_probs[l]
-
- if log:
- pg = np.log(np.exp(scaled_probs[g]) + np.exp(scaled_probs[l]) - 1)
- else:
- pg = scaled_probs[g] + scaled_probs[l] - 1
-
- scaled_probs[g] = pg
- to_small = pg < 0 if log else pg < 1
- if to_small:
- small.append(g)
- else:
- large.append(g)
-
- while len(large):
- prob[large.pop()] = 0 if log else 1
-
- while len(small):
- prob[small.pop()] = 0 if log else 1
-
- self.prob_table = prob
- self.alias_table = alias
-
- def __call__(self, n_samples=1):
- """
- Generate random draws from the `probs` distribution over integers in
- [0, N).
-
- Parameters
- ----------
- n_samples: int
- The number of samples to generate. Default is 1.
-
- Returns
- -------
- sample : :py:class:`ndarray ` of shape `(n_samples,)`
- A collection of draws from the distribution defined by `probs`.
- Each sample is an int in the range `[0, N)`.
- """
- return self.sample(n_samples)
-
- def sample(self, n_samples=1):
- """
- Generate random draws from the `probs` distribution over integers in
- [0, N).
-
- Parameters
- ----------
- n_samples: int
- The number of samples to generate. Default is 1.
-
- Returns
- -------
- sample : :py:class:`ndarray ` of shape `(n_samples,)`
- A collection of draws from the distribution defined by `probs`.
- Each sample is an int in the range `[0, N)`.
- """
- ixs = np.random.randint(0, self.N, n_samples)
- p = np.exp(self.prob_table[ixs]) if self.log else self.prob_table[ixs]
- flips = np.random.binomial(1, p)
- samples = [ix if f else self.alias_table[ix] for ix, f in zip(ixs, flips)]
-
- # do recursive rejection sampling to sample without replacement
- if not self.with_replacement:
- unique = list(set(samples))
- while len(samples) != len(unique):
- n_new = len(samples) - len(unique)
- samples = unique + self.sample(n_new).tolist()
- unique = list(set(samples))
-
- return np.array(samples, dtype=int)
-
-
-#######################################################################
-# Dict #
-#######################################################################
-
-
-class Dict(dict):
- def __init__(self, encoder=None):
- """
- A dictionary subclass which returns the key value if it is not in the
- dict.
-
- Parameters
- ----------
- encoder : function or None
- A function which is applied to a key before adding / retrieving it
- from the dictionary. If None, the function defaults to the
- identity. Default is None.
- """
- super(Dict, self).__init__()
- self._encoder = encoder
- self._id_max = 0
-
- def __setitem__(self, key, value):
- if self._encoder is not None:
- key = self._encoder(key)
- elif not isinstance(key, Hashable):
- key = tuple(key)
- super(Dict, self).__setitem__(key, value)
-
- def _encode_key(self, key):
- D = super(Dict, self)
- enc_key = self._encoder(key)
- if D.__contains__(enc_key):
- val = D.__getitem__(enc_key)
- else:
- val = self._id_max
- D.__setitem__(enc_key, val)
- self._id_max += 1
- return val
-
- def __getitem__(self, key):
- self._key = copy.deepcopy(key)
- if self._encoder is not None:
- return self._encode_key(key)
- elif not isinstance(key, Hashable):
- key = tuple(key)
- return super(Dict, self).__getitem__(key)
-
- def __missing__(self, key):
- return self._key
diff --git a/aitk/keras/numpy_ml_utils/distance_metrics.py b/aitk/keras/numpy_ml_utils/distance_metrics.py
deleted file mode 100644
index 8c51e6c..0000000
--- a/aitk/keras/numpy_ml_utils/distance_metrics.py
+++ /dev/null
@@ -1,132 +0,0 @@
-import numpy as np
-
-
-def euclidean(x, y):
- """
- Compute the Euclidean (`L2`) distance between two real vectors
-
- Notes
- -----
- The Euclidean distance between two vectors **x** and **y** is
-
- .. math::
-
- d(\mathbf{x}, \mathbf{y}) = \sqrt{ \sum_i (x_i - y_i)^2 }
-
- Parameters
- ----------
- x,y : :py:class:`ndarray ` s of shape `(N,)`
- The two vectors to compute the distance between
-
- Returns
- -------
- d : float
- The L2 distance between **x** and **y**.
- """
- return np.sqrt(np.sum((x - y) ** 2))
-
-
-def manhattan(x, y):
- """
- Compute the Manhattan (`L1`) distance between two real vectors
-
- Notes
- -----
- The Manhattan distance between two vectors **x** and **y** is
-
- .. math::
-
- d(\mathbf{x}, \mathbf{y}) = \sum_i |x_i - y_i|
-
- Parameters
- ----------
- x,y : :py:class:`ndarray ` s of shape `(N,)`
- The two vectors to compute the distance between
-
- Returns
- -------
- d : float
- The L1 distance between **x** and **y**.
- """
- return np.sum(np.abs(x - y))
-
-
-def chebyshev(x, y):
- """
- Compute the Chebyshev (:math:`L_\infty`) distance between two real vectors
-
- Notes
- -----
- The Chebyshev distance between two vectors **x** and **y** is
-
- .. math::
-
- d(\mathbf{x}, \mathbf{y}) = \max_i |x_i - y_i|
-
- Parameters
- ----------
- x,y : :py:class:`ndarray ` s of shape `(N,)`
- The two vectors to compute the distance between
-
- Returns
- -------
- d : float
- The Chebyshev distance between **x** and **y**.
- """
- return np.max(np.abs(x - y))
-
-
-def minkowski(x, y, p):
- """
- Compute the Minkowski-`p` distance between two real vectors.
-
- Notes
- -----
- The Minkowski-`p` distance between two vectors **x** and **y** is
-
- .. math::
-
- d(\mathbf{x}, \mathbf{y}) = \left( \sum_i |x_i - y_i|^p \\right)^{1/p}
-
- Parameters
- ----------
- x,y : :py:class:`ndarray ` s of shape `(N,)`
- The two vectors to compute the distance between
- p : float > 1
- The parameter of the distance function. When `p = 1`, this is the `L1`
- distance, and when `p=2`, this is the `L2` distance. For `p < 1`,
- Minkowski-`p` does not satisfy the triangle inequality and hence is not
- a valid distance metric.
-
- Returns
- -------
- d : float
- The Minkowski-`p` distance between **x** and **y**.
- """
- return np.sum(np.abs(x - y) ** p) ** (1 / p)
-
-
-def hamming(x, y):
- """
- Compute the Hamming distance between two integer-valued vectors.
-
- Notes
- -----
- The Hamming distance between two vectors **x** and **y** is
-
- .. math::
-
- d(\mathbf{x}, \mathbf{y}) = \\frac{1}{N} \sum_i \mathbb{1}_{x_i \\neq y_i}
-
- Parameters
- ----------
- x,y : :py:class:`ndarray ` s of shape `(N,)`
- The two vectors to compute the distance between. Both vectors should be
- integer-valued.
-
- Returns
- -------
- d : float
- The Hamming distance between **x** and **y**.
- """
- return np.sum(x != y) / len(x)
diff --git a/aitk/keras/numpy_ml_utils/graphs.py b/aitk/keras/numpy_ml_utils/graphs.py
deleted file mode 100644
index c65f5f3..0000000
--- a/aitk/keras/numpy_ml_utils/graphs.py
+++ /dev/null
@@ -1,363 +0,0 @@
-from abc import ABC, abstractmethod
-from collections import defaultdict
-from itertools import combinations, permutations
-
-import numpy as np
-
-#######################################################################
-# Graph Components #
-#######################################################################
-
-
-class Edge(object):
- def __init__(self, fr, to, w=None):
- """
- A generic directed edge object.
-
- Parameters
- ----------
- fr: int
- The id of the vertex the edge goes from
- to: int
- The id of the vertex the edge goes to
- w: float, :class:`Object` instance, or None
- The edge weight, if applicable. If weight is an arbitrary Object it
- must have a method called 'sample' which takes no arguments and
- returns a random sample from the weight distribution. If `w` is
- None, no weight is assumed. Default is None.
- """
- self.fr = fr
- self.to = to
- self._w = w
-
- def __repr__(self):
- return "{} -> {}, weight: {}".format(self.fr, self.to, self._w)
-
- @property
- def weight(self):
- return self._w.sample() if hasattr(self._w, "sample") else self._w
-
- def reverse(self):
- """Reverse the edge direction"""
- return Edge(self.t, self.f, self.w)
-
-
-#######################################################################
-# Graph Types #
-#######################################################################
-
-
-class Graph(ABC):
- def __init__(self, V, E):
- self._I2V = {i: v for i, v in zip(range(len(V)), V)}
- self._V2I = {v: i for i, v in zip(range(len(V)), V)}
- self._G = {i: set() for i in range(len(V))}
- self._V = V
- self._E = E
-
- self._build_adjacency_list()
-
- def __getitem__(self, v_i):
- return self.get_neighbors(v_i)
-
- def get_index(self, v):
- """Get the internal index for a given vetex"""
- return self._V2I[v]
-
- def get_vertex(self, v_i):
- """Get the original vertex from a given internal index"""
- return self._I2V[v_i]
-
- @property
- def vertices(self):
- return self._V
-
- @property
- def indices(self):
- return list(range(len(self.vertices)))
-
- @property
- def edges(self):
- return self._E
-
- def get_neighbors(self, v_i):
- """
- Return the internal indices of the vertices reachable from the vertex
- with index `v_i`.
- """
- return [self._V2I[e.to] for e in self._G[v_i]]
-
- def to_matrix(self):
- """Return an adjacency matrix representation of the graph"""
- adj_mat = np.zeros((len(self._V), len(self._V)))
- for e in self.edges:
- fr, to = self._V2I[e.fr], self._V2I[e.to]
- adj_mat[fr, to] = 1 if e.weight is None else e.weight
- return adj_mat
-
- def to_adj_dict(self):
- """Return an adjacency dictionary representation of the graph"""
- adj_dict = defaultdict(lambda: list())
- for e in self.edges:
- adj_dict[e.fr].append(e)
- return adj_dict
-
- def path_exists(self, s_i, e_i):
- """
- Check whether a path exists from vertex index `s_i` to `e_i`.
-
- Parameters
- ----------
- s_i: Int
- The interal index of the start vertex
- e_i: Int
- The internal index of the end vertex
-
- Returns
- -------
- path_exists : Boolean
- Whether or not a valid path exists between `s_i` and `e_i`.
- """
- queue = [(s_i, [s_i])]
- while len(queue):
- c_i, path = queue.pop(0)
- nbrs_not_on_path = set(self.get_neighbors(c_i)) - set(path)
-
- for n_i in nbrs_not_on_path:
- queue.append((n_i, path + [n_i]))
- if n_i == e_i:
- return True
- return False
-
- def all_paths(self, s_i, e_i):
- """
- Find all simple paths between `s_i` and `e_i` in the graph.
-
- Notes
- -----
- Uses breadth-first search. Ignores all paths with repeated vertices.
-
- Parameters
- ----------
- s_i: Int
- The interal index of the start vertex
- e_i: Int
- The internal index of the end vertex
-
- Returns
- -------
- complete_paths : list of lists
- A list of all paths from `s_i` to `e_i`. Each path is represented
- as a list of interal vertex indices.
- """
- complete_paths = []
- queue = [(s_i, [s_i])]
-
- while len(queue):
- c_i, path = queue.pop(0)
- nbrs_not_on_path = set(self.get_neighbors(c_i)) - set(path)
-
- for n_i in nbrs_not_on_path:
- if n_i == e_i:
- complete_paths.append(path + [n_i])
- else:
- queue.append((n_i, path + [n_i]))
-
- return complete_paths
-
- @abstractmethod
- def _build_adjacency_list(self):
- pass
-
-
-class DiGraph(Graph):
- def __init__(self, V, E):
- """
- A generic directed graph object.
-
- Parameters
- ----------
- V : list
- A list of vertex IDs.
- E : list of :class:`Edge ` objects
- A list of directed edges connecting pairs of vertices in ``V``.
- """
- super().__init__(V, E)
- self.is_directed = True
- self._topological_ordering = []
-
- def _build_adjacency_list(self):
- """Encode directed graph as an adjancency list"""
- # assumes no parallel edges
- for e in self.edges:
- fr_i = self._V2I[e.fr]
- self._G[fr_i].add(e)
-
- def reverse(self):
- """Reverse the direction of all edges in the graph"""
- return DiGraph(self.vertices, [e.reverse() for e in self.edges])
-
- def topological_ordering(self):
- """
- Returns a (non-unique) topological sort / linearization of the nodes
- IFF the graph is acyclic, otherwise returns None.
-
- Notes
- -----
- A topological sort is an ordering on the nodes in `G` such that for every
- directed edge :math:`u \\rightarrow v` in the graph, `u` appears before
- `v` in the ordering. The topological ordering is produced by ordering
- the nodes in `G` by their DFS "last visit time," from greatest to
- smallest.
-
- This implementation follows a recursive, DFS-based approach [1]_ which
- may break if the graph is very large. For an iterative version, see
- Khan's algorithm [2]_ .
-
- References
- ----------
- .. [1] Tarjan, R. (1976), Edge-disjoint spanning trees and depth-first
- search, *Acta Informatica, 6 (2)*: 171–185.
- .. [2] Kahn, A. (1962), Topological sorting of large networks,
- *Communications of the ACM, 5 (11)*: 558–562.
-
- Returns
- -------
- ordering : list or None
- A topoligical ordering of the vertex indices if the graph is a DAG,
- otherwise None.
- """
- ordering = []
- visited = set()
-
- def dfs(v_i, path=None):
- """A simple DFS helper routine"""
- path = set([v_i]) if path is None else path
- for nbr_i in self.get_neighbors(v_i):
- if nbr_i in path:
- return True # cycle detected!
- elif nbr_i not in visited:
- visited.add(nbr_i)
- path.add(nbr_i)
- is_cyclic = dfs(nbr_i, path)
- if is_cyclic:
- return True
-
- # insert to the beginning of the ordering
- ordering.insert(0, v_i)
- path -= set([v_i])
- return False
-
- for s_i in self.indices:
- if s_i not in visited:
- visited.add(s_i)
- is_cyclic = dfs(s_i)
-
- if is_cyclic:
- return None
-
- return ordering
-
- def is_acyclic(self):
- """Check whether the graph contains cycles"""
- return self.topological_ordering() is not None
-
-
-class UndirectedGraph(Graph):
- def __init__(self, V, E):
- """
- A generic undirected graph object.
-
- Parameters
- ----------
- V : list
- A list of vertex IDs.
- E : list of :class:`Edge ` objects
- A list of edges connecting pairs of vertices in ``V``. For any edge
- connecting vertex `u` to vertex `v`, :class:`UndirectedGraph
- ` will assume that there
- exists a corresponding edge connecting `v` to `u`, even if this is
- not present in `E`.
- """
- super().__init__(V, E)
- self.is_directed = False
-
- def _build_adjacency_list(self):
- """Encode undirected, unweighted graph as an adjancency list"""
- # assumes no parallel edges
- # each edge appears twice as (u,v) and (v,u)
- for e in self.edges:
- fr_i = self._V2I[e.fr]
- to_i = self._V2I[e.to]
-
- self._G[fr_i].add(e)
- self._G[to_i].add(e.reverse())
-
-
-#######################################################################
-# Graph Generators #
-#######################################################################
-
-
-def random_unweighted_graph(n_vertices, edge_prob=0.5, directed=False):
- """
- Generate an unweighted Erdős-Rényi random graph [*]_.
-
- References
- ----------
- .. [*] Erdős, P. and Rényi, A. (1959). On Random Graphs, *Publ. Math. 6*, 290.
-
- Parameters
- ----------
- n_vertices : int
- The number of vertices in the graph.
- edge_prob : float in [0, 1]
- The probability of forming an edge between two vertices. Default is
- 0.5.
- directed : bool
- Whether the edges in the graph should be directed. Default is False.
-
- Returns
- -------
- G : :class:`Graph` instance
- The resulting random graph.
- """
- vertices = list(range(n_vertices))
- candidates = permutations(vertices, 2) if directed else combinations(vertices, 2)
-
- edges = []
- for (fr, to) in candidates:
- if np.random.rand() <= edge_prob:
- edges.append(Edge(fr, to))
-
- return DiGraph(vertices, edges) if directed else UndirectedGraph(vertices, edges)
-
-
-def random_DAG(n_vertices, edge_prob=0.5):
- """
- Create a 'random' unweighted directed acyclic graph by pruning all the
- backward connections from a random graph.
-
- Parameters
- ----------
- n_vertices : int
- The number of vertices in the graph.
- edge_prob : float in [0, 1]
- The probability of forming an edge between two vertices in the
- underlying random graph, before edge pruning. Default is 0.5.
-
- Returns
- -------
- G : :class:`Graph` instance
- The resulting DAG.
- """
- G = random_unweighted_graph(n_vertices, edge_prob, directed=True)
-
- # prune edges to remove backwards connections between vertices
- G = DiGraph(G.vertices, [e for e in G.edges if e.fr < e.to])
-
- # if we pruned away all the edges, generate a new graph
- while not len(G.edges):
- G = random_unweighted_graph(n_vertices, edge_prob, directed=True)
- G = DiGraph(G.vertices, [e for e in G.edges if e.fr < e.to])
- return G
diff --git a/aitk/keras/numpy_ml_utils/kernels.py b/aitk/keras/numpy_ml_utils/kernels.py
deleted file mode 100644
index f346d61..0000000
--- a/aitk/keras/numpy_ml_utils/kernels.py
+++ /dev/null
@@ -1,344 +0,0 @@
-import re
-from abc import ABC, abstractmethod
-
-import numpy as np
-
-
-class KernelBase(ABC):
- def __init__(self):
- super().__init__()
- self.parameters = {}
- self.hyperparameters = {}
-
- @abstractmethod
- def _kernel(self, X, Y):
- raise NotImplementedError
-
- def __call__(self, X, Y=None):
- """Refer to documentation for the `_kernel` method"""
- return self._kernel(X, Y)
-
- def __str__(self):
- P, H = self.parameters, self.hyperparameters
- p_str = ", ".join(["{}={}".format(k, v) for k, v in P.items()])
- return "{}({})".format(H["id"], p_str)
-
- def summary(self):
- """Return the dictionary of model parameters, hyperparameters, and ID"""
- return {
- "id": self.hyperparameters["id"],
- "parameters": self.parameters,
- "hyperparameters": self.hyperparameters,
- }
-
- def set_params(self, summary_dict):
- """
- Set the model parameters and hyperparameters using the settings in
- `summary_dict`.
-
- Parameters
- ----------
- summary_dict : dict
- A dictionary with keys 'parameters' and 'hyperparameters',
- structured as would be returned by the :meth:`summary` method. If
- a particular (hyper)parameter is not included in this dict, the
- current value will be used.
-
- Returns
- -------
- new_kernel : :doc:`Kernel ` instance
- A kernel with parameters and hyperparameters adjusted to those
- specified in `summary_dict`.
- """
- kr, sd = self, summary_dict
-
- # collapse `parameters` and `hyperparameters` nested dicts into a single
- # merged dictionary
- flatten_keys = ["parameters", "hyperparameters"]
- for k in flatten_keys:
- if k in sd:
- entry = sd[k]
- sd.update(entry)
- del sd[k]
-
- for k, v in sd.items():
- if k in self.parameters:
- kr.parameters[k] = v
- if k in self.hyperparameters:
- kr.hyperparameters[k] = v
- return kr
-
-
-class LinearKernel(KernelBase):
- def __init__(self, c0=0):
- """
- The linear (i.e., dot-product) kernel.
-
- Notes
- -----
- For input vectors :math:`\mathbf{x}` and :math:`\mathbf{y}`, the linear
- kernel is:
-
- .. math::
-
- k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\\top \mathbf{y} + c_0
-
- Parameters
- ----------
- c0 : float
- An "inhomogeneity" parameter. When `c0` = 0, the kernel is said to be
- homogenous. Default is 1.
- """
- super().__init__()
- self.hyperparameters = {"id": "LinearKernel"}
- self.parameters = {"c0": c0}
-
- def _kernel(self, X, Y=None):
- """
- Compute the linear kernel (i.e., dot-product) between all pairs of rows in
- `X` and `Y`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(N, C)`
- Collection of `N` input vectors
- Y : :py:class:`ndarray ` of shape `(M, C)` or None
- Collection of `M` input vectors. If None, assume `Y` = `X`.
- Default is None.
-
- Returns
- -------
- out : :py:class:`ndarray ` of shape `(N, M)`
- Similarity between `X` and `Y`, where index (`i`, `j`) gives
- :math:`k(x_i, y_j)`.
- """
- X, Y = kernel_checks(X, Y)
- return X @ Y.T + self.parameters["c0"]
-
-
-class PolynomialKernel(KernelBase):
- def __init__(self, d=3, gamma=None, c0=1):
- """
- The degree-`d` polynomial kernel.
-
- Notes
- -----
- For input vectors :math:`\mathbf{x}` and :math:`\mathbf{y}`, the polynomial
- kernel is:
-
- .. math::
-
- k(\mathbf{x}, \mathbf{y}) = (\gamma \mathbf{x}^\\top \mathbf{y} + c_0)^d
-
- In contrast to the linear kernel, the polynomial kernel also computes
- similarities *across* dimensions of the **x** and **y** vectors,
- allowing it to account for interactions between features. As an
- instance of the dot product family of kernels, the polynomial kernel is
- invariant to a rotation of the coordinates about the origin, but *not*
- to translations.
-
- Parameters
- ----------
- d : int
- Degree of the polynomial kernel. Default is 3.
- gamma : float or None
- A scaling parameter for the dot product between `x` and `y`,
- determining the amount of smoothing/resonlution of the kernel.
- Larger values result in greater smoothing. If None, defaults to 1 /
- `C`. Sometimes referred to as the kernel bandwidth. Default is
- None.
- c0 : float
- Parameter trading off the influence of higher-order versus lower-order
- terms in the polynomial. If `c0` = 0, the kernel is said to be
- homogenous. Default is 1.
- """
- super().__init__()
- self.hyperparameters = {"id": "PolynomialKernel"}
- self.parameters = {"d": d, "c0": c0, "gamma": gamma}
-
- def _kernel(self, X, Y=None):
- """
- Compute the degree-`d` polynomial kernel between all pairs of rows in `X`
- and `Y`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(N, C)`
- Collection of `N` input vectors
- Y : :py:class:`ndarray ` of shape `(M, C)` or None
- Collection of `M` input vectors. If None, assume `Y = X`. Default
- is None.
-
- Returns
- -------
- out : :py:class:`ndarray ` of shape `(N, M)`
- Similarity between `X` and `Y` where index (`i`, `j`) gives
- :math:`k(x_i, y_j)` (i.e., the kernel's Gram-matrix).
- """
- P = self.parameters
- X, Y = kernel_checks(X, Y)
- gamma = 1 / X.shape[1] if P["gamma"] is None else P["gamma"]
- return (gamma * (X @ Y.T) + P["c0"]) ** P["d"]
-
-
-class RBFKernel(KernelBase):
- def __init__(self, sigma=None):
- """
- Radial basis function (RBF) / squared exponential kernel.
-
- Notes
- -----
- For input vectors :math:`\mathbf{x}` and :math:`\mathbf{y}`, the radial
- basis function kernel is:
-
- .. math::
-
- k(\mathbf{x}, \mathbf{y}) = \exp \left\{ -0.5
- \left\lVert \\frac{\mathbf{x} -
- \mathbf{y}}{\sigma} \\right\\rVert_2^2 \\right\}
-
- The RBF kernel decreases with distance and ranges between zero (in the
- limit) to one (when **x** = **y**). Notably, the implied feature space
- of the kernel has an infinite number of dimensions.
-
- Parameters
- ----------
- sigma : float or array of shape `(C,)` or None
- A scaling parameter for the vectors **x** and **y**, producing an
- isotropic kernel if a float, or an anistropic kernel if an array of
- length `C`. Larger values result in higher resolution / greater
- smoothing. If None, defaults to :math:`\sqrt(C / 2)`. Sometimes
- referred to as the kernel 'bandwidth'. Default is None.
- """
- super().__init__()
- self.hyperparameters = {"id": "RBFKernel"}
- self.parameters = {"sigma": sigma}
-
- def _kernel(self, X, Y=None):
- """
- Computes the radial basis function (RBF) kernel between all pairs of
- rows in `X` and `Y`.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(N, C)`
- Collection of `N` input vectors, each with dimension `C`.
- Y : :py:class:`ndarray ` of shape `(M, C)`
- Collection of `M` input vectors. If None, assume `Y` = `X`. Default
- is None.
-
- Returns
- -------
- out : :py:class:`ndarray ` of shape `(N, M)`
- Similarity between `X` and `Y` where index (i, j) gives :math:`k(x_i, y_j)`.
- """
- P = self.parameters
- X, Y = kernel_checks(X, Y)
- sigma = np.sqrt(X.shape[1] / 2) if P["sigma"] is None else P["sigma"]
- return np.exp(-0.5 * pairwise_l2_distances(X / sigma, Y / sigma) ** 2)
-
-
-class KernelInitializer(object):
- def __init__(self, param=None):
- """
- A class for initializing learning rate schedulers. Valid inputs are:
- (a) __str__ representations of `KernelBase` instances
- (b) `KernelBase` instances
- (c) Parameter dicts (e.g., as produced via the :meth:`summary` method in
- `KernelBase` instances)
-
- If `param` is None, return `LinearKernel`.
- """
- self.param = param
-
- def __call__(self):
- param = self.param
- if param is None:
- kernel = LinearKernel()
- elif isinstance(param, KernelBase):
- kernel = param
- elif isinstance(param, str):
- kernel = self.init_from_str()
- elif isinstance(param, dict):
- kernel = self.init_from_dict()
- return kernel
-
- def init_from_str(self):
- r = r"([a-zA-Z0-9]*)=([^,)]*)"
- kr_str = self.param.lower()
- kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, self.param)])
-
- if "linear" in kr_str:
- kernel = LinearKernel(**kwargs)
- elif "polynomial" in kr_str:
- kernel = PolynomialKernel(**kwargs)
- elif "rbf" in kr_str:
- kernel = RBFKernel(**kwargs)
- else:
- raise NotImplementedError("{}".format(kr_str))
- return kernel
-
- def init_from_dict(self):
- S = self.param
- sc = S["hyperparameters"] if "hyperparameters" in S else None
-
- if sc is None:
- raise ValueError("Must have `hyperparameters` key: {}".format(S))
-
- if sc and sc["id"] == "LinearKernel":
- scheduler = LinearKernel().set_params(S)
- elif sc and sc["id"] == "PolynomialKernel":
- scheduler = PolynomialKernel().set_params(S)
- elif sc and sc["id"] == "RBFKernel":
- scheduler = RBFKernel().set_params(S)
- elif sc:
- raise NotImplementedError("{}".format(sc["id"]))
- return scheduler
-
-
-def kernel_checks(X, Y):
- X = X.reshape(-1, 1) if X.ndim == 1 else X
- Y = X if Y is None else Y
- Y = Y.reshape(-1, 1) if Y.ndim == 1 else Y
-
- assert X.ndim == 2, "X must have 2 dimensions, but got {}".format(X.ndim)
- assert Y.ndim == 2, "Y must have 2 dimensions, but got {}".format(Y.ndim)
- assert X.shape[1] == Y.shape[1], "X and Y must have the same number of columns"
- return X, Y
-
-
-def pairwise_l2_distances(X, Y):
- """
- A fast, vectorized way to compute pairwise l2 distances between rows in `X`
- and `Y`.
-
- Notes
- -----
- An entry of the pairwise Euclidean distance matrix for two vectors is
-
- .. math::
-
- d[i, j] &= \sqrt{(x_i - y_i) @ (x_i - y_i)} \\\\
- &= \sqrt{sum (x_i - y_j)^2} \\\\
- &= \sqrt{sum (x_i)^2 - 2 x_i y_j + (y_j)^2}
-
- The code below computes the the third line using numpy broadcasting
- fanciness to avoid any for loops.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(N, C)`
- Collection of `N` input vectors
- Y : :py:class:`ndarray ` of shape `(M, C)`
- Collection of `M` input vectors. If None, assume `Y` = `X`. Default is
- None.
-
- Returns
- -------
- dists : :py:class:`ndarray ` of shape `(N, M)`
- Pairwise distance matrix. Entry (i, j) contains the `L2` distance between
- :math:`x_i` and :math:`y_j`.
- """
- D = -2 * X @ Y.T + np.sum(Y ** 2, axis=1) + np.sum(X ** 2, axis=1)[:, np.newaxis]
- D[D < 0] = 0 # clip any value less than 0 (a result of numerical imprecision)
- return np.sqrt(D)
diff --git a/aitk/keras/numpy_ml_utils/testing.py b/aitk/keras/numpy_ml_utils/testing.py
deleted file mode 100644
index 67f3111..0000000
--- a/aitk/keras/numpy_ml_utils/testing.py
+++ /dev/null
@@ -1,150 +0,0 @@
-"""Utilities for writing unit tests"""
-import numbers
-import numpy as np
-
-MSG_CACHE = set()
-
-def warn_once(msg):
- if msg not in MSG_CACHE:
- print(msg)
- MSG_CACHE.add(msg)
-
-#######################################################################
-# Assertions #
-#######################################################################
-
-
-def is_symmetric(X):
- """Check that an array `X` is symmetric along its main diagonal"""
- return np.allclose(X, X.T)
-
-
-def is_symmetric_positive_definite(X):
- """Check that a matrix `X` is a symmetric and positive-definite."""
- if is_symmetric(X):
- try:
- # if matrix is symmetric, check whether the Cholesky decomposition
- # (defined only for symmetric/Hermitian positive definite matrices)
- # exists
- np.linalg.cholesky(X)
- return True
- except np.linalg.LinAlgError:
- return False
- return False
-
-
-def is_stochastic(X):
- """True if `X` contains probabilities that sum to 1 along the columns"""
- msg = "Array should be stochastic along the columns"
- assert len(X[X < 0]) == len(X[X > 1]) == 0, msg
- if not np.allclose(np.sum(X, axis=1), np.ones(X.shape[0])):
- warn_once("WARNING: %s; are you using the correct activation function?" % msg)
- return True
-
-
-def is_number(a):
- """Check that a value `a` is numeric"""
- return isinstance(a, numbers.Number)
-
-
-def is_one_hot(x):
- """Return True if array `x` is a binary array with a single 1"""
- msg = "Matrix should be one-hot binary"
- assert np.array_equal(x, x.astype(bool)), msg
- assert np.allclose(np.sum(x, axis=1), np.ones(x.shape[0])), msg
- return True
-
-
-def is_binary(x):
- """Return True if array `x` consists only of binary values"""
- msg = "Matrix must be binary"
- assert np.array_equal(x, x.astype(bool)), msg
- return True
-
-
-#######################################################################
-# Data Generators #
-#######################################################################
-
-
-def random_one_hot_matrix(n_examples, n_classes):
- """Create a random one-hot matrix of shape (`n_examples`, `n_classes`)"""
- X = np.eye(n_classes)
- X = X[np.random.choice(n_classes, n_examples)]
- return X
-
-
-def random_stochastic_matrix(n_examples, n_classes):
- """Create a random stochastic matrix of shape (`n_examples`, `n_classes`)"""
- X = np.random.rand(n_examples, n_classes)
- X /= X.sum(axis=1, keepdims=True)
- return X
-
-
-def random_tensor(shape, standardize=False):
- """
- Create a random real-valued tensor of shape `shape`. If `standardize` is
- True, ensure each column has mean 0 and std 1.
- """
- offset = np.random.randint(-300, 300, shape)
- X = np.random.rand(*shape) + offset
-
- if standardize:
- eps = np.finfo(float).eps
- X = (X - X.mean(axis=0)) / (X.std(axis=0) + eps)
- return X
-
-
-def random_binary_tensor(shape, sparsity=0.5):
- """
- Create a random binary tensor of shape `shape`. `sparsity` is a value
- between 0 and 1 controlling the ratio of 0s to 1s in the output tensor.
- """
- return (np.random.rand(*shape) >= (1 - sparsity)).astype(float)
-
-
-def random_paragraph(n_words, vocab=None):
- """
- Generate a random paragraph consisting of `n_words` words. If `vocab` is
- not None, words will be drawn at random from this list. Otherwise, words
- will be sampled uniformly from a collection of 26 Latin words.
- """
- if vocab is None:
- vocab = [
- "at",
- "stet",
- "accusam",
- "aliquyam",
- "clita",
- "lorem",
- "ipsum",
- "dolor",
- "dolore",
- "dolores",
- "sit",
- "amet",
- "consetetur",
- "sadipscing",
- "elitr",
- "sed",
- "diam",
- "nonumy",
- "eirmod",
- "duo",
- "ea",
- "eos",
- "erat",
- "est",
- "et",
- "gubergren",
- ]
- return [np.random.choice(vocab) for _ in range(n_words)]
-
-
-#######################################################################
-# Custom Warnings #
-#######################################################################
-
-
-class DependencyWarning(RuntimeWarning):
- pass
diff --git a/aitk/keras/numpy_ml_utils/windows.py b/aitk/keras/numpy_ml_utils/windows.py
deleted file mode 100644
index cd3132f..0000000
--- a/aitk/keras/numpy_ml_utils/windows.py
+++ /dev/null
@@ -1,156 +0,0 @@
-import numpy as np
-
-
-def blackman_harris(window_len, symmetric=False):
- """
- The Blackman-Harris window.
-
- Notes
- -----
- The Blackman-Harris window is an instance of the more general class of
- cosine-sum windows where `K=3`. Additional coefficients extend the Hamming
- window to further minimize the magnitude of the nearest side-lobe in the
- frequency response.
-
- .. math::
- \\text{bh}(n) = a_0 - a_1 \cos\left(\\frac{2 \pi n}{N}\\right) +
- a_2 \cos\left(\\frac{4 \pi n }{N}\\right) -
- a_3 \cos\left(\\frac{6 \pi n}{N}\\right)
-
- where `N` = `window_len` - 1, :math:`a_0` = 0.35875, :math:`a_1` = 0.48829,
- :math:`a_2` = 0.14128, and :math:`a_3` = 0.01168.
-
- Parameters
- ----------
- window_len : int
- The length of the window in samples. Should be equal to the
- `frame_width` if applying to a windowed signal.
- symmetric : bool
- If False, create a 'periodic' window that can be used in with an FFT /
- in spectral analysis. If True, generate a symmetric window that can be
- used in, e.g., filter design. Default is False.
-
- Returns
- -------
- window : :py:class:`ndarray ` of shape `(window_len,)`
- The window
- """
- return generalized_cosine(
- window_len, [0.35875, 0.48829, 0.14128, 0.01168], symmetric
- )
-
-
-def hamming(window_len, symmetric=False):
- """
- The Hamming window.
-
- Notes
- -----
- The Hamming window is an instance of the more general class of cosine-sum
- windows where `K=1` and :math:`a_0 = 0.54`. Coefficients selected to
- minimize the magnitude of the nearest side-lobe in the frequency response.
-
- .. math::
-
- \\text{hamming}(n) = 0.54 -
- 0.46 \cos\left(\\frac{2 \pi n}{\\text{window_len} - 1}\\right)
-
- Parameters
- ----------
- window_len : int
- The length of the window in samples. Should be equal to the
- `frame_width` if applying to a windowed signal.
- symmetric : bool
- If False, create a 'periodic' window that can be used in with an FFT /
- in spectral analysis. If True, generate a symmetric window that can be
- used in, e.g., filter design. Default is False.
-
- Returns
- -------
- window : :py:class:`ndarray ` of shape `(window_len,)`
- The window
- """
- return generalized_cosine(window_len, [0.54, 1 - 0.54], symmetric)
-
-
-def hann(window_len, symmetric=False):
- """
- The Hann window.
-
- Notes
- -----
- The Hann window is an instance of the more general class of cosine-sum
- windows where `K=1` and :math:`a_0` = 0.5. Unlike the Hamming window, the
- end points of the Hann window touch zero.
-
- .. math::
-
- \\text{hann}(n) = 0.5 - 0.5 \cos\left(\\frac{2 \pi n}{\\text{window_len} - 1}\\right)
-
- Parameters
- ----------
- window_len : int
- The length of the window in samples. Should be equal to the
- `frame_width` if applying to a windowed signal.
- symmetric : bool
- If False, create a 'periodic' window that can be used in with an FFT /
- in spectral analysis. If True, generate a symmetric window that can be
- used in, e.g., filter design. Default is False.
-
- Returns
- -------
- window : :py:class:`ndarray ` of shape `(window_len,)`
- The window
- """
- return generalized_cosine(window_len, [0.5, 0.5], symmetric)
-
-
-def generalized_cosine(window_len, coefs, symmetric=False):
- """
- The generalized cosine family of window functions.
-
- Notes
- -----
- The generalized cosine window is a simple weighted sum of cosine terms.
-
- For :math:`n \in \{0, \ldots, \\text{window_len} \}`:
-
- .. math::
-
- \\text{GCW}(n) = \sum_{k=0}^K (-1)^k a_k \cos\left(\\frac{2 \pi k n}{\\text{window_len}}\\right)
-
- Parameters
- ----------
- window_len : int
- The length of the window in samples. Should be equal to the
- `frame_width` if applying to a windowed signal.
- coefs: list of floats
- The :math:`a_k` coefficient values
- symmetric : bool
- If False, create a 'periodic' window that can be used in with an FFT /
- in spectral analysis. If True, generate a symmetric window that can be
- used in, e.g., filter design. Default is False.
-
- Returns
- -------
- window : :py:class:`ndarray ` of shape `(window_len,)`
- The window
- """
- window_len += 1 if not symmetric else 0
- entries = np.linspace(-np.pi, np.pi, window_len) # (-1)^k * 2pi*n / window_len
- window = np.sum([ak * np.cos(k * entries) for k, ak in enumerate(coefs)], axis=0)
- return window[:-1] if not symmetric else window
-
-
-class WindowInitializer:
- def __call__(self, window):
- if window == "hamming":
- return hamming
- elif window == "blackman_harris":
- return blackman_harris
- elif window == "hann":
- return hann
- elif window == "generalized_cosine":
- return generalized_cosine
- else:
- raise NotImplementedError("{}".format(window))
diff --git a/aitk/keras/optimizers/README.md b/aitk/keras/optimizers/README.md
deleted file mode 100644
index fa815cb..0000000
--- a/aitk/keras/optimizers/README.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# Optimizers
-
-The `optimizers.py` module implements common modifications to stochastic gradient descent. It includes:
-
-- SGD with momentum ([Rummelhart, Hinton, & Williams, 1986](https://www.cs.princeton.edu/courses/archive/spring18/cos495/res/backprop_old.pdf))
-- AdaGrad ([Duchi, Hazan, & Singer, 2011](http://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf))
-- RMSProp ([Tieleman & Hinton, 2012](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf))
-- Adam ([Kingma & Ba, 2015](https://arxiv.org/pdf/1412.6980v8.pdf))
diff --git a/aitk/keras/optimizers/__init__.py b/aitk/keras/optimizers/__init__.py
deleted file mode 100644
index acd7379..0000000
--- a/aitk/keras/optimizers/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .optimizers import *
diff --git a/aitk/keras/optimizers/optimizers.py b/aitk/keras/optimizers/optimizers.py
deleted file mode 100644
index 6651e64..0000000
--- a/aitk/keras/optimizers/optimizers.py
+++ /dev/null
@@ -1,498 +0,0 @@
-from copy import deepcopy
-from abc import ABC, abstractmethod
-
-import numpy as np
-from numpy.linalg import norm
-
-
-class OptimizerBase(ABC):
- def __init__(self, learning_rate, scheduler=None):
- """
- An abstract base class for all Optimizer objects.
-
- This should never be used directly.
- """
- from ..initializers import SchedulerInitializer
-
- self.cache = {}
- self.cur_step = 0
- self.hyperparameters = {}
- self.lr_scheduler = SchedulerInitializer(scheduler, lr=learning_rate)()
-
- def __call__(self, param, param_grad, param_name, cur_loss=None):
- return self.update(param, param_grad, param_name, cur_loss)
-
- def step(self):
- """Increment the optimizer step counter by 1"""
- self.cur_step += 1
-
- def reset_step(self):
- """Reset the step counter to zero"""
- self.cur_step = 0
-
- def copy(self):
- """Return a copy of the optimizer object"""
- return deepcopy(self)
-
- def set_params(self, hparam_dict=None, cache_dict=None):
- """Set the parameters of the optimizer object from a dictionary"""
- from ..initializers import SchedulerInitializer
-
- if hparam_dict is not None:
- for k, v in hparam_dict.items():
- if k in self.hyperparameters:
- self.hyperparameters[k] = v
- if k == "lr_scheduler":
- self.lr_scheduler = SchedulerInitializer(v, lr=None)()
-
- if cache_dict is not None:
- for k, v in cache_dict.items():
- if k in self.cache:
- self.cache[k] = v
-
- @abstractmethod
- def update(self, param, param_grad, param_name, cur_loss=None):
- raise NotImplementedError
-
-
-class SGD(OptimizerBase):
- def __init__(
- self, learning_rate=0.01, momentum=0.0, clip_norm=None, lr_scheduler=None, **kwargs
- ):
- """
- A stochastic gradient descent optimizer.
-
- Notes
- -----
- For model parameters :math:`\\theta`, averaged parameter gradients
- :math:`\\nabla_{\\theta} \mathcal{L}`, and learning rate :math:`\eta`,
- the SGD update at timestep `t` is
-
- .. math::
-
- \\text{update}^{(t)}
- &= \\text{momentum} \cdot \\text{update}^{(t-1)} + \eta^{(t)} \\nabla_{\\theta} \mathcal{L}\\\\
- \\theta^{(t+1)}
- &\leftarrow \\theta^{(t)} - \\text{update}^{(t)}
-
- Parameters
- ----------
- learning_rate : float
- Learning rate for SGD. If scheduler is not None, this is used as
- the starting learning rate. Default is 0.01.
- momentum : float in range [0, 1]
- The fraction of the previous update to add to the current update.
- If 0, no momentum is applied. Default is 0.
- clip_norm : float
- If not None, all param gradients are scaled to have maximum l2 norm of
- `clip_norm` before computing update. Default is None.
- lr_scheduler : str, :doc:`Scheduler ` object, or None
- The learning rate scheduler. If None, use a constant learning
- rate equal to `learning_rate`. Default is None.
- """
- if "lr" in kwargs:
- learning_rate = kwargs["lr"]
- print("UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.")
-
- super().__init__(learning_rate, lr_scheduler)
-
- self.hyperparameters = {
- "id": "SGD",
- "learning_rate": learning_rate,
- "momentum": momentum,
- "clip_norm": clip_norm,
- "lr_scheduler": str(self.lr_scheduler),
- }
-
- def __str__(self):
- H = self.hyperparameters
- learning_rate, mm, cn, sc = H["learning_rate"], H["momentum"], H["clip_norm"], H["lr_scheduler"]
- return "SGD(learning_rate={}, momentum={}, clip_norm={}, lr_scheduler={})".format(
- learning_rate, mm, cn, sc
- )
-
- def update(self, param, param_grad, param_name, cur_loss=None):
- """
- Compute the SGD update for a given parameter
-
- Parameters
- ----------
- param : :py:class:`ndarray ` of shape (n, m)
- The value of the parameter to be updated.
- param_grad : :py:class:`ndarray ` of shape (n, m)
- The gradient of the loss function with respect to `param_name`.
- param_name : str
- The name of the parameter.
- cur_loss : float
- The training or validation loss for the current minibatch. Used for
- learning rate scheduling e.g., by
- :class:`~numpy_ml.neural_nets.schedulers.KingScheduler`.
- Default is None.
-
- Returns
- -------
- updated_params : :py:class:`ndarray ` of shape (n, m)
- The value of `param` after applying the momentum update.
- """
- C = self.cache
- H = self.hyperparameters
- momentum, clip_norm = H["momentum"], H["clip_norm"]
- learning_rate = self.lr_scheduler(self.cur_step, cur_loss)
-
- if param_name not in C:
- C[param_name] = np.zeros_like(param_grad)
-
- # scale gradient to avoid explosion
- t = np.inf if clip_norm is None else clip_norm
- if norm(param_grad) > t:
- param_grad = param_grad * t / norm(param_grad)
-
- update = momentum * C[param_name] + learning_rate * param_grad
- self.cache[param_name] = update
- return param - update
-
-
-#######################################################################
-# Adaptive Gradient Methods #
-#######################################################################
-
-
-class AdaGrad(OptimizerBase):
- def __init__(self, learning_rate=0.01, eps=1e-7, clip_norm=None, lr_scheduler=None, **kwargs):
- """
- An AdaGrad optimizer.
-
- Notes
- -----
- Weights that receive large gradients will have their effective learning
- rate reduced, while weights that receive small or infrequent updates
- will have their effective learning rate increased.
-
- Equations::
-
- cache[t] = cache[t-1] + grad[t] ** 2
- update[t] = learning_rate * grad[t] / (np.sqrt(cache[t]) + eps)
- param[t+1] = param[t] - update[t]
-
- Note that the ``**`` and `/` operations are elementwise
-
- "A downside of Adagrad ... is that the monotonic learning rate usually
- proves too aggressive and stops learning too early." [1]
-
- References
- ----------
- .. [1] Karpathy, A. "CS231n: Convolutional neural networks for visual
- recognition" https://cs231n.github.io/neural-networks-3/
-
- Parameters
- ----------
- learning_rate : float
- Global learning rate
- eps : float
- Smoothing term to avoid divide-by-zero errors in the update calc.
- Default is 1e-7.
- clip_norm : float or None
- If not None, all param gradients are scaled to have maximum `L2` norm of
- `clip_norm` before computing update. Default is None.
- lr_scheduler : str or :doc:`Scheduler ` object or None
- The learning rate scheduler. If None, use a constant learning
- rate equal to `learning_rate`. Default is None.
- """
- if "lr" in kwargs:
- learning_rate = kwargs["lr"]
- print("UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.")
-
- super().__init__(learning_rate, lr_scheduler)
-
- self.cache = {}
- self.hyperparameters = {
- "id": "AdaGrad",
- "learning_rate": learning_rate,
- "eps": eps,
- "clip_norm": clip_norm,
- "lr_scheduler": str(self.lr_scheduler),
- }
-
- def __str__(self):
- H = self.hyperparameters
- learning_rate, eps, cn, sc = H["learning_rate"], H["eps"], H["clip_norm"], H["lr_scheduler"]
- return "AdaGrad(learning_rate={}, eps={}, clip_norm={}, lr_scheduler={})".format(
- learning_rate, eps, cn, sc
- )
-
- def update(self, param, param_grad, param_name, cur_loss=None):
- """
- Compute the AdaGrad update for a given parameter.
-
- Notes
- -----
- Adjusts the learning rate of each weight based on the magnitudes of its
- gradients (big gradient -> small learning_rate, small gradient -> big learning_rate).
-
- Parameters
- ----------
- param : :py:class:`ndarray ` of shape (n, m)
- The value of the parameter to be updated
- param_grad : :py:class:`ndarray ` of shape (n, m)
- The gradient of the loss function with respect to `param_name`
- param_name : str
- The name of the parameter
- cur_loss : float or None
- The training or validation loss for the current minibatch. Used for
- learning rate scheduling e.g., by
- :class:`~numpy_ml.neural_nets.schedulers.KingScheduler`.
- Default is None.
-
- Returns
- -------
- updated_params : :py:class:`ndarray ` of shape (n, m)
- The value of `param` after applying the AdaGrad update
- """
- C = self.cache
- H = self.hyperparameters
- eps, clip_norm = H["eps"], H["clip_norm"]
- learning_rate = self.lr_scheduler(self.cur_step, cur_loss)
-
- if param_name not in C:
- C[param_name] = np.zeros_like(param_grad)
-
- # scale gradient to avoid explosion
- t = np.inf if clip_norm is None else clip_norm
- if norm(param_grad) > t:
- param_grad = param_grad * t / norm(param_grad)
-
- C[param_name] += param_grad ** 2
- update = learning_rate * param_grad / (np.sqrt(C[param_name]) + eps)
- self.cache = C
- return param - update
-
-
-class RMSProp(OptimizerBase):
- def __init__(
- self, learning_rate=0.001, decay=0.9, eps=1e-7, clip_norm=None, lr_scheduler=None, **kwargs
- ):
- """
- RMSProp optimizer.
-
- Notes
- -----
- RMSProp was proposed as a refinement of :class:`AdaGrad` to reduce its
- aggressive, monotonically decreasing learning rate.
-
- RMSProp uses a *decaying average* of the previous squared gradients
- (second moment) rather than just the immediately preceding squared
- gradient for its `previous_update` value.
-
- Equations::
-
- cache[t] = decay * cache[t-1] + (1 - decay) * grad[t] ** 2
- update[t] = learning_rate * grad[t] / (np.sqrt(cache[t]) + eps)
- param[t+1] = param[t] - update[t]
-
- Note that the ``**`` and ``/`` operations are elementwise.
-
- Parameters
- ----------
- learning_rate : float
- Learning rate for update. Default is 0.001.
- decay : float in [0, 1]
- Rate of decay for the moving average. Typical values are [0.9,
- 0.99, 0.999]. Default is 0.9.
- eps : float
- Constant term to avoid divide-by-zero errors during the update calc. Default is 1e-7.
- clip_norm : float or None
- If not None, all param gradients are scaled to have maximum l2 norm of
- `clip_norm` before computing update. Default is None.
- lr_scheduler : str or :doc:`Scheduler ` object or None
- The learning rate scheduler. If None, use a constant learning
- rate equal to `learning_rate`. Default is None.
- """
- if "lr" in kwargs:
- learning_rate = kwargs["lr"]
- print("UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.")
-
- super().__init__(learning_rate, lr_scheduler)
-
- self.cache = {}
- self.hyperparameters = {
- "id": "RMSProp",
- "learning_rate": learning_rate,
- "eps": eps,
- "decay": decay,
- "clip_norm": clip_norm,
- "lr_scheduler": str(self.lr_scheduler),
- }
-
- def __str__(self):
- H = self.hyperparameters
- sc = H["lr_scheduler"]
- learning_rate, eps, dc, cn = H["learning_rate"], H["eps"], H["decay"], H["clip_norm"]
- return "RMSProp(learning_rate={}, eps={}, decay={}, clip_norm={}, lr_scheduler={})".format(
- learning_rate, eps, dc, cn, sc
- )
-
- def update(self, param, param_grad, param_name, cur_loss=None):
- """
- Compute the RMSProp update for a given parameter.
-
- Parameters
- ----------
- param : :py:class:`ndarray ` of shape (n, m)
- The value of the parameter to be updated
- param_grad : :py:class:`ndarray ` of shape (n, m)
- The gradient of the loss function with respect to `param_name`
- param_name : str
- The name of the parameter
- cur_loss : float or None
- The training or validation loss for the current minibatch. Used for
- learning rate scheduling e.g., by
- :class:`~numpy_ml.neural_nets.schedulers.KingScheduler`.
- Default is None.
-
- Returns
- -------
- updated_params : :py:class:`ndarray ` of shape (n, m)
- The value of `param` after applying the RMSProp update.
- """
- C = self.cache
- H = self.hyperparameters
- eps, decay, clip_norm = H["eps"], H["decay"], H["clip_norm"]
- learning_rate = self.lr_scheduler(self.cur_step, cur_loss)
-
- if param_name not in C:
- C[param_name] = np.zeros_like(param_grad)
-
- # scale gradient to avoid explosion
- t = np.inf if clip_norm is None else clip_norm
- if norm(param_grad) > t:
- param_grad = param_grad * t / norm(param_grad)
-
- C[param_name] = decay * C[param_name] + (1 - decay) * param_grad ** 2
- update = learning_rate * param_grad / (np.sqrt(C[param_name]) + eps)
- self.cache = C
- return param - update
-
-
-class Adam(OptimizerBase):
- def __init__(
- self,
- learning_rate=0.001,
- decay1=0.9,
- decay2=0.999,
- eps=1e-7,
- clip_norm=None,
- lr_scheduler=None,
- **kwargs
- ):
- """
- Adam (adaptive moment estimation) optimization algorithm.
-
- Notes
- -----
- Designed to combine the advantages of :class:`AdaGrad`, which works
- well with sparse gradients, and :class:`RMSProp`, which works well in
- online and non-stationary settings.
-
- Parameters
- ----------
- learning_rate : float
- Learning rate for update. This parameter is ignored if using
- :class:`~numpy_ml.neural_nets.schedulers.NoamScheduler`.
- Default is 0.001.
- decay1 : float
- The rate of decay to use for in running estimate of the first
- moment (mean) of the gradient. Default is 0.9.
- decay2 : float
- The rate of decay to use for in running estimate of the second
- moment (variance) of the gradient. Default is 0.999.
- eps : float
- Constant term to avoid divide-by-zero errors during the update
- calc. Default is 1e-7.
- clip_norm : float
- If not None, all param gradients are scaled to have maximum l2 norm of
- `clip_norm` before computing update. Default is None.
- lr_scheduler : str, or :doc:`Scheduler ` object, or None
- The learning rate scheduler. If None, use a constant learning rate
- equal to `learning_rate`. Default is None.
- """
- if "lr" in kwargs:
- learning_rate = kwargs["lr"]
- print("UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.")
-
- super().__init__(learning_rate, lr_scheduler)
-
- self.cache = {}
- self.hyperparameters = {
- "id": "Adam",
- "learning_rate": learning_rate,
- "eps": eps,
- "decay1": decay1,
- "decay2": decay2,
- "clip_norm": clip_norm,
- "lr_scheduler": str(self.lr_scheduler),
- }
-
- def __str__(self):
- H = self.hyperparameters
- learning_rate, d1, d2 = H["learning_rate"], H["decay1"], H["decay2"]
- eps, cn, sc = H["eps"], H["clip_norm"], H["lr_scheduler"]
- return "Adam(learning_rate={}, decay1={}, decay2={}, eps={}, clip_norm={}, lr_scheduler={})".format(
- learning_rate, d1, d2, eps, cn, sc
- )
-
- def update(self, param, param_grad, param_name, cur_loss=None):
- """
- Compute the Adam update for a given parameter.
-
- Parameters
- ----------
- param : :py:class:`ndarray ` of shape (n, m)
- The value of the parameter to be updated.
- param_grad : :py:class:`ndarray ` of shape (n, m)
- The gradient of the loss function with respect to `param_name`.
- param_name : str
- The name of the parameter.
- cur_loss : float
- The training or validation loss for the current minibatch. Used for
- learning rate scheduling e.g., by
- :class:`~numpy_ml.neural_nets.schedulers.KingScheduler`. Default is
- None.
-
- Returns
- -------
- updated_params : :py:class:`ndarray ` of shape (n, m)
- The value of `param` after applying the Adam update.
- """
- C = self.cache
- H = self.hyperparameters
- d1, d2 = H["decay1"], H["decay2"]
- eps, clip_norm = H["eps"], H["clip_norm"]
- learning_rate = self.lr_scheduler(self.cur_step, cur_loss)
-
- if param_name not in C:
- C[param_name] = {
- "t": 0,
- "mean": np.zeros_like(param_grad),
- "var": np.zeros_like(param_grad),
- }
-
- # scale gradient to avoid explosion
- t = np.inf if clip_norm is None else clip_norm
- if norm(param_grad) > t:
- param_grad = param_grad * t / norm(param_grad)
-
- t = C[param_name]["t"] + 1
- var = C[param_name]["var"]
- mean = C[param_name]["mean"]
-
- # update cache
- C[param_name]["t"] = t
- C[param_name]["var"] = d2 * var + (1 - d2) * param_grad ** 2
- C[param_name]["mean"] = d1 * mean + (1 - d1) * param_grad
- self.cache = C
-
- # calc unbiased moment estimates and Adam update
- v_hat = C[param_name]["var"] / (1 - d2 ** t)
- m_hat = C[param_name]["mean"] / (1 - d1 ** t)
- update = learning_rate * m_hat / (np.sqrt(v_hat) + eps)
- return param - update
diff --git a/aitk/keras/preprocessing/README.md b/aitk/keras/preprocessing/README.md
deleted file mode 100644
index b0f90d7..0000000
--- a/aitk/keras/preprocessing/README.md
+++ /dev/null
@@ -1,24 +0,0 @@
-# Preprocessing
-The preprocessing module implements common data preprocessing routines.
-
-- `nlp.py`: Routines and objects for handling text data.
- - n-gram generators
- - Word and character tokenization
- - Punctuation and stop-word removal
- - Vocabulary / unigram count objects
- - [Huffman tree](https://en.wikipedia.org/wiki/Huffman_coding) encoding / decoding
- - Term frequency-inverse document frequency ([tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) encoding
-
-- `dsp.py`: Routines for handling audio and image data.
- - Signal windowing
- - Signal autocorrelation
- - Discrete Fourier transform
- - Discrete cosine transform (type II)
- - Signal resampling via (bi-)linear interpolation and nearest neighbor
- - Mel-frequency cepstral coefficients (MFCCs) ([Mermelstein, 1976](https://files.eric.ed.gov/fulltext/ED128870.pdf#page=93); [Davis & Mermelstein, 1980](https://pdfs.semanticscholar.org/24b8/7a58511919cc867a71f0b58328694dd494b3.pdf))
-
-- `general.py`: General data preprocessing objects and functions.
- - Feature hashing ([Moody, 1989](http://papers.nips.cc/paper/175-fast-learning-in-multi-resolution-hierarchies.pdf))
- - Mini-batch generators
- - One-hot encoding / decoding
- - Feature standardization
diff --git a/aitk/keras/preprocessing/__init__.py b/aitk/keras/preprocessing/__init__.py
deleted file mode 100644
index 021db2c..0000000
--- a/aitk/keras/preprocessing/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from . import general
-from . import nlp
-from . import dsp
diff --git a/aitk/keras/preprocessing/dsp.py b/aitk/keras/preprocessing/dsp.py
deleted file mode 100644
index 77f3c40..0000000
--- a/aitk/keras/preprocessing/dsp.py
+++ /dev/null
@@ -1,848 +0,0 @@
-import numpy as np
-from numpy.lib.stride_tricks import as_strided
-
-from ..utils.windows import WindowInitializer
-
-#######################################################################
-# Signal Resampling #
-#######################################################################
-
-
-def batch_resample(X, new_dim, mode="bilinear"):
- """
- Resample each image (or similar grid-based 2D signal) in a batch to
- `new_dim` using the specified resampling strategy.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(n_ex, in_rows, in_cols, in_channels)`
- An input image volume
- new_dim : 2-tuple of `(out_rows, out_cols)`
- The dimension to resample each image to
- mode : {'bilinear', 'neighbor'}
- The resampling strategy to employ. Default is 'bilinear'.
-
- Returns
- -------
- resampled : :py:class:`ndarray ` of shape `(n_ex, out_rows, out_cols, in_channels)`
- The resampled image volume.
- """
- if mode == "bilinear":
- interpolate = bilinear_interpolate
- elif mode == "neighbor":
- interpolate = nn_interpolate_2D
- else:
- raise NotImplementedError("Unrecognized resampling mode: {}".format(mode))
-
- out_rows, out_cols = new_dim
- n_ex, in_rows, in_cols, n_in = X.shape
-
- # compute coordinates to resample
- x = np.tile(np.linspace(0, in_cols - 2, out_cols), out_rows)
- y = np.repeat(np.linspace(0, in_rows - 2, out_rows), out_cols)
-
- # resample each image
- resampled = []
- for i in range(n_ex):
- r = interpolate(X[i, ...], x, y)
- r = r.reshape(out_rows, out_cols, n_in)
- resampled.append(r)
- return np.dstack(resampled)
-
-
-def nn_interpolate_2D(X, x, y):
- """
- Estimates of the pixel values at the coordinates (x, y) in `X` using a
- nearest neighbor interpolation strategy.
-
- Notes
- -----
- Assumes the current entries in `X` reflect equally-spaced samples from a 2D
- integer grid.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(in_rows, in_cols, in_channels)`
- An input image sampled along a grid of `in_rows` by `in_cols`.
- x : list of length `k`
- A list of x-coordinates for the samples we wish to generate
- y : list of length `k`
- A list of y-coordinates for the samples we wish to generate
-
- Returns
- -------
- samples : :py:class:`ndarray ` of shape `(k, in_channels)`
- The samples for each (x,y) coordinate computed via nearest neighbor
- interpolation
- """
- nx, ny = np.around(x), np.around(y)
- nx = np.clip(nx, 0, X.shape[1] - 1).astype(int)
- ny = np.clip(ny, 0, X.shape[0] - 1).astype(int)
- return X[ny, nx, :]
-
-
-def nn_interpolate_1D(X, t):
- """
- Estimates of the signal values at `X[t]` using a nearest neighbor
- interpolation strategy.
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(in_length, in_channels)`
- An input image sampled along an integer `in_length`
- t : list of length `k`
- A list of coordinates for the samples we wish to generate
-
- Returns
- -------
- samples : :py:class:`ndarray ` of shape `(k, in_channels)`
- The samples for each (x,y) coordinate computed via nearest neighbor
- interpolation
- """
- nt = np.clip(np.around(t), 0, X.shape[0] - 1).astype(int)
- return X[nt, :]
-
-
-def bilinear_interpolate(X, x, y):
- """
- Estimates of the pixel values at the coordinates (x, y) in `X` via bilinear
- interpolation.
-
- Notes
- -----
- Assumes the current entries in X reflect equally-spaced
- samples from a 2D integer grid.
-
- Modified from https://bit.ly/2NMb1Dr
-
- Parameters
- ----------
- X : :py:class:`ndarray ` of shape `(in_rows, in_cols, in_channels)`
- An input image sampled along a grid of `in_rows` by `in_cols`.
- x : list of length `k`
- A list of x-coordinates for the samples we wish to generate
- y : list of length `k`
- A list of y-coordinates for the samples we wish to generate
-
- Returns
- -------
- samples : list of length `(k, in_channels)`
- The samples for each (x,y) coordinate computed via bilinear
- interpolation
- """
- x0 = np.floor(x).astype(int)
- y0 = np.floor(y).astype(int)
- x1 = x0 + 1
- y1 = y0 + 1
-
- x0 = np.clip(x0, 0, X.shape[1] - 1)
- y0 = np.clip(y0, 0, X.shape[0] - 1)
- x1 = np.clip(x1, 0, X.shape[1] - 1)
- y1 = np.clip(y1, 0, X.shape[0] - 1)
-
- Ia = X[y0, x0, :].T
- Ib = X[y1, x0, :].T
- Ic = X[y0, x1, :].T
- Id = X[y1, x1, :].T
-
- wa = (x1 - x) * (y1 - y)
- wb = (x1 - x) * (y - y0)
- wc = (x - x0) * (y1 - y)
- wd = (x - x0) * (y - y0)
-
- return (Ia * wa).T + (Ib * wb).T + (Ic * wc).T + (Id * wd).T
-
-
-#######################################################################
-# Fourier Decomposition #
-#######################################################################
-
-
-def DCT(frame, orthonormal=True):
- """
- A naive :math:`O(N^2)` implementation of the 1D discrete cosine transform-II
- (DCT-II).
-
- Notes
- -----
- For a signal :math:`\mathbf{x} = [x_1, \ldots, x_N]` consisting of `N`
- samples, the `k` th DCT coefficient, :math:`c_k`, is
-
- .. math::
-
- c_k = 2 \sum_{n=0}^{N-1} x_n \cos(\pi k (2 n + 1) / (2 N))
-
- where `k` ranges from :math:`0, \ldots, N-1`.
-
- The DCT is highly similar to the DFT -- whereas in a DFT the basis
- functions are sinusoids, in a DCT they are restricted solely to cosines. A
- signal's DCT representation tends to have more of its energy concentrated
- in a smaller number of coefficients when compared to the DFT, and is thus
- commonly used for signal compression. [1]
-
- .. [1] Smoother signals can be accurately approximated using fewer DFT / DCT
- coefficients, resulting in a higher compression ratio. The DCT naturally
- yields a continuous extension at the signal boundaries due its use of
- even basis functions (cosine). This in turn produces a smoother
- extension in comparison to DFT or DCT approximations, resulting in a
- higher compression.
-
- Parameters
- ----------
- frame : :py:class:`ndarray ` of shape `(N,)`
- A signal frame consisting of N samples
- orthonormal : bool
- Scale to ensure the coefficient vector is orthonormal. Default is True.
-
- Returns
- -------
- dct : :py:class:`ndarray ` of shape `(N,)`
- The discrete cosine transform of the samples in `frame`.
- """
- N = len(frame)
- out = np.zeros_like(frame)
- for k in range(N):
- for (n, xn) in enumerate(frame):
- out[k] += xn * np.cos(np.pi * k * (2 * n + 1) / (2 * N))
- scale = np.sqrt(1 / (4 * N)) if k == 0 else np.sqrt(1 / (2 * N))
- out[k] *= 2 * scale if orthonormal else 2
- return out
-
-
-def __DCT2(frame):
- """Currently broken"""
- N = len(frame) # window length
-
- k = np.arange(N, dtype=float)
- F = k.reshape(1, -1) * k.reshape(-1, 1)
- K = np.divide(F, k, out=np.zeros_like(F), where=F != 0)
-
- FC = np.cos(F * np.pi / N + K * np.pi / 2 * N)
- return 2 * (FC @ frame)
-
-
-def DFT(frame, positive_only=True):
- """
- A naive :math:`O(N^2)` implementation of the 1D discrete Fourier transform (DFT).
-
- Notes
- -----
- The Fourier transform decomposes a signal into a linear combination of
- sinusoids (ie., basis elements in the space of continuous periodic
- functions). For a sequence :math:`\mathbf{x} = [x_1, \ldots, x_N]` of N
- evenly spaced samples, the `k` th DFT coefficient is given by:
-
- .. math::
-
- c_k = \sum_{n=0}^{N-1} x_n \exp(-2 \pi i k n / N)
-
- where `i` is the imaginary unit, `k` is an index ranging from `0, ..., N-1`,
- and :math:`X_k` is the complex coefficient representing the phase
- (imaginary part) and amplitude (real part) of the `k` th sinusoid in the
- DFT spectrum. The frequency of the `k` th sinusoid is :math:`(k 2 \pi / N)`
- radians per sample.
-
- When applied to a real-valued input, the negative frequency terms are the
- complex conjugates of the positive-frequency terms and the overall spectrum
- is symmetric (excluding the first index, which contains the zero-frequency
- / intercept term).
-
- Parameters
- ----------
- frame : :py:class:`ndarray