Skip to content

uwdata/fast-kde

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast Gaussian Kernel Density Estimation

Fast Gaussian kernel density estimation in 1D or 2D.

This package provides accurate, linear-time O(N + K) estimation using Deriche's approximation and is based on the IEEE VIS 2021 Short Paper Fast & Accurate Gaussian Kernel Density Estimation. (For the benchmarks in that paper, see github.com/uwdata/fast-kde-benchmarks.)

Citation

If you use or build on this package in academic work, please use this citation:

@inproceedings{2021-Heer-FastKDE,
 title = {Fast \& Accurate Gaussian Kernel Density Estimation},
 author = {Jeffrey Heer},
 booktitle = {IEEE VIS Short Papers},
 year = {2021},
 url = {http://idl.cs.washington.edu/papers/fast-kde},
}

Usage Examples

For interactive examples, see our Fast KDE Observable notebook.

1D Density Estimation

import { density1d } from 'fast-kde';

const data = [
  {u: 1, v: 1}, {u: 1, v: 2}, {u: 5, v: 4},
  {u: 5, v: 3}, {u: 6, v: 2}, {u: 8, v: 7}
];

// 1d density estimation, with automatic bandwidth and extent
// resulting estimator d1 is an object and also an iterable
let d1 = density1d([1, 1, 5, 5, 6, 8]);

// 1d density estimation, with accessor function
d1 = density1d(data, { x: d => d.u });

// 1d density estimation, with property key accessor
d1 = density1d(data, { x: 'u' });

// 1d density estimation, with given bandwidth and extent
d1 = density1d(data, { x: 'u', bandwidth: 1, extent: [0, 10] });

// efficiently update bandwidth on estimator (extent remains unchanged)
d1.bandwidth(0.5)

// generate array of {x, y} sample points
let p1 = Array.from(d1);

// generate array of {a, b} sample points (instead of {x, y})
p1 = [...d1.points('a', 'b')]

// retrieve internal sample grid array of density estimates
// these values represent the total probability mass in each bin
// they are not (yet) scaled to probability density function estimates
let g1 = d1.grid();

2D Density Estimation

import { density2d } from 'fast-kde';
import { interpolatePiYG } from 'd3-scale-chromatic';

const data = [
  {u: 1, v: 1}, {u: 1, v: 2}, {u: 5, v: 4},
  {u: 5, v: 3}, {u: 6, v: 2}, {u: 8, v: 7}
];

// 2d density estimation, with automatic bandwidth and extent
let d2 = density2d(data, { x: 'u', y: 'v' });

// 2d density estimation, with bandwidth and extent shared across (x, y)
d2 = density2d(data, { x: 'u', y: 'v', bandwidth: 1, extent: [0, 10] });

// 2d density estimation, with bandwidth and extent that differ across (x, y)
d2 = density2d(data, { x: 'u', y: 'v', bandwidth: [1, 0.5], extent: [[0, 10], [1, 9]] });

// 2d density estimation, with customized x- and y-bin counts
d2 = density2d(data, { x: 'u', y: 'v', bins: [256, 256] });

// generate array of {x, y, z} sample points
let p2 = [...d2];

// generate array of {a, b, v} sample points (instead of {x, y, z})
p2 = [...d2.points('a', 'b', 'v')]

// HTML canvas element with a bins[0] x bins[1] heatmap image
let h2 = d2.heatmap();

// HTML canvas heatmap with custom interpolator from d3-scale-chromatic
h2 = d2.heatmap({ color: interpolatePiYG })

// retrieve internal sample grid array of density estimates
// these values represent the total probability mass in each bin
// they are not (yet) scaled to probability density function estimates
let g2 = d2.grid();

Build Instructions

All code is written as ESM modules, and uses the "type": "module" Node.js setting. To build a bundle (ESM module or minified UMD):

  1. Run yarn to install dependencies.
  2. Run yarn build to build the bundles.

Compiled bundles will be written to the dist directory.

API Documentation

1D Density Estimation

# kde.density1d(data[, options])

Creates a new 1D density estimator for the input data. Returns an estimator object that includes the methods listed below, and also provides an iterator over resulting density points.

  • data: An array of input data values for which to perform density estimation. The array values may be numbers or objects.
  • options: Options for configuring density estimation.
    • x: An accessor function for input values. By default this is the identity function, corresponding to data as an array of numbers. If the x option is not function-valued, it will be treated as a key to look up on entries of the input data.
    • weight: An accessor function for weights. By default all input points are given the same weight. The weight values should sum to one for the output to be a proper probability density estimate. If the weight option is not function-valued, it will be treated as a key to look up on entries of the input data.
    • bandwidth: The kernel bandwidth (standard deviation) to use. If unspecified, the bandwidth is automatically calculated using the nrd heuristic and the adjust option.
    • adjust: A fractional value by which to scale (adjust) an automatically calculated bandwidth. For example, an adjust value of 0.5 will result in half the automatically-determined bandwidth. This option is ignored if the bandwidth option is specified.
    • extent: The extent over which to compute kernel density estimation as a two-element array. Note that input data values outside the extent are ignored, potentially resulting in inaccurate densities relative to the full data. If unspecified, the extent is automatically calculated based on the input data extent and the pad option.
    • pad: The amount (in kernel bandwidths) by which to extend an automatically-calculated extent. The default value is 3, capturing 99% of the density from the most extreme points. Set this value to 0 to trim the density estimate to the minimum and maximum observed data points. This option is ignored if the extent option is provided.
    • bins: The number of bins to use for the internal grid. The default is 512 bins. The returned density estimate will include a total of bins equally-spaced sample points over the extent.

Example

// perform 1D estimation with bandwidth = 1 over domain [0, 10]
// returns an iterator over [ { x, y }, ... ] points
kde.density1d([1, 2, 5, 5, 6, 9], { bandwidth: 1, extent: [0, 10] })

# density1d.grid()

Returns the internal grid array of total accumulated density values per bin. To instead produce an array of objects containing coordinate values and probability density function estimates, use density1d.points().

# density1d.points([x, y])

Returns an iterator over objects containing a sample point (x) and density value (y).

  • x: The property name for the sample point (default "x").
  • y: The property name for the estimated density value (default "y").

# density1d.bandwidth([bandwidth])

Get or set the bandwidth (standard deviation) of the Gaussian kernel. Setting the bandwidth will update the estimator efficiently without re-performing binning. The extent will remain unchanged, even if previously determined automatically.

# density1d.extent()

Get the calculated extent of density estimation as a [min, max] extent array. This method does not support setting the extent to a new value, as this requires re-binning the input data.


2D Density Estimation

# kde.density2d(data[, options])

Creates a new 2D density estimator for the input data. Returns an estimator object that includes the methods listed below, and also provides an iterator over resulting density points.

  • data: An array of input data values for which to perform density estimation.
  • options: Options for configuring density estimation.
    • x: An accessor function for x-dimension input values. The default accessor retrieves index 0. If the x option is not function-valued, it will be treated as a key to look up on entries of the input data.
    • y: An accessor function for y-dimension input values. The default accessor retrieves index 1. If the y option is not function-valued, it will be treated as a key to look up on entries of the input data.
    • weight: An accessor function for weights. By default all input points are given the same weight. The weight values should sum to one for the output to be a proper probability density estimate. If the weight option is not function-valued, it will be treated as a key to look up on entries of the input data.
    • bandwidth: The kernel bandwidths (standard deviations) to use. If array-valued, specifies the x- and y-bandwidths separately. If number-valued, sets both the x- and y-bandwidths to the same value. If unspecified, the bandwidths are automatically calculated per-dimension using the nrd heuristic and the adjust option.
    • adjust: A fractional value by which to scale (adjust) an automatically calculated bandwidth. For example, an adjust value of 0.5 will result in half the automatically-determined bandwidth. This option is ignored if the bandwidth option is specified.
    • extent: The extent over which to compute kernel density estimation along both the x- and y-dimensions. If an array of arrays is provided, specifies the x- and y-extents separately. If a single two-number array is provided, sets both x- and y-extents to the same value. Note that input data values outside the extent are ignored, potentially resulting in inaccurate densities relative to the full data. If unspecified, the extent is automatically calculated based on the input data extent and the pad option.
    • pad: The amount (in kernel bandwidths) by which to extend an automatically-calculated extent. The default value is 3, capturing 99% of the density from the most extreme points. Set this value to 0 to trim the density estimate to the minimum and maximum observed data points. This option is ignored if the extent option is provided.
    • bins: The number of bins to use for the internal grid. The default is [256, 256] bins. If array-valued, specifies the x- and y-bins separately. If number-valued, sets both x- and y-bins to the same value. The returned density estimate will include a total of bins[0] * bins[1] equally-spaced sample points over the extent.

Example

// perform 2D estimation with bandwidths [1, 1] over extent [[0, 10], [0, 10]]
// use default grid size ([256, 256])
// returns an iterator over [ { x, y, z }, ... ] points
const data = [[1, 1], [1, 2], [5, 4], [5, 3], [6, 2], [8, 7]];
kde.density2d(data, { bandwidth: 1, extent: [0, 10] })
// perform 2D estimation with different bandwidths and extent for x and y
// returns an iterator over [ { x, y, z }, ... ] points
const data = [[1, 1], [1, 2], [5, 4], [5, 3], [6, 2], [8, 7]];
kde.density2d(data, { bandwidth: [1, 0.5], extent: [[1, 9], [1, 8]] })

# density2d.grid()

Returns the internal grid array of total accumulated density values per bin. To instead produce an array of objects containing coordinate values and probability density function estimates, use density2d.points().

# density2d.points([x, y, z])

Returns an iterator over objects containing sample points (x, y) and density value (z).

  • x: The property name for the x-dimension sample point (default "x").
  • y: The property name for the y-dimension sample point (default "y").
  • z: The property name for the estimated density value (default "z").

# density2d.bandwidth([bandwidth])

Get or set the bandwidths (standard deviations) of the Gaussian kernel. If array-valued, specifies the x- and y-bandwidths separately. If number-valued, sets both x- and y-bandwidths to the same value. Setting the bandwidth will update the estimator efficiently without re-performing binning. The extent will remain unchanged, even if previously determined automatically.

# density2d.extent()

Get the calculated extent of density estimation. Returns an array containing the x- and y-dimension extents: [[xmin, xmax], [ymin, ymax]]. This method does not support setting the extent to a new value, as this requires re-binning the input data.

# density2d.heatmap([options])

Generate a heatmap image of the 2D density. Returns an HTML canvas element.

  • options: Options for heatmap image generation.
    • color: A color function that maps density values (normalized to the domain [0, 1]) to either valid CSS color strings or to RGB color objects with r, g, b properties (in the range 0-255) and an optional opacity property (a fractional value between 0 and 1). If CSS color strings are used, the d3-color library must also be loaded.
    • clamp: Sets the range of density values to a given [min, max] array. Values below the minimum or above the maximum will be clamped to the provided values. Values within the clamped range are then normalized to the domain [0, 1].
    • canvas: A canvas element to draw into. If unspecified, a new canvas instance is created with dimensions matching the density estimator bins option.
    • maxColors: The maximum number of colors (default 256) to use for heatmap generation. This number determines the size of the backing pre-computed palette, and thus the number of times the color function is invoked.

Utility Methods

# kde.nrd(data, accessor)

Calculates a suggested bandwidth for a set of numeric data values, using Scott's normal reference distribution (NRD) heuristic.

# kde.opacityMap(r, g, b)

Returns a color map function (compatible with the heatmap color option) that ramps the opacity for a fixed set of r, g, b values in the range 0-255.