Clustering

Is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

http://spark.apache.org/docs/latest/mllib-clustering.html

K-Means

K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

method: KMeans
model: KMeansModel
ruby: clustering/kmeans.rb

data = [
  DenseVector.new([0.0,0.0]),
  DenseVector.new([1.0,1.0]),
  DenseVector.new([9.0,8.0]),
  DenseVector.new([8.0,9.0])
]

model = KMeans.train($sc.parallelize(data), 2, max_iterations: 10,
                     runs: 30, initialization_mode: "random")

model.predict([0.0, 0.0]) == model.predict([1.0, 1.0])
# => true
model.predict([8.0, 9.0]) == model.predict([9.0, 8.0])
# => true

Gaussian mixture

method: GaussianMixture
model: GaussianMixtureModel
ruby: clustering/gaussian_mixture.rb

data = [
  DenseVector.new([-0.1, -0.05]),
  DenseVector.new([-0.01, -0.1]),
  DenseVector.new([0.9, 0.8]),
  DenseVector.new([0.75, 0.935]),
  DenseVector.new([-0.83, -0.68]),
  DenseVector.new([-0.91, -0.76])
]

model = GaussianMixture.train($sc.parallelize(data), 3, convergence_tol: 0.0001, max_iterations: 50, seed: 10)

labels = model.predict($sc.parallelize(data)).collect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering

K-Means

Gaussian mixture

Clone this wiki locally