K-means MapReduce implementation

In this work k-means clustering algorithm is implemented using MapReduce (Hadoop version 2.8) framework.

To run the program, shell script run.sh should be executed. It requires path to jar file and its input parameters which are:

input - path to data file
state - path to file that contains clusters
number - number of reducers
output - output directory
delta - threshold convergence (acceptable difference between 2 subsequent centroids)
max - maximum number of iterations
distance - similairty measure (currently only Euclidean distance is supported)

Workflow

The figure below denotes one iteration of MapReduce program.

First, Centroids and Context (Configuration) are loaded into the Distributed Cache. This is done by overriding setup function in the Mapper and Reducer class. Afterwards, the input data file is split and each data point is processed by one of the map functions (in Map process). The function writes key-value pairs <Centroid, Point>, where the Centroid is the closest one to the Point. Next, Combiner is used in order to decrease the number of local writings. In this phase data points that are on the same machine are summed up and the number of those data points is recorded, Point.number variable. Now, for the optimization reasons output values are automatically shuffled and sorted by Centroids. The Reducer performs the same procedure as the Combiner, but it also checks whether centroids converged; comparing the difference between old and new centroids with delta input parameter. If a centroid converge, then the global Counter remains unchanged, otherwise, it is incremented.

After the one iteration is done, new centroids are saved and the program checks two conditions, if the program reached the maximum number of iterations or if the Counter value is unchanged. If one of these two conditions is satisfied, then the program is finished, otherwise, the whole MapReduce process is run again with the updated centroids.

Examples

One of the use-cases of k-means algorithm is the color quantization process, reducing the number of distinct colors of an image. (Far better algorithms for this purpose are available)

Numerical (RGB) values of images (Fig. 1) are saved as input data (Fig. 2), and clusters are randomly initialized.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Resources		Resources
data_prep_scripts		data_prep_scripts
executable_jar		executable_jar
figures		figures
src		src
.gitignore		.gitignore
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-means MapReduce implementation

Workflow

Examples

Original Images

RGB values of original and modified images

After 10 iterations with 10 clusters, RBG values are represented in Fig. 3. It can be noted that a couple of centroids have vanished.

Modified images for a different number of centroids

Modified images for a different number of iterations and 10 centroids

About

Releases

Packages

Contributors 2

Languages

markomih/kmeans_mapreduce

Folders and files

Latest commit

History

Repository files navigation

K-means MapReduce implementation

Workflow

Examples

Original Images

RGB values of original and modified images

After 10 iterations with 10 clusters, RBG values are represented in Fig. 3. It can be noted that a couple of centroids have vanished.

Modified images for a different number of centroids

Modified images for a different number of iterations and 10 centroids

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages