In this work k-means clustering algorithm is implemented using MapReduce (Hadoop version 2.8) framework.
To run the program, shell script run.sh
should be executed. It requires path to jar file and its input parameters which are:
input
- path to data filestate
- path to file that contains clustersnumber
- number of reducersoutput
- output directorydelta
- threshold convergence (acceptable difference between 2 subsequent centroids)max
- maximum number of iterationsdistance
- similairty measure (currently only Euclidean distance is supported)
The figure below denotes one iteration of MapReduce program.
First, Centroids and Context (Configuration) are loaded into the Distributed Cache. This is done by overriding setup function in the Mapper and Reducer class. Afterwards, the input data file is split and each data point is processed by one of the map functions (in Map process). The function writes key-value pairs <Centroid, Point>, where the Centroid is the closest one to the Point. Next, Combiner is used in order to decrease the number of local writings. In this phase data points that are on the same machine are summed up and the number of those data points is recorded, Point.number variable. Now, for the optimization reasons output values are automatically shuffled and sorted by Centroids. The Reducer performs the same procedure as the Combiner, but it also checks whether centroids converged; comparing the difference between old and new centroids with delta input parameter. If a centroid converge, then the global Counter remains unchanged, otherwise, it is incremented.
After the one iteration is done, new centroids are saved and the program checks two conditions, if the program reached the maximum number of iterations or if the Counter value is unchanged. If one of these two conditions is satisfied, then the program is finished, otherwise, the whole MapReduce process is run again with the updated centroids.
One of the use-cases of k-means algorithm is the color quantization process, reducing the number of distinct colors of an image. (Far better algorithms for this purpose are available)
Numerical (RGB) values of images (Fig. 1) are saved as input data (Fig. 2), and clusters are randomly initialized.