Function for calculating Density-Based Clustering Validation metric

This function is optimized with numba and multithreading (Not python library) and does not store the distance matrix in memory but computes it inplace using lazy computation approach

Related research: "Density-Based Clustering Validation". Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, Jörg Sander. https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf

A little bit about implementation

We use Prim's algorithm to find mst
So far, only the square of Euclid's norm is available

However, it's easy to add the norm you need to this function: wrap the function to compute the norm in an njit decorator with 'signature_for_norms' from src.config like this:
```
    @numba.njit(signature_for_norms, cache=True)
    def your_norm(x, y):
        ...
```
You can control the number of threads in use via numba.set_num_threads()

Speed

Tests were run on the make_blobs dataset from sklearn.datasets

On average, 15 times faster than realizations from https://github.com/FelSiq/DBCV

Sample size == 10^4

Vector length	Num. of clusters	Av. time of 10 runs for "old" DBCV sec.	Av. time of 10 runs for "new" DBCV sec.
6	2	26.86	1.23
6	4	11.33	0.60
6	8	5.47	0.34
6	10	4.33	0.28
8	2	26.53	1.41
8	4	11.41	0.69
8	8	5.23	0.37
8	10	4.42	0.34

Sample size == 10^3

Vector length	Num. of clusters	Av. time of 10 runs for "old" DBCV sec.	Av. time of 10 runs for "new" DBCV sec.
6	2	0.209	0.014
6	4	0.147	0.012
6	8	0.101	0.008
6	10	0.102	0.008
8	2	0.221	0.018
8	4	0.132	0.010
8	8	0.102	0.008
8	10	0.098	0.008

Future plans

Currently the bottleneck of the function is the mst build, which is done in a single thread. It is planned to switch to Boruvka's algorithm, which can be executed in parallel
There are also plans to add support for execution on gpu

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
tests		tests
.gitignore		.gitignore
DBCV.py		DBCV.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Function for calculating Density-Based Clustering Validation metric

A little bit about implementation

Speed

Sample size == 10^4

Sample size == 10^3

Future plans

About

Languages

boyfws/DBCV

Folders and files

Latest commit

History

Repository files navigation

Function for calculating Density-Based Clustering Validation metric

A little bit about implementation

Speed

Sample size == 10^4

Sample size == 10^3

Future plans

About

Topics

Resources

Stars

Watchers

Forks

Languages