This function is optimized with numba and multithreading (Not python library) and does not store the distance matrix in memory but computes it inplace using lazy computation approach
Related research: "Density-Based Clustering Validation". Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, Jörg Sander. https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
-
We use Prim's algorithm to find mst
-
So far, only the square of Euclid's norm is available
However, it's easy to add the norm you need to this function: wrap the function to compute the norm in an njit decorator with 'signature_for_norms' from src.config like this:
@numba.njit(signature_for_norms, cache=True) def your_norm(x, y): ...
-
You can control the number of threads in use via numba.set_num_threads()
Tests were run on the make_blobs dataset from sklearn.datasets
On average, 15 times faster than realizations from https://github.com/FelSiq/DBCV
Vector length | Num. of clusters | Av. time of 10 runs for "old" DBCV sec. | Av. time of 10 runs for "new" DBCV sec. |
---|---|---|---|
6 | 2 | 26.86 | 1.23 |
6 | 4 | 11.33 | 0.60 |
6 | 8 | 5.47 | 0.34 |
6 | 10 | 4.33 | 0.28 |
8 | 2 | 26.53 | 1.41 |
8 | 4 | 11.41 | 0.69 |
8 | 8 | 5.23 | 0.37 |
8 | 10 | 4.42 | 0.34 |
Vector length | Num. of clusters | Av. time of 10 runs for "old" DBCV sec. | Av. time of 10 runs for "new" DBCV sec. |
---|---|---|---|
6 | 2 | 0.209 | 0.014 |
6 | 4 | 0.147 | 0.012 |
6 | 8 | 0.101 | 0.008 |
6 | 10 | 0.102 | 0.008 |
8 | 2 | 0.221 | 0.018 |
8 | 4 | 0.132 | 0.010 |
8 | 8 | 0.102 | 0.008 |
8 | 10 | 0.098 | 0.008 |
- Currently the bottleneck of the function is the mst build, which is done in a single thread. It is planned to switch to Boruvka's algorithm, which can be executed in parallel
- There are also plans to add support for execution on gpu