ParallelKMeans.jl Package
Motivation
It's actually a funny story led to the development of this package. What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimization tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
Say hello to ParallelKMeans
!
This package aims to utilize the speed of Julia and parallelization (both CPU & GPU) to offer an extremely fast implementation of the K-Means clustering algorithm and its variants via a friendly interface for practioners.
In short, we hope this package will eventually mature as the "one-stop-shop" for everything K-Means on CPUs and GPUs.
K-Means Algorithm Implementation Notes
Since Julia is a column major language, the input (design matrix) expected by the package must be in the following format;
- Design matrix X of size n×m, the i-th column of X
(X[:, i])
is a single data point in n-dimensional space. - Thus, the rows of the design matrix represent the feature space with the columns representing all the training samples in this feature space.
One of the pitfalls of K-Means algorithm is that it can fall into a local minima. This implementation inherits this problem like every implementation does. As a result, it is useful in practice to restart it several times to get the correct results.
Installation
If you are using Julia in the recommended Juno IDE, the number of threads is already set to the number of available CPU cores so multithreading enabled out of the box. For other IDEs, multithreading must be exported in your environment before launching the Julia REPL in the command line.
TIP: One needs to navigate or point to the Julia executable file to be able to launch it in the command line. Enable multi threading on Mac/Linux systems via;
export JULIA_NUM_THREADS=n # where n is the number of threads/cores
For Windows systems:
set JULIA_NUM_THREADS=n # where n is the number of threads/cores
You can grab the latest stable version of this package from Julia registries by simply running;
NB: Don't forget to invoke Julia's package manager with ]
pkg> add ParallelKMeans
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]
:
pkg> add ParallelKMeans#master
You are good to go with bleeding edge features and breakages!
To revert to a stable version, you can simply run:
pkg> free ParallelKMeans
Features
- Lightning fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
- Support for multi-threading implementation of K-Means clustering algorithm.
- 'Kmeans++' initialization for faster and better convergence.
- Implementation of available classic and contemporary variants of the K-Means algorithm.
Pending Features
- [X] Implementation of Hamerly implementation.
- [X] Interface for inclusion in Alan Turing Institute's MLJModels.
- [X] Full Implementation of Triangle inequality based on Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means".
- [X] Implementation of Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup.
- [X] Implementation of Coresets.
- [X] Support for weighted K-means.
- [X] Support of MLJ Random generation hyperparameter.
- [ ] Support for other distance metrics supported by Distances.jl.
- [ ] Implementation of Geometric methods to accelerate k-means algorithm.
- [ ] Native support for tabular data inputs outside of MLJModels' interface.
- [ ] Refactoring and finalization of API design.
- [ ] GPU support.
- [ ] Distributed calculations support.
- [ ] Optimization of code base.
- [ ] Improved Documentation
- [ ] More benchmark tests.
How To Use
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy-to-use API.
using ParallelKMeans
# Uses all available CPU cores by default
multi_results = kmeans(X, 3; max_iters=300)
# Use only 1 core of CPU
results = kmeans(X, 3; n_threads=1, max_iters=300)
The main design goal is to offer all available variations of the KMeans algorithm to end users as composable elements. By default, Lloyd's implementation is used but users can specify different variations of the KMeans clustering algorithm via this interface;
some_results = kmeans([algo], input_matrix, k; kwargs)
# example
r = kmeans(Lloyd(), X, 3) # same result as the default
# r contains all the learned artifacts that can be accessed as;
r.centers # cluster centers (d x k)
r.assignments # label assignments (n)
r.totalcost # total cost (i.e. objective)
r.iterations # number of elapsed iterations
r.converged # whether the procedure converged
Supported KMeans algorithm variations and recommended use cases
- Lloyd() - Default algorithm but only recommended for very small matrices (switch to
n_threads = 1
to avoid overhead). - Hamerly() - Hamerly is good for moderate number of clusters (< 50?) and moderate dimensions (<100?).
- Elkan() - Recommended for high dimensional data.
- Yinyang() - Recommended for large dimensions and/or large number of clusters.
- Coreset() - Recommended for very fast clustering of very large datasets, when extreme accuracy is not important.
- Geometric() - (Coming soon)
- MiniBatch() - (Coming soon)
Practical Usage Examples
Some of the common usage examples of this package are as follows:
Clustering With A Desired Number Of Groups
using ParallelKMeans, RDatasets, Plots
# load the data
iris = dataset("datasets", "iris");
# features to use for clustering
features = collect(Matrix(iris[:, 1:4])');
# various artifacts can be accessed from the result i.e. assigned labels, cost value etc
result = kmeans(features, 3);
# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
color=:lightrainbow, legend=false)
Elbow Method For The Selection Of optimal number of clusters
using ParallelKMeans
# Single Thread Implementation of Lloyd's Algorithm
b = [ParallelKMeans.kmeans(X, i, n_threads=1; tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10]
# Multi-threaded Implementation of Lloyd's Algorithm by default
c = [ParallelKMeans.kmeans(X, i; tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10]
Benchmarks
Currently, this package is benchmarked against similar implementations in both Python and Julia. All reproducible benchmarks can be found in ParallelKMeans/extras directory. More tests in various languages are planned beyond the initial release version (0.1.0
).
Note: All benchmark tests are made on the same computer to help eliminate any bias.
PC Name | CPU | Ram |
---|---|---|
iMac (Retina 5K 27-inch 2019) | 3 GHz 6-Core Intel Core i5 | 8 GB 2667 MHz DDR4 |
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the Elbow Method since this is a practical use case for most practioners employing the K-Means algorithm.
Benchmark Results
_________________________________________________________________________________________________________
1 million sample (secs) | 100k sample (secs) | 10k sample (secs) | 1k sample (secs) | package | language |
---|---|---|---|---|---|
538.53100 | 33.15700 | 0.74238 | 0.01710 | Clustering.jl | Julia |
220.35700 | 20.93600 | 0.82430 | 0.02639 | mlpack | C++ Wrapper |
20.55400 | 2.91300 | 0.17559 | 0.00609 | Lloyd | Julia |
11.51800 | 0.96637 | 0.09990 | 0.00635 | Hamerly | Julia |
14.01900 | 1.13100 | 0.07912 | 0.00646 | Elkan | Julia |
9.97000 | 1.14600 | 0.10834 | 0.00704 | Yinyang | Julia |
1,430.00000 | 146.00000 | 5.77000 | 0.34400 | Sklearn KMeans | Python |
30.10000 | 3.75000 | 0.61300 | 0.20100 | Sklearn MiniBatchKMeans | Python |
218.20000 | 15.51000 | 0.73370 | 0.01947 | Knor | R |
_________________________________________________________________________________________________________
Release History
- 0.1.0 Initial release.
- 0.1.1 Added interface for MLJ.
- 0.1.2 Added
Elkan
algorithm. - 0.1.3 Faster & optimized execution.
- 0.1.4 Bug fixes.
- 0.1.5 Added
Yinyang
algorithm. - 0.1.6 Added support for weighted k-means; Added
Coreset
algorithm; improved support for different types of the design matrix. - 0.1.7 Added
Yinyang
andCoreset
support in MLJ interface; addedweights
support in MLJ; added RNG seed support in MLJ interface and through all algorithms; added metric support. - 0.1.8 Minor cleanup
- 0.1.9 Added travis support for Julia 1.5
Contributing
Ultimately, we see this package as potentially the one-stop-shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project.
Detailed contribution guidelines will be added in upcoming releases.
<!–- TODO: Contribution Guidelines –->
ParallelKMeans.AbstractKMeansAlg
ParallelKMeans.ClusteringResult
ParallelKMeans.Coreset
ParallelKMeans.Elkan
ParallelKMeans.Hamerly
ParallelKMeans.KmeansResult
ParallelKMeans.Lloyd
ParallelKMeans.Yinyang
MLJModelInterface.fit
ParallelKMeans.chunk_colwise
ParallelKMeans.chunk_initialize
ParallelKMeans.chunk_update_bounds
ParallelKMeans.chunk_update_bounds
ParallelKMeans.chunk_update_centroids
ParallelKMeans.create_containers
ParallelKMeans.distance
ParallelKMeans.distance
ParallelKMeans.double_argmax
ParallelKMeans.kmeans
ParallelKMeans.kmeans!
ParallelKMeans.move_centers
ParallelKMeans.point_all_centers!
ParallelKMeans.point_all_centers!
ParallelKMeans.smart_init
ParallelKMeans.splitter
ParallelKMeans.sum_of_squares
ParallelKMeans.update_containers
ParallelKMeans.@parallelize
ParallelKMeans.AbstractKMeansAlg
— TypeAbstractKMeansAlg
Abstract base type inherited by all sub-KMeans algorithms.
ParallelKMeans.ClusteringResult
— TypeClusteringResult
Base type for the output of clustering algorithm.
ParallelKMeans.Coreset
— TypeCoreset()
Coreset algorithm implementation, based on "Lucic, Mario & Bachem, Olivier & Krause, Andreas. (2015). Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures."
Coreset
supports following arguments:
m
: default 100, subsample sizealg
: defaultLloyd()
, algorithm used to clusterize sample
It can be used directly in kmeans
function
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
# 3 clusters, Coreset algorithm with default Lloyd algorithm and 100 subsamples
kmeans(Coreset(), X, 3)
# 3 clusters, Coreset algorithm with Hamerly algorithm and 500 subsamples
kmeans(Coreset(m = 500, alg = Hamerly()), X, 3)
kmeans(Coreset(500, Hamerly()), X, 3)
# alternatively short form can be used for defining subsample size or algorithm only
kmeans(Coreset(500), X, 3) # sample of the size 500, Lloyd clustering algorithm
kmeans(Coreset(Hamerly()), X, 3) # sample of the size 100, Hamerly clustering algorithm
ParallelKMeans.Elkan
— TypeElkan()
Elkan algorithm implementation, based on "Charles Elkan. 2003. Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML’03). AAAI Press, 147–153."
This algorithm provides much faster convergence than Lloyd algorithm especially for high dimensional data. It can be used directly in kmeans
function
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
kmeans(Elkan(), X, 3) # 3 clusters, Elkan algorithm
ParallelKMeans.Hamerly
— TypeHamerly()
Hamerly algorithm implementation, based on "Hamerly, Greg. (2010). Making k-means Even Faster. Proceedings of the 2010 SIAM International Conference on Data Mining. 130-140. 10.1137/1.9781611972801.12."
This algorithm provides much faster convergence than Lloyd algorithm with realtively small increase in memory footprint. It is especially suitable for low to medium dimensional input data.
It can be used directly in kmeans
function
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
kmeans(Hamerly(), X, 3) # 3 clusters, Hamerly algorithm
ParallelKMeans.KmeansResult
— TypeKmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult
The output of kmeans
and kmeans!
.
Type parameters
C<:AbstractMatrix{<:AbstractFloat}
: type of thecenters
matrixD<:Real
: type of the assignment costWC<:Real
: type of the cluster weight
C is the type of centers, an (abstract) matrix of size (d x k)
D is the type of pairwise distance computation from points to cluster centers
WC is the type of cluster weights, either Int (in the case where points are
unweighted) or eltype(weights) (in the case where points are weighted).
ParallelKMeans.Lloyd
— TypeLloyd <: AbstractKMeansAlg
Basic algorithm for k-means calculation.
ParallelKMeans.Yinyang
— TypeYinyang()
Yinyang algorithm implementation, based on "Yufei Ding et al. 2015. Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup. Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015"
Generally it outperform Hamerly
algorithm and has roughly the same time as Elkan
algorithm with much lower memory consumption.
Yinyang
supports following arguments: auto
: Bool
, indicates whether to perform automated or manual grouping group_size
: Int
, estimation of average number of clusters per group. Lower numbers corresponds to higher calculation speed and higher memory consumption and vice versa.
It can be used directly in kmeans
function
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
# 3 clusters, Yinyang algorithm, with deault 7 group_size
kmeans(Yinyang(), X, 3)
# Following are equivalent
# 3 clusters, Yinyang algorithm with 10 group_size
kmeans(Yinyang(group_size = 10), X, 3)
kmeans(Yinyang(10), X, 3)
# One group with the size of the number of points
kmeans(Yinyang(auto = false), X, 3)
kmeans(Yinyang(false), X, 3)
# Chinese writing can be used
kmeans(阴阳(), X, 3)
MLJModelInterface.fit
— MethodFit the specified ParaKMeans model constructed by the user.
See also the [package documentation](https://pydatablog.github.io/ParallelKMeans.jl/stable).
ParallelKMeans.chunk_colwise
— Methodchunk_colwise!(target, x, y, i, weights, r, idx)
Utility function for the calculation of the weighted distance between points x
and centroid vector y[:, i]
. UnitRange argument r
select subarray of original design matrix x
that is going to be processed.
ParallelKMeans.chunk_initialize
— Methodchunk_initialize(alg::Hamerly, containers, centroids, X, weights, metric, r, idx)
Initial calulation of all bounds and points labeling.
ParallelKMeans.chunk_update_bounds
— Methodchunk_update_bounds(alg::Hamerly, containers, r1, r2, pr1, pr2, metric::Euclidean, r, idx)
Updates upper and lower bounds of point distance to the centers, with regard to the centers movement when metric is Euclidean.
ParallelKMeans.chunk_update_bounds
— Methodchunk_update_bounds(alg::Hamerly, containers, r1, r2, pr1, pr2, metric::Metric, r, idx)
Updates upper and lower bounds of point distance to the centers, with regard to the centers movement when metric is Euclidean.
ParallelKMeans.chunk_update_centroids
— Methodchunk_update_centroids(alg::Hamerly, containers, centroids, X, weights, metric, r, idx)
Detailed description of this function can be found in the original paper. It iterates through all points and tries to skip some calculation using known upper and lower bounds of distances from point to centers. If it fails to skip than it fall back to generic point_all_centers!
function.
ParallelKMeans.create_containers
— Methodcreate_containers(::Lloyd, k, nrow, ncol, n_threads)
Internal function for the creation of all necessary intermidiate structures.
centroids_new
- container which holds new positions of centroidscentroids_cnt
- container which holds number of points for each centroidlabels
- vector which holds labels of corresponding points
ParallelKMeans.distance
— Methoddistance(metric, X1, X2, i1, i2)
Allocationless calculation of distance between vectors X1[:, i1] and X2[:, i2] defined by the supplied distance metric.
ParallelKMeans.distance
— Methoddistance(X1, X2, i1, i2)
Allocationless calculation of square eucledean distance between vectors X1[:, i1] and X2[:, i2]
ParallelKMeans.double_argmax
— Methoddouble_argmax(p)
Finds maximum and next after maximum arguments.
ParallelKMeans.kmeans!
— FunctionKmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=true)
Mutable version of kmeans
function. Definition of arguments and results can be found in kmeans
.
Argument containers
represent algorithm specific containers, such as labels, intermidiate centroids and so on, which are used during calculations.
ParallelKMeans.kmeans
— Methodkmeans([alg::AbstractKMeansAlg,] design_matrix, k; n_threads = nthreads(),
k_init="k-means++", max_iters=300, tol=1e-6, verbose=true, rng = Random.GLOBAL_RNG)
This main function employs the K-means algorithm to cluster all examples in the training data (design_matrix) into k groups using either the k-means++
or random initialisation technique for selecting the initial centroids.
At the end of the number of iterations specified (max_iters), convergence is achieved if difference between the current and last cost objective is less than the tolerance level (tol). An error is thrown if convergence fails.
Arguments:
alg
defines one of the algorithms used to calculatek-means
. This
argument can be omitted, by default Lloyd algorithm is used.
n_threads
defines number of threads used for calculations, by default it is equal
to the Threads.nthreads()
which is defined by JULIA_NUM_THREADS
environmental variable. For small size design matrices it make sense to set this argument to 1 in order to avoid overhead of threads generation.
k_init
is one of the algorithms used for initialization. By defaultk-means++
algorithm is used,
alternatively one can use rand
to choose random points for init.
max_iters
is the maximum number of iterationstol
defines tolerance for early stopping.verbose
is verbosity level. Details of operations can be either printed or not by setting verbose accordingly.
A KmeansResult
structure representing labels, centroids, and sum_squares is returned.
ParallelKMeans.move_centers
— Methodmove_centers(::Hamerly, containers, centroids, metric)
Calculates new positions of centers and distance they have moved. Results are stored in centroids
and p
respectively.
ParallelKMeans.point_all_centers!
— Methodpoint_all_centers!(containers, centroids, X, i, metric)
Calculates new labels and upper and lower bounds for all points.
ParallelKMeans.point_all_centers!
— Methodpoint_all_centers!(containers, centroids, X, i)
Calculates new labels and upper and lower bounds for all points.
ParallelKMeans.smart_init
— Functionsmart_init(X, k; init="k-means++")
This function handles the random initialisation of the centroids from the design matrix (X) and desired groups (k) that a user supplies.
k-means++
algorithm is used by default with the normal random selection of centroids from X used if any other string is attempted.
A named tuple representing centroids and indices respecitively is returned.
ParallelKMeans.splitter
— Methodspliiter(n, k)
Internal utility function, splits 1:n sequence to k chunks of approximately same size.
ParallelKMeans.sum_of_squares
— Methodsum_of_squares(x, labels, centre, k)
This function computes the total sum of squares based on the assigned (labels) design matrix(x), centroids (centre), and the number of desired groups (k).
A Float type representing the computed metric is returned.
ParallelKMeans.update_containers
— Methodupdate_containers(::Hamerly, containers, centroids, n_threads, metric)
Calculates minimum distances from centers to each other.
ParallelKMeans.@parallelize
— Macro@parallelize(n_threads, ncol, f)
Parallelize function and run it over n_threads. Function should require following conditions:
- It should not return any values.
- It should accept parameters two parameters at the end of the argument list. First
accepted parameter is range
, which defines chunk used in calculations. Second parameter is idx
which defines id of the container where results can be stored.
ncol
argument defines range 1:ncol which is sliced in n_threads
chunks.