ParallelKMeans.jl Package

Motivation

It's actually a funny story led to the development of this package. What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimization tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.

Say hello to ParallelKMeans!

This package aims to utilize the speed of Julia and parallelization (both CPU & GPU) to offer an extremely fast implementation of the K-Means clustering algorithm and its variants via a friendly interface for practioners.

In short, we hope this package will eventually mature as the "one-stop-shop" for everything K-Means on CPUs and GPUs.

K-Means Algorithm Implementation Notes

Since Julia is a column major language, the input (design matrix) expected by the package must be in the following format;

  • Design matrix X of size n×m, the i-th column of X (X[:, i]) is a single data point in n-dimensional space.
  • Thus, the rows of the design matrix represent the feature space with the columns representing all the training samples in this feature space.

One of the pitfalls of K-Means algorithm is that it can fall into a local minima. This implementation inherits this problem like every implementation does. As a result, it is useful in practice to restart it several times to get the correct results.

Installation

If you are using Julia in the recommended Juno IDE, the number of threads is already set to the number of available CPU cores so multithreading enabled out of the box. For other IDEs, multithreading must be exported in your environment before launching the Julia REPL in the command line.

TIP: One needs to navigate or point to the Julia executable file to be able to launch it in the command line. Enable multi threading on Mac/Linux systems via;

export JULIA_NUM_THREADS=n  # where n is the number of threads/cores

For Windows systems:

set JULIA_NUM_THREADS=n  # where n is the number of threads/cores

You can grab the latest stable version of this package from Julia registries by simply running;

NB: Don't forget to invoke Julia's package manager with ]

pkg> add ParallelKMeans

The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with ]:

pkg> add ParallelKMeans#master

You are good to go with bleeding edge features and breakages!

To revert to a stable version, you can simply run:

pkg> free ParallelKMeans

Features

  • Lightning fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
  • Support for multi-threading implementation of K-Means clustering algorithm.
  • 'Kmeans++' initialization for faster and better convergence.
  • Implementation of available classic and contemporary variants of the K-Means algorithm.

Pending Features

How To Use

Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy-to-use API.

using ParallelKMeans

# Uses all available CPU cores by default
multi_results = kmeans(X, 3; max_iters=300)

# Use only 1 core of CPU
results = kmeans(X, 3; n_threads=1, max_iters=300)

The main design goal is to offer all available variations of the KMeans algorithm to end users as composable elements. By default, Lloyd's implementation is used but users can specify different variations of the KMeans clustering algorithm via this interface;

some_results = kmeans([algo], input_matrix, k; kwargs)

# example
r = kmeans(Lloyd(), X, 3)  # same result as the default
# r contains all the learned artifacts that can be accessed as;
r.centers               # cluster centers (d x k)
r.assignments           # label assignments (n)
r.totalcost             # total cost (i.e. objective)
r.iterations            # number of elapsed iterations
r.converged             # whether the procedure converged
  • Lloyd() - Default algorithm but only recommended for very small matrices (switch to n_threads = 1 to avoid overhead).
  • Hamerly() - Hamerly is good for moderate number of clusters (< 50?) and moderate dimensions (<100?).
  • Elkan() - Recommended for high dimensional data.
  • Yinyang() - Recommended for large dimensions and/or large number of clusters.
  • Coreset() - Recommended for very fast clustering of very large datasets, when extreme accuracy is not important.
  • MiniBatch() - Recommended for extremely large datasets, when extreme accuracy is not important. Experimental Implementation
  • Geometric() - (Coming soon)

Practical Usage Examples

Some of the common usage examples of this package are as follows:

Clustering With A Desired Number Of Groups

using ParallelKMeans, RDatasets, Plots

# load the data
iris = dataset("datasets", "iris");

# features to use for clustering
features = collect(Matrix(iris[:, 1:4])');

# various artifacts can be accessed from the result i.e. assigned labels, cost value etc
result = kmeans(features, 3);

# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
        color=:lightrainbow, legend=false)

Image description

Elbow Method For The Selection Of optimal number of clusters

using ParallelKMeans

# Single Thread Implementation of Lloyd's Algorithm
b = [ParallelKMeans.kmeans(X, i, n_threads=1; tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10]

# Multi-threaded Implementation of Lloyd's Algorithm by default
c = [ParallelKMeans.kmeans(X, i; tol=1e-6, max_iters=300, verbose=false).totalcost for i = 2:10]

Benchmarks

Currently, this package is benchmarked against similar implementations in both Python and Julia. All reproducible benchmarks can be found in ParallelKMeans/extras directory. More tests in various languages are planned beyond the initial release version (0.1.0).

Note: All benchmark tests are made on the same computer to help eliminate any bias.

PC NameCPURam
iMac (Retina 5K 27-inch 2019)3 GHz 6-Core Intel Core i524 GB 2667 MHz DDR4

Currently, the benchmark speed tests are based on the search for optimal number of clusters using the Elbow Method since this is a practical use case for most practioners employing the K-Means algorithm.

Benchmark Results

benchmark_image.png

_________________________________________________________________________________________________________

1 million sample (secs)100k sample (secs)10k sample (secs)1k sample (secs)packagelanguageprocess
282.715.270.73240.01682KnorRfull scan
854876.110.000719Sklearn KMeansPythonfull scan
11.21.410.0003170.000141Sklearn MiniBatch KmeansPythonstochastic
254.48118.5170.0007949560.000031211MlpackC++ Wrapperfull scan
653.17845.4680.0008241150.000017301Clustering.jlJuliafull scan
19.9552.7580.0001669570.000009206ParallelKMeans LloydJuliafull scan
11.2341.6540.0001090740.000012819ParallelKMeans HamerlyJuliafull scan
19.3941.4360.0001092620.000013726ParallelKMeans ElkanJuliafull scan
14.0800.0009729140.0000953250.000009802ParallelKMeans YingYangJuliafull scan

_________________________________________________________________________________________________________

Release History

  • 0.1.0 Initial release.
  • 0.1.1 Added interface for MLJ.
  • 0.1.2 Added Elkan algorithm.
  • 0.1.3 Faster & optimized execution.
  • 0.1.4 Bug fixes.
  • 0.1.5 Added Yinyang algorithm.
  • 0.1.6 Added support for weighted k-means; Added Coreset algorithm; improved support for different types of the design matrix.
  • 0.1.7 Added Yinyang and Coreset support in MLJ interface; added weights support in MLJ; added RNG seed support in MLJ interface and through all algorithms; added metric support.
  • 0.1.8 Minor cleanup.
  • 0.1.9 Added travis support for Julia 1.5.
  • 0.2.0 Updated MLJ Interface.
  • 0.2.1 Initial Mini-batch implementation.
  • 0.2.2 Updated MLJInterface.
  • 1.0.0 Stable public release.

Contributing

Ultimately, we see this package as potentially the one-stop-shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project. This project adopts the ColPrac community guidelines.

ParallelKMeans.CoresetType
Coreset()

Coreset algorithm implementation, based on "Lucic, Mario & Bachem, Olivier & Krause, Andreas. (2015). Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures."

Coreset supports following arguments:

  • m: default 100, subsample size
  • alg: default Lloyd(), algorithm used to clusterize sample

It can be used directly in kmeans function

X = rand(30, 100_000)   # 100_000 random points in 30 dimensions

# 3 clusters, Coreset algorithm with default Lloyd algorithm and 100 subsamples
kmeans(Coreset(), X, 3)

# 3 clusters, Coreset algorithm with Hamerly algorithm and 500 subsamples
kmeans(Coreset(m = 500, alg = Hamerly()), X, 3)
kmeans(Coreset(500, Hamerly()), X, 3)

# alternatively short form can be used for defining subsample size or algorithm only
kmeans(Coreset(500), X, 3) # sample of the size 500, Lloyd clustering algorithm
kmeans(Coreset(Hamerly()), X, 3) # sample of the size 100, Hamerly clustering algorithm
source
ParallelKMeans.ElkanType
Elkan()

Elkan algorithm implementation, based on "Charles Elkan. 2003. Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML’03). AAAI Press, 147–153."

This algorithm provides much faster convergence than Lloyd algorithm especially for high dimensional data. It can be used directly in kmeans function

X = rand(30, 100_000)   # 100_000 random points in 30 dimensions

kmeans(Elkan(), X, 3) # 3 clusters, Elkan algorithm
source
ParallelKMeans.HamerlyType
Hamerly()

Hamerly algorithm implementation, based on "Hamerly, Greg. (2010). Making k-means Even Faster. Proceedings of the 2010 SIAM International Conference on Data Mining. 130-140. 10.1137/1.9781611972801.12."

This algorithm provides much faster convergence than Lloyd algorithm with realtively small increase in memory footprint. It is especially suitable for low to medium dimensional input data.

It can be used directly in kmeans function

X = rand(30, 100_000)   # 100_000 random points in 30 dimensions

kmeans(Hamerly(), X, 3) # 3 clusters, Hamerly algorithm
source
ParallelKMeans.KMeansType
ParallelKMeans model constructed by the user.
See also the [package documentation](https://pydatablog.github.io/ParallelKMeans.jl/stable).
source
ParallelKMeans.KmeansResultType
KmeansResult{C,D<:Real,WC<:Real} <: ClusteringResult

The output of kmeans and kmeans!.

Type parameters

  • C<:AbstractMatrix{<:AbstractFloat}: type of the centers matrix
  • D<:Real: type of the assignment cost
  • WC<:Real: type of the cluster weight

C is the type of centers, an (abstract) matrix of size (d x k)

D is the type of pairwise distance computation from points to cluster centers

WC is the type of cluster weights, either Int (in the case where points are

unweighted) or eltype(weights) (in the case where points are weighted).

source
ParallelKMeans.MiniBatchType
MiniBatch(b::Int)
`b` represents the size of the batch which should be sampled.

Sculley et al. 2007 Mini batch k-means algorithm implementation.
X = rand(30, 100_000)  # 100_000 random points in 30 dimensions

kmeans(MiniBatch(100), X, 3)  # 3 clusters, MiniBatch algorithm with 100 batch samples at each iteration
source
ParallelKMeans.YinyangType
Yinyang()

Yinyang algorithm implementation, based on "Yufei Ding et al. 2015. Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup. Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015"

Generally it outperform Hamerly algorithm and has roughly the same time as Elkan algorithm with much lower memory consumption.

Yinyang supports following arguments: auto: Bool, indicates whether to perform automated or manual grouping group_size: Int, estimation of average number of clusters per group. Lower numbers corresponds to higher calculation speed and higher memory consumption and vice versa.

It can be used directly in kmeans function

X = rand(30, 100_000)   # 100_000 random points in 30 dimensions

# 3 clusters, Yinyang algorithm, with deault 7 group_size
kmeans(Yinyang(), X, 3)

# Following are equivalent
# 3 clusters, Yinyang algorithm with 10 group_size
kmeans(Yinyang(group_size = 10), X, 3)
kmeans(Yinyang(10), X, 3)

# One group with the size of the number of points
kmeans(Yinyang(auto = false), X, 3)
kmeans(Yinyang(false), X, 3)

# Chinese writing can be used
kmeans(阴阳(), X, 3)
source
MLJModelInterface.fitMethod
Fit the specified ParallelKMeans model constructed by the user.

See also the [package documentation](https://pydatablog.github.io/ParallelKMeans.jl/stable).
source
ParallelKMeans.chunk_colwiseMethod
chunk_colwise!(target, x, y, i, weights, r, idx)

Utility function for the calculation of the weighted distance between points x and centroid vector y[:, i]. UnitRange argument r select subarray of original design matrix x that is going to be processed.

source
ParallelKMeans.chunk_update_boundsMethod
chunk_update_bounds(alg::Hamerly, containers, r1, r2, pr1, pr2, metric::Euclidean, r, idx)

Updates upper and lower bounds of point distance to the centers, with regard to the centers movement when metric is Euclidean.

source
ParallelKMeans.chunk_update_boundsMethod
chunk_update_bounds(alg::Hamerly, containers, r1, r2, pr1, pr2, metric::Metric, r, idx)

Updates upper and lower bounds of point distance to the centers, with regard to the centers movement when metric is Euclidean.

source
ParallelKMeans.chunk_update_centroidsMethod
chunk_update_centroids(alg::Hamerly, containers, centroids, X, weights, metric, r, idx)

Detailed description of this function can be found in the original paper. It iterates through all points and tries to skip some calculation using known upper and lower bounds of distances from point to centers. If it fails to skip than it fall back to generic point_all_centers! function.

source
ParallelKMeans.create_containersMethod
create_containers(::Lloyd, k, nrow, ncol, n_threads)

Internal function for the creation of all necessary intermidiate structures.

  • centroids_new - container which holds new positions of centroids
  • centroids_cnt - container which holds number of points for each centroid
  • labels - vector which holds labels of corresponding points
source
ParallelKMeans.create_containersMethod
create_containers(::MiniBatch, k, nrow, ncol, n_threads)

Internal function for the creation of all necessary intermidiate structures.

  • centroids_new - container which holds new positions of centroids
  • labels - vector which holds labels of corresponding points
  • sum_of_squares - vector which holds the sum of squares values for each thread
  • batch_rand_idx - vector which holds the selected batch indices
source
ParallelKMeans.distanceMethod
distance(metric, X1, X2, i1, i2)

Allocationless calculation of distance between vectors X1[:, i1] and X2[:, i2] defined by the supplied distance metric.

source
ParallelKMeans.distanceMethod
distance(X1, X2, i1, i2)

Allocationless calculation of square eucledean distance between vectors X1[:, i1] and X2[:, i2]

source
ParallelKMeans.kmeans!Function
kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=true)

Mutable version of kmeans function. Definition of arguments and results can be found in kmeans.

Argument containers represent algorithm specific containers, such as labels, intermidiate centroids and so on, which are used during calculations.

source
ParallelKMeans.kmeansMethod
kmeans([alg::AbstractKMeansAlg,] design_matrix, k; n_threads = nthreads(),
k_init="k-means++", max_iters=300, tol=1e-6, verbose=true, rng = Random.GLOBAL_RNG)

This main function employs the K-means algorithm to cluster all examples in the training data (design_matrix) into k groups using either the k-means++ or random initialisation technique for selecting the initial centroids.

At the end of the number of iterations specified (max_iters), convergence is achieved if difference between the current and last cost objective is less than the tolerance level (tol). An error is thrown if convergence fails.

Arguments:

  • alg defines one of the algorithms used to calculate k-means. This

argument can be omitted, by default Lloyd algorithm is used.

  • n_threads defines number of threads used for calculations, by default it is equal

to the Threads.nthreads() which is defined by JULIA_NUM_THREADS environmental variable. For small size design matrices it make sense to set this argument to 1 in order to avoid overhead of threads generation.

  • k_init is one of the algorithms used for initialization. By default k-means++ algorithm is used,

alternatively one can use rand to choose random points for init.

  • max_iters is the maximum number of iterations
  • tol defines tolerance for early stopping.
  • verbose is verbosity level. Details of operations can be either printed or not by setting verbose accordingly.

A KmeansResult structure representing labels, centroids, and sum_squares is returned.

source
ParallelKMeans.move_centersMethod
move_centers(::Hamerly, containers, centroids, metric)

Calculates new positions of centers and distance they have moved. Results are stored in centroids and p respectively.

source
ParallelKMeans.smart_initFunction
smart_init(X, k; init="k-means++")

This function handles the random initialisation of the centroids from the design matrix (X) and desired groups (k) that a user supplies.

k-means++ algorithm is used by default with the normal random selection of centroids from X used if any other string is attempted.

A named tuple representing centroids and indices respecitively is returned.

source
ParallelKMeans.sum_of_squaresMethod
sum_of_squares(x, labels, centre, k)

This function computes the total sum of squares based on the assigned (labels) design matrix(x), centroids (centre), and the number of desired groups (k).

A Float type representing the computed metric is returned.

source
ParallelKMeans.@parallelizeMacro
@parallelize(n_threads, ncol, f)

Parallelize function and run it over n_threads. Function should require following conditions:

  1. It should not return any values.
  2. It should accept parameters two parameters at the end of the argument list. First

accepted parameter is range, which defines chunk used in calculations. Second parameter is idx which defines id of the container where results can be stored.

ncol argument defines range 1:ncol which is sliced in n_threads chunks.

source