Russell Burdt Tech Blog

Ground truth accuracy of clustering algorithms
by Russell Burdt


Using clustering algorithms to identify common patterns in unsupervised data can be a difficult concept to apply in practice, because there are no explicit methods to assess accuracy of the results. An accuracy calculation requires predicted data labels and true data labels, and with unsupervised data there are no true data labels. From a Python and scikit-learn point of view, there is no y_true to use with the sklearn.metrics.accuracy_score function.


    In [12]: sklearn.metrics.accuracy_score?
    Signature: sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
    
This blog post briefly explores clustering algorithms applied to problems where there are true data labels, i.e. with class data in supervised learning problems. In these cases, results of the clustering algorithm can be compared directly to class data via the sklearn.metrics.accuracy_score function. Specifically, this blog post finds that

  • Accuracy of a clustering algorithm can exceed accuracy of a classification algorithm applied to the same supervised learning problem

  • Clustering algorithms can be more accurate when the number of clusters exceeds the number of unique classes in the supervised dataset

  • Clustering algorithms are susceptible to overfitting, especially sklearn.cluster.MiniBatchKMeans

Clustering algorithms are explained in the scikit-learn documentation by visualizing their application to articial data in two dimensions. The techniques and results described in this blog post have helped me to get a better understanding of how clustering algorithms apply to multi-dimensional, real-world datasets. Though not a topic of this blog post, others ([1] and [2]) have demonstrated clustering algorithms as a pre-processing step that improves accuracy of a classification algorithm.

Baseline supervised learning datasets

Supervised learning datasets from the sklearn.datasets module and the UC Irvine Machine Learning Repository are used. The table below describes parameters of each dataset, and provides the accuracy of several classification algorithms applied to each dataset. The classification algorithms all use default hyperparameters at initialization, and sklearn.model_selection.cross_val_score with cv=5 to get accuracy results.

The results in the table represent a baseline, that is used for comparison to accuracy results of clustering algorithms applied to the same datasets later on in this blog post. The datasets were chosen to represent a range of real-world, multi-dimensional datasets with varying classification accuracy.

Dataset # of features # of instances # of unique classes
(n_classes)
5-fold cross-validation accuracy with default classifier
sklearn.naive_bayes.
GaussianNB
sklearn.ensemble.
RandomForestClassifier
sklearn.neighbors.
KNeighborsClassifier
sklearn.linear_model.
LogisticRegression
Banknote Authentication 4 1372 2 83.8% 99.1% 100.0% 98.8%
Adult 14 32561 2 79.5% 84.8% 77.7% 79.2%
Wireless Localization 7 2000 4 98.1% 97.0% 97.7% 96.7%
Ionosphere 34 351 2 86.6% 93.2% 82.6% 85.5%
Iris 4 150 3 95.3% 96.7% 97.3% 96.0%

Data in the table above (for the Ionosphere dataset) were created using the code below. The code represents the steps to load the dataset and calculate the cross-validation accuracy for the list of scikit-learn classifier objects in the table. The datasets module used in the code was written to support a very small library of standard datasets that load into Python in a consistent structure, and is available on github.

    from sklearn.model_selection import cross_val_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.naive_bayes import GaussianNB


    # load classification data
    data = datasets.supervised_ionosphere()
    X, y = data['X'].values, data['y'].values

    # loop over a list of classifiers
    classifiers = [GaussianNB, RandomForestClassifier, KNeighborsClassifier, LogisticRegression]
    for classifier in classifiers:

        # initialize the algo, run the cross-validation function, print the results
        algo = classifier()
        accuracy = cross_val_score(algo, X, y, cv=5).mean()
        print('{} accuracy, {:.1f}%'.format(classifier.__name__, 100 * accuracy))
    

Accuracy of clustering algorithms

A KMeans clustering algorithm can be applied to each of the baseline datasets by initializing the algorithm with n_clusters and then fitting and predicting to the training data of the supervised dataset.

    from sklearn.cluster import KMeans

    model = KMeans(n_clusters=2)
    ycluster = model.fit_predict(X)
    
At this point (with ycluster data created by the code above), results of the clustering algorithm can be compared to the class data of the supervised dataset. This comparison can get complicated, however, due to the number of ways the clustered data can be assigned to line up with the class data.

In the Ionosphere dataset, for example, the class data are actually strings of 'good' or 'bad', describing how features of radar data affect a measurement quality. Results of a clustering algorithm on the other hand can never represent anything physical - they just represent cluster index assignments and nothing more. So the clustering results need to be translated to the units of class data, and one way to do this translation is to maximize accuracy when the translated clustering results are compared directly to class data in the same units. The next two sections describe exactly how to do that; otherwise skip to results.

\(\text{Number of clusters} = \text{number of unique classes}\)

For the case of the Ionosphere dataset, where n_classes = 2, and if the number of clusters is the same (n_clusters = 2), then the possible cluster index assignments of 0 or 1 can line up with the unique class data of 'good' or 'bad' as 1 means 'good' and 0 means 'bad' or 0 means 'good' and 1 means 'bad'. One of those options will result in a higher accuracy when the translated clustering results are compared directly to class data, and the clustering accuracy is defined by that number.

\(\text{Number of clusters} \geq \text{number of unique classes}\)

For the general case where the number of clusters can exceed the number of unique classes, there are just more possibilities for the ways in which the clustering results can line up with the class data, but the clustering accuracy is still defined to be the highest accuracy of all possible combinations. Formally, those combinations can be characterized as all permutations of all k-subset partitions of an array X, where k is the number of unique values in class data, and X is the array of unique clustering results.

An example for the Ionosphere dataset where n_clusters = 4 would include 0, 1, and 2 means 'good'; 3 means 'bad' and 1 and 2 means 'good'; 0 and 3 means 'bad' among other possibilities (14 total in this case). The algorithm_u_permutations function in the code below is used to get that list of combinations and can be found on github. The rest of the code below is just calculating the clustering accuracy for each combination, and picking out the maximum value at the end.

    import numpy as np
    from sklearn.metrics import accuracy_score

    groups = algorithm_u_permutations(ns=np.unique(ycluster), m=np.unique(y).size)
    group_accuracy = []

    for group in groups:
        tmp = np.full(y.size, np.nan)

        for idx, items in enumerate(group):
            tmp[np.in1d(ycluster, items)] = idx

        group_accuracy.append(accuracy_score(y, tmp))

    cluster_accuracy = max(group_accuracy)
    

Finally, clustering accuracy results

Running the code above for all of the cases results in the table below of clustering accuracy for each of the baseline datasets. Cases where n_clusters = n_classes and where n_clusters = 2 * n_classes were considered, in order to demonstrate that more clusters always improves clustering accuracy (this improvement flattens predictably using the elbow method).

In some cases the clustering accuracy can exceed a classification accuracy applied to the same dataset, and this is without any kind of additional optimizations. The way in which clustering accuracy increases with the number of clusters represents an insight into the raw data structure that only unsupervised learning methods can offer.

Dataset # of unique classes
n_classes
sklearn.cluster.KMeans for different n_clusters
n_clusters
(= n_classes)
clustering accuracy n_clusters
(= 2 * n_classes)
clustering accuracy
Banknote Authentication 2 2 61.2% 4 85.9%
Adult 2 2 62.0% 4 73.6%
Wireless Localization 4 4 70.7% 8 72.4%
Ionosphere 2 2 71.2% 4 84.9%
Iris 3 3 89.3% 6 90.7%

Clustering algorithms can also overfit

Anyone trying to reproduce results of the clustering accuracy table will find minor differences, because the sklearn.cluster.KMeans algorithm produces different results when the initial random state of the algorithm is not controlled. These differences with initial random state can be thought of as a measure of the degree to which overfitting is occuring.

For example, the clustering accuracy of a sklearn.cluster.KMeans algorithm applied to the Ionosphere dataset, for the case of n_clusters = 2 * n_classes, is actually \(84.7 \pm 0.4\%\) on average when the algorithm is run 50 times and the error represents \(\pm1\) standard deviation. The variability in results captures the overfitting that is occuring each time the algorithm is run from a different initial random state.

For the same Ionosphere dataset, and the same procedure except that sklearn.cluster.MiniBatchKMeans is used instead of sklearn.cluster.KMeans, results become \(82.0 \pm 3.6\%\). Those results indicate the MiniBatchKMeans algorithm is more prone to overfitting than the KMeans algorithm. This result gave me a deeper insight into the difference between those algorithms than what was learned from reading scikit-learn documentation alone.

© Russell Burdt 2018.