Using clustering algorithms to identify common patterns in unsupervised data can be a difficult concept to apply in practice, because there are no explicit methods to assess accuracy of the results. An accuracy calculation requires predicted data labels and true data labels, and with unsupervised data there are no true data labels. From a Python and scikit-learn point of view, there is no y_true
to use with the sklearn.metrics.accuracy_score
function.
In [12]: sklearn.metrics.accuracy_score?
Signature: sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
This blog post briefly explores clustering algorithms applied to problems where there are true data labels, i.e. with class data in supervised learning problems. In these cases, results of the clustering algorithm can be compared directly to class data via the sklearn.metrics.accuracy_score
function. Specifically, this blog post finds that
sklearn.cluster.MiniBatchKMeans
sklearn.datasets
module and the UC Irvine Machine Learning Repository are used. The table below describes parameters of each dataset, and provides the accuracy of several classification algorithms applied to each dataset. The classification algorithms all use default hyperparameters at initialization, and sklearn.model_selection.cross_val_score
with cv=5
to get accuracy results.
Dataset | # of features | # of instances | # of unique classes ( n_classes ) |
5-fold cross-validation accuracy with default classifier | |||
---|---|---|---|---|---|---|---|
sklearn.naive_bayes. |
sklearn.ensemble. |
sklearn.neighbors. |
sklearn.linear_model. |
||||
Banknote Authentication | 4 | 1372 | 2 | 83.8% | 99.1% | 100.0% | 98.8% |
Adult | 14 | 32561 | 2 | 79.5% | 84.8% | 77.7% | 79.2% |
Wireless Localization | 7 | 2000 | 4 | 98.1% | 97.0% | 97.7% | 96.7% |
Ionosphere | 34 | 351 | 2 | 86.6% | 93.2% | 82.6% | 85.5% |
Iris | 4 | 150 | 3 | 95.3% | 96.7% | 97.3% | 96.0% |
datasets
module used in the code was written to support a very small library of standard datasets that load into Python in a consistent structure, and is available on github.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# load classification data
data = datasets.supervised_ionosphere()
X, y = data['X'].values, data['y'].values
# loop over a list of classifiers
classifiers = [GaussianNB, RandomForestClassifier, KNeighborsClassifier, LogisticRegression]
for classifier in classifiers:
# initialize the algo, run the cross-validation function, print the results
algo = classifier()
accuracy = cross_val_score(algo, X, y, cv=5).mean()
print('{} accuracy, {:.1f}%'.format(classifier.__name__, 100 * accuracy))
KMeans
clustering algorithm can be applied to each of the baseline datasets by initializing the algorithm with n_clusters
and then fitting and predicting to the training data of the supervised dataset.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2)
ycluster = model.fit_predict(X)
At this point (with ycluster
data created by the code above), results of the clustering algorithm can be compared to the class data of the supervised dataset. This comparison can get complicated, however, due to the number of ways the clustered data can be assigned to line up with the class data.
n_classes = 2
, and if the number of clusters is the same (n_clusters = 2
), then the possible cluster index assignments of 0 or 1 can line up with the unique class data of 'good' or 'bad' as 1 means 'good' and 0 means 'bad' or 0 means 'good' and 1 means 'bad'. One of those options will result in a higher accuracy when the translated clustering results are compared directly to class data, and the clustering accuracy is defined by that number.
n_clusters = 4
would include 0, 1, and 2 means 'good'; 3 means 'bad' and 1 and 2 means 'good'; 0 and 3 means 'bad' among other possibilities (14 total in this case). The algorithm_u_permutations
function in the code below is used to get that list of combinations and can be found on github. The rest of the code below is just calculating the clustering accuracy for each combination, and picking out the maximum value at the end.
import numpy as np
from sklearn.metrics import accuracy_score
groups = algorithm_u_permutations(ns=np.unique(ycluster), m=np.unique(y).size)
group_accuracy = []
for group in groups:
tmp = np.full(y.size, np.nan)
for idx, items in enumerate(group):
tmp[np.in1d(ycluster, items)] = idx
group_accuracy.append(accuracy_score(y, tmp))
cluster_accuracy = max(group_accuracy)
n_clusters = n_classes
and where n_clusters = 2 * n_classes
were considered, in order to demonstrate that more clusters always improves clustering accuracy (this improvement flattens predictably using the elbow method).
Dataset | # of unique classesn_classes |
sklearn.cluster.KMeans for different n_clusters |
|||
---|---|---|---|---|---|
n_clusters ( = n_classes ) |
clustering accuracy | n_clusters ( = 2 * n_classes ) |
clustering accuracy | ||
Banknote Authentication | 2 | 2 | 61.2% | 4 | 85.9% |
Adult | 2 | 2 | 62.0% | 4 | 73.6% |
Wireless Localization | 4 | 4 | 70.7% | 8 | 72.4% |
Ionosphere | 2 | 2 | 71.2% | 4 | 84.9% |
Iris | 3 | 3 | 89.3% | 6 | 90.7% |
sklearn.cluster.KMeans
algorithm produces different results when the initial random state of the algorithm is not controlled. These differences with initial random state can be thought of as a measure of the degree to which overfitting is occuring.
sklearn.cluster.KMeans
algorithm applied to the Ionosphere dataset, for the case of n_clusters = 2 * n_classes
, is actually \(84.7 \pm 0.4\%\) on average when the algorithm is run 50 times and the error represents \(\pm1\) standard deviation. The variability in results captures the overfitting that is occuring each time the algorithm is run from a different initial random state.
sklearn.cluster.MiniBatchKMeans
is used instead of sklearn.cluster.KMeans
, results become \(82.0 \pm 3.6\%\). Those results indicate the MiniBatchKMeans
algorithm is more prone to overfitting than the KMeans
algorithm. This result gave me a deeper insight into the difference between those algorithms than what was learned from reading scikit-learn documentation alone.