by Russell Burdt

Using clustering algorithms to identify common patterns in unsupervised data can be a difficult concept to apply in practice, because there are no explicit methods to assess accuracy of the results. An accuracy calculation requires predicted data labels and true data labels, and with unsupervised data there are no true data labels. From a Python and scikit-learn point of view, there is no `y_true`

to use with the `sklearn.metrics.accuracy_score`

function.

```
In [12]: sklearn.metrics.accuracy_score?
Signature: sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
```

This blog post briefly explores clustering algorithms applied to problems where there are true data labels, i.e. with class data in supervised learning problems. In these cases, results of the clustering algorithm can be compared directly to class data via the `sklearn.metrics.accuracy_score`

function. Specifically, this blog post finds that
- Accuracy of a clustering algorithm can exceed accuracy of a classification algorithm applied to the same supervised learning problem
- Clustering algorithms can be more accurate when the number of clusters exceeds the number of unique classes in the supervised dataset
- Clustering algorithms are susceptible to overfitting, especially
`sklearn.cluster.MiniBatchKMeans`

Clustering algorithms are explained in the scikit-learn documentation by visualizing their application to articial data in two dimensions. The techniques and results described in this blog post have helped me to get a better understanding of how clustering algorithms apply to multi-dimensional, real-world datasets. Though not a topic of this blog post, others ([1] and [2]) have demonstrated clustering algorithms as a pre-processing step that improves accuracy of a classification algorithm.

`sklearn.datasets`

module and the UC Irvine Machine Learning Repository are used. The table below describes parameters of each dataset, and provides the accuracy of several classification algorithms applied to each dataset. The classification algorithms all use default hyperparameters at initialization, and `sklearn.model_selection.cross_val_score`

with `cv=5`

to get accuracy results.
The results in the table represent a baseline, that is used for comparison to accuracy results of clustering algorithms applied to the same datasets later on in this blog post. The datasets were chosen to represent a range of real-world, multi-dimensional datasets with varying classification accuracy.

Dataset | # of features | # of instances | # of unique classes ( `n_classes` ) |
5-fold cross-validation accuracy with default classifier | |||
---|---|---|---|---|---|---|---|

`sklearn.naive_bayes.` |
`sklearn.ensemble.` |
`sklearn.neighbors.` |
`sklearn.linear_model.` |
||||

Banknote Authentication | 4 | 1372 | 2 | 83.8% | 99.1% | 100.0% | 98.8% |

Adult | 14 | 32561 | 2 | 79.5% | 84.8% | 77.7% | 79.2% |

Wireless Localization | 7 | 2000 | 4 | 98.1% | 97.0% | 97.7% | 96.7% |

Ionosphere | 34 | 351 | 2 | 86.6% | 93.2% | 82.6% | 85.5% |

Iris | 4 | 150 | 3 | 95.3% | 96.7% | 97.3% | 96.0% |

Data in the table above (for the Ionosphere dataset) were created using the code below. The code represents the steps to load the dataset and calculate the cross-validation accuracy for the list of scikit-learn classifier objects in the table. The

`datasets`

module used in the code was written to support a very small library of standard datasets that load into Python in a consistent structure, and is available on github.
```
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# load classification data
data = datasets.supervised_ionosphere()
X, y = data['X'].values, data['y'].values
# loop over a list of classifiers
classifiers = [GaussianNB, RandomForestClassifier, KNeighborsClassifier, LogisticRegression]
for classifier in classifiers:
# initialize the algo, run the cross-validation function, print the results
algo = classifier()
accuracy = cross_val_score(algo, X, y, cv=5).mean()
print('{} accuracy, {:.1f}%'.format(classifier.__name__, 100 * accuracy))
```

`KMeans`

clustering algorithm can be applied to each of the baseline datasets by initializing the algorithm with `n_clusters`

and then fitting and predicting to the training data of the supervised dataset.
```
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2)
ycluster = model.fit_predict(X)
```

At this point (with `ycluster`

data created by the code above), results of the clustering algorithm can be compared to the class data of the supervised dataset. This comparison can get complicated, however, due to the number of ways the clustered data can be assigned to line up with the class data.
In the Ionosphere dataset, for example, the class data are actually strings of 'good' or 'bad', describing how features of radar data affect a measurement quality. Results of a clustering algorithm on the other hand can never represent anything physical - they just represent cluster index assignments and nothing more. So the clustering results need to be translated to the units of class data, and one way to do this translation is to maximize accuracy when the translated clustering results are compared directly to class data in the same units. The next two sections describe exactly how to do that; otherwise skip to results.

`n_classes = 2`

, and if the number of clusters is the same (`n_clusters = 2`

), then the possible cluster index assignments of 0 or 1 can line up with the unique class data of 'good' or 'bad' as An example for the Ionosphere dataset where

`n_clusters = 4`

would include `algorithm_u_permutations`

function in the code below is used to get that list of combinations and can be found on github. The rest of the code below is just calculating the clustering accuracy for each combination, and picking out the maximum value at the end.
```
import numpy as np
from sklearn.metrics import accuracy_score
groups = algorithm_u_permutations(ns=np.unique(ycluster), m=np.unique(y).size)
group_accuracy = []
for group in groups:
tmp = np.full(y.size, np.nan)
for idx, items in enumerate(group):
tmp[np.in1d(ycluster, items)] = idx
group_accuracy.append(accuracy_score(y, tmp))
cluster_accuracy = max(group_accuracy)
```

`n_clusters = n_classes`

and where `n_clusters = 2 * n_classes`

were considered, in order to demonstrate that more clusters always improves clustering accuracy (this improvement flattens predictably using the elbow method).
In some cases the clustering accuracy can exceed a classification accuracy applied to the same dataset, and this is without any kind of additional optimizations. The way in which clustering accuracy increases with the number of clusters represents an insight into the raw data structure that only unsupervised learning methods can offer.

Dataset | # of unique classes`n_classes` |
`sklearn.cluster.KMeans` for different `n_clusters` |
|||
---|---|---|---|---|---|

`n_clusters` ( `= n_classes` ) |
clustering accuracy | `n_clusters` ( `= 2 * n_classes` ) |
clustering accuracy | ||

Banknote Authentication | 2 | 2 | 61.2% | 4 | 85.9% |

Adult | 2 | 2 | 62.0% | 4 | 73.6% |

Wireless Localization | 4 | 4 | 70.7% | 8 | 72.4% |

Ionosphere | 2 | 2 | 71.2% | 4 | 84.9% |

Iris | 3 | 3 | 89.3% | 6 | 90.7% |

`sklearn.cluster.KMeans`

algorithm produces different results when the initial random state of the algorithm is not controlled. These differences with initial random state can be thought of as a measure of the degree to which overfitting is occuring.
For example, the clustering accuracy of a

`sklearn.cluster.KMeans`

algorithm applied to the Ionosphere dataset, for the case of `n_clusters = 2 * n_classes`

, is actually \(84.7 \pm 0.4\%\) on average when the algorithm is run 50 times and the error represents \(\pm1\) standard deviation. The variability in results captures the overfitting that is occuring each time the algorithm is run from a different initial random state.
For the same Ionosphere dataset, and the same procedure except that

`sklearn.cluster.MiniBatchKMeans`

is used instead of `sklearn.cluster.KMeans`

, results become \(82.0 \pm 3.6\%\). Those results indicate the `MiniBatchKMeans`

algorithm is more prone to overfitting than the `KMeans`

algorithm. This result gave me a deeper insight into the difference between those algorithms than what was learned from reading scikit-learn documentation alone.