by Russell Burdt

Using clustering algorithms to identify common patterns in unsupervised data can be a difficult concept to apply in practice, because there are no explicit methods to assess accuracy of the results. An accuracy calculation requires predicted data labels and true data labels, and with unsupervised data there are no true data labels. So this blog post explores clustering algorithms applied to supervised datasets (where there are true data labels) and specifically finds that

- Accuracy of a clustering algorithm can exceed accuracy of a classification algorithm applied to the same supervised data
- Clustering algorithms can be more accurate when the number of clusters exceeds the number of unique classes in supervised data
- Clustering algorithms are also susceptible to overfitting

Clustering algorithms are explained in scikit-learn documentation by visualizing their application to articial data in two dimensions, which may not provide sufficient insight to apply a clustering algorithm in practice on multi-dimensional real-world data. This blog post attempts to bridge that gap by exploring the accuracy of clustering algorithms applied to standard real-world datasets. Though not a topic of this blog post, others [1, 2] have demonstrated clustering algorithms as a pre-processing step for supervised learning.

`sklearn.datasets`

module and the UC Irvine Machine Learning Repository are used as baselines. The table below describes parameters of each dataset, and provides the cross-validation accuracies (min — max range) of several classification algorithms applied to each dataset. The classification algorithms all use default hyperparameters at initialization, and `sklearn.model_selection.cross_val_score`

with `cv=4`

to get accuracy results (see code below table to create data for one classification algorithm applied to one dataset).
Dataset | # of features | # of instances | # of unique classes ( `n_classes` ) |
4-fold cross-validation accuracy min — max percentage | |||
---|---|---|---|---|---|---|---|

`sklearn.naive_bayes.` |
`sklearn.ensemble.` |
`sklearn.linear_model.` |
|||||

Banknote Authentication | 4 | 1372 | 2 | 79.9 — 85.4 | 98.8 — 99.7 | 98.3 — 99.4 | |

Adult | 14 | 32561 | 2 | 79.3 — 80.0 | 84.6 — 85.2 | 79.1 — 80.9 | |

Wireless Localization | 7 | 2000 | 4 | 96.6 — 99.0 | 96.0 — 98.6 | 96.2 — 97.4 | |

Ionosphere | 34 | 351 | 2 | 83.9 — 91.0 | 84.1 — 95.4 | 76.1 — 90.8 | |

Iris | 4 | 150 | 3 | 91.7 — 100.0 | 94.4 — 100.0 | 86.1 — 100.0 |

```
# datasets is a Python module on github:
# https://github.com/russellburdt/pyrb/blob/master/datasets.py
import datasets
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
# load classification data, ionosphere data from:
# https://archive.ics.uci.edu/ml/datasets/Ionosphere
data = datasets.supervised_ionosphere()
X, y = data['X'].values, data['y'].values
# initialize and run the model
model = GaussianNB()
acc = cross_val_score(model, X, y, cv=4, scoring='accuracy')
```

`sklearn.cluster.KMeans`

algorithm applied to the Ionosphere dataset with accuracy for each fold measured as the best case accuracy of each translation referenced above.
```
# datasets is a Python module on github:
# https://github.com/russellburdt/pyrb/blob/master/datasets.py
import datasets
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
# load classification data, ionosphere data from:
# https://archive.ics.uci.edu/ml/datasets/Ionosphere
data = datasets.supervised_ionosphere()
X, y = data['X'].values, data['y'].values
# initialize clustering algorithm
kmeans = KMeans(n_clusters=np.unique(y).size)
# manually run cross-validation
kfold = KFold(n_splits=4, shuffle=True)
acc = []
for idx_train, idx_test in kfold.split(X):
kmeans.fit(X[idx_train, :])
y_pred = kmeans.predict(X[idx_test, :]).astype(np.bool)
y_true = y[idx_test].astype(np.bool)
# append best case accuracy for all combinations of cluster index alignments
acc.append(max(accuracy_score(y_true, y_pred), accuracy_score(y_true, ~y_pred)))
```

`n_clusters = 3`

would include `algorithm_u_permutations`

function that can be used to create a generator for those possibilities.
```
# knuth is a Python module on github:
# https://github.com/russellburdt/pyrb/blob/master/knuth.py
from knuth import algorithm_u_permutations
for x in algorithm_u_permutations([0, 1, 2], 2):
print(x)
...
([0, 1], [2])
([2], [0, 1])
([0], [1, 2])
([1, 2], [0])
([0, 2], [1])
([1], [0, 2])
```

`sklearn.cluster.KMeans`

algorithm for cases where the number of clusters is set to the number of unique responses and to 2x the number of unique responses.
Dataset | # of unique classes ( `n_classes` ) |
4-fold cross-validation accuracy min — max percentage | ||||
---|---|---|---|---|---|---|

`sklearn.naive_bayes.` |
`sklearn.ensemble.` |
`sklearn.linear_model.` |
`KMeans` `n_clusters = n_classes` |
`KMeans` `n_clusters = 2 * n_classes` |
||

Banknote Authentication | 2 | 79.9 — 85.4 | 98.8 — 99.7 | 98.3 — 99.4 | 59.5 — 61.8 | 85.4 — 86.6 |

Adult | 2 | 79.3 — 80.0 | 84.6 — 85.2 | 79.1 — 80.9 | 61.4 — 62.5 | 73.0 — 74.4 |

Wireless Localization | 4 | 96.6 — 99.0 | 96.0 — 98.6 | 96.2 — 97.4 | 68.2 — 73.4 | 72.0 — 73.2 |

Ionosphere | 2 | 83.9 — 91.0 | 84.1 — 95.4 | 76.1 — 90.8 | 64.8 — 79.5 | 80.4 — 89.7 |

Iris | 3 | 91.7 — 100.0 | 94.4 — 100.0 | 86.1 — 100.0 | 84.2 — 91.9 | 86.5 — 94.7 |