K-Means Clustering Notebook Methodology
Google Colab notebook: Open in Google Colab
This post explains the K-Means clustering methodology used in my Colab notebook on 3 datasets - Iris, Wine, and WeatherAUS. The notebook applies the same clustering process to these datasets.
Methodology
The notebook follows a consistent unsupervised learning workflow. Each dataset is loaded from CSV, inspected briefly, and then converted into a numeric feature matrix suitable for K-Means. Labels and outcome columns are removed before clustering because K-Means should group records by feature similarity, not by being given the answer in advance.
The most important preprocessing step is standardisation. K-Means uses Euclidean distance, so variables with larger numeric ranges can dominate the result if they are left unscaled. The notebook therefore uses median imputation for missing numeric values and StandardScaler to put all selected features onto a comparable scale.
For each dataset, different values of K are explored using:
- SSE / inertia, shown through elbow plots.
- Silhouette scores, used to assess how separated the clusters are.
- PCA scatter plots, used to visualise high-dimensional clusters in two dimensions.
The model uses k-means++ initialisation, repeated initialisation with n_init=12, a fixed random seed, and a maximum of 300 iterations. This makes the results more stable and reproducible while still reflecting the iterative nature of K-Means.
Iris and Wine
For the Iris dataset, the notebook removes the species label and clusters the four numeric flower measurements: sepal length, sepal width, petal length, and petal width. The required model uses K = 3, which matches the three known species. The clusters are then compared with the true species labels using crosstabs, Adjusted Rand Index, Normalised Mutual Information, and best mapped cluster-label accuracy.
The result shows that K-Means separates setosa very clearly, but there is some overlap between versicolor and virginica. This is a useful learning point because the algorithm can identify structure, but it does not understand botanical categories. It only uses distance between numeric measurements.
For the Wine dataset, the notebook removes the Wine class label and clusters thirteen chemical measurements. Again, K = 3 is used so the unsupervised clusters can be compared with the three known wine classes. The Wine result is stronger than Iris, with a best mapped cluster-label accuracy of 0.966. This suggests that the chemical variables contain a clearer grouping structure for K-Means.
WeatherAUS
The WeatherAUS dataset is more challenging. It contains numeric weather measurements, categorical fields, dates, locations, and rain labels. To keep the method appropriate for K-Means, the notebook removes RainTomorrow, removes the categorical RainToday field, and uses twelve numeric weather variables including temperature, rainfall, wind speed, humidity, pressure, and time-of-day temperature readings.
The notebook tests K = 2 to K = 6, then visualises each result using PCA. A K = 3 model is inspected in more detail. The cluster summaries suggest broad weather profiles, such as warmer conditions, wetter and windier conditions, and cooler higher-pressure conditions. The optional comparison with RainTomorrow is useful, but it should not be treated as a supervised prediction task. K-Means has not been trained to predict rain; it has only grouped records by similarity.
Learning reflection
This activity highlights the applicability and limitations of different datasets for machine learning. Iris and Wine are clean, compact datasets with known labels, so they are useful for testing whether unsupervised clusters align with expected classes. WeatherAUS is more realistic because it includes missing values, mixed data types, and a much larger number of observations. This makes preprocessing decisions more important and also makes the clusters harder to explain.
On limitations, clustering can create categories that look authoritative even when the groupings are only mathematical. If this method were used on people, such as customers, patients, students, or employees, the selected features, value of K, missing-data treatment, and interpretation of clusters would need to be documented carefully. Poorly explained clusters could lead to unfair segmentation, biased decisions, or misleading claims about groups.
Professionally, the notebook also reflects a development-team mindset. The workflow is structured into reusable helper functions for scaling, plotting metrics, visualising PCA results, and comparing labels. This makes the analysis easier for another team member to review, rerun, or extend in a virtual environment. Reproducibility is supported through fixed random seeds, explicit feature lists, and clear separation between preprocessing, modelling, visualisation, and interpretation.
Summary
The notebook demonstrates that K-Means is a practical method for exploratory clustering when the data is numeric, scaled, and suitable for distance-based grouping. It works especially well when clusters are compact and well separated, as seen in parts of the Iris and Wine analysis. However, the WeatherAUS task shows that real-world data requires more careful preprocessing and more cautious interpretation.
