9- Data Mining / K-Means Clustering Repository Presentation

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

K-Means Algorithm - Clustering - Presentation

This repository contains the full presentation and step-by-step application of the K-Means clustering algorithm. The goal is to demonstrate the process from data preprocessing, through model evaluation, to the final conclusion about the optimal number of clusters, based on an included PDF presentation. This provides a comprehensive, practical example of unsupervised clustering for educational and analytical purposes.

K-Means Clustering Repository Presentation

This repository contains the full presentation and step-by-step application of the K-Means clustering algorithm. The goal is to demonstrate the process from data preprocessing, through model evaluation, to the final conclusion about the optimal number of clusters, based on an included PDF presentation. This provides a comprehensive, practical example of unsupervised clustering for educational and analytical purposes.

What is K-Means?

K-Means is a popular unsupervised machine learning algorithm used for clustering data. Its primary purpose is to partition a dataset into a pre-specified number of distinct, non-overlapping groups called "clusters." The "K" in K-Means refers to the number of clusters the user wants to identify.

The algorithm works by grouping data points that are similar to each other based on a distance metric, usually Euclidean distance. Each cluster is represented by its centroid, which is the mean position of all points within that cluster. K-Means iteratively adjusts the centroids and reassigns points to clusters until the clusters are stable or a set number of iterations is reached.

Type of Algorithm

K-Means is an example of a "hard" clustering algorithm because each data point belongs to exactly one cluster. It is an iterative centroid-based clustering method that aims to minimize the within-cluster variance (sum of squared distances from points to their cluster centroid).

Because it is unsupervised learning, it does not require labeled data.

When to Use K-Means

When you have a dataset without labels and want to discover natural groupings based on feature similarities.
When clusters are expected to be spherical or roughly equally sized, as K-Means works best in these cases.
When you know or can estimate the number of clusters (K) in advance.
When computational efficiency is important, as K-Means is relatively fast and scalable to large datasets.
For applications like market segmentation, image compression, document clustering, and pattern recognition.

When Not to Use K-Means

If clusters in data are non-spherical, overlapping, or have very different sizes or densities, K-Means may not perform well.
When the number of clusters K is not known and difficult to estimate.
When the data contains many outliers, since K-Means is sensitive to outliers which can distort centroids.
For categorical or non-numeric data without proper encoding or different distance metrics.
When clusters have complex shapes that cannot be approximated well by centroids.

In these cases, other clustering methods such as DBSCAN, hierarchical clustering, or Gaussian mixture models might be more appropriate.

Data Preprocessing

The original dataset consisted of multiple columns, but only “Column1” and “Column2” were used for the analysis. The column "Unnamed: 0", which was merely an index without analytical value, was dropped. The final dataset contains 2 columns and 9,308 rows.

Data Exploration

A plot of the original data was constructed to explore its behavior visually. The visual inspection suggested the data was suitable for clustering using the K-Means model. The initial hypothesis was the presence of 4 to 6 groups. However, this estimate was to be confirmed later through the elbow method and silhouette analysis.

Original Data Plot

Data Preprocessing

There were 2 missing values per column (9,306 non-null values out of 9,308). Since the K-Means algorithm cannot handle missing values, these were imputed using the mean of each respective column to enable modeling.

The dataset initially contained 3 columns, but only "Column1" and "Column2" were used after dropping the "Unnamed: 0" index column, resulting in 2 columns and 9,308 rows.

The data plot shows that the dataset is suitable for clustering, with an initial hypothesis of 4 to 6 groups to be validated later using the elbow method and silhouette index.

Notably, there are 2 missing values per column, which were imputed using the mean of each column, as K-Means does not handle missing values.

df['Column1'] = df['Column1'].fillna(df['Column1'].mean())  \# Fill NaNs with mean
df['Column2'] = df['Column2'].fillna(df['Column2'].mean())  \# Fill NaNs with mean

Duplicate Values

Duplicate rows were checked in each column to avoid redundant data points in clustering.

Code was used to list duplicates in “Column1” and “Column2” separately:

df = df.drop_duplicates(subset='Column1', keep='first')

This resulted in 9,299 rows and 2 columns ready for further processing.

Data Normalization

K-Means clustering is sensitive to the scale of features since it relies on distance calculations. To avoid magnitude bias, the data was normalized to a [0,1] scale using MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
standard_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

The normalized dataset had minimum values of 0 and maximum values of 1 for both columns, confirming correct scaling.

Determining the Number of Clusters (Choosing K)

The Elbow Method

The Elbow Method analyzes the total within-cluster sum of squares (inertia) for different values of K (number of clusters). The goal is to identify the "elbow" point where the inertia reduction rate sharply declines, indicating an optimal K.

The script runs KMeans clustering for K from 2 to 10 and stores the inertia values:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

inertia_values = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
inertia_values.append(kmeans.fit(standard_df).inertia_)

plt.figure(figsize=(10, 6))
sns.lineplot(x=range(2, 11), y=inertia_values, marker='o')

plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_visible(False)

plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')

plt.axvline(x=3, color='\#D86565', linestyle='--')  \# Candidate K=3
plt.axvline(x=5, color='\#D86565', linestyle='--')  \# Candidate K=5

plt.show()

Elbow Plot Analysis

There is a sharp drop in inertia from (K=2) to (K=3).
The decrease continues but less steep until (K=5), after which the curve flattens.
The plot suggests two potential "elbows" at (K=3) and (K=5), indicating ambiguity in choosing between these two values solely based on the elbow method.

Silhouette Score Evaluation

The silhouette score is a metric that evaluates cluster quality by assessing how similar each point is to its own cluster compared to other clusters. Scores range from -1 to 1, where a high positive score indicates well-separated, coherent clusters.

The silhouette scores were calculated for (K=3, 4, 5):

from sklearn.metrics import silhouette_score
import pandas as pd

scores = []
for k in :
kmeans = KMeans(n_clusters=k, random_state=43)
labels = kmeans.fit_predict(standard_df)
scores.append(silhouette_score(standard_df, labels))

pd.DataFrame({'K': , 'Silhouette Score': scores})

K	Silhouette Score
3	0.667
4	0.700
5	0.671

The silhouette score clearly favors (K=4), showing the best balance of cluster cohesion and separation among tested values.

Visual Cluster Analysis

Scatter plots of clusters for (K=3), (K=4), and (K=5) were generated, including marked centroids, for intuitive visual evaluation.

import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(3, 1, figsize=(14, 20))
for ax, k in zip(axes, ):
kmeans = KMeans(n_clusters=k, random_state=43)
kmeans.fit(standard_df)

    data_with_clusters = standard_df.copy()
    data_with_clusters['Cluster'] = kmeans.labels_
    
    sns.scatterplot(data=data_with_clusters, x='Column1', y='Column2', hue='Cluster', palette='Set2', legend='full', ax=ax)
    sns.scatterplot(x=kmeans.cluster_centers_[:, 0], y=kmeans.cluster_centers_[:, 1], s=150, color='black', marker='X', label='Centroids', ax=ax)
    
    ax.set_title(f'K = {k}')
    ax.set_xlabel('Column 1')
    ax.set_ylabel('Column 2')
    ax.legend(loc='upper left')
    
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

Visual Analysis of Clusters

K=3: The clustering appears reasonable overall, but one large cluster (orange) on the left looks extensive and seems to contain two distinct “clouds” of points that could be separated better.
**K=5:** This splits the large orange cluster from K=3 into two smaller clusters (orange and light green). This captures some substructure but the separation among other clusters may not be as optimized.
K=4: Appears most natural and interpretable. Clearly separates the four main visible groups in the data. Each cluster is dense and well-defined. For example, it distinguishes the upper-middle group (pink) from the right group (teal), which K=3 does not.

Cluster Descriptive Statistics (for K=4)

Descriptive statistics were computed per cluster to characterize their properties quantitatively:

Cluster	Count	Mean Column1	Mean Column2
0	1329	8.19	6.10
1	5311	-4.53	-4.98
2	1331	8.93	-8.13
3	1328	0.35	9.58

These statistics help understand the spatial distribution and size of each cluster in feature space.

Final Conclusion

The Elbow Method was inconclusive, pointing to (K=3) or (K=5) as possible choices without a definitive answer.
The silhouette score provided a clear quantitative preference for (K=4), with the highest score of 0.699.
Visual inspection of cluster plots confirmed that (K=4) produced the most intuitive and clearly separated clusters, aligning well with the spatial data structure.
Therefore, the decision to proceed with K-Means using (K=4) clusters is well supported, combining strong evaluation metrics with logical and interpretable visual results.

Summaryn

This repository and its accompanying presentation PDF provide a full, practical example of using the K-Means clustering algorithm, starting from data cleaning and preprocessing, through determination of the optimal number of clusters by multiple methods, and ending with qualitative and quantitative validation of the results.

21- Our Crew:

👨🏽‍🚀 Andson Ribeiro - Slide into my inbox
👩🏻‍🚀 Fabiana ⚡️ Campanari - Shoot me an email
👨🏽‍🚀 José Augusto de Souza Oliveira - email
🧑🏼‍🚀 Luan Fabiano - email
👨🏽‍🚀 Pedro Barrenco - email
🧑🏼‍🚀 Pedro Vyctor - Hit me up by email

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introdução à mineração de dados: conceitos básicos, algoritmos e aplicações. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Inteligência Artificial - Uma Abordagem de Aprendizado de Máquina. 2nd Ed. LTC.

3. Larson & Farber (2015). Estatística Aplicada. Pearson.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
Briefing		Briefing
Code_Kmeans_Ckustering_Presentation		Code_Kmeans_Ckustering_Presentation
Datasets		Datasets
Presentation_PDF		Presentation_PDF
Presentation_PPTX		Presentation_PPTX
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

9- Data Mining / K-Means Clustering Repository Presentation

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

K-Means Algorithm - Clustering - Presentation

What is K-Means?

Type of Algorithm

When to Use K-Means

When Not to Use K-Means

Data Preprocessing

Data Exploration

Original Data Plot

Data Preprocessing

Duplicate Values

Data Normalization

Determining the Number of Clusters (Choosing K)

The Elbow Method

Elbow Plot Analysis

Visual Cluster Analysis

Visual Analysis of Clusters

Cluster Descriptive Statistics (for K=4)

Final Conclusion

Summaryn

21- Our Crew:

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/9-DataMining_K-Means_Clustering_Presentation

Folders and files

Latest commit

History

Repository files navigation

9- Data Mining / K-Means Clustering Repository Presentation

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

The Elbow Method

21- Our Crew:

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages