This repository contains a clustering analysis of the Mall Customers dataset. The main work is in code/clustering_project.ipynb, where the data is explored, preprocessed, and segmented using unsupervised learning.
The project uses dataset/Mall_Customers.csv, which contains 200 mall customers with the following attributes:
CustomerIDGenreAgeAnnual Income (k$)Spending Score (1-100)
The dataset has no missing values.
The objective is to identify meaningful customer groups based on income and spending behavior, then compare two clustering approaches:
- K-Means clustering
- Agglomerative hierarchical clustering
The notebook uses Annual Income (k$) and Spending Score (1-100) as the main features for clustering after scaling.
The analysis follows these steps:
- Load and inspect the dataset.
- Explore feature distributions and pairwise relationships.
- Remove
CustomerID, which does not help with behavioral clustering. - Encode
Genreand scale the numeric features. - Estimate the best number of clusters with inertia and silhouette analysis.
- Fit K-Means with the selected
K. - Compare K-Means with hierarchical clustering.
- Visualize the final segments and summarize the customer profiles.
The notebook selects K = 5 as the optimal number of clusters. The final groups are interpreted as customer segments such as average shoppers, premium shoppers, impulsive shoppers, careful spenders, and sensible shoppers.
code/- notebooks for the clustering analysisdataset/- Mall Customers CSV filepaper/- space for report material
The notebook was developed with Python and common data science libraries, including:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- scipy
- Open code/clustering_project.ipynb in Jupyter or VS Code.
- Make sure the relative dataset path points to dataset/Mall_Customers.csv.
- Run the notebook cells from top to bottom.
The notebook also includes a comparison against scikit-learn clustering documentation and a Kaggle reference for the dataset (https://www.kaggle.com/datasets/abdallahwagih/mall-customers-segmentation/data).