This project performs customer segmentation using K-means clustering on an online retail dataset. The goal is to identify different customer segments based on their purchasing behavior.
The dataset contains transactional data from an online retail store, including fields such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country.
- pandas: Data manipulation and analysis
- matplotlib: Data visualization
- seaborn: Statistical data visualization
- scikit-learn: Machine learning algorithms
- openpyxl: Excel file handling
Install the dependencies using the following command:
pip install pandas matplotlib seaborn scikit-learn openpyxl
- Data Preprocessing: Handles missing values and removes duplicates.
- Feature Engineering: Creates RFM (Recency, Frequency, Monetary) features.
- Outlier Detection: Uses IQR to remove outliers.
- Scaling: Standardizes features for clustering.
- K-means Clustering: Implements K-means for customer segmentation.
- Visualization: Plots clusters to interpret results.
- Import Libraries: Load required Python libraries.
- Load Dataset: Import data from Excel file.
- Data Cleaning: Handle missing values and duplicates.
- Feature Engineering: Calculate RFM metrics for customers.
- Outlier Removal: Detect and remove outliers using the IQR method.
- Scaling: Standardize features for clustering.
- K-means Clustering: Determine the optimal number of clusters using the Elbow method and fit the model.
- Visualization: Plot cluster distributions and interpret results.
The output includes visualizations of clusters and insights into customer segmentation based on purchasing behavior.
- Implementing hierarchical clustering and DBSCAN for comparison.
- Automating identifying customer label for future data by automation.
- Adding dashboards for interactive visualization.
- Online retail mining paper: https://link.springer.com/article/10.1057/dbm.2012.17
- YouTube video: https://www.youtube.com/watch?v=afPJeQuVeuY&t=5235s