Iris Dataset Clustering and VisualizationΒΆ

IntroductionΒΆ

This project showcases the application of K-Means clustering and various visualization techniques on the famous Iris dataset. The Iris dataset consists of measurements of 150 iris flowers from three different species: Iris-setosa, Iris-versicolor, and Iris-virginica. The objective of this project is to cluster the dataset into groups based on the features and provide 2D and 3D visualizations to better understand the data distribution.

Key Features:ΒΆ

  • Data Preprocessing: Standardizing the dataset for better clustering.
  • K-Means Clustering: Clustering the data into 3 groups (representing the three species).
  • 2D and 3D Visualizations: Visualizing the dataset using both 2D and interactive 3D scatter plots.
  • Technologies: Python, Scikit-Learn, Seaborn, Matplotlib, Plotly.

Libraries Used:ΒΆ

  • Pandas: For data manipulation.
  • NumPy: For numerical operations.
  • Scikit-Learn: For K-Means clustering and data scaling.
  • Seaborn & Matplotlib: For 2D visualizations.
  • Plotly: For interactive 3D visualizations.

AuthorΒΆ

Durvesh Sunil Baharwal
Artificial Intelligence and Data Science Engineer

  • GitHub: Durveshbaharwal
  • LinkedIn: Durvesh Baharwal

Project Goals:ΒΆ

  • Gain insight into the Iris dataset by visualizing the relationships between features.
  • Demonstrate clustering techniques using K-Means.
  • Provide interactive visualizations to enhance data understanding.

DatasetΒΆ

The Iris dataset can be found on UCI Machine Learning Repository.

Step 1: Import LibrariesΒΆ

To start, we import all the necessary libraries for data manipulation, clustering, and visualization.

InΒ [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
C:\Users\ASUS\AppData\Roaming\Python\Python39\site-packages\pandas\core\computation\expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.8.3' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
C:\Users\ASUS\AppData\Roaming\Python\Python39\site-packages\pandas\core\arrays\masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (

Step 2: Load the DatasetΒΆ

We load the Iris dataset from a CSV file, drop the unnecessary Id column, and display the first few rows of the dataset.

InΒ [3]:
# Load the dataset from a CSV file
iris_data = pd.read_csv('Iris.csv')

# Drop the 'Id' column as it's not useful for analysis
iris_data = iris_data.drop(columns=['Id'])

# View the first few rows of the dataset
print(iris_data.head())
   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0            5.1           3.5            1.4           0.2  Iris-setosa
1            4.9           3.0            1.4           0.2  Iris-setosa
2            4.7           3.2            1.3           0.2  Iris-setosa
3            4.6           3.1            1.5           0.2  Iris-setosa
4            5.0           3.6            1.4           0.2  Iris-setosa

The Iris dataset consists of the following columns:ΒΆ

  1. Id: A unique identifier for each row.
  2. SepalLengthCm: The length of the sepal in cm.
  3. SepalWidthCm: The width of the sepal in cm.
  4. PetalLengthCm: The length of the petal in cm.
  5. PetalWidthCm: The width of the petal in cm.
  6. Species: The species of the iris flower (Iris-setosa, Iris-versicolor, Iris-virginica).

Project OverviewΒΆ

We will go step-by-step with the following tasks:

Data Preprocessing:ΒΆ

Handle any missing values (if present). Standardize/normalize the data.

K-Means Clustering:ΒΆ

Apply the K-means clustering algorithm and determine clusters.

2D Visualizations:ΒΆ

Visualize the dataset in two dimensions using various combinations of the features.

3D Visualization:ΒΆ

Extend the visualizations to 3D. Include interactive 3D plots.

Step 3: Data PreprocessingΒΆ

Before applying clustering, we need to check for missing values and normalize the feature data. Standardizing the data helps improve the performance of the clustering algorithm.

1. Check for Missing Values:ΒΆ

InΒ [4]:
# Check for missing values
print(iris_data.isnull().sum())
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

There are no missing values in the dataset, so we can proceed with data preprocessing by normalizing the numerical features.

2. Normalize the Feature Data:ΒΆ

We'll normalize the feature columns to bring them onto the same scale for better performance during clustering. After that, we will proceed with the K-means clustering algorithm.

InΒ [5]:
# Separate features and target variable
features = iris_data.drop(columns=['Species'])
species = iris_data['Species']

# Standardize the feature data
scaler = StandardScaler()
normalized_features = scaler.fit_transform(features)

# Convert the normalized features back to a DataFrame
normalized_features_df = pd.DataFrame(normalized_features, columns=features.columns)

# Print the normalized data
print(normalized_features_df.head())
   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
0      -0.900681      1.032057      -1.341272     -1.312977
1      -1.143017     -0.124958      -1.341272     -1.312977
2      -1.385353      0.337848      -1.398138     -1.312977
3      -1.506521      0.106445      -1.284407     -1.312977
4      -1.021849      1.263460      -1.341272     -1.312977

The features have been successfully standardized. Next, let's move on to K-Means Clustering. We'll apply the K-means algorithm and cluster the Iris dataset into 3 clusters (since we know there are 3 species).

Step 4: Apply K-Means ClusteringΒΆ

We apply the K-Means clustering algorithm with 3 clusters (since we know the Iris dataset has three species) and assign the cluster labels to the dataset.

InΒ [6]:
# Apply K-Means clustering with 3 clusters (since there are 3 species)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(normalized_features)

# Add the cluster labels to the original dataframe
iris_data['Cluster'] = kmeans.labels_

# Show the first few rows with the cluster labels
print(iris_data.head())
C:\Users\ASUS\AppData\Roaming\Python\Python39\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species  \
0            5.1           3.5            1.4           0.2  Iris-setosa   
1            4.9           3.0            1.4           0.2  Iris-setosa   
2            4.7           3.2            1.3           0.2  Iris-setosa   
3            4.6           3.1            1.5           0.2  Iris-setosa   
4            5.0           3.6            1.4           0.2  Iris-setosa   

   Cluster  
0        1  
1        1  
2        1  
3        1  
4        1  

The K-Means algorithm has grouped the data into clusters based on the feature similarity, and each flower is assigned a cluster number (0, 1, or 2), independent of the actual species label in the Species column. These clusters do not necessarily match the species labels directly but are a result of unsupervised learning based on feature similarities.

Step 5: 2D VisualizationsΒΆ

1. Pairplot using Seaborn:ΒΆ

A pairplot helps visualize the relationships between the features and their clusters.

InΒ [7]:
# Plot a pairplot to visualize the distribution of features by cluster
sns.pairplot(iris_data, hue='Cluster', diag_kind='kde', palette='Set1')
plt.show()
No description has been provided for this image
  1. Clusters Distinction: The data points are divided into three clusters (0, 1, and 2), represented by red, green, and blue colors, respectively. The clusters are distinguishable by their density distribution across different feature combinations.

  2. Sepal Length & Sepal Width: Cluster 1 (blue) has higher values for both Sepal Length and Sepal Width, forming a clear separation from clusters 0 (red) and 2 (green). Cluster 0 tends to have smaller Sepal Length and Sepal Width values, while cluster 2 forms a middle range.

  3. Petal Length & Petal Width: A strong positive correlation is visible between Petal Length and Petal Width for all clusters. Cluster 1 has higher Petal Length and Petal Width values, showing a clear distinction from the other clusters.

  4. Distribution Overlap: The Kernel Density Estimate (KDE) plots on the diagonal show that the distribution of certain features, like Sepal Width, overlaps between clusters 0 and 2, indicating less distinction in those features.

2. Scatter Plot of Sepal Length vs Sepal Width:ΒΆ

A simple 2D scatter plot to visualize two features.

InΒ [8]:
plt.figure(figsize=(8,6))
sns.scatterplot(x=normalized_features_df['SepalLengthCm'], 
                y=normalized_features_df['SepalWidthCm'], 
                hue=iris_data['Cluster'], 
                palette='Set1')
plt.title('2D Scatter Plot: Sepal Length vs Sepal Width')
plt.show()
No description has been provided for this image
  1. Sepal Length vs. Sepal Width: Cluster 1 (blue) has higher Sepal Width values, clearly distinguishing it from the other two clusters. Cluster 0 (red) is more concentrated towards the lower Sepal Width values, while cluster 2 (green) lies in between but is more spread out, showing more variability in Sepal Width.

  2. Separation: There is a noticeable gap between clusters 0 and 1 along the Sepal Width axis, making this feature useful for distinguishing them. However, some overlap occurs between clusters 0 and 2 along the Sepal Length axis.

  3. Distinct Clusters: Features like Petal Length, Petal Width, and Sepal Width provide good separation between clusters, particularly for cluster 1.

  4. Overlapping Features: Some overlap is observed between clusters 0 and 2, particularly in Sepal Length and Sepal Width, making them harder to distinguish using these features alone.

  5. Correlations: Petal features (length and width) show a stronger relationship across all clusters, which could be useful for predictive modeling or further analysis.

3. Correlation Heatmap:ΒΆ

InΒ [13]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset (Iris dataset for example)
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Generate correlation matrix
correlation_matrix = df.corr()

# Create heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()
No description has been provided for this image

Key Insights:ΒΆ

  1. Strong Positive Correlation Between Petal Length and Petal Width (0.96): The most significant positive correlation is between petal length and petal width with a value of 0.96. This indicates that as petal length increases, petal width also increases proportionally. This relationship suggests that these two features are closely related and likely contribute significantly to clustering or classification algorithms.

  2. High Positive Correlation Between Sepal Length and Petal Dimensions: Sepal length has strong positive correlations with both petal length (0.87) and petal width (0.82). This means that longer sepals are associated with longer and wider petals. This correlation might indicate that these features are related to the overall size of the flower.

  3. Moderate Negative Correlation Between Sepal Width and Petal Features: Sepal width has a negative correlation with both petal length (-0.43) and petal width (-0.37). This suggests that as sepal width increases, petal dimensions tend to decrease slightly. This weaker negative relationship implies that sepal width behaves differently compared to the other features, making it an interesting dimension to consider in clustering.

  4. Low Correlation Between Sepal Width and Sepal Length (-0.12): Sepal width and sepal length are almost uncorrelated with a value of -0.12. This means that changes in sepal width do not predict changes in sepal length. This lack of correlation suggests that these two features can vary independently of each other in the dataset.

Summary:ΒΆ

Petal dimensions (length and width) are strongly correlated with each other and moderately correlated with sepal length, making them highly influential in determining flower size and possibly the flower species.

Sepal width stands out as a feature with lower correlations to other dimensions, making it less predictable based on other features, potentially offering more unique information in classification models.

These correlations can be leveraged in machine learning models to improve clustering or classification by focusing on the features with the highest predictive power (petal length and width).

Step 6: 3D VisualizationsΒΆ

1. 3D Scatter Plot using Matplotlib:ΒΆ

Visualize the data in 3D to capture the relationships between three features.

InΒ [9]:
# 3D Scatter Plot using Matplotlib
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plotting the points
ax.scatter(normalized_features_df['SepalLengthCm'], 
           normalized_features_df['SepalWidthCm'], 
           normalized_features_df['PetalLengthCm'], 
           c=iris_data['Cluster'], cmap='Set1')

ax.set_xlabel('Sepal Length')
ax.set_ylabel('Sepal Width')
ax.set_zlabel('Petal Length')
plt.title('3D Scatter Plot')
plt.show()
No description has been provided for this image
  1. Axes: X-axis represents Sepal Length. Y-axis represents Sepal Width. Z-axis represents Petal Length.

  2. Clustering: The data points are color-coded, likely representing different clusters. The color gradient from red to orange shows some overlap in clusters, suggesting that there might be a transition or mixed membership between them. Gray points may represent another distinct cluster (possibly with minimal overlap with others).

  3. Distribution: There's a clear separation between the clusters along the Sepal Length and Sepal Width axes. The Petal Length shows more varied distribution but also helps in differentiating between groups.

Insights: This 3D scatter plot helps in understanding that the clusters are relatively well-separated in terms of Sepal Length and Sepal Width. However, there is still some overlap, especially along the Petal Length axis, which indicates these features might not be the only discriminators for the dataset.

2. Interactive 3D Plot using Plotly:ΒΆ

Use Plotly to create an interactive 3D scatter plot.

InΒ [10]:
# 3D Scatter Plot using Plotly for interactive visualization
fig = px.scatter_3d(iris_data, 
                    x='SepalLengthCm', 
                    y='SepalWidthCm', 
                    z='PetalLengthCm', 
                    color='Cluster',
                    labels={'Cluster': 'Cluster Group'},
                    title='3D Interactive Scatter Plot')
fig.show()
  1. Axes: X-axis: Sepal Length (Cm). Y-axis: Sepal Width (Cm). Z-axis: Petal Length (Cm).

  2. Cluster Grouping: The color scheme represents different cluster groups (0, 1, 2), with a distinct gradient from purple to yellow. The clusters are visually distinct, with yellow, pink, and blue clusters separating along different axes, primarily on Sepal Length and Sepal Width.

  3. Distribution: The Sepal Length and Sepal Width axes help in the visual separation of the clusters, while Petal Length contributes to the variance within the groups. The interactive nature of the plot allows rotation to better observe how the clusters are distributed in 3D space.

Insights:

The clusters appear well-separated, with clear boundaries between them based on Sepal Length and Sepal Width. The yellow and pink clusters are more tightly packed compared to the blue, which indicates stronger similarity within those groups. It shows that the variables (Sepal Length, Sepal Width, Petal Length) used are fairly effective in distinguishing between the clusters, but fine-tuning or additional features might be needed for more precise classification.

ConclusionΒΆ

Through these visualizations, we were able to confirm that the petal length and petal width are the most crucial features for distinguishing between different species of Iris flowers. While sepal measurements provide some information, they are not as effective on their own. The use of 3D visualizations helped in better understanding the interactions between features, while the heatmap provided a statistical basis for these findings.

By incorporating various types of visualizations, we improved our understanding of the relationships between features, leading to more precise clustering and species identification. This approach not only highlights the power of exploratory data analysis (EDA) but also demonstrates how combining visual and statistical insights can enhance the clustering process in machine learning applications.

This project is a simple yet powerful demonstration of how clustering techniques and visualizations can be used to analyze datasets effectively. The use of K-Means clustering and 3D visualizations provides a clear understanding of the data distribution and cluster formation.

Feel free to check out my other projects on GitHub and connect with me on LinkedIn.