en
Tutorials
07. Clustering

07. Clustering

K-Means, DBSCAN, Hierarchical Clustering, Silhouette Score


Learning Objectives

After completing this tutorial, you will be able to:

  • Understand K-Means algorithm operation and objective function (Inertia)
  • Select optimal K using Elbow Method and Silhouette Score
  • Understand K-Means limitations and necessity of K-Means++ initialization
  • Compare various clustering algorithms: DBSCAN, Hierarchical Clustering
  • Apply practical Customer Segmentation

Key Concepts

1. What is Clustering?

A technique for grouping similar data through unsupervised learning. Discovers data structure without labels.

TypeAlgorithmCharacteristics
PartitioningK-MeansFast, k must be specified
DensityDBSCANOutlier detection, k not needed
HierarchicalAgglomerativeDendrogram visualization

2. K-Means Algorithm Principle

K-Means is the most basic and widely used clustering algorithm that groups data into K clusters.

Operation Process:

  1. Random initialization of K centroids
  2. Assign each data point to nearest centroid
  3. Calculate new centroid for each cluster (mean)
  4. Repeat steps 2-3 until centroids don't change

Objective Function (Inertia):

J=i=1nk=1Krikxiμk2J = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} ||x_i - \mu_k||^2

  • rikr_{ik}: 1 if data i belongs to cluster k, else 0
  • μk\mu_k: Centroid of cluster k
from sklearn.cluster import KMeans
 
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
centroids = kmeans.cluster_centers_
 
print(f'Inertia: {kmeans.inertia_:.2f}')
⚠️

Scaling Required! K-Means is distance-based, so features must be normalized with StandardScaler.

K-Means Limitations

  • k must be specified in advance
  • Only finds spherical clusters well
  • Sensitive to outliers
  • Results vary depending on initialization

3. Finding Optimal K

Elbow Method

Find the point where Inertia decrease rate bends (elbow).

inertias = []
K_range = range(1, 11)
 
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
 
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=10)
plt.axvline(x=4, color='red', linestyle='--', label='Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.title('Elbow Method')
plt.legend()
plt.show()

Optimal K Selection Methods Comparison

MethodDescription
Elbow MethodPoint where Inertia decrease rate bends
Silhouette ScoreIntra-cluster cohesion vs Inter-cluster separation
Gap StatisticCompare actual data vs uniform distribution
Domain KnowledgeBased on business requirements

Practical Tip: Use Elbow and Silhouette Score together for comprehensive judgment. Sometimes business requirements may be more important.


4. Silhouette Score

A metric evaluating whether each sample is assigned to the correct cluster.

s(i) = (b(i) - a(i)) / max(a(i), b(i))
  • a(i): Average distance within same cluster (cohesion)
  • b(i): Average distance to nearest other cluster (separation)
  • Range: [-1, 1], higher is better (closer to 1 means better cluster separation)
from sklearn.metrics import silhouette_score, silhouette_samples
 
# Overall average score
score = silhouette_score(X_scaled, labels)
print(f'Silhouette Score: {score:.4f}')
 
# Per-sample scores (for visualization)
sample_scores = silhouette_samples(X_scaled, labels)
 
# Compare Silhouette Score by K
sil_scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    sil_scores.append(silhouette_score(X_scaled, labels))

5. K-Means++ Initialization

Initial centroid selection significantly impacts K-Means performance. K-Means++ provides better initialization.

# Compare initialization methods
results = []
for init_method in ['random', 'k-means++']:
    inertias_trial = []
    for _ in range(10):
        km = KMeans(n_clusters=4, init=init_method, n_init=1, random_state=None)
        km.fit(X)
        inertias_trial.append(km.inertia_)
    print(f'{init_method}: Mean={np.mean(inertias_trial):.2f}, Std={np.std(inertias_trial):.2f}')

K-Means++ is the default. scikit-learn's KMeans uses init='k-means++' by default, so no separate configuration is needed. It's more stable and achieves lower Inertia than random initialization.


6. DBSCAN

Density-based clustering: Recognizes dense regions as clusters and treats low-density areas as noise.

ParameterDescription
epsNeighbor distance threshold (radius)
min_samplesMinimum neighbors to be a core point
from sklearn.cluster import DBSCAN
 
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
 
# Noise points: label = -1
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f'Number of clusters: {n_clusters}, Noise count: {n_noise}')

DBSCAN Advantages

  • K specification not required (automatically determines cluster count)
  • Automatic outlier detection (label = -1)
  • Detects non-spherical clusters (crescent, concentric circles, etc.)
⚠️

K-Means Limitations: K-Means only finds spherical clusters well. DBSCAN is much more effective for crescent-shaped (moons) or concentric circles data.


7. Hierarchical Clustering

Forms clusters in tree structure, visualizable as dendrogram.

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
 
# Clustering with sklearn
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X_scaled)
 
# Dendrogram (scipy)
Z = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

Linkage Methods

MethodDescription
wardMinimize variance (default, most commonly used)
completeMaximum distance (complete linkage)
averageAverage distance
singleMinimum distance (single linkage)

8. Algorithm Comparison

The notebook compares K-Means, DBSCAN, and Hierarchical Clustering on various datasets (Blobs, Moons, Circles).

SituationRecommended Algorithm
k is known, spherical clustersK-Means
Outliers exist, non-spherical clustersDBSCAN
Need to understand hierarchical structureAgglomerative
Large-scale data (100K+ samples)Mini-batch K-Means

9. Mini-Batch K-Means (Large-Scale Data)

Mini-Batch K-Means is much faster for large-scale data.

from sklearn.cluster import MiniBatchKMeans
 
# Large-scale data
X_large, _ = make_blobs(n_samples=100000, centers=10, random_state=42)
 
# Mini-Batch K-Means
mbkm = MiniBatchKMeans(n_clusters=10, random_state=42, batch_size=1000, n_init=3)
mbkm.fit(X_large)

Code Summary

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
 
# Scaling required!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
km_labels = kmeans.fit_predict(X_scaled)
 
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
db_labels = dbscan.fit_predict(X_scaled)
 
# Evaluation
print(f"K-Means Silhouette: {silhouette_score(X_scaled, km_labels):.4f}")
 
# DBSCAN excludes noise (-1) from evaluation
mask = db_labels != -1
if mask.sum() > 1:
    print(f"DBSCAN Silhouette: {silhouette_score(X_scaled[mask], db_labels[mask]):.4f}")

Practical Example: Customer Segmentation

The notebook covers customer segmentation practice using Mall Customers data.

# Customer data clustering
X_customers = customers[['Annual_Income', 'Spending_Score']].values
X_scaled = StandardScaler().fit_transform(X_customers)
 
# Find optimal K then cluster
km_final = KMeans(n_clusters=5, random_state=42, n_init=10)
customers['Cluster'] = km_final.fit_predict(X_scaled)
 
# Analyze cluster profiles
cluster_profile = customers.groupby('Cluster').agg({
    'CustomerID': 'count',
    'Age': 'mean',
    'Annual_Income': 'mean',
    'Spending_Score': 'mean'
})

Cluster Interpretation Example:

  • High income + High spending: VIP customers, provide premium services
  • High income + Low spending: Potential customers, marketing targets
  • Low income + High spending: Discount promotions effective

Interview Questions Preview

  1. What are K-Means limitations and alternatives?
  2. How do you select eps and min_samples for DBSCAN?
  3. What is the meaning and interpretation of Silhouette Score?
  4. Why is K-Means++ better than regular K-Means?
  5. How do you improve clustering performance for large-scale data?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

The notebook additionally covers:

  • Step-by-step visualization of K-Means operation (centroid movement per iteration)
  • Silhouette Plot for per-cluster quality visualization
  • Algorithm comparison on various datasets (Blobs, Moons, Circles)
  • 3D visualization for customer segmentation analysis
  • Mini-Batch K-Means performance benchmark

Previous: 06. Feature Engineering | Next: 08. Dimensionality Reduction