07. Clustering
K-Means, DBSCAN, Hierarchical Clustering, Silhouette Score
Learning Objectives
After completing this tutorial, you will be able to:
- Understand K-Means algorithm operation and objective function (Inertia)
- Select optimal K using Elbow Method and Silhouette Score
- Understand K-Means limitations and necessity of K-Means++ initialization
- Compare various clustering algorithms: DBSCAN, Hierarchical Clustering
- Apply practical Customer Segmentation
Key Concepts
1. What is Clustering?
A technique for grouping similar data through unsupervised learning. Discovers data structure without labels.
| Type | Algorithm | Characteristics |
|---|---|---|
| Partitioning | K-Means | Fast, k must be specified |
| Density | DBSCAN | Outlier detection, k not needed |
| Hierarchical | Agglomerative | Dendrogram visualization |
2. K-Means Algorithm Principle
K-Means is the most basic and widely used clustering algorithm that groups data into K clusters.
Operation Process:
- Random initialization of K centroids
- Assign each data point to nearest centroid
- Calculate new centroid for each cluster (mean)
- Repeat steps 2-3 until centroids don't change
Objective Function (Inertia):
- : 1 if data i belongs to cluster k, else 0
- : Centroid of cluster k
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
centroids = kmeans.cluster_centers_
print(f'Inertia: {kmeans.inertia_:.2f}')Scaling Required! K-Means is distance-based, so features must be normalized with StandardScaler.
K-Means Limitations
- k must be specified in advance
- Only finds spherical clusters well
- Sensitive to outliers
- Results vary depending on initialization
3. Finding Optimal K
Elbow Method
Find the point where Inertia decrease rate bends (elbow).
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=10)
plt.axvline(x=4, color='red', linestyle='--', label='Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-cluster sum of squares)')
plt.title('Elbow Method')
plt.legend()
plt.show()Optimal K Selection Methods Comparison
| Method | Description |
|---|---|
| Elbow Method | Point where Inertia decrease rate bends |
| Silhouette Score | Intra-cluster cohesion vs Inter-cluster separation |
| Gap Statistic | Compare actual data vs uniform distribution |
| Domain Knowledge | Based on business requirements |
Practical Tip: Use Elbow and Silhouette Score together for comprehensive judgment. Sometimes business requirements may be more important.
4. Silhouette Score
A metric evaluating whether each sample is assigned to the correct cluster.
s(i) = (b(i) - a(i)) / max(a(i), b(i))a(i): Average distance within same cluster (cohesion)b(i): Average distance to nearest other cluster (separation)- Range: [-1, 1], higher is better (closer to 1 means better cluster separation)
from sklearn.metrics import silhouette_score, silhouette_samples
# Overall average score
score = silhouette_score(X_scaled, labels)
print(f'Silhouette Score: {score:.4f}')
# Per-sample scores (for visualization)
sample_scores = silhouette_samples(X_scaled, labels)
# Compare Silhouette Score by K
sil_scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
sil_scores.append(silhouette_score(X_scaled, labels))5. K-Means++ Initialization
Initial centroid selection significantly impacts K-Means performance. K-Means++ provides better initialization.
# Compare initialization methods
results = []
for init_method in ['random', 'k-means++']:
inertias_trial = []
for _ in range(10):
km = KMeans(n_clusters=4, init=init_method, n_init=1, random_state=None)
km.fit(X)
inertias_trial.append(km.inertia_)
print(f'{init_method}: Mean={np.mean(inertias_trial):.2f}, Std={np.std(inertias_trial):.2f}')K-Means++ is the default. scikit-learn's KMeans uses init='k-means++' by default, so no separate configuration is needed. It's more stable and achieves lower Inertia than random initialization.
6. DBSCAN
Density-based clustering: Recognizes dense regions as clusters and treats low-density areas as noise.
| Parameter | Description |
|---|---|
eps | Neighbor distance threshold (radius) |
min_samples | Minimum neighbors to be a core point |
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
# Noise points: label = -1
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f'Number of clusters: {n_clusters}, Noise count: {n_noise}')DBSCAN Advantages
- K specification not required (automatically determines cluster count)
- Automatic outlier detection (label = -1)
- Detects non-spherical clusters (crescent, concentric circles, etc.)
K-Means Limitations: K-Means only finds spherical clusters well. DBSCAN is much more effective for crescent-shaped (moons) or concentric circles data.
7. Hierarchical Clustering
Forms clusters in tree structure, visualizable as dendrogram.
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Clustering with sklearn
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X_scaled)
# Dendrogram (scipy)
Z = linkage(X_scaled, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()Linkage Methods
| Method | Description |
|---|---|
ward | Minimize variance (default, most commonly used) |
complete | Maximum distance (complete linkage) |
average | Average distance |
single | Minimum distance (single linkage) |
8. Algorithm Comparison
The notebook compares K-Means, DBSCAN, and Hierarchical Clustering on various datasets (Blobs, Moons, Circles).
| Situation | Recommended Algorithm |
|---|---|
| k is known, spherical clusters | K-Means |
| Outliers exist, non-spherical clusters | DBSCAN |
| Need to understand hierarchical structure | Agglomerative |
| Large-scale data (100K+ samples) | Mini-batch K-Means |
9. Mini-Batch K-Means (Large-Scale Data)
Mini-Batch K-Means is much faster for large-scale data.
from sklearn.cluster import MiniBatchKMeans
# Large-scale data
X_large, _ = make_blobs(n_samples=100000, centers=10, random_state=42)
# Mini-Batch K-Means
mbkm = MiniBatchKMeans(n_clusters=10, random_state=42, batch_size=1000, n_init=3)
mbkm.fit(X_large)Code Summary
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Scaling required!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
km_labels = kmeans.fit_predict(X_scaled)
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
db_labels = dbscan.fit_predict(X_scaled)
# Evaluation
print(f"K-Means Silhouette: {silhouette_score(X_scaled, km_labels):.4f}")
# DBSCAN excludes noise (-1) from evaluation
mask = db_labels != -1
if mask.sum() > 1:
print(f"DBSCAN Silhouette: {silhouette_score(X_scaled[mask], db_labels[mask]):.4f}")Practical Example: Customer Segmentation
The notebook covers customer segmentation practice using Mall Customers data.
# Customer data clustering
X_customers = customers[['Annual_Income', 'Spending_Score']].values
X_scaled = StandardScaler().fit_transform(X_customers)
# Find optimal K then cluster
km_final = KMeans(n_clusters=5, random_state=42, n_init=10)
customers['Cluster'] = km_final.fit_predict(X_scaled)
# Analyze cluster profiles
cluster_profile = customers.groupby('Cluster').agg({
'CustomerID': 'count',
'Age': 'mean',
'Annual_Income': 'mean',
'Spending_Score': 'mean'
})Cluster Interpretation Example:
- High income + High spending: VIP customers, provide premium services
- High income + Low spending: Potential customers, marketing targets
- Low income + High spending: Discount promotions effective
Interview Questions Preview
- What are K-Means limitations and alternatives?
- How do you select eps and min_samples for DBSCAN?
- What is the meaning and interpretation of Silhouette Score?
- Why is K-Means++ better than regular K-Means?
- How do you improve clustering performance for large-scale data?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
The notebook additionally covers:
- Step-by-step visualization of K-Means operation (centroid movement per iteration)
- Silhouette Plot for per-cluster quality visualization
- Algorithm comparison on various datasets (Blobs, Moons, Circles)
- 3D visualization for customer segmentation analysis
- Mini-Batch K-Means performance benchmark
Previous: 06. Feature Engineering | Next: 08. Dimensionality Reduction