Visualizing Hierarchical Clustering With Python

Hierarchical Clustering is an algorithm that groups similar objects into groups called clusters. It allows you to draw inferences from the dataset that aren't possible otherwise. Today we will learn how to implement it in Python and visualize the results.

Necessary Libraries

To start off, we would require two libraries - Scikit-learn for the Hierarchical Clustering model, and matplotlib for visualization.

from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

Dummy Data

We will be creating some dummy data for this project. For simplicity's sake, we will stick to two features only.

import numpy as np
np.random.seed(1)  # for consistency

# Create 3 clusters of data
x = np.concatenate([1.5 * np.random.randn(100, 2),
                    0.5*np.random.randn(50, 2) + [2, 2],
                    np.random.randn(100, 2) + [4, -3]], axis=0)

Agglomerative Clustering Model

Next, we will create our model, specifying how many clusters we want, in our case 3.

model = AgglomerativeClustering(n_clusters=3)
model = model.fit(x)

Visualization

Finally, we will visualize our clusters.

plt.figure(figsize=(10, 7))

plt.scatter(x[:,0], x[:,1], c=model.labels_);
plt.title("Hierarchical Clustering")
plt.show()

This code will produce a scatter plot with each of the data points colored according to their respective clusters. In the figure, we can see that the algorithm has captured the three groups in which our data was divided.

That's all for this post. By manipulating the data and the parameters, you can explore hierarchical clustering more. Keep in mind that real-world datasets often have many features, not just two, so they can't be visualized as easily.