Hierarchical Clustering is an algorithm that groups similar objects into groups called clusters. It allows you to draw inferences from the dataset that aren't possible otherwise. Today we will learn how to implement it in Python and visualize the results.
To start off, we would require two libraries - Scikit-learn for the Hierarchical Clustering model, and matplotlib for visualization.
from sklearn.cluster import AgglomerativeClustering import matplotlib.pyplot as plt
We will be creating some dummy data for this project. For simplicity's sake, we will stick to two features only.
import numpy as np np.random.seed(1) # for consistency # Create 3 clusters of data x = np.concatenate([1.5 * np.random.randn(100, 2), 0.5*np.random.randn(50, 2) + [2, 2], np.random.randn(100, 2) + [4, -3]], axis=0)
Next, we will create our model, specifying how many clusters we want, in our case 3.
model = AgglomerativeClustering(n_clusters=3) model = model.fit(x)
Finally, we will visualize our clusters.
plt.figure(figsize=(10, 7)) plt.scatter(x[:,0], x[:,1], c=model.labels_); plt.title("Hierarchical Clustering") plt.show()
This code will produce a scatter plot with each of the data points colored according to their respective clusters. In the figure, we can see that the algorithm has captured the three groups in which our data was divided.
That's all for this post. By manipulating the data and the parameters, you can explore hierarchical clustering more. Keep in mind that real-world datasets often have many features, not just two, so they can't be visualized as easily.