Understanding And Implementing K-Means Clustering In Python

Introduction

The K-Means algorithm is one of the simplest and most commonly used unsupervised learning algorithms for solving clustering problems. The goal of K-Means clustering is to partition data into a certain number (K) of distinct clusters such that each dataset's objects belong to the cluster with the closest mean value.

Today, we will implement the K-Means algorithm in Python and learn how to visualize the clusters of an example dataset.

Algorithm

Here is a brief summary of the K-Means algorithm:

  1. Randomly initialize K cluster centroids.
  2. For each point, compute its distance from each of the K centroids and assign the point to the nearest centroid's cluster.
  3. Update each centroid to be the mean of the points in its cluster.
  4. Repeat steps 2 and 3 until convergence (i.e., centroids do not change significantly).

Code Implementation

Now let's see how we can implement this in Python. We'll use the numpy and matplotlib.pyplot libraries for computation and visualization respectively.

First, we will import the required libraries:

import numpy as np import matplotlib.pyplot as plt

Let's now create our sample data:

# Sample 2D data data = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]])

Next, we implement the K-Means class:

class K_Means: def __init__(self, k=2, tol=0.001, max_iter=300): self.k = k self.tol = tol self.max_iter = max_iter def fit(self,data): self.centroids = {} # Initialize the centroids, the first 'k' elements in the dataset for i in range(self.k): self.centroids[i] = data[i] # Begin iterations for i in range(self.max_iter): self.classes = {} for i in range(self.k): self.classes[i] = [] # Find the distance between the point and cluster; choose the nearest centroid for features in data: distances = [np.linalg.norm(features - self.centroids[centroid]) for centroid in self.centroids] classification = distances.index(min(distances)) self.classes[classification].append(features) previous = dict(self.centroids) # Average the cluster datapoints to re-calculate the centroids for classification in self.classes: self.centroids[classification] = np.average(self.classes[classification], axis = 0) isOptimal = True for centroid in self.centroids: original_centroid = previous[centroid] curr = self.centroids[centroid] if np.sum((curr - original_centroid)/original_centroid * 100.0) > self.tol: isOptimal = False # Break out of the main loop if the results are optimal, i.e. the centroids don't change their positions much(more than our tolerance) if isOptimal: break

Let's run this on our data:

kmeans = K_Means(k=2) kmeans.fit(data)

We can also visualize the clusters:

colors = 10*["r", "g", "c", "b", "k"] for centroid in kmeans.centroids: plt.scatter(kmeans.centroids[centroid][0], kmeans.centroids[centroid][1], s = 130, marker = "x") for classification in kmeans.classes: color = colors[classification] for features in kmeans.classes[classification]: plt.scatter(features[0], features[1], color = color,s = 30) plt.show()

Above code will plot the data points and centroids with distinct color for each cluster.

That's how we implement a simple K-Means algorithm from scratch in Python and visualize our dataset into specified number of clusters!

Of course, in most real life situations, you don’t have to implement the algorithm from scratch every time. Python has many libraries like Scikit-Learn that provide ready to use implementations of K-Means and many other machine learning algorithms.