Exploring Mahalanobis Distance For Outlier Detection

Introduction

Mahalanobis distance is a popular technique used to detect outliers in multivariate datasets. It measures the distance between a data point and a distribution, taking into account the correlations within the dataset. This property makes it well-suited for finding outlying data points in high-dimensional datasets where simple Euclidean distance can be misleading. In this blog, we will explain the concept of Mahalanobis distance and demonstrate its application in outlier detection using Python.

What is Mahalanobis Distance

Imagine you have a multivariate dataset with features that are correlated. Euclidean distance, the most commonly used distance metric, does not take into account the correlations between features. Mahalanobis distance, on the other hand, considers these correlations, providing more accurate measurements.

The formula for calculating Mahalanobis distance for a point X from a distribution with mean μ and covariance matrix Σ is:

D2 = (X - μ)Σ-1(X - μ)T

where T denotes the transpose operation.

Outlier Detection using Mahalanobis Distance

To demonstrate the utility of Mahalanobis distance in outlier detection, we'll use a sample dataset and the following algorithm:

Calculate the mean and covariance matrix of the dataset.
Compute Mahalanobis distance for each data point.
Determine a threshold for detecting outliers.
Flag data points with Mahalanobis distance greater than the threshold as outliers.

Implementation using Python

We will use the numpy and scipy libraries to implement the above algorithm. Let's start by generating a synthetic dataset:

import numpy as np

# Create random data with correlation
np.random.seed(0)
mean = [0, 0]
cov = [[1, 0.5], [0.5, 1]]
data = np.random.multivariate_normal(mean, cov, 100)

Now, let's calculate the mean and covariance matrix of the dataset:

mean = np.mean(data, axis=0)
cov = np.cov(data, rowvar=False)

Next, we'll compute the Mahalanobis distance for each data point using the scipy.spatial.distance.mahalanobis function:

from scipy.spatial import distance

mahalanobis_dist = np.array([distance.mahalanobis(x, mean, np.linalg.inv(cov)) for x in data])

We can determine the threshold for outlier detection using a chi-squared distribution. Suppose we pick an alpha level of 0.01, we can calculate the threshold as follows:

from scipy.stats import chi2

threshold = chi2.ppf((1-0.01), df=2)  # 2 is the number of dimensions

Now, we can flag outliers in the data by comparing each Mahalanobis distance with the threshold:

outliers = data[mahalanobis_dist > threshold]

That's it! We have successfully detected outliers in a multivariate dataset using Mahalanobis distance.

Conclusion

In this blog post, we introduced Mahalanobis distance and demonstrated its use for detecting outliers in a multivariate dataset. We showed how to calculate the mean, covariance matrix, Mahalanobis distance, threshold, and identify outliers. This technique is especially useful for datasets with correlated features, where Euclidean distance may lead to biased results.