Machine learning involves handling huge amounts of multi-dimensional data. Exploring and visualizing such high-dimensional data can often be challenging, especially when you need to reduce its dimension while maintaining the significant properties of the original data. The solution to this dilemma is the method known as dimensionality reduction, and in this post, we will consider one specific technique, t-Distributed Stochastic Neighbor Embedding (t-SNE), utilized for the purpose of high-dimensional data visualization.
t-SNE is a machine learning algorithm for visualization developed by Laurens van der Maaten and Geoffrey Hinton (the "Godfather of Deep Learning") in 2008. It is a non-linear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions.
Essentially, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, while dissimilar points have an extremely low probability of being picked.
Then, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map.
Let's implement t-SNE using Scikit-learn library on the Iris dataset.
from sklearn.manifold import TSNE from sklearn.datasets import load_iris import matplotlib.pyplot as plt iris = load_iris() x = iris['data'] y = iris['target'] tsne = TSNE(n_components=2, random_state=0) x_2d = tsne.fit_transform(x) scatter = plt.scatter(x_2d[:,0], x_2d[:,1], c=y) plt.legend(*scatter.legend_elements()) plt.show()
In this Python script, load_iris() function loads the iris dataset. We then initialize the t-SNE model with the desired number of components (dimensions) as two and a random state for the reproducibility of results. After fitting this model to our data, we use Matplotlib to visualize the newly transformed 2D data.
Despite being a powerful tool for data visualization, it is good to note that t-SNE could be somewhat tricky to interpret sometimes and also computational resources demanding for very large datasets. Use it with a clear understanding of its underpinning principles and capacity.
In the next tutorial, we will dive deep into another interesting Machine Learning topic. Stay tuned!