Machine learning models are usually treated as being balanced, i.e., the classes in the target variable are distributed about equally. However, in real-world scenarios, we often encounter imbalanced datasets, where the classes are unequally distributed. An imbalanced classification problem is an example of a classification problem where the classes are not represented equally.
Let’s understand this with an obvious example: fraud detection. Fraud transactions are rare compared to normal transactions, leading to an imbalanced problem. If we use this imbalanced data for our machine learning model, the model will likely overfit to the majority class whilst ignoring the minority class, due to lack of sufficient data.
There are three methods typically used to balance classes:
In this blog, let's focus on implementing the SMOTE technique using Python and the imbalanced-learn library.
from imblearn.over_sampling import SMOTE from sklearn.datasets import load_breast_cancer from collections import Counter # Loading the dataset data = load_breast_cancer(as_frame=True) X, y = data.data, data.target # Counting the class before applying SMOTE counter = Counter(y) print(f'Before SMOTE: {counter}') # Applying SMOTE smote = SMOTE() X_smote, y_smote = smote.fit_resample(X, y) # Counting the class after applying SMOTE counter = Counter(y_smote) print(f'After SMOTE: {counter}')
Before the execution of SMOTE, our datasets’ distribution was imbalanced with the classes not evenly distributed. After applying SMOTE, we notice that our data are distributed evenly, allowing a machine learning model to be effectively trained on our now balanced dataset.
Remember while dealing with imbalanced data, these methods may help in improving model performance but bear in mind that there’s no sure shot way of deciding which method will work best for your dataset. This requires comprehensive knowledge about your data and the problem that you’re working on.
Stay tuned for more such interesting topics on data science and machine learning!