Exploring Grid Search For Hyperparameter Tuning In Random Forest Classification

Introduction

In this blog post, we will explore the concept of Grid Search for hyperparameter tuning in a Random Forest Classification model using Python. Grid Search is a popular technique to find the best combination of hyperparameters in a model. We will be using the popular scikit-learn library to implement the Random Forest Classifier and GridSearchCV.

Importing necessary libraries

For this demonstration, we need to import the necessary libraries:

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

Loading and preprocessing the data

We will use the iris dataset for this example, which includes the sepal length, sepal width, petal length, petal width, and the species of iris flowers.

iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['species'] = iris.target

After loading the data, we will split it into training and test sets using the train_test_split function from scikit-learn.

X = data.iloc[:, 0:4]
y = data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Creating a baseline Random Forest Classifier

To ensure that our Grid Search will be helpful, we first create a baseline Random Forest Classifier with default hyperparameters.

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Baseline accuracy:", accuracy_score(y_test, y_pred))

Implementing Grid Search

Now, let's implement Grid Search to find the best hyperparameters combination for our Random Forest Classifier.

param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 5, 10],
}

grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

Here, param_grid is a dictionary containing the hyperparameters we would like to test. In this case, we are testing different combinations of n_estimators, max_depth, min_samples_split, and min_samples_leaf. The GridSearchCV function is used to perform cross-validated grid search while training the model.

Evaluating the best model

Now, let us evaluate how our improved Random Forest Classifier with the best hyperparameters performs.

best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)
print("Best accuracy:", accuracy_score(y_test, y_pred))
print("Best hyperparameters:", grid_search.best_params_)
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Conclusion

In this blog post, we explored the Grid Search technique for hyperparameter tuning in a Random Forest Classification model. We saw how this method significantly improved the model's performance. This technique can be easily implemented in other machine learning models using scikit-learn.