In this blog post, we will explore the concept of Grid Search for hyperparameter tuning in a Random Forest Classification model using Python. Grid Search is a popular technique to find the best combination of hyperparameters in a model. We will be using the popular scikit-learn
library to implement the Random Forest Classifier and GridSearchCV.
For this demonstration, we need to import the necessary libraries:
import numpy as np import pandas as pd import seaborn as sns from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.model_selection import GridSearchCV
We will use the iris dataset for this example, which includes the sepal length, sepal width, petal length, petal width, and the species of iris flowers.
iris = load_iris() data = pd.DataFrame(data=iris.data, columns=iris.feature_names) data['species'] = iris.target
After loading the data, we will split it into training and test sets using the train_test_split
function from scikit-learn
.
X = data.iloc[:, 0:4] y = data['species'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
To ensure that our Grid Search will be helpful, we first create a baseline Random Forest Classifier with default hyperparameters.
clf = RandomForestClassifier(random_state=42) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print("Baseline accuracy:", accuracy_score(y_test, y_pred))
Now, let's implement Grid Search to find the best hyperparameters combination for our Random Forest Classifier.
param_grid = { 'n_estimators': [10, 50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10, 15], 'min_samples_leaf': [1, 2, 5, 10], } grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, verbose=2, n_jobs=-1) grid_search.fit(X_train, y_train)
Here, param_grid
is a dictionary containing the hyperparameters we would like to test. In this case, we are testing different combinations of n_estimators
, max_depth
, min_samples_split
, and min_samples_leaf
. The GridSearchCV
function is used to perform cross-validated grid search while training the model.
Now, let us evaluate how our improved Random Forest Classifier with the best hyperparameters performs.
best_clf = grid_search.best_estimator_ y_pred = best_clf.predict(X_test) print("Best accuracy:", accuracy_score(y_test, y_pred)) print("Best hyperparameters:", grid_search.best_params_) print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
In this blog post, we explored the Grid Search technique for hyperparameter tuning in a Random Forest Classification model. We saw how this method significantly improved the model's performance. This technique can be easily implemented in other machine learning models using scikit-learn
.