Understanding Decision Trees For Machine Learning

Introduction

Decision Trees are one of the easiest and most popular Machine Learning algorithms. The interpretability nature of the decision tree models, where each decision can be logically explained, makes it a tool that is easily accepted by people without much statistical backgrounds.

In this blog post, we will demonstrate a simple Python implementation of a Decision Tree Classifier using the well-known Iris dataset.

Decision Trees

Decision Trees are possibly one of the most basic types of machine learning algorithms. They belong to the class of supervised learning algorithms where they can be used for solving both Classification and Regression problems.

The purpose of a decision tree is to split your dataset in such a way that at the end you have groups that are as homogeneous as possible.

How does a Decision Tree work?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees. Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. The most popular split algorithms include -

Gini Index
Chi-Square
Information Gain
Reduction in Variance

For simplicity, we will not deep dive into these algorithms but we will understand the working of decision trees through Python implementation.

Python Implementation

Let's implement a basic Decision Tree Classifier using Scikit-Learn library. We will use the Iris dataset for this example.

First, we will import necessary libraries.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

Then, we load our dataset and split it into training and test sets.

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Now, let's create the Decision Tree model, train it, make predictions, and evaluate performance.

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

# Predict for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

We have built a Decision Tree Classifier without specifying any hyperparameters. The Iris dataset being a relatively simple one, our model could achieve reasonable accuracy.

Conclusion

Decision Trees, being one of the simpler machine learning algorithms, have several uses and can offer a good starting point for more complex topics. Not only can they solve complex tasks, but their output is also very interpretable. These features make Decision Trees invaluable, both in standalone work and as a component of ensemble methods.

Indeed, learning Decision Tree thoroughly will be great stepping stone ahead for learning more complex algorithms.