Machine Learning With Naive Bayes Classifier

In today's blog, we are going to delve into a topic that blends statistics with technology - Machine Learning. More specifically, we are going to discuss and implement a very popular algorithm in Machine Learning- the Naive Bayes Classifier.

What is Naive Bayes Classifier?

Naive Bayes is a type of supervised learning algorithm, which is based on Bayes' theorem. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

The algorithm is called "naive" because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.

Naive Bayes Classifier in Python

Let's look at a simple example of how to implement a Naive Bayes Classifier in Python. We will create a spam filter for emails. We'll use the SciKit-learn library, which has built-in classes for various Naive Bayes classifiers.

First, we install and import the necessary libraries:

!pip install numpy !pip install sklearn import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB

Assume we have a set of emails and their labels:

emails = ['Hey, Can we meet tomorrow?', 'Upto 20% discount on parking, exclusive offer just for you', 'Are you available tomorrow?', 'Win a million dollars!!', 'Met Mr. XYZ today, was good'] labels = [0, 1, 0, 1, 0] # 0 - Not Spam, 1 - Spam

Now, let's convert our email text into a matrix of token counts:

cv = CountVectorizer() features = cv.fit_transform(emails)

Once our data is ready, we can train our Naive Bayes Classifier:

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=42) model = MultinomialNB() model.fit(features_train, labels_train)

That's it! We can now use this model to predict whether a new email is spam or not.

Conclusion

Naive Bayes is a powerful algorithm for machine learning on a budget. Despite its simplicity, it has been shown to be effective in many real-world problems. Although it makes a strong assumption about the features being independent, it still performs very well in practice.