In machine learning, Support Vector Machines (SVM) is a popular supervised learning model for classification and regression tasks. SVM can handle both linear and non-linear data, making it versatile and efficient.
In this blog post, we will explore how SVM can be used for text classification tasks specifically. Imagine having a massive dataset of text documents and looking to categorize them into predefined classes, SVM can be your go-to machine learning model.
For this tutorial, we need Python programming environment set up with Scikit-learn library installed. If you don't have Scikit-learn installed, you can install it using pip:
pip install -U scikit-learn
For the purpose of this tutorial, let's use the classic '20 Newsgroups' dataset from Scikit-learn to classify text articles into 20 different categories.
Start by importing necessary Python libraries.
import numpy as np from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import svm from sklearn.metrics import classification_report
Next, load the '20 Newsgroups' dataset and split it into training and test datasets.
newsgroups_dataset = datasets.fetch_20newsgroups(subset='all', shuffle=True) X_train, X_test, y_train, y_test = train_test_split(newsgroups_dataset.data, newsgroups_dataset.target, test_size=0.2, random_state=42)
Since machine learning models cannot work with raw text, we need to convert text documents into feature vectors. Here, TfidfVectorizer from Scikit-learn can be used.
vectorizer = TfidfVectorizer() X_train_vectorized = vectorizer.fit_transform(X_train) X_test_vectorized = vectorizer.transform(X_test)
Now, create a SVM model and train it using the training data.
model = svm.SVC(kernel='linear') model.fit(X_train_vectorized, y_train)
Finally, predict the test data using the trained model and evaluate the model performance.
y_pred = model.predict(X_test_vectorized) print(classification_report(y_test, y_pred, target_names=newsgroups_dataset.target_names))
Through this post, we have walked through an entire process of utilizing SVM for text classification tasks. While SVM is already a powerful classifier, tuning its hyperparameters like C, gamma and kernel can improve its performance even further.