Applying Topic Modeling For Text Classification Using Latent Dirichlet Allocation

Introduction to Topic Modeling

Topic modeling is an unsupervised Machine Learning technique that allows us to automatically identify topics present in a given set of texts and classify them accordingly. Latent Dirichlet Allocation (LDA) is a widely used topic modeling technique applied to large datasets.

In this blog post, we will explore LDA for topic modeling and apply it for text classification with Python's Gensim library.

Dependencies Installation

Before we start, ensure you have the following libraries installed in your Python environment:

pip install gensim
pip install nltk

Dataset Preparation

We will be using the 20newsgroups dataset for this example. First, import the necessary libraries and load the 20newsgroups dataset.

from sklearn.datasets import fetch_20newsgroups
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import nltk

nltk.download('stopwords')

# Load dataset
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

Now, we preprocess the dataset by tokenizing the text, removing stopwords, and applying lowercase.

tokenizer = RegexpTokenizer(r'\w+')
en_stop = set(stopwords.words('english'))

def preprocess_data(text):
    tokens = tokenizer.tokenize(text.lower())
    stopped_tokens = [i for i in tokens if not i in en_stop]
    return stopped_tokens

texts = [preprocess_data(text) for text in newsgroups_data.data]

Building the LDA Model

First, we create a dictionary (id2word) and a bag-of-words representation of our dataset. Then, we train the LDA model using the Gensim library.

# Create a dictionary (id2word)
id2word = Dictionary(texts)

# Filter out words that occur less than 20 documents
id2word.filter_extremes(no_below=20)

# Create a bag-of-words representation of our dataset
corpus = [id2word.doc2bow(text) for text in texts]

# Train the LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=20, random_state=42)

Analyzing the Results

We can then print out the top 10 words for each topic to explore the topic clustering.

for topic_num, topic in lda_model.show_topics(num_topics=20, formatted=False, num_words=10):
    print(f"Topic {topic_num}: {[id2word[int(id)] for _, id in topic]}")

Now let's use our LDA model to classify new texts into these topics.

def classify_text(text):
    preprocessed_text = preprocess_data(text)
    bow = id2word.doc2bow(preprocessed_text)
    topics = lda_model.get_document_topics(bow)
    topics.sort(key=lambda x: x[1], reverse=True)
    print(f"Predicted topic: {topics[0][0]}")
    return topics[0][0]

# Sample text for classification
sample_text = "AI technologies are taking over the world with their powerful applications and impact on various industries."
predicted_topic = classify_text(sample_text)

Conclusion

By using Latent Dirichlet Allocation (LDA) and the Gensim library, we are able to apply topic modeling for text classification tasks. This can be particularly beneficial in understanding the main themes and topics present in large amounts of unstructured text data. With these powerful techniques at our disposal, we are better equipped to process and analyze text efficiently.