Topic modeling is an unsupervised Machine Learning technique that allows us to automatically identify topics present in a given set of texts and classify them accordingly. Latent Dirichlet Allocation (LDA) is a widely used topic modeling technique applied to large datasets.
In this blog post, we will explore LDA for topic modeling and apply it for text classification with Python's Gensim
library.
Before we start, ensure you have the following libraries installed in your Python environment:
pip install gensim pip install nltk
We will be using the 20newsgroups dataset for this example. First, import the necessary libraries and load the 20newsgroups dataset.
from sklearn.datasets import fetch_20newsgroups from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords from gensim.corpora import Dictionary from gensim.models import LdaModel import nltk nltk.download('stopwords') # Load dataset newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
Now, we preprocess the dataset by tokenizing the text, removing stopwords, and applying lowercase.
tokenizer = RegexpTokenizer(r'\w+') en_stop = set(stopwords.words('english')) def preprocess_data(text): tokens = tokenizer.tokenize(text.lower()) stopped_tokens = [i for i in tokens if not i in en_stop] return stopped_tokens texts = [preprocess_data(text) for text in newsgroups_data.data]
First, we create a dictionary (id2word
) and a bag-of-words representation of our dataset. Then, we train the LDA model using the Gensim
library.
# Create a dictionary (id2word) id2word = Dictionary(texts) # Filter out words that occur less than 20 documents id2word.filter_extremes(no_below=20) # Create a bag-of-words representation of our dataset corpus = [id2word.doc2bow(text) for text in texts] # Train the LDA model lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=20, random_state=42)
We can then print out the top 10 words for each topic to explore the topic clustering.
for topic_num, topic in lda_model.show_topics(num_topics=20, formatted=False, num_words=10): print(f"Topic {topic_num}: {[id2word[int(id)] for _, id in topic]}")
Now let's use our LDA model to classify new texts into these topics.
def classify_text(text): preprocessed_text = preprocess_data(text) bow = id2word.doc2bow(preprocessed_text) topics = lda_model.get_document_topics(bow) topics.sort(key=lambda x: x[1], reverse=True) print(f"Predicted topic: {topics[0][0]}") return topics[0][0] # Sample text for classification sample_text = "AI technologies are taking over the world with their powerful applications and impact on various industries." predicted_topic = classify_text(sample_text)
By using Latent Dirichlet Allocation (LDA) and the Gensim
library, we are able to apply topic modeling for text classification tasks. This can be particularly beneficial in understanding the main themes and topics present in large amounts of unstructured text data. With these powerful techniques at our disposal, we are better equipped to process and analyze text efficiently.