Topic modelling is a fascinating field of Machine Learning. It deals with identifying the major topics in a large collection of documents. It’s like creating the table of contents for a massive book written about everything that is mentioned in the documents. One popular method for performing such topic modelling is Latent Dirichlet Allocation (LDA).
LDA is a generative probabilistic model, which assumes that documents are produced from a mixture of topics. These topics output words based on their probability distribution. Given a collection of textual data, it backtracks and deduces the topic and word distributions that might have generated the data.
We'll build a basic model using Python's Gensim library and apply it to a collection of documents.
# Python from gensim import corpora from gensim.models.ldamodel import LdaModel # Collection of documents documents = ["Machine Learning is great", "Python is an awesome language for Machine Learning", "Gensim is a useful library", "..."] # Tokenize the documents. tokenized_documents = [doc.split() for doc in documents] # Create a Gensim dictionary from the tokenized data. dictionary = corpora.Dictionary(tokenized_documents) # Convert tokenized documents into a document-term matrix. corpus = [dictionary.doc2bow(doc) for doc in tokenized_documents] # Generate LDA model ldamodel = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15) topics = ldamodel.show_topics() for topic in topics: print(topic)
The output consists of a collection of topics, and each topic is constructed from a collection of keywords. Each keyword is associated with a weightage that indicates the relevance of the keyword for that particular topic. As we've chosen 3 topics in this case, the output will present the top topics with the highest weights in the corpus.
LDA is a powerful approach to automatically make sense of large volumes of text. From analyzing customer reviews to summarizing news articles, it can be applied to an array of use-cases involving unstructured text data. However, the quality of the output vastly depends on the quality of input. A thorough pre-processing of text, understanding the required number of topics, can greatly improve the result. Keep in mind, LDA doesn't understand context, so if your documents deal with sarcasm, double negatives, or similar language complications, it may not perform well.