Natural Language Processing (NLP) is an emerging field that deals with the interaction between humans and computers using natural language. It has become an essential part of many communication and media analysis applications. In this blog post, we'll delve into a very random yet interesting topic: Analyzing the text of Shakespeare's works using NLP to find out the most common words used and their relationship with one another. We'll be using Python as our programming language with the libraries NLTK, spaCy, and the BeautifulSoup library for web scraping.
For our project, we'll need the complete text of Shakespeare's works. Our first step is to download the text of his work. Due to copyright issues, the works of Shakespeare are in the public domain, making them easily accessible for our project. We can obtain the text from the following site: https://www.gutenberg.org/cache/epub/100/pg100.xml
To download the text, let's use the requests
library and extract the text using the BeautifulSoup
library.
import requests from bs4 import BeautifulSoup url = "https://www.gutenberg.org/cache/epub/100/pg100.xml" response = requests.get(url) soup = BeautifulSoup(response.content, 'xml') shakespeare_text = soup.get_text()
Now that we have our text, let's start pre-processing it. Pre-processing is an important step in any NLP project, as it helps to clean, normalize, and tokenize the text efficiently. For this project, we'll use the following methods of pre-processing:
Let's import the necessary libraries:
import string import nltk from nltk.corpus import stopwords from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize, word_tokenize import spacy nltk.download('punkt') nltk.download('stopwords')
Now let's define our pre-processing function:
def preprocess(text): # Remove special characters and convert to lowercase text = text.translate(str.maketrans('', '', string.punctuation)).lower() tokens = word_tokenize(text) # Remove stopwords stop_words = set(stopwords.words('english')) tokens = [word for word in tokens if word not in stop_words] # Perform stemming stemmer = PorterStemmer() tokens = [stemmer.stem(token) for token in tokens] # Perform lemmatization nlp = spacy.load('en_core_web_sm') tokens = ' '.join(tokens) doc = nlp(tokens) tokens = [token.lemma_ for token in doc] return tokens
Let's preprocess our Shakespeare text:
processed_text = preprocess(shakespeare_text)
To find the most common words in the processed text, we can use the Counter
class from the collections
library.
from collections import Counter word_count = Counter(processed_text) most_common_words = word_count.most_common(10) print(most_common_words)
This will output the top 10 most common words used in Shakespeare's works.
To visualize the results, we can use a popular data visualization library called matplotlib
.
import matplotlib.pyplot as plt words, counts = zip(*most_common_words) plt.bar(words, counts) plt.xlabel('Words') plt.ylabel('Frequency') plt.title('Top 10 Most Common Words in Shakespeare Texts') plt.show()
This will display a bar chart depicting the frequency of the most common words in Shakespeare's works.
In this blog post, we've successfully analyzed Shakespeare's text using NLP techniques. The pre-processing steps taken, such as removing special characters, lowercasing, tokenization, stopword removal, and stemming/lemmatization, all contribute to the efficiency of NLP workflows. We've also visualized the most common words used in Shakespeare's works through a simple bar chart using Python's matplotlib
library.