Intro to natural language processing with Python
The limits of my language means the limits of my world. - Ludwig Wittgenstein
Computers speak their own language, the binary language. Thus, they are limited in how they can interact with us humans; expanding their language and understanding our own is crucial to set them free from their boundaries.
NLP is an abbreviation for natural language processing, which encompasses a set of tools, routines, and techniques computers can use to process and understand human communications. Not to be confused with speech recognition, NLP deals with understanding the meaning of words other than interpreting audio signals into those words.
If you think NLP is just a futuristic idea, you may be shocked to know that we are likely to interact with NLP every day when we perform queries in Google when we use translators online when we talk with Google Assistant or Siri. NLP is everywhere, and to implement it in your projects is now very reachable thanks to libraries such as NLTK, which provide a huge abstraction of the complexity.
In this article, we’ll discuss how to work with NLP using the NLTK library with python.
Setting up the Environment
For making things easier, we will use jupyter notebooks with Google Colab , and you can follow every step we do by accessing the complementary source code here .
Once you have a jupyter notebook, or any environment of your choice set up, make sure you install the nltk library by using the following command:
!pip3 install nltk
Note that if you are not on a jupyter notebook environment you won’t need the
! at the beginning.
NLTK is a huge library that provides a lot of different tools to work with language. While some functions are available with the library itself, some modules require additional downloads.
punkt is a module to work with tokenization, which is the process of separating a paragraph into chunks or words, and it’s usually a first step in the process of text analysis.
Before starting, make sure you download the module
import nltk nltk.download('punkt')
Now, let’s see it in action
from nltk.tokenize import word_tokenize Text = "Good morning, How you doing? Are you coming tonight?" Tokenized = word_tokenize(Text) print(Tokenized)
['Good', 'morning', ',', 'How', 'you', 'doing', '?', 'Are', 'you', 'coming', 'tonight', '?']
This first function,
word_tokenize will split a text into words and symbols, however there’s more you can do with
punkt, such as separating a paragraph into sentences.
from nltk.tokenize import sent_tokenize Text = "Good morning, How you doing? Are you coming tonight?" Tokenized = sent_tokenize(Text) print(Tokenized)
['Good morning, How you doing?', 'Are you coming tonight?']
If the first example wasn’t very impressive, this one definitely is. Here we start seeing a much more intelligent method that tries to split the text into simpler meaningful chunks.
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. It’s important in certain situations to ignore such words, and thus having a dictionary of them can become really handy, especially when we need to deal with multiple languages. NLKT provides a module to work with step words, let’s download it next:
import nltk nltk.download('stopwords')
Stop words is a simple list of words, so we can operate with it very easily, for example by writing a small routing to get a list of words without stop words in it:
from nltk.corpus import stopwords stopwords = stopwords.words("english") Text = ["Good", "morning", "How", "you", "doing", "Are", "you", "coming", "tonight"] for i in Text: if i not in stopwords: print(i)
Good morning How Are coming tonight
Since we are given a simple list of words, we can simple print it to see all of them for a particular language:
from nltk.corpus import stopwords stopwords = stopwords.words("english") print(stopwords)
A word stem is the base or root form of a word, for example the word “loving” has roots in the word “love”, or being” on the word “be”. Stemming is the process to which we transform a given word into its stem word. This is a very complex task to do, words can be written in many forms, and different words have different ways to get its stem. Thankfully, NLTK makes it really easy for us to achieve this, let’s see how:
from nltk.stem import PorterStemmer ps = PorterStemmer() words = ["Loving", "Chocolate", "Retrieved", "Being"] for i in words: print(ps.stem(i))
love chocol retriev be
This simplification of a word can be very helpful in search engines to prevent different ways of writing the same word to be ignored on the search criteria.
Counting how many times each word appears can be very helpful in the context of text analysis. NLTK provides us a neat method to calculate the frequency of words in a text called
import nltk words = ["men", "teacher", "men", "woman"] FreqDist = nltk.FreqDist(words) for i,j in FreqDist.items(): print(i, "---", j)
men --- 2 teacher --- 1 woman --- 1
Oftentimes we see some words being used together to give a specific meaning, for example “let’s go”, “best performance” and others. On text analysis it is important to capture these words as pairs as seeing them together can make a big difference in the comprehension of the text.
NLTK provides a few methods to do exactly that, and we will start with bigrams, which is a method to extract pairs of connected words:
words = "Learning python was such an amazing experience for me" word_tokenize = nltk.word_tokenize(words) print(list(nltk.bigrams(word_tokenize)))
[('Learning', 'python'), ('python', 'was'), ('was', 'such'), ('such', 'an'), ('an', 'amazing'), ('amazing', 'experience'), ('experience', 'for'), ('for', 'me')]
Similarly, we can do the same for 3 words and more:
Bigrams is the two words that occur together always but trigrams are the same as bigrams but with three words and there almost no difference in the code:
words = "Learning python was such an amazing experience for me" print(list(nltk.trigrams(word_tokenize)))
[('Learning', 'python', 'was'), ('python', 'was', 'such'), ('was', 'such', 'an'), ('such', 'an', 'amazing'), ('an', 'amazing', 'experience'), ('amazing', 'experience', 'for'), ('experience', 'for', 'me')]
The Ngrams are also some words or letters or symbols that appear together in a single phrase or document such as the previous two methods bigrams and trigrams but here you can specify the words numbers. Let’s see an example:
[('Learning', 'python', 'was', 'such'), ('python', 'was', 'such', 'an'), ('was', 'such', 'an', 'amazing'), ('such', 'an', 'amazing', 'experience'), ('an', 'amazing', 'experience', 'for'), ('amazing', 'experience', 'for', 'me')]
Though with the presented text the results may not seem very impressive, there are many use cases where Ngrams can be effectively used, for example for spam detection.
The concept of lemmatization is very similar to stemming words, just this last one only removes the prefix or suffix of the word and sometimes makes some spelling errors and the lemmatization converts it to its rel base word. Let’s see an example but before we do that we need to download the WordNet package using NLTK:
nltk.download('wordnet') from nltk.stem import WordNetLemmatizer Lem = WordNetLemmatizer() print(Lem.lemmatize("believes")) print(Lem.lemmatize("stripes"))
When you run the code it will convert every word to its base like “believes” to “belief” and “stripes” to “stripe” and so on. The nice thing about this package known as WordNetLemmatizer it has an argument called pos which stands for “part of speech” and you can specify if you want to get the verb or the adjective of the word. Let’ see an example:
from nltk.stem import WordNetLemmatizer Lem = WordNetLemmatizer() print(Lem.lemmatize("believes", pos="v")) print(Lem.lemmatize("stripes", pos="v"))
Notice the difference on the results?
Some time ago when you were at school you probably learned to categorize words into verbs, nouns, adjectives, etc. And today that task is pretty trivial for us humans, but if computers want to understand human language they need to understand these concepts, they need to differentiate between an action and a target, a verb and a noun. NLTK provides us with POS (Part of Speech) to categorize words.
It’s super easy to work with, so let’s look at the code:
nltk.download('averaged_perceptron_tagger') words = "Learning python was such an amazing experience for me" word_tokenize = nltk.word_tokenize(words) print(nltk.pos_tag(word_tokenize))
[('Learning', 'VBG'), ('python', 'NN'), ('was', 'VBD'), ('such', 'JJ'), ('an', 'DT'), ('amazing', 'JJ'), ('experience', 'NN'), ('for', 'IN'), ('me', 'PRP')]
The result is a list of the words and its POS tag associated with it. The tags are acronyms and you can find a full reference in the table below:
Named Entity Recognition
Now we start working with more powerful functions, NER (Named entity recognition) is used to capture all textual mentions of named entities. A named entity can be anything from a place, a person, organization, money, etc.
This can be extremely powerful combined with other methods to answer questions such as, “who is the president of the USA?” directly from text sources without having the answer in a structured format. If you google often, you will see quite a lot of these in action.
nltk.download('maxent_ne_chunker') nltk.download('words') Text = "The russian president Vladimir Putin is in the Kremlin" Tokenize = nltk.word_tokenize(Text) POS_tags = nltk.pos_tag(Tokenize) NameEn = nltk.ne_chunk(POS_tags) print(NameEn)
(S The/DT russian/JJ president/NN (PERSON Vladimir/NNP Putin/NNP) is/VBZ in/IN the/DT (FACILITY Kremlin/NNP))
See how now we start combining functions we learned to gather more information about the text, and to start giving a sentence, a text more meaning by identifying the structure and entities in it.
So far we word with text to break it down into smaller units we can then use for processing. Sentiment analysis deviates from that as it’s a process to determine the sentiment, or emotional component of a text. It is very known, for example, for cataloging positive and negative reviews for apps, movies, etc.
Though it is possible to do sentiment analysis directly with NLTK by utilizing the functions we already learned, it is still a unnecessary tedious process, thankfully, python offers TextBlob, a library built on top of NLTK for text processing, that handles all the complications for sentiment analysis for us.
It is super easy to use as we will demonstrate next by analysing a tweet from the 46th president of the USA, Joe Biden:
!pip3 install textblob from textblob import TextBlob Joe_Biden_Tweet = "Small businesses need relief, but many were muscled out of the way by big companies last year." Joe_Biden = TextBlob(Joe_Biden_Tweet) print(Joe_Biden.sentiment)
The sentiment analysis results in 2 variables, the polarity and the subjectivity. The polarity is a value ranging between -1 and 1, with -1 being very negative and +1 very positive. The subjectivity ranges between 0 and 1, and refers to the person’s opinion, emotion or even judgement. The higher the number the more subjective the text is.
We mentioned that TextBlob as multiple uses and spelling correction is another one of them. Simply as it sounds, TextBlob can help us eliminate spelling mistakes from our text. Let’s see it in action:
from textblob import TextBlob Text = "Smalle businesses neede relief" spelling_mistakes = TextBlob(Text) print(spelling_mistakes.correct())
Small business need relief
As expected, TextBlob identified our mistakes and corrected them in the resulting text.
NLP is a complex and fascinating world, today we introduced a few concepts and code, but we barely scratched the surface of what can be done. NLTK is a huge library, with tons of use cases and potential, and it’s worth reading about it in detail. In further articles we will continue learning about NLP, what are the new trends, new algorithms and how AI will get us to produce machines that can interact seamlessly with humans.
Thanks for reading!