LSTM-Based Name Generator - First Dive into NLP

LSTM-Based Name Generator - First Dive into NLP


After the in-depth study of computer vision techniques, we carry forward our learning process and start diving into the world of Natural Language Processing (NLP) as the world of artificial intelligence is not just limited to vision and making computers learn how humans perceive visual data.

It is meant to empower technology to encompass the whole umbrella of skills and tasks that can be performed by a human. One of those skills is perceiving language. Language is a strong communication tool that makes the world go as we know it. In a nutshell, language has been the total of all human experiences since the dawn of time. Humans would be unable to convey their feelings, ideas, emotions, desires, and beliefs without the use of language. There could be no civilization and maybe no religion without language.

This takes us to the world of Natural Language Processing (NLP). Every internet user has used an NLP program. Natural language processing is used by search engines like Google and Bing to suggest possible search requests.

When users begin typing search parameters, search engines attempt to fill in the blanks for them. Users can choose from the pre-defined criteria or type in their own query. NLP uses are not limited to search engines.

NLP is used by voice-activated devices like Siri and Alexa to process language. NLP is used by chatbots to provide more accurate replies to end-user inquiries. For better data sets, the technique may be utilized to extract essential information from unstructured data. For firms that use NLP, there are several evident benefits.

Businesses deal with a lot of unstructured, text-heavy data and require a means to handle it quickly. Natural human language makes up a substantial portion of the data produced online and kept in databases, and organizations have been unable to efficiently evaluate this data until recently. Natural language processing comes in handy here.

In this article, we are going to follow the deep learning-based approach to solving NLP problems. We will be implementing a Recurrent Neural Network (RNN), that is Long-Short Term Memory (LSTM) based name generator. Hence, the article will follow the following structure:


What is NLP?

NLP is an advanced form of linguistics that can be thought of as an extension of classical linguistics to computational linguistics.

Linguistics is the study of language in its whole, encompassing grammar, semantics, and phonetics. Language norms were devised and evaluated in classical linguistics. Although formal approaches for syntax and semantics have made significant progress, the most fascinating problems in natural language processing continue to defy neat mathematical formalisms.

The current study of linguistics employing computer science methods is known as computational linguistics. As the adoption of computational tools and thinking has dominated most disciplines of study, yesterday’s linguistics may be today’s computational linguist.

Statistical techniques and statistical machine learning began to supplant traditional top-down rule-based approaches to language in the 1990s, owing to their superior results, speed, and robustness. The statistical approach to natural language research currently dominates the discipline, and it may perhaps define it.

To represent the more engineer-based or empirical approach of the statistical approaches, computational linguistics became known as natural language process, or NLP.

We are interested in the tools and approaches from the discipline of Natural Language Processing as machine learning practitioners dealing with text data. For difficult natural language processing issues, deep learning approaches show a lot of potential.

Deep learning approaches hold a lot of potential when it comes to solving difficult natural language processing difficulties. Natural language processing enables computers to converse with humans in their native tongue and handle other language-related activities.

NLP allows computers to read text, hear a voice, analyze it, gauge sentiment, and identify which bits are significant, for example. Machines can now interpret more language-based data than humans can, without becoming fatigued and in a consistent and fair manner.

Automation will be critical for quickly processing text and audio data, given the vast volume of unstructured data generated every day, from medical records to social media.


What are RNNs?

A recurrent neural network (RNN) is a form of artificial neural network that is designed to operate with time series or sequence data. Ordinary feed forward neural networks are only designed to handle data items that are unrelated to one another.

However, if we have data in a sequence where one data point is dependent on the preceding data point, we must change the neural network to account for these dependencies.

RNNs feature a concept of ‘memory,’ which allows them to retain the states or information of prior inputs in order to construct the sequence’s next output. RNNs are basically developed to handle streams of data such as textual data.

In the case of sentences, textual data has different meanings depending upon the context around a word, this makes the interpretation of textual data extremely difficult and almost impossible using the generic feed-forward Artificial Neural Network. Here the memory concepts of RNN mentioned above help. It assists in interpreting contextual data streams.

RNN has an uncanny ability to recall information. All of the inputs in other neural networks are unrelated to one another.

In an RNN, however, all of the inputs are connected. Let us imagine you need to anticipate the next word in a phrase. In this situation, the relationship between all the preceding words aids in improved output prediction.

During training, the RNN keeps track of all of these connections. To do this, the RNN constructs networks with loops in them, allowing it to store information.

The neural network can take the sequence of input thanks to its loop structure. You will have a better understanding of it if you view the unrolled version.

Diagram of LSTM network

Diagram of LSTM network

There are four basic types of RNNs. These include:

  • One to One
  • One to Many
  • Many to One
  • Many to Many

To learn more about these types, visit here .


Intro To LSTM Algorithm

An RNN applies a function to the current data to totally alter it in order to add new information.

As a result, the entire information gets altered, i.e., there is no distinction between ‘important’ and ’not so essential’ information. LSTMs, on the other hand, use multiplications and add to make minor changes to the data. Information travels through a mechanism known as cell states in LSTMs.

LSTMs may selectively recall or forget information in this way. There are different dependencies on the information at a specific cell state.


Architecture of LSTM

A typical LSTM network is made up of several memory blocks known as cells (the rectangles that we see in the image). The cell state and the concealed state are the two states that are passed to the following cell. The memory blocks are in charge of remembering things, and they are manipulated by three basic mechanisms known as gates. Each of these is detailed further down.

LSTM cell architure

LSTM cell architure

A forget gate is in charge of erasing data from the cell state. By multiplying a filter, information that is no longer necessary for the LSTM to comprehend things or that is of lesser value is eliminated. This is essential for the LSTM network’s performance to be optimized.

X_t is the input at that time step, and h_t-1 is the hidden state from the previous cell. The weight matrices are multiplied by the provided inputs, and a bias is applied.

This value is then subjected to the sigmoid function. The sigmoid function gives vectors with values ranging from 0 to 1, one for each cell state number.

The sigmoid function is in charge of determining which data should be kept and which should be discarded. When the forget gate outputs a ‘0’ for a specific value in the cell state, it signifies that the forget gate wants the cell state to fully forget that piece of information.

A ‘1’, on the other hand, indicates that the forget gate wishes to remember the complete piece of data. The cell state is multiplied by the sigmoid function’s vector output.

The input gate is in charge of updating the cell state with new information. As can be seen in the picture above, adding information is a three-step procedure.

  • A sigmoid function is used to control what values should be added to the cell state. This is similar to the forget gate in that it functions as a filter for all of the data from h t-1 and x t.
  • Creating a vector containing all potential values that can be added to the cell state (as determined by h t-1 and x t). The tanh function, which returns values ranging from -1 to +1, is used to do this.
  • Adding this advantageous information to the cell state by multiplying the value of the regulatory filter (the sigmoid gate) by the created vector (the tanh function).

An output gate’s operation may be broken down into three parts once more:

  • After applying the tanh function on the cell state, the values are scaled to the range -1 to +1, resulting in a vector.
  • Using the values of h t-1 and x t, design a filter that can control the values that must be produced from the vector established previously. A sigmoid function is used once again in this filter.
  • Multiplying the value of this regulatory filter by the vector formed in step 1 and sending it out as an output as well as to the following cell’s concealed state.

Creating a Name Generator with LSTM

Now that we have looked at the basis of recurrent neural networks and the LSTM algorithm, we can move on to the implementation of the LSTM for the development of a name generator.

We need TensorFlow as a pre-requisite before we move on to the implementation. We can use a pip command to install TensorFlow.

pip install --upgrade tensorflow

Now, we can start to code. As always, you can access the full code on a JupyterNotebook on Google Colab .

import pandas as pd
import numpy as np
import tensorflow as tf
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import random
import os

We import all the necessary modules first. Pandas is a library for handling and manipulating data. Numpy is a mathematics processing library. Tensorflow is a framework for developing machine and deep learning models.

step_length = 1 # The step length we take to get our samples from our corpus
epochs = 50 # Number of times we train on our full data
batch_size = 32 # Data samples in each training step
latent_dim = 64 # Size of our LSTM
dropout_rate = 0.2 # Regularization with dropout
model_path = os.path.realpath('./poke_gen_model.h5') # Location for the model
load_model = False # Enable loading model from disk
store_model = True # Store model to disk after training
verbosity = 1 # Print result for each epoch
gen_amount = 10 # How many

Afterwards we declare all the necessary variables that we will use in the code.

input_path = os.path.realpath('names.txt')
input_names = []
print('Reading names from file:')
with open(input_path) as f:
    for name in f:
        name = name.rstrip()
        if len(input_names) < 10:
            print(name)
        input_names.append(name)
print('...')

Here, we print out a few starting examples from the test files of the names.

Reading names from file:
Abbas
Abbey
Abbott
Abdi
Abel
Abraham
Abrahams
Abrams
Ackary
Ackroyd
# Make it all to a long string
concat_names = '\n'.join(input_names).lower()

# Find all unique characters by using set()
chars = sorted(list(set(concat_names)))
num_chars = len(chars)

# Build translation dictionaries, 'a' -> 0, 0 -> 'a'
char2idx = dict((c, i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))

# Use longest name length as our sequence window
max_sequence_length = max([len(name) for name in input_names])

print('Total chars: {}'.format(num_chars))
print('Corpus length:', len(concat_names))
print('Number of names: ', len(input_names))
print('Longest name: ', max_sequence_length)

Now, we have to find the unique characters that constitute all the names. This can be done by unwrapping all the names into a single long string and then sorting it into a list.

This is then used to create a dictionary that maps indexes to characters. This converts the cardinal data to an ordinal one and this encoding can be used to train the network since models only understand numbers.

sequences = []
next_chars = []

# Loop over our data and extract pairs of sequances and next chars
for i in range(0, len(concat_names) - max_sequence_length, step_length):
    sequences.append(concat_names[i: i + max_sequence_length])
    next_chars.append(concat_names[i + max_sequence_length])

num_sequences = len(sequences)

print('Number of sequences:', num_sequences)
print('First 10 sequences and next chars:')
for i in range(10):
    print('X=[{}] y=[{}]'.replace('\n', ' ').format(sequences[i], next_chars[i]).replace('\n', ' '))

Here, we loop over the whole data and extract pairs of sequences with the next character. Below, the first 10 sequences and their next characters are shown. This is used to provide the context in the data.

X = np.zeros((num_sequences, max_sequence_length, num_chars), dtype=np.bool)
Y = np.zeros((num_sequences, num_chars), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for j, char in enumerate(sequence):
        X[i, j, char2idx[char]] = 1
        Y[i, char2idx[next_chars[i]]] = 1

print('X shape: {}'.format(X.shape))
print('Y shape: {}'.format(Y.shape))

After dividing the data into X (the sequences) and Y (next character), we look at the shape of X and Y. This X and Y will be used in training as input and their corresponding subsequent character.

model = Sequential()
model.add(LSTM(latent_dim,
               input_shape=(max_sequence_length, num_chars),
               recurrent_dropout=dropout_rate))
model.add(Dense(units=num_chars, activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

model.summary()

Now, we define the LSTM model using the Keras predefined LSTM layers. The Dense is a normal fully connected Neural Network.

if load_model:
    model.load_weights(model_path)
else:
    start = time.time()
    print('Start training for {} epochs'.format(epochs))
    history = model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=verbosity)
    end = time.time()
    print('Finished training - time elapsed:', (end - start)/60, 'min')
if store_model:
    print('Storing model at:', model_path)
    model.save(model_path)

Here, we start the training process of the model by calling the fit() module. We pass it the number of epochs that defines the number of times the model performs the training process on the whole dataset.

# Start sequence generation from end of the input sequence
sequence = concat_names[-(max_sequence_length - 1):] + '\n'

new_names = []
print('{} new names are being generated'.format(gen_amount))

while len(new_names) < gen_amount:
    # Vectorize sequence for prediction
    x = np.zeros((1, max_sequence_length, num_chars))
    for i, char in enumerate(sequence):
        x[0, i, char2idx[char]] = 1

    # Sample next char from predicted probabilities
    probs = model.predict(x, verbose=0)[0]
    probs /= probs.sum()
    next_idx = np.random.choice(len(probs), p=probs)
    next_char = idx2char[next_idx]
    sequence = sequence[1:] + next_char

    # New line means we have a new name
    if next_char == '\n':
        gen_name = [name for name in sequence.split('\n')][1]
        
        # Never start name with two identical chars, could probably also
        if len(gen_name) > 2 and gen_name[0] == gen_name[1]:
            gen_name = gen_name[1:]
        
        # Discard all names that are too short
        if len(gen_name) > 2:
            # Only allow new and unique names
            if gen_name not in input_names + new_names:
                new_names.append(gen_name.capitalize())
        
        if 0 == (len(new_names) % (gen_amount/ 10)):
            print('Generated {}'.format(len(new_names)))

Now, we generate 10 names from the learned weights, that is the LSTM model we just trained is used to generate the names.

print_first_n = min(10, gen_amount)
print('First {} generated names:'.format(print_first_n))
for name in new_names[:print_first_n]:
    print(name)

Finally, we display the names generated by the model.

First 10 generated names:
Zaoui
Palner
Palner
Pane
Panrett
Panm
Parner
Parrey
Parrett
Parrison

Conclusion

Natural Language Processing is one of the most essential parts of artificial intelligence working for the technological revolution of the world. The use of recurrent neural networks for the interpretation of contextual data is used extensively in the field of NLP.

LSTM is a form of recurrent neural network that is used for sequential data along with catering for the memory concept which looks at the context on the whole to predict the next word. If you enjoyed this tutorial, take a look at other studies from us: