So far, this series of articles has covered two key aspects of Generative AI: Large Language Models and an introduction to Natural Language Processing. In this article, the author takes us on a deep dive into the complex world of Natural Language Processing with its many steps, algorithms, layers, and nuances. Read on to learn how NLP works behind the scenes to create a form of AI that is truly engaging to humans.

Natural Language Processing

The NLP process, given its advanced algorithms, can be accomplished for the most part, in just two basic steps, which are as follows:

  • Data Preprocessing: This is the first step, in which all of the datasets are examined, cleansed, and optimized. An NLP model can work very well with both Quantitative and Qualitative based datasets. There are four ways in which this can happen:
    • Tokenization: This is where sensitive datasets (such as people’s private data) are substituted with an irreversible token.
    • Stop Word Removal: In this instance, all of the common words are removed, so that only the most “robust” and meaningful words remain. This is designed so that the output that is generated will be as realistic as possible.
    • Lemmatization/Stemming: This is where the words in a Qualitative based dataset are broken down to their most basic form. For example, the word “playing” would be reduced to its root form “play.” One of the benefits of this is that this will ensure a smoother processing by the NLP model, because of the need for reduced processing and computing resources.
    • Part of Speech Tagging: This was reviewed in the previous article, where individual words are categorized based upon where they fit in a sentence, like a noun, verb or adjective.
  • Data Processing: In this step the datasets (both Quantitative and Qualitative based) are processed and are used to generate the output. This is where the role of the established algorithms of both Machine Learning and Rule Based Systems take place. The latter are a series of carefully developed linguistic rules.

Natural Language Generation: A new field of NLP

A new field of NLP has started to evolve, and it is called “Natural Language Generation”, or “NLG” for short. There are three types of NLG:

  • Extractive NLG: This is where a large group of sentences are grouped together, and the most important words and phrases are taken out in order to provide a summary of that particular block of text.
  • Abstractive NLG: This technique also takes long blocks of text, but rather than summarize them, it tries to create brand new content and language from it.
  • Sequence to Sequence: This is where the NLP algorithm can take one kind of input and convert it to different kinds of output, while still meeting the overall objective. For example, if an end user wanted to translate a text of English and have it outputted in Arabic, this kind of algorithm would work very well.

Natural Language Understanding

Another closely allied area is that of Natural Language Understanding (NLU), which is technically defined as follows:

“Natural Language Understanding (NLU) is a field of computer science which analyzes what human language means, rather than simply what individual words say.”1

Here is an example of how it works:

The query: “tickets New York to Miami 25 April 8pm”

How it is broken down:

“Tickets [intent to buy]

New York [location]

Miami [location]

25 April [date]

8 PM [time]

There are two distinct components to NLP, which are as follows:

  • Intent Recognition: This technique tries to ascertain the meaning or the context of the words, as illustrated in the example above.
  • Entity Recognition: This technique tries to ascertain which words in a sentence or phrase actually refer to a tangible entity. There are two subcategories of Entity Recognition:
    • Named Entities: These are distinct classifications, such as names of people, businesses, geographic locations, etc.
    • Numeric Entities: These are numerical based classifications, such as quantities, percentages, currencies, etc.

One of the primary objectives of both NLU and NLG is to totally engage the end user with the AI model. In other words, the end user wants to be heard and responded to like a human being.

How Machine Learning and Deep Learning come into play

It is important to note at this point that NLP also makes heavy use of both Machine Learning and Deep Learning. Here is a simple example that demonstrates their role in NLP:

In this particular scenario, the output is produced in three distinct steps:

  • The model is trained with an Input/Output (I/O) combination, such as the following:

    (2 * 10) + (3 * 10) + (5* 10)

    This is considered to the be the “Preparation and Build” steps.

  • The ML algorithms then determine the above mathematical relationship as follows:

     (x * y) + (x * y) + (x * y) = Z

    This is considered to be the “Training and Tuning” step.

  • We can give this mathematical model an Input/Output of “32”, but the actual, computed output will be “100”, as denoted by the variable “Z.”

    This is considered to be the “Deploy and Manage” step and is the last phase.

Deep Learning can be technically defined as follows:

“Deep learning models can recognize complex patterns in pictures, text, sounds, and other data to produce accurate insights and predictions. You can use deep learning methods to automate tasks that typically require human intelligence, such as describing images or transcribing a sound file into text.”2

As its name implies, this takes AI overall into a much more sophisticated analysis, thus there will be many more layers embedded into it as a result. As it relates to NLP, the following algorithms are the most important:

  • The Convolutional Neural Network: These are also referred to as “CovNets.” It has four layers that are embedded into it:
    • The Convolutional Layer: This filters out for any Statistical based outliers.
    • The Rectified Layer Unit: This is also referred to as the “ReLU.” This is the phase where the datasets are actually mapped.
    • The Pooling Layer: This is where the dataset is compressed, for the purposes of efficiency.
    • The Fully Connected Layer: This is where a mathematical, linear-based matrix is created so that any “images” can be recognized in other datasets that have selected for the initial training of the NLP model.
  • Recurring Neural Networks: These are also referred to commonly as “RNNs.” This algorithm can learn from previous inputs, thus making them optimal for an NLP model.
  • Long-Short Term Memory Networks: These are also referred to commonly as “LSTMNs.” They are made up of memory blocks, which store the more relevant information from the datasets. This is how it works:
      1. The extraneous data is removed at the “Sigmoid Layer.”
      2. The above is replaced with more pertinent and relevant data.
      3. The output is then calculated on the current cell state, based on the last step.
  • General Adversarial Networks: These are also referred to as “GANs.” A GAN consists of two distinct parts:
    • The Generator: This is used to create “Fake Data,” based upon the information that it has been trained upon.
    • The Discriminator: This algorithm checks for the “Fake Data,” counters it, in an effort to further optimize the NLP model.
  • Multi-Layer Perceptrons: In this instance, there are multiple layers of both Inputs and Outputs, along with various Hidden Layers. There are also multiple Neurons, and they are all interconnected with another. The datasets are first ingested into the Input Layer, then subsequently processed to yield the output.
  • Autoencoder: This is a subset of Deep Learning, and is made up of the following components:
    • The Encoder: This is where the datasets are encoded into a mathematical file format.
    • The Code: This is where the datasets are broken down into “chunks” for easier ingestion and processing.
    • The Decoder: This is where the datasets are eventually used for the initial training of the NLP model.

Up Next: The History of Natural Language Processing

Now that you’ve learned the deep and complex details of NLP, you may be wondering about the path computer science took to achieve this. The history of NLP can be traced all the way back to 1906! In the next article in this series, author Ravi Das will take a slight detour to review the fascinating history of NLP.

Sources/References:

1.Qualtrics   

2.Amazon: What is Deep Learning?

Join the conversation.

Keesing Technologies

Keesing Platform forms part of Keesing Technologies
The global market leader in banknote and ID document verification

+ posts

Ravi Das is a Cybersecurity Consultant and Business Development Specialist. He also does Cybersecurity Consulting through his private practice, RaviDas Tech, Inc. He also possesses the Certified in Cybersecurity (CC) cert from the ISC2.
Visit his website at mltechnologies.io

Previous articleA Review of Generative Artificial Intelligence: Part 2
Next articleRepublic of Portugal Is Issuing a New National Identity Card