#### Bot Basics, Bot Development

# Deep Learning Fundamentals to Build an Efficient Chatbot

Learn about the potential of deep learning and the role it plays in chatbots! Also, get to know about how neural networks are efficient for speech recognition in chatbots.

##### By **Aarushi Ramesh**

###### August 20, 2019

How does a chatbot learn the user’s commands? How does it effectively seek the entities and intents of a message? In order for your chatbot to converse, you will need to input data for the chatbot to learn, seek patterns, and develop an algorithm for future data inputs. These are the main steps in order to train your chatbot to understand and correctly interpret commands and questions from the user. Training your chatbot involves decision-making skills for your chatbot to match a certain phrase to a certain action (for example, searching for a nearby restaurant). The process to program your bot to learn such complicated decision-making involves a very interesting concept known as deep learning.

## What is deep learning?

All of these skills require an algorithm in order for the chatbot to learn to predict future commands from the user. This is where deep learning comes into play. Deep learning is a subset of machine learning, where learning is based on an algorithm known as neural networks. Machine learning is a subset of AI and is the computer’s ability to learn an algorithm without being explicitly programmed. So deep learning is essentially a specific subset of AI, where the algorithm used to make computers learn is a neural network. Well, what is a neural network, and how does the algorithm work? Neural networks are a collection of algorithms, inspired by the human brain, that recognize and learn patterns. They are useful for classification problems. For example, if you input in a labeled data point, it learns from the correlation and helps detect and predict future data points.

## How do neural networks work?

You can think of neural networks as a tree-like structure, with multiple nodes and branches. Every layer of the network has a certain number of nodes. A node or neuron is a structure that holds a numerical value. The first layer of the network is your inputs, and the last layer is your outputs. Every input is connected to a node in the next layer. This is because every input has several other options or nodes to enhance the search and extract the important features to finalize what the exact output is. In a neural network, we are basically breaking down these features of the inputs to different layers, where certain combinations or selections of these layers lead to a certain output. This is also known as layers of abstraction.

The main input for a conversational chatbot is speech—the user sends a command to the chatbot, and the chatbot in return accomplishes the task. So we would need to build a neural network for speech recognition. Since every layer essentially builds off of each other, the first layer would start with the audio, the next layer would be the sound, syllables would be next, and then letters. And this entire process eventually leads to the action the chatbot is required to take.

*A neuron is denoted as a circle in the diagram. This is a basic structure of a neural network.*

Let’s first consider an example of image classification. Suppose we are using a neural network to recognize a digit from an image. The image can correspond to digits 0, 1, or 2. If the task was to classify an image of a number, then every pixel of the image would be an input to the neural network. So, in the first layer, every neuron would be a number representing the grayscale number of the pixel: a white pixel corresponds to a 1, and a black pixel corresponds to a 0. So every input to the neural network is a number from 0–1. This means that the last layer must have three neurons, each representing a digit (numbers 0–2). There are two other layers in between, and these are called the hidden layers of the structure. In this case, there are two hidden layers, each layer denoting different groupings of the image. So it’ll first start with one pixel, and by the time it reaches the end of the network, the whole image would have been processed and the network would have found the digit.

An image of a number is composed of several parts—a line for numbers like 7, 4, 9, 1, or a circle for numbers like 8, 9, 0, etc. We can essentially train our neural network to find these important characteristics and to recognize the digit from the image. So this means that the input neurons, which together form a circular shape, activate or lead to the neuron in the next layer that indicates a circle characteristic. Basically, in a neural network, activations from one layer lead to activations in the next layer. The formula of connecting the inputs to the output is called the activation function. But how does the network know which neuron to activate in the next layer? How does it link all of the shapes and characteristics and lead to one number?

## Putting it all together

We need some way to combine all of the pixel inputs to the lines, circles, and other characteristics. You need parameters to tweak and adjust in order for the network to process the small pixels and get a bigger picture of the number (like a circular shape or line). These parameters are called weights, and they are basically numbers that are assigned in between the connections of two layers. Every line that connects the neuron to the next layer’s neuron is associated with a weight. The weight is essentially a numerical factor that lets the network know which neurons have a higher influence over others. So, if the weight between neuron 1 from the first level to neuron 2 in the second level has a high magnitude, that means neuron 1 has a high influence on the output. The essence of training the chatbot involves updating the weights periodically to make the predictions more accurate. This is done to ensure that the network knows which weights to update to get a clearer picture of the image by grouping them together layer by layer and finally predicting the digit. But what if all of the input neurons are inactive? There needs to be an additional constant that makes sure the neurons in the next layers become active, regardless of the previous layers. This is called the bias, and it is used to adjust the output by acting as the y-intercept of a slope equation.

## The role of NLP

But for conversational chatbots, we need to focus more on natural language processing (NLP). How can we use neural networks for speech recognition? Since we need to have numerical values for the inputs to the neural network, we need to somehow convert the sound data we receive from the user (the command to a chatbot) to a numerical data point. We can do this by inputting the sound wave height at every time frame. This is known as sampling. However, trying to recognize speech through this data can be inefficient and hard. So we preprocess the inputs to make it easier for the neural network to process. For example, one option could be to group the inputs based on the pitches and frequencies. Once we have the preprocessed inputs, we can feed them into our neural network. An efficient algorithm to use for NLP is the recurrent neural network, which is basically a network that takes in the current state to affect future predictions. So, for every letter the network predicts, it will affect the probability of the next letter. For example, if you said, “Goodbye,” and the network has already predicted Goodb, then it's more likely it will predict ye for the rest of the word because it has the memory of previous predictions. This increases the accuracy of the predictions.

## Into the future

By using these complex decision-making algorithms, we can input and feed in data to our chatbots to start training them. We can use our training data as the tool for the bot to learn from and make future predictions based on the new and unseen data. By adjusting and updating the weights at every loop, we can make our predictions much more accurate. Deep learning is an essential tool used to train chatbots to accurately understand, process, and respond to the user.