Deep Learning Concepts demystified. Inspired by MIT 6.S191.
Deep learning is a type of machine learning that uses artificial neural networks to learn from data. Neural networks are inspired by the human brain, and they can learn complex patterns from large amounts of data.
Deep learning has been used to achieve state-of-the-art results in a wide variety of tasks, including image classification, natural language processing, and speech recognition.
In this article series, we will see how this seemingly complex technology works by dissecting it to its core, then building it back up. We will build an understanding of deep learning and neural networks from the beginning such that it will be very easy to understand concepts like convolutional Neural Networks, Sequence modeling, Transformers, Reinforcement Learning and Text to Image generation, which we will cover later in the series.
If this tickles your curiosity, buckle up for the greatest ride ever.
A perceptron is a type of artificial neuron that is used in deep learning. Perceptrons are inspired by the biological neurons in the human brain. They can learn to classify data by adjusting their weights and biases.
A perceptron does the basic operations of a neural network. It takes in some inputs, carries out several operations on the data, then gives an output.
Each input has an attributed weight, that is useful to the perceptron to calculate an output. The perceptron takes in the dot products (Matrix multiplication) of each input and its weight, sums all these dot products, then applies some non-linearity, otherwise known as an activation function.
An activation function is a function applied to the output of a neuron to determine whether that neuron will be activated or not. A neuron being activated means its output will be used in the next layer (layers will be explained later). Activation functions also introduce non-linearity which makes it possible for neural networks to learn complex patterns from data.
some common activation functions include
- Sigmoid Activation — The output of this function takes values between 0 and 1.
- Tanh Activation — Similar to the Sigmoid function, but has a range of (-1, 1).
- Rectified Linear Unit (ReLU) — The output is equal to the input if the input is positive, or 0 if the input is negative.
- Leaky ReLU — A variation of ReLU with a gently sloping positive gradient for negative inputs to address some shortcomings of ReLU.
In the popular deep learning framework, TensorFlow, or simply TF, the abstraction to these functions is as shown here
# Sigmoid tf.math.sigmoid(z) #Tanh tf.math.tanh(z) #ReLU tf.nn.relu(z)
Perceptron to Neural Networks.
The perceptron takes in all input and gives an output. Let’s take a group of perceptrons, which each takes in all the input and produces an output. This group of neurons is called a layer. In the diagram below, there are two perceptrons, each taking in the three inputs, then giving an output. Each input has its weight vector, while each neuron has a bias term. This kind of layer where all inputs are connected to each perceptron in the layer is called a Dense Layer
These dense layers can then be stacked together, such that the output of one layer is the input of the next layer. This creates a feed-forward neural network called a multi-layer perceptron.
stacking these layers in TensorFlow is done by calling the sequential layer function.
model = tf.keras.sequential([ #create a list of Dense layers to stack them tf.keras.layers.Dense(n, activation), tf.keras.layers.Dense(n, activation) }) # n represents the number of perceptrons. The Dense layer also # takes an activation function as a parameter.
Training a model in essence involves finding the best set of weights and biases that gives the best outcome. A loss function/Cost function is a measure of the difference between a model’s predicted output and the actual expected value. This informs on how a model performs. The goal of a deep learning model is to minimize the cost function.
The most common loss functions are:
- Mean Square Error (MSE): This is simply the average squared difference between the predicted and actual output. It is used for regression problems where the output can be any real number.
- Cross Entropy: Often used for classification problems, it measures the difference between the predicted probability distribution and the actual probability distribution. Categorical Cross-Entropy is used for multi-class classification, while binary cross-entropy is used for binary classification problems (Where there are only two classes).
- Hinge Loss: This one measures the difference between the predicted outcome and a threshold value. It is mostly used for binary classification.
The goal of training a model is to find the set of weights that results in the least loss. Initially, the model picks random gradients for all steps until it reaches the output. As expected, it performs poorly, but now there’s a set of outputs to use for back-propagation. For back-propagation, the output layer is the first layer (Moving backward). For each neuron in the output layer, the difference between its output and the expected output (Error) is calculated. Using the Chain Rule (Geeky math name for calculating gradient), This error is propagated backward through the network, layer by layer.
At each layer, weights are updated in the direction negative to the gradient of the cost function. This means that the weights are updated in the direction that will reduce the error. This happens at each stage up to the input layer. When all weights are updated, the data goes through the model, forward this time, and hopefully, the model gives a better output. This process is iterative and will keep going until the model is at its best.
We saw that during back propagation, weights are adjusted. The size of that adjustment is called the learning rate, as expressed below in a formula I came up with just now.
Wn here represents the new weight. W is the current weight, r is the learning rate and G is the formula for assigning weights, which is unknown to be at the time of writing this piece 🤙.
The learning rate presents a paradox. When it’s very low, the gradient converges slowly and may end up getting stuck at a local minimum, instead of finding the best-performing global minimum. When it’s too high, it keeps skipping the goal, and gradients may explode.
You could try guessing the best number to use and experiment with each guess, or, you could use adaptive learning rate optimizers. How I think they work is by taking larger steps in the beginning hence converging faster, then taking progressively smaller steps as it nears the optimal point. such optimizers include Adam, SGD, Adadelta, Adagrad, and RMSProp. All these are available in TensorFlow, as shown in examples below.
tensorflow.keras.optimizers.SGD tensorflow.keras.optimizers.Adam tensorflow.keras.optimizers.RMSprop
When the model has had many iterations on the same data, it tends to generalize well on that data, but not quite well to new data that it has never seen before. regularization is aimed to get the model to reduce over-fitting, hence improving its generalization to new data.
There are two engineers’ favorite ways.
- Dropout. This function works by randomly “dropping out” (setting to zero) a fraction of the input units or neurons during training, forcing the model to learn more features from the data instead of relying too much on specific neurons. In TensorFlow, Dropout is implemented as a layer.
tf.keras.layers.Dropout(0.5) #0.5 here says that half of the neurons will be dropped out.
2. Just don’t train the model too much. Early Stopping involves monitoring the performance of a model on a validation set during the training process and stopping the training early when the performance starts to deteriorate. Training is halted when the validation accuracy fails to improve in a couple of iterations
Congrats🥳, you have made it this far. Now you know a lot about Neural Networks and deep learning, but most likely cannot train a functionally useful neural network. For that, my advice is checking out Deeplizard on their website or on YouTube. Its how i learnt, and its pretty awesome.
In the next article of this series, I will try to debunk sequential Modelling, and how neural networks can pick up information on context and sequence. Looking forward to that? Me too. Happy learning!!!