How do you build Transformers from Scratch? [A Guide]

So what the hell are Transformers?

Transformers are a type of deep learning model specifically designed for sequence-to-sequence tasks, such as language translation, text summarization, and question-answering. They were introduced by Vaswani et al. in the paper "Attention is All You Need" in 2017 and have since become the foundation for many state-of-the-art NLP models.

At the heart of a Transformer is the "self-attention mechanism." This mechanism allows the model to weigh the importance of different words in a sentence while processing it. By doing so, the model can capture dependencies and relationships between words, making it exceptionally effective at handling tasks that require an understanding of context, long-range dependencies, and relationships between words in a sentence.

How do Transformers work?

Transformers consist of an encoder and a decoder, both composed of multiple layers. The encoder takes in the input sequence, such as a sentence in one language, and processes it using self-attention mechanisms and feedforward neural networks. The output of the encoder is a set of contextualized representations for each word in the input sequence. The decoder then uses these representations to generate the target sequence, such as a translation in another language.

The key innovation of Transformers is that they can process input sequences in parallel, making them highly efficient and reducing the training time compared to earlier sequential models. Additionally, Transformers employ a mechanism called "multi-head attention" to capture different types of relationships between words simultaneously, improving their ability to capture complex patterns in data.

The Whole New Deal with Transformers

The "whole new deal" with Transformers lies in their capacity to learn and represent contextual information from large datasets, which has propelled them to the forefront of NLP research. Unlike earlier models like RNNs and LSTMs, Transformers have a fixed context window for each word, which means they can capture long-range dependencies in text effectively.

Moreover, Transformers have the ability to transfer knowledge from one task to another. This transfer learning capability has led to the development of pre-trained models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). These models are pre-trained on vast amounts of text data and can be fine-tuned for specific downstream tasks. This transfer learning approach has significantly reduced the need for vast amounts of task-specific training data.

Real-Life Applications of Transformers

Transformers have found application in a wide range of real-world scenarios. Some notable examples include:

Language Translation: Transformers are used in machine translation services, such as Google Translate, to convert text from one language to another.

Chatbots and Virtual Assistants: Chatbots like GPT-3 are powered by Transformers, enabling natural and context-aware conversations.

Text Summarisation: Transformers are employed to generate concise summaries of long articles, making it easier for users to grasp the main points.

Sentiment Analysis: They help businesses analyze customer feedback, reviews, and social media posts to understand public sentiment.

Question-Answering: Transformers like BERT are used to answer questions based on a given context, which is valuable in search engines and virtual assistants.

Image and Text Generation: Transformers like DALL·E are capable of generating images from textual descriptions.

The adaptability and versatility of Transformers have made them an indispensable tool in natural language understanding, and their influence continues to grow as researchers develop more sophisticated and domain-specific models, opening up new horizons for AI applications in various domains.

To build our own Transformer we'll follow this steps:

Step 1: Import Necessary Libraries and Modules

Begin by importing the required libraries and initializing key modules to set up the foundational infrastructure for the Transformer model.

Step 2: Define Basic Building Blocks

Multi-Head Attention: Define the multi-head attention mechanism to improve the model's ability to understand complex dependencies.

Position-wise Feed-Forward Networks: Create the position-wise feed-forward networks for localised information processing and model adaptability.

Positional Encoding: Add positional encoding to help the model understand word order and sequence information.

Step 3: Build Encoder and Decoder Layers

Construct the Encoder and Decoder layers, which are the core building blocks of the Transformer model. The Encoder processes input data, while the Decoder generates outputs.

Step 4: Combine Encoder and Decoder Layers

Merge the Encoder and Decoder layers to create the complete Transformer model. This fusion is the central component of our machine learning architecture.

Step 5: Prepare Sample Data

Curate and prepare sample data that aligns with the model's objectives, ensuring it showcases the model's capabilities.

Step 6: Train the Model

Begin the model training phase, where data, design, and computation come together to unlock the potential of the Transformer model.

Let's get our hands dirty. We will be using Python of course in all of this, we'll start off by importing necessary libraries.

import torch

import torch.nn as nn

import torch.optim as optim

import torch.utils.data as data

import math

import copy

Next up we will be building basic blocks of a Transformer. There's this thing known as The Multi-Head Attention mechanism this is the secret sauce that allows Transformers to understand the intricate dance of words in a sequence. It's like having a team of experts who collaborate to grasp every nuance, ensuring that the model can tackle a wide range of language understanding tasks.


import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

# MultiHeadAttention class
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear transformations for Query (Q), Key (K), Value (V), and the output (o)
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        # Apply an optional mask to the attention scores
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)

        # Calculate attention probabilities using softmax
        attn_probs = torch.softmax(attn_scores, dim=-1)

        # Compute the weighted sum of values (V) using attention probabilities
        output = torch.matmul(attn_probs, V)
        return output

    def split_heads(self, x):
        # Reshape the input tensor into multiple heads
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        # Combine the attention outputs from all heads
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations to the input Q, K, and V
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))

        # Compute multi-head self-attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)

        # Combine the multi-head attention outputs
        output = self.W_o(self.combine_heads(attn_output))
        return output

# PositionWiseFeedForward class
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init()
        # Define two linear layers and a ReLU activation
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        # Pass the input through the first linear layer, ReLU activation, and the second linear layer
        return self.fc2(self.relu(self.fc1(x)))

# Tutorial-style explanations

MultiHeadAttention Class:

The MultiHeadAttention class is a crucial component of the Transformer architecture. It enables the model to capture complex relationships within input sequences. It does this by using multiple attention heads, each focusing on different aspects of the data.

In the constructor (__init__), we set up the number of attention heads (num_heads) and ensure that the model's dimensionality (d_model) is divisible by the number of heads. We also define linear transformation layers for Query (Q), Key (K), Value (V), and the output (o).

The scaled_dot_product_attention method calculates attention scores, applies an optional mask to these scores, and computes the weighted sum of values (V) based on attention probabilities.

The split_heads and combine_heads methods are used to reshape and combine the attention outputs from all heads.

In the forward method, we apply linear transformations to the input Q, K, and V, compute multi-head self-attention, and then combine the outputs.

PositionWiseFeedForward Class:

The PositionWiseFeedForward class is another essential building block of the Transformer. It processes the output of the multi-head attention layers.

In the constructor, we define two linear layers (fc1 and fc2) and a ReLU activation function. These layers are responsible for processing and transforming the model's representations.

The forward method takes an input tensor x, passes it through the first linear layer, applies the ReLU activation, and then feeds it through the second linear layer.

Together, these classes represent core elements of the Transformer architecture, allowing it to perform multi-head self-attention and position-wise feedforward transformations. These are key components for understanding and processing sequences, making the Transformer one of the most powerful models in natural language processing and beyond.


import torch
import torch.nn as nn

# Create a custom class for the PositionWiseFeedForward module
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()

        # The first linear transformation layer with input dimension 'd_model' and output dimension 'd_ff'
        self.fc1 = nn.Linear(d_model, d_ff)

        # The second linear transformation layer with input dimension 'd_ff' and output dimension 'd_model'
        self.fc2 = nn.Linear(d_ff, d_model)

        # ReLU activation function for non-linearity
        self.relu = nn.ReLU()

    def forward(self, x):
        """
        This method defines the forward pass of the PositionWiseFeedForward module.

        Args:
            x (torch.Tensor): The input tensor of shape (batch_size, seq_length, d_model).

        Returns:
            torch.Tensor: The output tensor after passing through the feedforward layers.
        """

        # Apply the first linear transformation followed by ReLU activation
        intermediate_output = self.relu(self.fc1(x))

        # Apply the second linear transformation to get the final output
        final_output = self.fc2(intermediate_output)

        return final_output

# Tutorial-style explanation

PositionWiseFeedForward Class:

In the Transformer architecture, the PositionWiseFeedForward module plays a crucial role in processing and transforming the outputs of multi-head self-attention layers. Let's break down this class and explain its components in detail.

The first linear layer (self.fc1) takes an input tensor with dimension d_model and projects it into a higher-dimensional space with dimension d_ff. This step introduces non-linearity into the model's transformation.

The ReLU activation function (self.relu) is applied after the first linear transformation. ReLU introduces non-linearity by replacing negative values with zero, helping the model capture complex patterns in the data.

The second linear layer (self.fc2) projects the intermediate output from the first layer back to the original dimension d_model. This step helps the model to retain the desired dimensionality while capturing essential features introduced by the non-linear transformation.

In the forward method, we perform the forward pass for this module. We apply the first linear transformation, followed by the ReLU activation. Then, the intermediate output goes through the second linear transformation to produce the final output.

This PositionWiseFeedForward module is a critical component in the Transformer's ability to capture and process complex patterns in sequences, making it a powerful tool for various natural language processing tasks like translation, summarisation, and more.




# Custom PositionalEncoding class
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()

        # Create a positional encoding matrix 'pe'
        pe = torch.zeros(max_seq_length, d_model)

        # Generate positions from 0 to 'max_seq_length' as a column vector
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)

        # Compute div_term to scale and create periodicity in positional encodings
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

        # Calculate sine and cosine components for positional encodings
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Register 'pe' as a buffer to make it a part of the model's parameters
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        """
        This method defines the forward pass of the PositionalEncoding module.

        Args:
            x (torch.Tensor): The input tensor of shape (batch_size, seq_length, d_model).

        Returns:
            torch.Tensor: The input tensor with positional encodings added.
        """

        # Add the positional encodings to the input tensor 'x'
        return x + self.pe[:, :x.size(1)]

# Tutorial-style explanation

PositionalEncoding Class:

In the Transformer model, positional information is crucial because the model doesn't have built-in knowledge of the order of elements in a sequence. The PositionalEncoding class is responsible for adding these positional encodings to the input data to inform the model about the position of each element in the sequence. Let's delve into this class and understand its workings.

We create a matrix pe (positional encodings) of zeros with dimensions max_seq_length by d_model.

We generate a column vector position containing values from 0 to max_seq_length.

We compute div_term, which is used to create periodicity in the positional encodings by applying an exponential function to even indices. This helps the model learn positional information effectively.

Next, we calculate the sine and cosine components of the positional encodings and store them in the pe matrix. This is where the magic happens: these components will inform the model about the positions of elements in the sequence.

To ensure these positional encodings become part of the model's parameters, we register pe as a buffer.

In the forward method, we add the positional encodings to the input tensor x. This step allows the model to consider the position of each element, making it aware of the order in the sequence.

Building the Encoder and Decoder layers.

Encoder layer is responsible for processing and transforming the input data, allowing the model to capture and understand complex patterns within sequences.



# Custom EncoderLayer class
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()

        # Self-attention mechanism
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # Position-wise feedforward network
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)

        # Layer normalization for the self-attention and feedforward sub-layers
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        """
        This method defines the forward pass of the EncoderLayer module.

        Args:
            x (torch.Tensor): The input tensor of shape (batch_size, seq_length, d_model).
            mask (torch.Tensor): An optional mask for handling padding in sequences.

        Returns:
            torch.Tensor: The output tensor after applying self-attention and feedforward layers.
        """

        # Self-attention mechanism: The model focuses on different parts of the input sequence
        attn_output = self.self_attn(x, x, x, mask)

        # Add residual connection and apply layer normalization
        x = self.norm1(x + self.dropout(attn_output))

        # Position-wise feedforward network: Captures complex patterns in the data
        ff_output = self.feed_forward(x)

        # Add residual connection and apply layer normalization
        x = self.norm2(x + self.dropout(ff_output))

        return x

# Tutorial-style explanation

EncoderLayer Class:

The EncoderLayer is repeated multiple times in the encoder stack of the Transformer, allowing the model to process input sequences effectively. It's a key contributor to the model's ability to understand complex relationships within data, making it an essential component for tasks like language translation and text generation.

Decoder Layer:

This layer is responsible for processing the target sequence, incorporating information from the source sequence, and capturing complex patterns within sequences.


import torch
import torch.nn as nn

# Custom DecoderLayer class
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init()

        # Self-attention mechanism for the target sequence
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        # Cross-attention mechanism for the source-target interaction
        self.cross_attn = MultiHeadAttention(d_model, num_heads)

        # Position-wise feedforward network
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)

        # Layer normalization for the self-attention, cross-attention, and feedforward sub-layers
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

        # Dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        """
        This method defines the forward pass of the DecoderLayer module.

        Args:
            x (torch.Tensor): The input tensor of shape (batch_size, tgt_seq_length, d_model).
            enc_output (torch.Tensor): The output from the encoder.
            src_mask (torch.Tensor): Mask for handling padding in the source sequence.
            tgt_mask (torch.Tensor): Mask for handling padding in the target sequence.

        Returns:
            torch.Tensor: The output tensor after applying self-attention, cross-attention, and feedforward layers.
        """

        # Self-attention mechanism for the target sequence
        attn_output = self.self_attn(x, x, x, tgt_mask)

        # Add residual connection and apply layer normalization
        x = self.norm1(x + self.dropout(attn_output))

        # Cross-attention mechanism for source-target interaction
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)

        # Add residual connection and apply layer normalization
        x = self.norm2(x + self.dropout(attn_output))

        # Position-wise feedforward network: Captures complex patterns in the data
        ff_output = self.feed_forward(x)

        # Add residual connection and apply layer normalization
        x = self.norm3(x + self.dropout(ff_output))

        return x

# Tutorial-style explanation

DecoderLayer Class:

The DecoderLayer is repeated multiple times in the decoder stack of the Transformer, allowing the model to process and generate target sequences effectively. It's a key contributor to the model's ability to understand complex relationships within data, making it an essential component for tasks like language translation and text generation.

Preparation of the sample data:



# Define hyperparameters for the Transformer model
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

# Create an instance of the Transformer model
transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

# Generate random sample data for demonstration purposes
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

# Training the model
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

# Set the model in training mode
transformer.train()

# Training loop
for epoch in range(100):
    optimizer.zero_grad()

    # Forward pass: compute the model's output
    output = transformer(src_data, tgt_data[:, :-1])

    # Calculate the loss, ignoring the padded values (0)
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))

    # Backpropagation and optimization
    loss.backward()
    optimizer.step()

    # Print the loss for monitoring
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

# Tutorial-style explanation

Training a Transformer Model:

In this code example, we're demonstrating how to train a Transformer model using a toy dataset. In practice, you would use a larger and more meaningful dataset for tasks like language translation. Let's break down the code and understand its components.

We start by defining hyperparameters for the Transformer model, including vocabulary sizes, model dimensions, the number of attention heads, the number of layers, feedforward dimensions, maximum sequence length, and dropout rate.

Next, we create an instance of the Transformer model with the specified hyperparameters. This model is ready to be trained.

To demonstrate training, we generate random sample data for both the source and target sequences. In practice, you would preprocess real data and create vocabulary mappings for source and target languages.

We set up the training process, including defining a loss function (CrossEntropyLoss) and an optimiser (Adam) for updating the model's parameters.

The model is put into training mode using transformer.train(). This is important because some components of the model, like dropout layers, behave differently during training and evaluation.

We enter a training loop that iterates over a fixed number of epochs (100 in this case). In practice, you would use early stopping or other techniques to determine when to stop training based on validation performance.

Inside the loop, we perform the following steps:

Reset the optimiser's gradients to zero.

Perform a forward pass through the model to compute the model's output.

Calculate the loss by comparing the model's predictions to the actual target sequences. We use CrossEntropyLoss and ignore padding values with ignore_index=0.

Perform backpropagation to compute gradients.

Update the model's parameters using the optimizer.

Print the current epoch and loss for monitoring training progress.

This code demonstrates the basic training process for a Transformer model. In practice, training involves much larger datasets, extensive preprocessing, and potentially distributed computing resources. The Transformer architecture is widely used for tasks like machine translation, summarization, and more, and its training can be a computationally intensive process.


class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()

        # Initialize the Transformer model with essential parameters.
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        # Create a stack of encoder and decoder layers.
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        # Prepare the final output layer and dropout for regularization.
        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        # Generate masks for the source and target sequences.
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)

        # Create a mask to prevent the decoder from attending to future tokens.
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask

        return src_mask, tgt_mask

    def forward(self, src, tgt):
        # Execute the forward pass of the Transformer model.

        # Generate source and target masks using the generate_mask method.
        src_mask, tgt_mask = self.generate_mask(src, tgt)

        # Apply embeddings, positional encoding, and dropout to the source and target sequences.
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        # Initialize the encoder output with source embeddings.
        enc_output = src_embedded

        # Process the source sequence through encoder layers, updating the enc_output tensor.
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        # Initialize the decoder output with target embeddings.
        dec_output = tgt_embedded

        # Process the target sequence through decoder layers, utilizing enc_output and masks, and updating dec_output.
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        # Apply the linear projection layer to the decoder output, obtaining output logits.
        output = self.fc(dec_output)

        return output

1. The Transformer's Foundation

At the core of the Transformer is the Transformer class. This class brings together various components to build a powerful language model. Let's break it down step by step.

2. The Setup

First, we need some essential information: the size of our vocabulary, the model's dimensionality (d_model), the number of attention heads (num_heads), the number of layers (num_layers), and more. These parameters determine the model's capabilities and complexity.

3. The Embeddings

Our model begins with embeddings. We have two types: one for the source language and one for the target language. These embeddings help the model understand words by converting them into numerical vectors. The Embedding layers are like a dictionary for the model, allowing it to look up the meaning of words.

4. Positional Encoding

Words matter, but so does their position in a sentence. We add a dash of magic with positional encoding. This ensures that the model knows where words are located in a sequence. Think of it as giving each word its own unique address.

5. Stacking Layers

The Transformer isn't a one-trick pony. It has layers, lots of them! We stack multiple encoder and decoder layers on top of each other. Each layer refines the model's understanding of the input and output.

6. Masks: Ignoring the Unimportant

We don't want the model to be distracted by padding tokens or cheat by looking at future words. That's where masks come in. We generate source and target masks to tell the model which parts of the input to pay attention to.

7. The Flow of Data

With the setup complete, it's time for the data to flow. We apply embeddings to our source and target sequences, add positional encoding for context, and sprinkle some dropout for regularisation.

8. The Transformers' Dance

The encoder does its magic on the source sequence, refining its understanding. Then, the decoder follows suit. Each encoder and decoder layer is like a choreographer, teaching the model to understand nuances in the data.

9. The Grand Finale: The Output Layer

All the hard work comes down to this - the output layer. It takes the refined understanding and projects it back into the vocabulary space. The model's response or translation is crafted here.

10. The Transformer Rises!

And there you have it, a complete Transformer model. It can understand and generate human language, perform translations, and even answer questions. The Transformer is a superstar of NLP, and with these core building blocks, you now have a backstage pass to its inner workings.