Transformers Explained: Key Components & Applications

The Transformer is a deep learning architecture introduced in the 2017 paper Attention Is All You Need by Vaswani et al. Unlike recurrent neural networks (RNNs) and long short-term memory (LSTMs)—which process sequential data sequentially (token by token, left to right)—Transformers rely entirely on self-attention mechanisms to model relationships between all tokens in a sequence in parallel. This parallelization enables faster training on large datasets and superior performance on long-sequence tasks, making Transformers the backbone of modern natural language processing (NLP) models like BERT, GPT, and T5.

I. Core Limitation of RNNs/LSTMs Solved by Transformers

RNNs and LSTMs process sequences in a stepwise manner:

The hidden state at time step t depends only on the input at t and the hidden state at \(t-1\).
This sequential processing prevents parallelization (each step must wait for the previous one to finish), leading to slow training on long sequences.
Even with LSTMs, capturing very long-term dependencies (e.g., a word in a 10,000-token document) remains challenging.

Transformers eliminate recurrence entirely. Instead, they use self-attention to compute a weighted representation of every token relative to all other tokens in the sequence—all in a single parallel operation.

II. Core Components of the Transformer Architecture

The Transformer consists of two main sub-models:

Encoder: Processes the input sequence and generates a contextualized representation (embedding) for each token. Used in tasks like text classification and named entity recognition (NER).
Decoder: Generates an output sequence (e.g., translated text) by attending to the encoder’s output and its own previously generated tokens. Used in tasks like machine translation and text generation.

High-Level Architecture Diagram

plaintext

[Input Sequence] → [Embedding + Positional Encoding] → [Encoder Stack] → [Decoder Stack] → [Output Sequence]

1. Embedding & Positional Encoding

A. Token Embedding

Converts each token (e.g., a word or subword) into a dense vector of fixed dimension (e.g., 512). This is similar to embedding layers in RNNs/LSTMs.

B. Positional Encoding

Since Transformers have no recurrence, they lack inherent knowledge of token order. Positional encoding injects sequence order information into the embeddings by adding a unique vector to each token’s embedding based on its position in the sequence.

The positional encoding for position pos and dimension i is defined as:

\(PE_{pos, 2i} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)\)

\(PE_{pos, 2i+1} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)\)

where \(d_{\text{model}}\) = dimension of the embedding vector (e.g., 512).

Key properties:

Positional encodings are fixed (not learned) but can be learned in variants.
The sine/cosine functions allow the model to generalize to sequence lengths longer than those seen during training.

2. Self-Attention: The Heart of the Transformer

Self-attention is a mechanism that allows each token in a sequence to “attend” to (i.e., weigh the importance of) every other token in the same sequence. The output for each token is a weighted sum of all tokens’ embeddings, where weights reflect how relevant other tokens are to the current token.

A. Scaled Dot-Product Attention (Core Calculation)

Given three vectors for each token:

Query (Q): Represents the current token’s “interest” in other tokens.
Key (K): Represents other tokens’ “ability to answer” the query.
Value (V): Represents the actual content of other tokens.

These vectors are derived by multiplying the input embeddings by three learnable weight matrices (\(W_Q, W_K, W_V\)):

\(Q = X W_Q, \quad K = X W_K, \quad V = X W_V\)

where X = input embedding matrix of shape \((\text{seq_len}, d_{\text{model}})\).

The scaled dot-product attention is computed as:

\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V\)

\(d_k\) = dimension of Q and K (e.g., 64).
The scaling factor \(\sqrt{d_k}\) prevents the dot product values from becoming too large (which would flatten the softmax distribution and reduce attention diversity).

B. Multi-Head Attention

Multi-head attention splits the query, key, and value vectors into h smaller sub-vectors (heads), computes scaled dot-product attention for each head independently, and concatenates the results. This allows the model to capture multiple types of relationships between tokens (e.g., syntactic structure and semantic meaning) simultaneously.

The steps are:

Split \(Q, K, V\) into h heads: \(Q_i, K_i, V_i\) for \(i = 1 \dots h\).
Compute attention for each head: \(A_i = \text{Attention}(Q_i, K_i, V_i)\).
Concatenate all head outputs: \(A = [A_1; A_2; \dots; A_h]\).
Apply a final linear projection: \(A W_O\) (where \(W_O\) is a learnable matrix).

C. Types of Attention in Transformers

Attention Type	Use Case	Inputs
Self-Attention (Encoder)	Model relationships within the input sequence	\(Q=K=V=\text{Encoder Input}\)
Masked Self-Attention (Decoder)	Prevent the decoder from attending to future tokens (for autoregressive generation)	\(Q=K=V=\text{Decoder Input}\); masks future positions to \(-\infty\) before softmax
Encoder-Decoder Attention (Decoder)	Allow the decoder to attend to the encoder’s output (e.g., link source and target tokens in translation)	\(Q=\text{Decoder Input}\), \(K=V=\text{Encoder Output}\)

3. Feed-Forward Network (FFN)

After multi-head attention, each token’s embedding passes through a position-wise feed-forward network—a simple two-layer fully connected network applied independently to every token (hence “position-wise”).

The FFN is defined as:

\(\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2\)

Uses ReLU activation (non-linearity).
Applies the same weights to all tokens (parameter sharing across positions).

4. Residual Connections & Layer Normalization

Every sub-layer (multi-head attention, FFN) in the encoder/decoder is wrapped in a residual connection followed by layer normalization:

\(\text{LayerNorm}(x + \text{Sublayer}(x))\)

Residual connections help mitigate the vanishing gradient problem in deep networks.
Layer normalization stabilizes training by normalizing the activations of each token to have zero mean and unit variance.

5. Encoder Stack

The encoder is a stack of N identical layers (e.g., \(N=6\) in the original Transformer). Each layer contains two sub-layers:

Multi-head self-attention (with residual connection + layer norm).
Position-wise FFN (with residual connection + layer norm).

All encoder layers process the entire input sequence in parallel, and the output is a sequence of contextualized embeddings of shape \((\text{seq_len}, d_{\text{model}})\).

6. Decoder Stack

The decoder is also a stack of N identical layers. Each layer contains three sub-layers:

Masked multi-head self-attention: Ensures the decoder only uses tokens generated so far (masks future tokens).
Encoder-decoder multi-head attention: Links the decoder to the encoder’s output (critical for translation tasks).
Position-wise FFN.

Each sub-layer is wrapped in residual connections and layer normalization. The final decoder output is passed through a linear layer followed by a softmax to generate probability distributions over the target vocabulary.

III. Key Transformer Variants & Applications

The original Transformer was designed for machine translation, but its architecture has been adapted to create state-of-the-art models for almost all NLP tasks. Below are the most influential variants:

Model	Type	Key Innovation	Primary Use Cases
BERT (Bidirectional Encoder Representations from Transformers)	Encoder-only	Uses bidirectional self-attention (attends to left and right context) and pre-training on masked language modeling (MLM).	Text classification, NER, question answering, sentiment analysis.
GPT (Generative Pre-trained Transformer)	Decoder-only	Uses autoregressive masked self-attention (generates text token by token, left to right) and pre-training on causal language modeling (CLM).	Text generation, summarization, dialogue systems, code generation.
T5 (Text-to-Text Transfer Transformer)	Encoder-decoder	Frames all NLP tasks as “text-to-text” (e.g., translation: `translate English to French: hello` → `bonjour`).	Machine translation, summarization, question answering, text classification.
GPT-3/4 (Generative Pre-trained Transformer 3/4)	Decoder-only	Scaled to billions of parameters and trained on massive text corpora; supports few-shot/zero-shot learning without task-specific fine-tuning.	General-purpose NLP, creative writing, reasoning, code generation.
ViT (Vision Transformer)	Encoder-only	Adapts Transformers to computer vision by splitting images into fixed-size patches (treating patches as tokens).	Image classification, object detection, segmentation.

IV. Transformer Implementation (Python with PyTorch)

Below is a simplified implementation of a mini Transformer encoder for text classification (e.g., sentiment analysis).

1. Import Libraries

python

运行

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

2. Positional Encoding Module

python

运行

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        # Compute positional encodings once (fixed)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # Shape: (1, max_len, d_model)
        self.register_buffer('pe', pe)  # Not a learnable parameter

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        x = x + self.pe[:, :x.size(1), :]
        return x

3. Multi-Head Attention Module

python

运行

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # Dimension per head
        
        # Linear layers for Q, K, V projections
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)  # Output projection

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        # q, k, v shape: (batch_size, n_heads, seq_len, d_k)
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask (if provided) to hide future tokens or padding
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        attn_probs = F.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, v)  # Shape: (batch_size, n_heads, seq_len, d_k)
        return output, attn_probs

    def split_heads(self, x):
        # Split x into n_heads: (batch_size, seq_len, d_model) → (batch_size, n_heads, seq_len, d_k)
        batch_size = x.size(0)
        return x.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

    def combine_heads(self, x):
        # Combine heads back to d_model: (batch_size, n_heads, seq_len, d_k) → (batch_size, seq_len, d_model)
        batch_size = x.size(0)
        return x.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)

    def forward(self, q, k, v, mask=None):
        # Project Q, K, V and split into heads
        q = self.split_heads(self.w_q(q))
        k = self.split_heads(self.w_k(k))
        v = self.split_heads(self.w_v(v))
        
        # Compute attention
        attn_output, attn_probs = self.scaled_dot_product_attention(q, k, v, mask)
        
        # Combine heads and project output
        output = self.w_o(self.combine_heads(attn_output))
        return output, attn_probs

4. Transformer Encoder Layer

python

运行

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention + residual + norm
        attn_output, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # FFN + residual + norm
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

5. Transformer Encoder for Text Classification

python

运行

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, n_heads: int, n_layers: int, d_ff: int, num_classes: int, dropout: float = 0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        self.encoder_layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(d_model, num_classes)  # Classification head
        self.d_model = d_model

    def forward(self, src, src_mask=None):
        # Embedding + positional encoding
        x = self.embedding(src) * math.sqrt(self.d_model)  # Scale embedding for stability
        x = self.pos_encoding(x)
        x = self.dropout(x)
        
        # Pass through encoder layers
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        
        # Pooling: Use <CLS> token (first token) or average pooling
        # Here we use average pooling over all tokens
        x = x.mean(dim=1)
        
        # Classification
        output = self.fc(x)
        return output

6. Test the Model

python

运行

# Hyperparameters
VOCAB_SIZE = 10000
D_MODEL = 128
N_HEADS = 4
N_LAYERS = 2
D_FF = 512
NUM_CLASSES = 2  # Binary classification (positive/negative)
DROPOUT = 0.1

# Initialize model
model = TransformerClassifier(VOCAB_SIZE, D_MODEL, N_HEADS, N_LAYERS, D_FF, NUM_CLASSES, DROPOUT)

# Dummy input: batch_size=8, seq_len=32
src = torch.randint(0, VOCAB_SIZE, (8, 32))
output = model(src)
print(f"Model Output Shape: {output.shape}")  # Expected: (8, 2)

V. Transformer vs. LSTM vs. CNN: Key Differences

Feature	Transformer	LSTM	CNN
Sequence Processing	Parallel (all tokens at once)	Sequential (token by token)	Parallel (sliding window over tokens)
Long-Term Dependencies	Excellent (self-attention models all token pairs)	Good (memory cell + gates)	Poor (limited by kernel size)
Training Speed	Fast (parallelization; GPU-friendly)	Slow (sequential; no parallelization)	Medium (parallel but limited by window size)
Key Mechanism	Self-attention (weights token relationships)	Recurrence + memory cell	Convolution (local feature extraction)
Use Cases	Long texts, NLP, vision (ViT), speech	Short/medium sequences, time-series, speech	Image processing, short-text classification

VI. Summary

Limitations: High memory usage (self-attention has \(O(n^2)\) complexity for sequence length n; mitigated by sparse attention variants like Longformer).

Transformers are deep learning architectures that replace recurrence with self-attention mechanisms, enabling parallel processing of sequential data.

Core Components: Embedding + positional encoding, multi-head self-attention, feed-forward networks, residual connections, and layer normalization.

Key Variants: BERT (encoder-only, bidirectional), GPT (decoder-only, autoregressive), T5 (encoder-decoder, text-to-text).

Advantages: Faster training, better long-term dependency modeling, and state-of-the-art performance on NLP/vision tasks.