The Transformer is a deep learning architecture introduced in the 2017 paper Attention Is All You Need by Vaswani et al. Unlike recurrent neural networks (RNNs) and long short-term memory (LSTMs)—which process sequential data sequentially (token by token, left to right)—Transformers rely entirely on self-attention mechanisms to model relationships between all tokens in a sequence in parallel. This parallelization enables faster training on large datasets and superior performance on long-sequence tasks, making Transformers the backbone of modern natural language processing (NLP) models like BERT, GPT, and T5.
I. Core Limitation of RNNs/LSTMs Solved by Transformers
RNNs and LSTMs process sequences in a stepwise manner:
- The hidden state at time step t depends only on the input at t and the hidden state at \(t-1\).
- This sequential processing prevents parallelization (each step must wait for the previous one to finish), leading to slow training on long sequences.
- Even with LSTMs, capturing very long-term dependencies (e.g., a word in a 10,000-token document) remains challenging.
Transformers eliminate recurrence entirely. Instead, they use self-attention to compute a weighted representation of every token relative to all other tokens in the sequence—all in a single parallel operation.
II. Core Components of the Transformer Architecture
The Transformer consists of two main sub-models:
- Encoder: Processes the input sequence and generates a contextualized representation (embedding) for each token. Used in tasks like text classification and named entity recognition (NER).
- Decoder: Generates an output sequence (e.g., translated text) by attending to the encoder’s output and its own previously generated tokens. Used in tasks like machine translation and text generation.
High-Level Architecture Diagram
plaintext
[Input Sequence] → [Embedding + Positional Encoding] → [Encoder Stack] → [Decoder Stack] → [Output Sequence]
1. Embedding & Positional Encoding
A. Token Embedding
Converts each token (e.g., a word or subword) into a dense vector of fixed dimension (e.g., 512). This is similar to embedding layers in RNNs/LSTMs.
B. Positional Encoding
Since Transformers have no recurrence, they lack inherent knowledge of token order. Positional encoding injects sequence order information into the embeddings by adding a unique vector to each token’s embedding based on its position in the sequence.
The positional encoding for position pos and dimension i is defined as:
\(PE_{pos, 2i} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)\)
\(PE_{pos, 2i+1} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)\)
where \(d_{\text{model}}\) = dimension of the embedding vector (e.g., 512).
Key properties:
- Positional encodings are fixed (not learned) but can be learned in variants.
- The sine/cosine functions allow the model to generalize to sequence lengths longer than those seen during training.
2. Self-Attention: The Heart of the Transformer
Self-attention is a mechanism that allows each token in a sequence to “attend” to (i.e., weigh the importance of) every other token in the same sequence. The output for each token is a weighted sum of all tokens’ embeddings, where weights reflect how relevant other tokens are to the current token.
A. Scaled Dot-Product Attention (Core Calculation)
Given three vectors for each token:
- Query (Q): Represents the current token’s “interest” in other tokens.
- Key (K): Represents other tokens’ “ability to answer” the query.
- Value (V): Represents the actual content of other tokens.
These vectors are derived by multiplying the input embeddings by three learnable weight matrices (\(W_Q, W_K, W_V\)):
\(Q = X W_Q, \quad K = X W_K, \quad V = X W_V\)
where X = input embedding matrix of shape \((\text{seq_len}, d_{\text{model}})\).
The scaled dot-product attention is computed as:
\(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V\)
- \(d_k\) = dimension of Q and K (e.g., 64).
- The scaling factor \(\sqrt{d_k}\) prevents the dot product values from becoming too large (which would flatten the softmax distribution and reduce attention diversity).
B. Multi-Head Attention
Multi-head attention splits the query, key, and value vectors into h smaller sub-vectors (heads), computes scaled dot-product attention for each head independently, and concatenates the results. This allows the model to capture multiple types of relationships between tokens (e.g., syntactic structure and semantic meaning) simultaneously.
The steps are:
- Split \(Q, K, V\) into h heads: \(Q_i, K_i, V_i\) for \(i = 1 \dots h\).
- Compute attention for each head: \(A_i = \text{Attention}(Q_i, K_i, V_i)\).
- Concatenate all head outputs: \(A = [A_1; A_2; \dots; A_h]\).
- Apply a final linear projection: \(A W_O\) (where \(W_O\) is a learnable matrix).
C. Types of Attention in Transformers
| Attention Type | Use Case | Inputs |
|---|---|---|
| Self-Attention (Encoder) | Model relationships within the input sequence | \(Q=K=V=\text{Encoder Input}\) |
| Masked Self-Attention (Decoder) | Prevent the decoder from attending to future tokens (for autoregressive generation) | \(Q=K=V=\text{Decoder Input}\); masks future positions to \(-\infty\) before softmax |
| Encoder-Decoder Attention (Decoder) | Allow the decoder to attend to the encoder’s output (e.g., link source and target tokens in translation) | \(Q=\text{Decoder Input}\), \(K=V=\text{Encoder Output}\) |
3. Feed-Forward Network (FFN)
After multi-head attention, each token’s embedding passes through a position-wise feed-forward network—a simple two-layer fully connected network applied independently to every token (hence “position-wise”).
The FFN is defined as:
\(\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2\)
- Uses ReLU activation (non-linearity).
- Applies the same weights to all tokens (parameter sharing across positions).
4. Residual Connections & Layer Normalization
Every sub-layer (multi-head attention, FFN) in the encoder/decoder is wrapped in a residual connection followed by layer normalization:
\(\text{LayerNorm}(x + \text{Sublayer}(x))\)
- Residual connections help mitigate the vanishing gradient problem in deep networks.
- Layer normalization stabilizes training by normalizing the activations of each token to have zero mean and unit variance.
5. Encoder Stack
The encoder is a stack of N identical layers (e.g., \(N=6\) in the original Transformer). Each layer contains two sub-layers:
- Multi-head self-attention (with residual connection + layer norm).
- Position-wise FFN (with residual connection + layer norm).
All encoder layers process the entire input sequence in parallel, and the output is a sequence of contextualized embeddings of shape \((\text{seq_len}, d_{\text{model}})\).
6. Decoder Stack
The decoder is also a stack of N identical layers. Each layer contains three sub-layers:
- Masked multi-head self-attention: Ensures the decoder only uses tokens generated so far (masks future tokens).
- Encoder-decoder multi-head attention: Links the decoder to the encoder’s output (critical for translation tasks).
- Position-wise FFN.
Each sub-layer is wrapped in residual connections and layer normalization. The final decoder output is passed through a linear layer followed by a softmax to generate probability distributions over the target vocabulary.
III. Key Transformer Variants & Applications
The original Transformer was designed for machine translation, but its architecture has been adapted to create state-of-the-art models for almost all NLP tasks. Below are the most influential variants:
| Model | Type | Key Innovation | Primary Use Cases |
|---|---|---|---|
| BERT (Bidirectional Encoder Representations from Transformers) | Encoder-only | Uses bidirectional self-attention (attends to left and right context) and pre-training on masked language modeling (MLM). | Text classification, NER, question answering, sentiment analysis. |
| GPT (Generative Pre-trained Transformer) | Decoder-only | Uses autoregressive masked self-attention (generates text token by token, left to right) and pre-training on causal language modeling (CLM). | Text generation, summarization, dialogue systems, code generation. |
| T5 (Text-to-Text Transfer Transformer) | Encoder-decoder | Frames all NLP tasks as “text-to-text” (e.g., translation: translate English to French: hello → bonjour). | Machine translation, summarization, question answering, text classification. |
| GPT-3/4 (Generative Pre-trained Transformer 3/4) | Decoder-only | Scaled to billions of parameters and trained on massive text corpora; supports few-shot/zero-shot learning without task-specific fine-tuning. | General-purpose NLP, creative writing, reasoning, code generation. |
| ViT (Vision Transformer) | Encoder-only | Adapts Transformers to computer vision by splitting images into fixed-size patches (treating patches as tokens). | Image classification, object detection, segmentation. |
IV. Transformer Implementation (Python with PyTorch)
Below is a simplified implementation of a mini Transformer encoder for text classification (e.g., sentiment analysis).
1. Import Libraries
python
运行
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
2. Positional Encoding Module
python
运行
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, max_len: int = 5000):
super().__init__()
# Compute positional encodings once (fixed)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # Shape: (1, max_len, d_model)
self.register_buffer('pe', pe) # Not a learnable parameter
def forward(self, x):
# x shape: (batch_size, seq_len, d_model)
x = x + self.pe[:, :x.size(1), :]
return x
3. Multi-Head Attention Module
python
运行
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # Dimension per head
# Linear layers for Q, K, V projections
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_o = nn.Linear(d_model, d_model) # Output projection
def scaled_dot_product_attention(self, q, k, v, mask=None):
# q, k, v shape: (batch_size, n_heads, seq_len, d_k)
attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
# Apply mask (if provided) to hide future tokens or padding
if mask is not None:
attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
attn_probs = F.softmax(attn_scores, dim=-1)
output = torch.matmul(attn_probs, v) # Shape: (batch_size, n_heads, seq_len, d_k)
return output, attn_probs
def split_heads(self, x):
# Split x into n_heads: (batch_size, seq_len, d_model) → (batch_size, n_heads, seq_len, d_k)
batch_size = x.size(0)
return x.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
def combine_heads(self, x):
# Combine heads back to d_model: (batch_size, n_heads, seq_len, d_k) → (batch_size, seq_len, d_model)
batch_size = x.size(0)
return x.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
def forward(self, q, k, v, mask=None):
# Project Q, K, V and split into heads
q = self.split_heads(self.w_q(q))
k = self.split_heads(self.w_k(k))
v = self.split_heads(self.w_v(v))
# Compute attention
attn_output, attn_probs = self.scaled_dot_product_attention(q, k, v, mask)
# Combine heads and project output
output = self.w_o(self.combine_heads(attn_output))
return output, attn_probs
4. Transformer Encoder Layer
python
运行
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, n_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention + residual + norm
attn_output, _ = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# FFN + residual + norm
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_output))
return x
5. Transformer Encoder for Text Classification
python
运行
class TransformerClassifier(nn.Module):
def __init__(self, vocab_size: int, d_model: int, n_heads: int, n_layers: int, d_ff: int, num_classes: int, dropout: float = 0.1):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model)
self.encoder_layers = nn.ModuleList([
TransformerEncoderLayer(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(d_model, num_classes) # Classification head
self.d_model = d_model
def forward(self, src, src_mask=None):
# Embedding + positional encoding
x = self.embedding(src) * math.sqrt(self.d_model) # Scale embedding for stability
x = self.pos_encoding(x)
x = self.dropout(x)
# Pass through encoder layers
for layer in self.encoder_layers:
x = layer(x, src_mask)
# Pooling: Use <CLS> token (first token) or average pooling
# Here we use average pooling over all tokens
x = x.mean(dim=1)
# Classification
output = self.fc(x)
return output
6. Test the Model
python
运行
# Hyperparameters
VOCAB_SIZE = 10000
D_MODEL = 128
N_HEADS = 4
N_LAYERS = 2
D_FF = 512
NUM_CLASSES = 2 # Binary classification (positive/negative)
DROPOUT = 0.1
# Initialize model
model = TransformerClassifier(VOCAB_SIZE, D_MODEL, N_HEADS, N_LAYERS, D_FF, NUM_CLASSES, DROPOUT)
# Dummy input: batch_size=8, seq_len=32
src = torch.randint(0, VOCAB_SIZE, (8, 32))
output = model(src)
print(f"Model Output Shape: {output.shape}") # Expected: (8, 2)
V. Transformer vs. LSTM vs. CNN: Key Differences
| Feature | Transformer | LSTM | CNN |
|---|---|---|---|
| Sequence Processing | Parallel (all tokens at once) | Sequential (token by token) | Parallel (sliding window over tokens) |
| Long-Term Dependencies | Excellent (self-attention models all token pairs) | Good (memory cell + gates) | Poor (limited by kernel size) |
| Training Speed | Fast (parallelization; GPU-friendly) | Slow (sequential; no parallelization) | Medium (parallel but limited by window size) |
| Key Mechanism | Self-attention (weights token relationships) | Recurrence + memory cell | Convolution (local feature extraction) |
| Use Cases | Long texts, NLP, vision (ViT), speech | Short/medium sequences, time-series, speech | Image processing, short-text classification |
VI. Summary
Limitations: High memory usage (self-attention has \(O(n^2)\) complexity for sequence length n; mitigated by sparse attention variants like Longformer).
Transformers are deep learning architectures that replace recurrence with self-attention mechanisms, enabling parallel processing of sequential data.
Core Components: Embedding + positional encoding, multi-head self-attention, feed-forward networks, residual connections, and layer normalization.
Key Variants: BERT (encoder-only, bidirectional), GPT (decoder-only, autoregressive), T5 (encoder-decoder, text-to-text).
Advantages: Faster training, better long-term dependency modeling, and state-of-the-art performance on NLP/vision tasks.
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment