Mini-Batch Gradient Descent is a widely adopted optimization algorithm in machine learning and deep learning, serving as a balanced compromise between Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD). It computes the gradient of the loss function using a small, fixed-size subset of the training data (called a mini-batch) for each parameter update, combining the efficiency of SGD and the stability of BGD.
This algorithm is the de facto standard for training deep neural networks, as it leverages parallel computing capabilities of GPUs and TPUs to accelerate training while maintaining relatively low gradient noise.
I. Core Principles
1. Mathematical Formulation
Let the model parameters be denoted as \(\theta\), the mini-batch size as b (typically ranging from 16 to 256), the learning rate as \(\alpha\), and the loss function for a single sample \((x_i, y_i)\) as \(L(\theta; x_i, y_i)\).
Given a mini-batch \(B = \{(x_1,y_1), (x_2,y_2), …, (x_b,y_b)\}\) randomly sampled from the training dataset, the parameter update rule for Mini-Batch Gradient Descent is:
\(\theta_{t+1} = \theta_t – \alpha \cdot \frac{1}{b} \sum_{i=1}^b \nabla_{\theta} L(\theta_t; x_i, y_i)\)
Where:
- t: Current iteration number
- \(\frac{1}{b} \sum_{i=1}^b \nabla_{\theta} L(\theta_t; x_i, y_i)\): Average gradient of the loss function over the mini-batch B
- \(\nabla_{\theta} L(\theta_t; x_i, y_i)\): Gradient of the loss function with respect to \(\theta\) for the i-th sample in the mini-batch
2. Key Parameter: Mini-Batch Size (b)
The mini-batch size is a critical hyperparameter that impacts training speed, stability, and hardware utilization:
- Small b (e.g., 16): Similar to SGD, with faster updates but higher gradient noise. Suitable for small datasets or models with limited memory.
- Medium b (e.g., 32, 64, 128): Optimal balance—low enough noise for stable convergence, high enough to leverage GPU parallelism. The default choice for most deep learning tasks.
- Large b (e.g., 512, 1024): Similar to BGD, with smoother gradients but slower updates and higher memory usage. Suitable for large-scale distributed training.
3. Comparison with BGD and SGD
| Property | Batch Gradient Descent (BGD) | Stochastic Gradient Descent (SGD) | Mini-Batch Gradient Descent |
|---|---|---|---|
| Gradient Source | Entire training dataset (m samples) | Single random sample (\(b=1\)) | Small random subset (\(1 < b < m\)) |
| Update Frequency | Low (1 update per epoch) | High ( m updates per epoch) | Moderate ( \(m/b\) updates per epoch) |
| Gradient Noise | Very low (exact gradient) | Very high (approximate gradient) | Moderate (averaged gradient) |
| Convergence Speed | Slow (high computation per update) | Fast (low computation per update but noisy) | Fast (balanced computation and noise) |
| GPU Compatibility | Poor (cannot leverage parallelism) | Poor (single sample per iteration) | Excellent (parallelizes mini-batch computations) |
| Memory Requirement | High (loads entire dataset) | Low (loads one sample) | Moderate (loads mini-batch) |
II. Enhanced Variants of Mini-Batch Gradient Descent
Basic Mini-Batch Gradient Descent can be augmented with momentum and adaptive learning rate strategies to further improve convergence speed and stability:
1. Mini-Batch SGD with Momentum
Momentum accumulates the gradient of previous iterations to form a velocity term, which helps the algorithm accelerate in the direction of consistent gradients and dampen oscillations:
\(v_{t+1} = \gamma v_t + \frac{1}{b} \sum_{i=1}^b \nabla_{\theta} L(\theta_t; x_i, y_i)\)
\(\theta_{t+1} = \theta_t – \alpha \cdot v_{t+1}\)
Where \(\gamma\) (momentum coefficient, typically 0.9) controls the contribution of past gradients.
2. Adaptive Optimizers for Mini-Batch Training
The most popular adaptive optimizers (used with mini-batches) combine gradient averaging with per-parameter learning rate tuning:
- Adam (Adaptive Moment Estimation): Tracks the first and second moments of the gradient to adapt learning rates dynamically. The de facto optimizer for most deep learning tasks.
- RMSprop: Normalizes the gradient by the root mean square of past gradients to stabilize updates for sparse data.
III. Implementation Example (Python with PyTorch)
Below is a practical implementation of Mini-Batch Gradient Descent with Adam optimizer for training a convolutional neural network (CNN) on the MNIST handwritten digit classification dataset.
python
运行
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# 1. Define CNN Model
class MNISTCNN(nn.Module):
def __init__(self):
super(MNISTCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 7 * 7, 128)
self.fc2 = nn.Linear(128, 10)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = x.view(-1, 64 * 7 * 7)
x = self.dropout(self.relu(self.fc1(x)))
x = self.fc2(x)
return x
# 2. Data Preprocessing and Loader (with Mini-Batches)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
# Mini-batch size set to 64 (a common choice)
batch_size = 64
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, shuffle=True # Shuffle data to randomize mini-batches
)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)
test_loader = torch.utils.data.DataLoader(
test_dataset, batch_size=batch_size, shuffle=False
)
# 3. Initialize Model, Loss Function, and Optimizer
model = MNISTCNN()
criterion = nn.CrossEntropyLoss()
# Adam optimizer with mini-batch gradient updates
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 4. Training Loop with Mini-Batch Gradient Descent
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
num_epochs = 5
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
# Zero gradients before each mini-batch update
optimizer.zero_grad()
# Forward pass
outputs = model(data)
loss = criterion(outputs, target)
# Backward pass: compute gradient for the mini-batch
loss.backward()
# Update parameters using mini-batch gradient
optimizer.step()
running_loss += loss.item()
if (batch_idx + 1) % 100 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx+1}/{len(train_loader)}], Loss: {running_loss/100:.4f}')
running_loss = 0.0
# 5. Evaluate Model
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
outputs = model(data)
_, predicted = torch.max(outputs.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
print(f'Test Accuracy of the network on the 10000 test images: {100 * correct / total:.2f} %')
IV. Practical Best Practices
- Shuffle Training Data: Always shuffle the training dataset before creating mini-batches to avoid correlation between consecutive batches, which can lead to biased gradient estimates.
- Tune Mini-Batch Size: Start with a batch size of 32 or 64. If training is unstable, reduce the size; if GPU memory is underutilized, increase the size (up to the limits of hardware).
- Use Learning Rate Scheduling: Decay the learning rate over time (e.g.,
StepLRorReduceLROnPlateauin PyTorch) to reduce oscillations and converge to a better minimum. - Combine with Regularization: Pair Mini-Batch Gradient Descent with dropout, weight decay, or data augmentation to prevent overfitting in deep neural networks.
V. Advantages and Limitations
Advantages
- Balanced Speed and Stability: Faster than BGD (more frequent updates) and more stable than SGD (less gradient noise).
- GPU Acceleration: Mini-batches are optimized for parallel processing on GPUs, drastically reducing training time for large models.
- Scalability: Works seamlessly with distributed training frameworks (e.g., PyTorch Distributed, TensorFlow Distributed) for large-scale datasets.
Limitations
Memory Constraints: Larger mini-batches require more GPU memory, which can be a bottleneck for very deep models (e.g., transformers with billions of parameters).
Hyperparameter Tuning: Requires tuning both the learning rate and mini-batch size, which can be time-consuming for new tasks.
- High-Performance Waterproof Solar Connectors
- Durable IP68 Waterproof Solar Connectors for Outdoor Use
- High-Quality Tinned Copper Material for Durability
- High-Quality Tinned Copper Material for Long Service Life
- Y Branch Parallel Solar Connector for Enhanced Power
- 10AWG Tinned Copper Solar Battery Cables
- NEMA 5-15P to Powercon Extension Cable Overview
- Dual Port USB 3.0 Adapter for Optimal Speed
- 4-Pin XLR Connector: Reliable Audio Transmission
- 4mm Banana to 2mm Pin Connector: Your Audio Solution
- 12GB/s Mini SAS to U.2 NVMe Cable for Fast Data Transfer
- CAB-STK-E Stacking Cable: 40Gbps Performance
- High-Performance CAB-STK-E Stacking Cable Explained
- Best 10M OS2 LC to LC Fiber Patch Cable for Data Centers
- Mini SAS HD Cable: Boost Data Transfer at 12 Gbps
- Multi Rate SFP+: Enhance Your Network Speed
- Best 6.35mm to MIDI Din Cable for Clear Sound
- 15 Pin SATA Power Splitter: Solutions for Your Device Needs
- 9-Pin S-Video Cable: Enhance Your Viewing Experience
- USB 9-Pin to Standard USB 2.0 Adapter: Easy Connection
- 3 Pin to 4 Pin Fan Adapter: Optimize Your PC Cooling
- S-Video to RCA Cable: High-Definition Connections Made Easy
- 6.35mm TS Extension Cable: High-Quality Sound Solution
- BlackBerry Curve 9360: Key Features and Specs






















Leave a comment