Stochastic Gradient Descent (SGD) is a variant of the gradient descent optimization algorithm that updates model parameters using the gradient computed from a single random training sample in each iteration, rather than the entire dataset (Batch Gradient Descent) or a fixed mini-batch (Mini-Batch Gradient Descent).
The term “stochastic” refers to the random selection of individual samples, which introduces noise into the gradient estimates but enables faster parameter updates—making SGD well-suited for large-scale machine learning and deep learning tasks.
I. Core Principles
1. Mathematical Formulation
Let the model parameters be denoted as \(\theta\), the loss function for a single sample \((x_i, y_i)\) as \(L(\theta; x_i, y_i)\), and the learning rate as \(\alpha\). The parameter update rule for SGD is:
\(\theta_{t+1} = \theta_t – \alpha \cdot \nabla_{\theta} L(\theta_t; x_i, y_i)\)
Where:
- t: Current iteration number
- \(x_i, y_i\): A randomly selected input-output pair from the training dataset
- \(\nabla_{\theta} L(\theta_t; x_i, y_i)\): Gradient of the loss function with respect to \(\theta\), computed using the single sample \((x_i, y_i)\)
2. Key Difference from Batch Gradient Descent (BGD)
| Property | Stochastic Gradient Descent (SGD) | Batch Gradient Descent (BGD) |
|---|---|---|
| Gradient Computation | Uses 1 random sample per iteration | Uses the entire training dataset per iteration |
| Update Frequency | High (one update per sample) | Low (one update per full dataset pass) |
| Gradient Noise | High (approximate gradient) | Low (exact gradient) |
| Convergence Behavior | Oscillates around the minimum | Converges smoothly to the minimum |
| Computational Cost | Low (O(1) per iteration) | High (O(m) per iteration, m = number of samples) |
| Memory Requirement | Low (only one sample loaded at a time) | High (entire dataset loaded into memory) |
II. Convergence Characteristics
- Noise-Induced OscillationsThe random sampling of single samples leads to noisy gradient estimates, causing the parameter updates to oscillate around the optimal value \(\theta_*\). To mitigate this, a decaying learning rate is often used:\(\alpha_t = \frac{\alpha_0}{\sqrt{t}} \quad \text{or} \quad \alpha_t = \frac{\alpha_0}{1 + \lambda t}\)Where \(\alpha_0\) is the initial learning rate, t is the iteration, and \(\lambda\) is a decay coefficient. A decaying learning rate reduces oscillations over time and helps the algorithm converge to the minimum.
- Escaping Local MinimaThe noise in SGD gradients can help the model escape shallow local minima and saddle points—an advantage over BGD, which may get stuck in suboptimal solutions for non-convex loss functions (common in deep learning).
III. SGD with Momentum
Standard SGD suffers from slow convergence in regions with elongated loss function valleys. SGD with Momentum addresses this by adding a velocity term that accumulates gradients from previous iterations, mimicking physical inertia:
\(v_{t+1} = \gamma v_t + \nabla_{\theta} L(\theta_t; x_i, y_i)\)
\(\theta_{t+1} = \theta_t – \alpha \cdot v_{t+1}\)
Where \(\gamma\) (momentum coefficient, typically 0.9) controls the contribution of past gradients. Momentum accelerates convergence in the direction of consistent gradients and dampens oscillations.
IV. Implementation Example (Python with PyTorch)
Below is a minimal implementation of SGD for training a simple neural network on the MNIST dataset:
python
运行
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# Define a simple fully-connected neural network
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = x.view(-1, 784) # Flatten 28x28 images to 784-dimensional vectors
x = self.relu(self.fc1(x))
x = self.fc2(x)
return x
# Data preprocessing
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
# Load MNIST dataset
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=1, shuffle=True) # Batch size = 1 for SGD
# Initialize model, loss function, and SGD optimizer with momentum
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Training loop (SGD)
num_epochs = 5
model.train()
for epoch in range(num_epochs):
running_loss = 0.0
for i, (data, target) in enumerate(train_loader):
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(data)
loss = criterion(outputs, target)
# Backward pass (compute gradient for the single sample)
loss.backward()
# Update parameters (SGD step)
optimizer.step()
# Accumulate loss
running_loss += loss.item()
# Print statistics every 1000 samples
if (i + 1) % 1000 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {running_loss/1000:.4f}')
running_loss = 0.0
print('Finished Training')
V. Advantages and Limitations
Advantages
- Computational Efficiency: SGD requires minimal computation per iteration, making it suitable for large datasets that cannot fit into memory.
- Fast Initial Convergence: SGD updates parameters frequently, leading to rapid progress in the early stages of training.
- Robustness to Local Minima: Noise in gradients helps escape shallow local minima, which is critical for training deep neural networks.
Limitations
- Noisy Gradients: The approximate gradient estimates cause oscillations, slowing down convergence near the optimal point.
- Sensitivity to Learning Rate: A fixed learning rate may lead to either slow convergence (too small) or divergence (too large).
- Poor Scalability for Parallel Computing: SGD processes one sample at a time, which cannot leverage the parallel processing capabilities of modern GPUs as effectively as mini-batch gradient descent.
VI. Practical Tips
Combine with Momentum: Always use SGD with momentum (\(\gamma = 0.9\)) to reduce oscillations and accelerate convergence.
Use Mini-Batch SGD in Practice: For most deep learning tasks, mini-batch gradient descent (a hybrid of SGD and BGD, using small batches of 16–256 samples) is preferred over pure SGD, as it balances speed and stability.
Tune the Learning Rate: Start with a moderate initial learning rate (e.g., 0.01) and use a learning rate scheduler to decay it over time.
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment