Stochastic Gradient Descent (SGD) is the foundational optimization algorithm for training machine learning and deep learning models. Unlike Batch Gradient Descent (which uses the entire dataset to compute gradients) or Mini-Batch Gradient Descent (which uses small batches), SGD computes the gradient of the loss function using a single random sample from the training data at each step. This makes it computationally efficient and scalable to large datasets—critical for training deep neural networks.
While modern optimizers like Adam have largely replaced vanilla SGD as the default choice, understanding SGD is essential for grasping core optimization concepts (momentum, learning rate scheduling) and for scenarios where simplicity, low memory usage, or better generalization is needed.
Core Motivation
The goal of any optimizer is to minimize the loss function \(L(\theta)\), where \(\theta\) represents the model’s parameters (weights and biases). For a dataset with N samples, the full loss (empirical risk) is:
\(L(\theta) = \frac{1}{N} \sum_{i=1}^N L_i(\theta)\)
where \(L_i(\theta)\) is the loss for the i-th sample.
Limitations of Batch Gradient Descent
Batch Gradient Descent computes the gradient using the entire dataset:
\(\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N \nabla_\theta L_i(\theta)\)
- Problem 1: Computationally expensive for large N (e.g., ImageNet has 1.4M samples).
- Problem 2: Slow convergence (one update per full pass over the dataset).
SGD Solution
SGD approximates the full gradient using a single random sample (i):
\(\nabla_\theta L(\theta) \approx \nabla_\theta L_i(\theta)\)
This approximation introduces noise (stochasticity) but:
- Reduces computation per update (O(1) instead of O(N)).
- Allows frequent parameter updates (faster progress toward the loss minimum).
- The noise can help escape local minima (a benefit for non-convex loss surfaces in deep learning).
How SGD Works
Basic SGD Algorithm
The core update rule for SGD is simple and iterative:
- Initialize parameters: Randomly initialize \(\theta_0\) (weights/biases).
- Shuffle the dataset: Ensure samples are processed in random order (critical for SGD).
- For each epoch:a. For each sample i in the shuffled dataset:i. Compute the gradient of the loss for sample i: \(g_i = \nabla_\theta L_i(\theta_t)\).ii. Update parameters by moving in the direction opposite to the gradient:\(\theta_{t+1} = \theta_t – \alpha \cdot g_i\)where \(\alpha\) (learning rate) controls the step size.
- Repeat: Until the loss converges (stops decreasing) or a maximum number of epochs is reached.
Key Note: Mini-Batch SGD (Practical SGD)
In practice, “SGD” almost always refers to Mini-Batch SGD (not single-sample SGD), which uses small batches of m samples (e.g., 32, 64, 128) to compute gradients:
\(\theta_{t+1} = \theta_t – \alpha \cdot \frac{1}{m} \sum_{i \in \text{batch}} \nabla_\theta L_i(\theta_t)\)
- Mini-Batch SGD balances the noise of single-sample SGD and the computational cost of Batch GD.
- Batch size \(m=32\) is a common default for deep learning.
SGD with Momentum (Critical Improvement)
Vanilla SGD suffers from oscillations (zig-zagging) around the loss minimum, especially on steep loss surfaces. Momentum fixes this by accumulating past gradients to smooth updates—like a ball rolling down a hill (momentum keeps it moving in the right direction):
\(v_t = \beta \cdot v_{t-1} + (1 – \beta) \cdot g_t \quad (\text{or } v_t = \beta \cdot v_{t-1} + g_t \text{ (simpler form)})\)
\(\theta_{t+1} = \theta_t – \alpha \cdot v_t\)
- \(v_t\): Momentum vector (accumulated gradient).
- \(\beta\): Momentum coefficient (default: 0.9)—higher values = more smoothing.
- Momentum accelerates convergence and reduces oscillations.
SGD Implementation (Python: Manual + TensorFlow/Keras)
We first implement vanilla SGD and SGD with momentum manually for a regression task, then show standard usage in TensorFlow/Keras (for deep learning).
Step 1: Manual SGD Implementation
python
运行
import numpy as np
import matplotlib.pyplot as plt
# --------------------------
# 1. Synthetic Regression Data
# --------------------------
np.random.seed(42)
x = np.linspace(-5, 5, 100)
y_true = 2 * x + 1 # True model: y = 2x + 1
y = y_true + np.random.normal(0, 1, size=x.shape) # Add noise
# Model: y_hat = w*x + b (single weight w, bias b)
def predict(x, w, b):
return w * x + b
# MSE loss for a single sample
def sample_loss(y_hat, y):
return 0.5 * (y_hat - y)**2 # 0.5 simplifies gradient calculation
# Gradient of loss for a single sample
def compute_sample_gradient(x_i, y_i, w, b):
y_hat = predict(x_i, w, b)
dw = (y_hat - y_i) * x_i # dL/dw
db = (y_hat - y_i) # dL/db
return np.array([dw, db])
# --------------------------
# 2. Vanilla SGD Optimizer
# --------------------------
class VanillaSGD:
def __init__(self, lr=0.01):
self.lr = lr # Learning rate
def update(self, params, grad):
# Basic SGD update: theta = theta - lr * grad
params = params - self.lr * grad
return params
# --------------------------
# 3. SGD with Momentum
# --------------------------
class SGDWithMomentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None # Momentum vector (initialized later)
def update(self, params, grad):
# Initialize momentum if first call
if self.v is None:
self.v = np.zeros_like(params)
# Update momentum: v = beta*v + grad (simplified form)
self.v = self.momentum * self.v + grad
# Update parameters with momentum
params = params - self.lr * self.v
return params
# --------------------------
# 4. Train with Vanilla SGD and SGD+Momentum
# --------------------------
def train_sgd(optimizer_class, lr=0.01, momentum=None):
# Initialize parameters (w=0, b=0)
params = np.array([0.0, 0.0])
if momentum is not None:
optimizer = optimizer_class(lr=lr, momentum=momentum)
else:
optimizer = optimizer_class(lr=lr)
loss_history = []
params_history = [params.copy()]
# Training loop (50 epochs)
epochs = 50
for epoch in range(epochs):
# Shuffle data (critical for SGD)
indices = np.random.permutation(len(x))
x_shuffled = x[indices]
y_shuffled = y[indices]
epoch_loss = 0
# Iterate over single samples (vanilla SGD)
for x_i, y_i in zip(x_shuffled, y_shuffled):
# Compute gradient for single sample
grad = compute_sample_gradient(x_i, y_i, params[0], params[1])
# Update parameters
params = optimizer.update(params, grad)
# Track loss
y_hat = predict(x_i, params[0], params[1])
epoch_loss += sample_loss(y_hat, y_i)
# Average loss per epoch
avg_loss = epoch_loss / len(x)
loss_history.append(avg_loss)
params_history.append(params.copy())
# Print progress
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f} | w={params[0]:.4f}, b={params[1]:.4f}")
return loss_history, params_history, params
# Train Vanilla SGD
print("=== Vanilla SGD ===")
loss_vanilla, params_vanilla, final_vanilla = train_sgd(VanillaSGD, lr=0.01)
# Train SGD with Momentum
print("\n=== SGD with Momentum ===")
loss_momentum, params_momentum, final_momentum = train_sgd(SGDWithMomentum, lr=0.01, momentum=0.9)
# --------------------------
# 5. Visualize Results
# --------------------------
# Loss curves
plt.figure(figsize=(10, 4))
plt.plot(loss_vanilla, label="Vanilla SGD (lr=0.01)")
plt.plot(loss_momentum, label="SGD + Momentum (lr=0.01, β=0.9)")
plt.xlabel("Epoch")
plt.ylabel("Average MSE Loss")
plt.title("SGD: Loss Over Time")
plt.legend()
plt.grid(True)
plt.show()
# Regression fit
plt.figure(figsize=(10, 4))
plt.scatter(x, y, label="Data (with noise)", alpha=0.6)
plt.plot(x, y_true, "r-", label="True: y=2x+1", linewidth=2)
plt.plot(x, predict(x, final_vanilla[0], final_vanilla[1]), "g--", label=f"Vanilla SGD: y={final_vanilla[0]:.2f}x+{final_vanilla[1]:.2f}")
plt.plot(x, predict(x, final_momentum[0], final_momentum[1]), "b-.", label=f"SGD+Momentum: y={final_momentum[0]:.2f}x+{final_momentum[1]:.2f}")
plt.legend()
plt.title("SGD: Regression Fit")
plt.show()
Step 2: SGD in TensorFlow/Keras (Standard Usage)
For deep learning, use Keras’s built-in SGD optimizer (which implements Mini-Batch SGD with momentum):
python
运行
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
# --------------------------
# 1. Load MNIST Data
# --------------------------
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
# --------------------------
# 2. Build CNN Model
# --------------------------
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(10, activation="softmax")
])
# --------------------------
# 3. Compile with SGD Optimizer
# --------------------------
# Vanilla SGD
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01), # No momentum
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# Train (Vanilla SGD)
print("=== Training with Vanilla SGD ===")
history_vanilla = model.fit(
x_train, y_train,
epochs=10,
batch_size=64, # Mini-batch size
validation_split=0.1
)
# Reset model and compile with SGD + Momentum
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(10, activation="softmax")
])
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9), # With momentum
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# Train (SGD + Momentum)
print("\n=== Training with SGD + Momentum ===")
history_momentum = model.fit(
x_train, y_train,
epochs=10,
batch_size=64,
validation_split=0.1
)
# --------------------------
# 4. Evaluate and Visualize
# --------------------------
# Evaluate test accuracy
test_loss_vanilla, test_acc_vanilla = model.evaluate(x_test, y_test)
test_loss_momentum, test_acc_momentum = model.evaluate(x_test, y_test)
print(f"\nTest Accuracy (Vanilla SGD): {test_acc_vanilla:.4f}")
print(f"Test Accuracy (SGD + Momentum): {test_acc_momentum:.4f}")
# Plot accuracy curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history_vanilla.history["accuracy"], label="Vanilla SGD (Train)")
plt.plot(history_vanilla.history["val_accuracy"], label="Vanilla SGD (Val)")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("Vanilla SGD Accuracy")
plt.subplot(1, 2, 2)
plt.plot(history_momentum.history["accuracy"], label="SGD+Momentum (Train)")
plt.plot(history_momentum.history["val_accuracy"], label="SGD+Momentum (Val)")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("SGD+Momentum Accuracy")
plt.tight_layout()
plt.show()
Key Outputs
- Manual Implementation: SGD with momentum converges faster and to a lower loss than vanilla SGD (the fitted line is closer to the true model).
- TensorFlow/Keras: SGD + momentum achieves ~97–98% test accuracy on MNIST (vs. ~95% for vanilla SGD) in the same number of epochs.
Key Hyperparameters of SGD
SGD has few hyperparameters, but tuning them is critical for performance:
| Hyperparameter | Default Value | Purpose | Tuning Tips |
|---|---|---|---|
| Learning Rate (\(\alpha\)) | 0.01 | Step size for parameter updates | – Too small: Slow convergence (gets stuck in local minima).- Too large: Unstable training (loss oscillates or diverges).- Use learning rate scheduling (e.g., decay by 10x after 50 epochs). |
| Momentum (\(\beta\)) | 0.9 | Smoothing factor for gradient accumulation | – 0.9 = standard choice (balances smoothing and adaptability).- Higher values (0.95) = more smoothing (good for noisy gradients). |
| Batch Size (m) | 32/64 | Number of samples per mini-batch | – Small batches (16): More noise, faster updates, better generalization.- Large batches (256): Less noise, slower updates, more stable convergence.- Use powers of 2 (16, 32, 64) for GPU efficiency. |
Learning Rate Scheduling for SGD
SGD benefits greatly from learning rate decay (reducing the learning rate over time to fine-tune parameters near the loss minimum):
python
运行
# Example: Step decay (reduce LR by 10x every 10 epochs)
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 0.01 * (0.1 ** (epoch // 10))
)
# Add to model.fit()
model.fit(..., callbacks=[lr_scheduler])
SGD vs. Adam (Critical Comparison)
SGD and Adam are the two most common optimizers—knowing when to use each is key:
| Feature | SGD (with Momentum) | Adam |
|---|---|---|
| Convergence Speed | Slow (requires more epochs) | Fast (converges in fewer epochs) |
| Generalization | Better (noise helps escape overfitting) | Worse (prone to overfitting on small datasets) |
| Memory Usage | Low (only stores parameters/gradients) | High (stores first/second moments for each parameter) |
| Hyperparameter Sensitivity | High (lr/momentum need careful tuning) | Low (default hyperparameters work for most tasks) |
| Sparse Data | Poor (fixed lr for all parameters) | Excellent (adaptive lr per parameter) |
| Use Cases | Small datasets, RL, edge devices, better generalization | Large datasets, NLP/CNN/Transformer, sparse data |
When to Choose SGD Over Adam
- Small datasets: SGD generalizes better (Adam may overfit).
- Reinforcement Learning: SGD is more stable for policy gradient methods (e.g., REINFORCE, PPO).
- Edge devices: Lower memory usage (critical for mobile/embedded AI).
- Research: Easier to interpret (fewer moving parts than Adam).
Common Variants of SGD
- Nesterov Accelerated Gradient (NAG): A modified momentum that looks ahead to the next update, reducing overshooting the minimum:\(v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla_\theta L(\theta_t – \beta \cdot v_{t-1})\)\(\theta_{t+1} = \theta_t – v_t\)Implemented in Keras as:python运行
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True) - SGD with Weight Decay: Adds L2 regularization to prevent overfitting (equivalent to AdamW for SGD):python运行
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, weight_decay=1e-4)
Summary
Use SGD for small datasets, RL, or edge devices; use Adam for large datasets, sparse data, or fast convergence.
Stochastic Gradient Descent (SGD) optimizes the loss function by updating parameters using gradients from single samples (or mini-batches), making it efficient for large datasets.
SGD with momentum is the most practical variant—it smooths updates, reduces oscillations, and accelerates convergence.
SGD has better generalization than Adam but requires more hyperparameter tuning (especially learning rate).
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment