Stochastic Gradient Descent (SGD) is the foundational optimization algorithm for training machine learning and deep learning models. Unlike Batch Gradient Descent (which uses the entire dataset to compute gradients) or Mini-Batch Gradient Descent (which uses small batches), SGD computes the gradient of the loss function using a single random sample from the training data at each step. This makes it computationally efficient and scalable to large datasets—critical for training deep neural networks.
While modern optimizers like Adam have largely replaced vanilla SGD as the default choice, understanding SGD is essential for grasping core optimization concepts (momentum, learning rate scheduling) and for scenarios where simplicity, low memory usage, or better generalization is needed.
Core Motivation
The goal of any optimizer is to minimize the loss function \(L(\theta)\), where \(\theta\) represents the model’s parameters (weights and biases). For a dataset with N samples, the full loss (empirical risk) is:
\(L(\theta) = \frac{1}{N} \sum_{i=1}^N L_i(\theta)\)
where \(L_i(\theta)\) is the loss for the i-th sample.
Limitations of Batch Gradient Descent
Batch Gradient Descent computes the gradient using the entire dataset:
\(\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N \nabla_\theta L_i(\theta)\)
- Problem 1: Computationally expensive for large N (e.g., ImageNet has 1.4M samples).
- Problem 2: Slow convergence (one update per full pass over the dataset).
SGD Solution
SGD approximates the full gradient using a single random sample (i):
\(\nabla_\theta L(\theta) \approx \nabla_\theta L_i(\theta)\)
This approximation introduces noise (stochasticity) but:
- Reduces computation per update (O(1) instead of O(N)).
- Allows frequent parameter updates (faster progress toward the loss minimum).
- The noise can help escape local minima (a benefit for non-convex loss surfaces in deep learning).
How SGD Works
Basic SGD Algorithm
The core update rule for SGD is simple and iterative:
- Initialize parameters: Randomly initialize \(\theta_0\) (weights/biases).
- Shuffle the dataset: Ensure samples are processed in random order (critical for SGD).
- For each epoch:a. For each sample i in the shuffled dataset:i. Compute the gradient of the loss for sample i: \(g_i = \nabla_\theta L_i(\theta_t)\).ii. Update parameters by moving in the direction opposite to the gradient:\(\theta_{t+1} = \theta_t – \alpha \cdot g_i\)where \(\alpha\) (learning rate) controls the step size.
- Repeat: Until the loss converges (stops decreasing) or a maximum number of epochs is reached.
Key Note: Mini-Batch SGD (Practical SGD)
In practice, “SGD” almost always refers to Mini-Batch SGD (not single-sample SGD), which uses small batches of m samples (e.g., 32, 64, 128) to compute gradients:
\(\theta_{t+1} = \theta_t – \alpha \cdot \frac{1}{m} \sum_{i \in \text{batch}} \nabla_\theta L_i(\theta_t)\)
- Mini-Batch SGD balances the noise of single-sample SGD and the computational cost of Batch GD.
- Batch size \(m=32\) is a common default for deep learning.
SGD with Momentum (Critical Improvement)
Vanilla SGD suffers from oscillations (zig-zagging) around the loss minimum, especially on steep loss surfaces. Momentum fixes this by accumulating past gradients to smooth updates—like a ball rolling down a hill (momentum keeps it moving in the right direction):
\(v_t = \beta \cdot v_{t-1} + (1 – \beta) \cdot g_t \quad (\text{or } v_t = \beta \cdot v_{t-1} + g_t \text{ (simpler form)})\)
\(\theta_{t+1} = \theta_t – \alpha \cdot v_t\)
- \(v_t\): Momentum vector (accumulated gradient).
- \(\beta\): Momentum coefficient (default: 0.9)—higher values = more smoothing.
- Momentum accelerates convergence and reduces oscillations.
SGD Implementation (Python: Manual + TensorFlow/Keras)
We first implement vanilla SGD and SGD with momentum manually for a regression task, then show standard usage in TensorFlow/Keras (for deep learning).
Step 1: Manual SGD Implementation
python
运行
import numpy as np
import matplotlib.pyplot as plt
# --------------------------
# 1. Synthetic Regression Data
# --------------------------
np.random.seed(42)
x = np.linspace(-5, 5, 100)
y_true = 2 * x + 1 # True model: y = 2x + 1
y = y_true + np.random.normal(0, 1, size=x.shape) # Add noise
# Model: y_hat = w*x + b (single weight w, bias b)
def predict(x, w, b):
return w * x + b
# MSE loss for a single sample
def sample_loss(y_hat, y):
return 0.5 * (y_hat - y)**2 # 0.5 simplifies gradient calculation
# Gradient of loss for a single sample
def compute_sample_gradient(x_i, y_i, w, b):
y_hat = predict(x_i, w, b)
dw = (y_hat - y_i) * x_i # dL/dw
db = (y_hat - y_i) # dL/db
return np.array([dw, db])
# --------------------------
# 2. Vanilla SGD Optimizer
# --------------------------
class VanillaSGD:
def __init__(self, lr=0.01):
self.lr = lr # Learning rate
def update(self, params, grad):
# Basic SGD update: theta = theta - lr * grad
params = params - self.lr * grad
return params
# --------------------------
# 3. SGD with Momentum
# --------------------------
class SGDWithMomentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None # Momentum vector (initialized later)
def update(self, params, grad):
# Initialize momentum if first call
if self.v is None:
self.v = np.zeros_like(params)
# Update momentum: v = beta*v + grad (simplified form)
self.v = self.momentum * self.v + grad
# Update parameters with momentum
params = params - self.lr * self.v
return params
# --------------------------
# 4. Train with Vanilla SGD and SGD+Momentum
# --------------------------
def train_sgd(optimizer_class, lr=0.01, momentum=None):
# Initialize parameters (w=0, b=0)
params = np.array([0.0, 0.0])
if momentum is not None:
optimizer = optimizer_class(lr=lr, momentum=momentum)
else:
optimizer = optimizer_class(lr=lr)
loss_history = []
params_history = [params.copy()]
# Training loop (50 epochs)
epochs = 50
for epoch in range(epochs):
# Shuffle data (critical for SGD)
indices = np.random.permutation(len(x))
x_shuffled = x[indices]
y_shuffled = y[indices]
epoch_loss = 0
# Iterate over single samples (vanilla SGD)
for x_i, y_i in zip(x_shuffled, y_shuffled):
# Compute gradient for single sample
grad = compute_sample_gradient(x_i, y_i, params[0], params[1])
# Update parameters
params = optimizer.update(params, grad)
# Track loss
y_hat = predict(x_i, params[0], params[1])
epoch_loss += sample_loss(y_hat, y_i)
# Average loss per epoch
avg_loss = epoch_loss / len(x)
loss_history.append(avg_loss)
params_history.append(params.copy())
# Print progress
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f} | w={params[0]:.4f}, b={params[1]:.4f}")
return loss_history, params_history, params
# Train Vanilla SGD
print("=== Vanilla SGD ===")
loss_vanilla, params_vanilla, final_vanilla = train_sgd(VanillaSGD, lr=0.01)
# Train SGD with Momentum
print("\n=== SGD with Momentum ===")
loss_momentum, params_momentum, final_momentum = train_sgd(SGDWithMomentum, lr=0.01, momentum=0.9)
# --------------------------
# 5. Visualize Results
# --------------------------
# Loss curves
plt.figure(figsize=(10, 4))
plt.plot(loss_vanilla, label="Vanilla SGD (lr=0.01)")
plt.plot(loss_momentum, label="SGD + Momentum (lr=0.01, β=0.9)")
plt.xlabel("Epoch")
plt.ylabel("Average MSE Loss")
plt.title("SGD: Loss Over Time")
plt.legend()
plt.grid(True)
plt.show()
# Regression fit
plt.figure(figsize=(10, 4))
plt.scatter(x, y, label="Data (with noise)", alpha=0.6)
plt.plot(x, y_true, "r-", label="True: y=2x+1", linewidth=2)
plt.plot(x, predict(x, final_vanilla[0], final_vanilla[1]), "g--", label=f"Vanilla SGD: y={final_vanilla[0]:.2f}x+{final_vanilla[1]:.2f}")
plt.plot(x, predict(x, final_momentum[0], final_momentum[1]), "b-.", label=f"SGD+Momentum: y={final_momentum[0]:.2f}x+{final_momentum[1]:.2f}")
plt.legend()
plt.title("SGD: Regression Fit")
plt.show()
Step 2: SGD in TensorFlow/Keras (Standard Usage)
For deep learning, use Keras’s built-in SGD optimizer (which implements Mini-Batch SGD with momentum):
python
运行
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
# --------------------------
# 1. Load MNIST Data
# --------------------------
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
# --------------------------
# 2. Build CNN Model
# --------------------------
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(10, activation="softmax")
])
# --------------------------
# 3. Compile with SGD Optimizer
# --------------------------
# Vanilla SGD
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01), # No momentum
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# Train (Vanilla SGD)
print("=== Training with Vanilla SGD ===")
history_vanilla = model.fit(
x_train, y_train,
epochs=10,
batch_size=64, # Mini-batch size
validation_split=0.1
)
# Reset model and compile with SGD + Momentum
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(10, activation="softmax")
])
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9), # With momentum
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# Train (SGD + Momentum)
print("\n=== Training with SGD + Momentum ===")
history_momentum = model.fit(
x_train, y_train,
epochs=10,
batch_size=64,
validation_split=0.1
)
# --------------------------
# 4. Evaluate and Visualize
# --------------------------
# Evaluate test accuracy
test_loss_vanilla, test_acc_vanilla = model.evaluate(x_test, y_test)
test_loss_momentum, test_acc_momentum = model.evaluate(x_test, y_test)
print(f"\nTest Accuracy (Vanilla SGD): {test_acc_vanilla:.4f}")
print(f"Test Accuracy (SGD + Momentum): {test_acc_momentum:.4f}")
# Plot accuracy curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history_vanilla.history["accuracy"], label="Vanilla SGD (Train)")
plt.plot(history_vanilla.history["val_accuracy"], label="Vanilla SGD (Val)")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("Vanilla SGD Accuracy")
plt.subplot(1, 2, 2)
plt.plot(history_momentum.history["accuracy"], label="SGD+Momentum (Train)")
plt.plot(history_momentum.history["val_accuracy"], label="SGD+Momentum (Val)")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("SGD+Momentum Accuracy")
plt.tight_layout()
plt.show()
Key Outputs
- Manual Implementation: SGD with momentum converges faster and to a lower loss than vanilla SGD (the fitted line is closer to the true model).
- TensorFlow/Keras: SGD + momentum achieves ~97–98% test accuracy on MNIST (vs. ~95% for vanilla SGD) in the same number of epochs.
Key Hyperparameters of SGD
SGD has few hyperparameters, but tuning them is critical for performance:
| Hyperparameter | Default Value | Purpose | Tuning Tips |
|---|---|---|---|
| Learning Rate (\(\alpha\)) | 0.01 | Step size for parameter updates | – Too small: Slow convergence (gets stuck in local minima).- Too large: Unstable training (loss oscillates or diverges).- Use learning rate scheduling (e.g., decay by 10x after 50 epochs). |
| Momentum (\(\beta\)) | 0.9 | Smoothing factor for gradient accumulation | – 0.9 = standard choice (balances smoothing and adaptability).- Higher values (0.95) = more smoothing (good for noisy gradients). |
| Batch Size (m) | 32/64 | Number of samples per mini-batch | – Small batches (16): More noise, faster updates, better generalization.- Large batches (256): Less noise, slower updates, more stable convergence.- Use powers of 2 (16, 32, 64) for GPU efficiency. |
Learning Rate Scheduling for SGD
SGD benefits greatly from learning rate decay (reducing the learning rate over time to fine-tune parameters near the loss minimum):
python
运行
# Example: Step decay (reduce LR by 10x every 10 epochs)
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 0.01 * (0.1 ** (epoch // 10))
)
# Add to model.fit()
model.fit(..., callbacks=[lr_scheduler])
SGD vs. Adam (Critical Comparison)
SGD and Adam are the two most common optimizers—knowing when to use each is key:
| Feature | SGD (with Momentum) | Adam |
|---|---|---|
| Convergence Speed | Slow (requires more epochs) | Fast (converges in fewer epochs) |
| Generalization | Better (noise helps escape overfitting) | Worse (prone to overfitting on small datasets) |
| Memory Usage | Low (only stores parameters/gradients) | High (stores first/second moments for each parameter) |
| Hyperparameter Sensitivity | High (lr/momentum need careful tuning) | Low (default hyperparameters work for most tasks) |
| Sparse Data | Poor (fixed lr for all parameters) | Excellent (adaptive lr per parameter) |
| Use Cases | Small datasets, RL, edge devices, better generalization | Large datasets, NLP/CNN/Transformer, sparse data |
When to Choose SGD Over Adam
- Small datasets: SGD generalizes better (Adam may overfit).
- Reinforcement Learning: SGD is more stable for policy gradient methods (e.g., REINFORCE, PPO).
- Edge devices: Lower memory usage (critical for mobile/embedded AI).
- Research: Easier to interpret (fewer moving parts than Adam).
Common Variants of SGD
- Nesterov Accelerated Gradient (NAG): A modified momentum that looks ahead to the next update, reducing overshooting the minimum:\(v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla_\theta L(\theta_t – \beta \cdot v_{t-1})\)\(\theta_{t+1} = \theta_t – v_t\)Implemented in Keras as:python运行
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True) - SGD with Weight Decay: Adds L2 regularization to prevent overfitting (equivalent to AdamW for SGD):python运行
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, weight_decay=1e-4)
Summary
Use SGD for small datasets, RL, or edge devices; use Adam for large datasets, sparse data, or fast convergence.
Stochastic Gradient Descent (SGD) optimizes the loss function by updating parameters using gradients from single samples (or mini-batches), making it efficient for large datasets.
SGD with momentum is the most practical variant—it smooths updates, reduces oscillations, and accelerates convergence.
SGD has better generalization than Adam but requires more hyperparameter tuning (especially learning rate).
- High-Performance Waterproof Solar Connectors
- Durable IP68 Waterproof Solar Connectors for Outdoor Use
- High-Quality Tinned Copper Material for Durability
- High-Quality Tinned Copper Material for Long Service Life
- Y Branch Parallel Solar Connector for Enhanced Power
- 10AWG Tinned Copper Solar Battery Cables
- NEMA 5-15P to Powercon Extension Cable Overview
- Dual Port USB 3.0 Adapter for Optimal Speed
- 4-Pin XLR Connector: Reliable Audio Transmission
- 4mm Banana to 2mm Pin Connector: Your Audio Solution
- 12GB/s Mini SAS to U.2 NVMe Cable for Fast Data Transfer
- CAB-STK-E Stacking Cable: 40Gbps Performance
- High-Performance CAB-STK-E Stacking Cable Explained
- Best 10M OS2 LC to LC Fiber Patch Cable for Data Centers
- Mini SAS HD Cable: Boost Data Transfer at 12 Gbps
- Multi Rate SFP+: Enhance Your Network Speed
- Best 6.35mm to MIDI Din Cable for Clear Sound
- 15 Pin SATA Power Splitter: Solutions for Your Device Needs
- 9-Pin S-Video Cable: Enhance Your Viewing Experience
- USB 9-Pin to Standard USB 2.0 Adapter: Easy Connection
- 3 Pin to 4 Pin Fan Adapter: Optimize Your PC Cooling
- S-Video to RCA Cable: High-Definition Connections Made Easy
- 6.35mm TS Extension Cable: High-Quality Sound Solution
- BlackBerry Curve 9360: Key Features and Specs






















Leave a comment