SGD vs Adam: Choosing the Right Optimization Algorithm

Stochastic Gradient Descent (SGD) is the foundational optimization algorithm for training machine learning and deep learning models. Unlike Batch Gradient Descent (which uses the entire dataset to compute gradients) or Mini-Batch Gradient Descent (which uses small batches), SGD computes the gradient of the loss function using a single random sample from the training data at each step. This makes it computationally efficient and scalable to large datasets—critical for training deep neural networks.

While modern optimizers like Adam have largely replaced vanilla SGD as the default choice, understanding SGD is essential for grasping core optimization concepts (momentum, learning rate scheduling) and for scenarios where simplicity, low memory usage, or better generalization is needed.

Core Motivation

The goal of any optimizer is to minimize the loss function \(L(\theta)\), where \(\theta\) represents the model’s parameters (weights and biases). For a dataset with N samples, the full loss (empirical risk) is:

\(L(\theta) = \frac{1}{N} \sum_{i=1}^N L_i(\theta)\)

where \(L_i(\theta)\) is the loss for the i-th sample.

Limitations of Batch Gradient Descent

Batch Gradient Descent computes the gradient using the entire dataset:

\(\nabla_\theta L(\theta) = \frac{1}{N} \sum_{i=1}^N \nabla_\theta L_i(\theta)\)

Problem 1: Computationally expensive for large N (e.g., ImageNet has 1.4M samples).
Problem 2: Slow convergence (one update per full pass over the dataset).

SGD Solution

SGD approximates the full gradient using a single random sample (i):

\(\nabla_\theta L(\theta) \approx \nabla_\theta L_i(\theta)\)

This approximation introduces noise (stochasticity) but:

Reduces computation per update (O(1) instead of O(N)).
Allows frequent parameter updates (faster progress toward the loss minimum).
The noise can help escape local minima (a benefit for non-convex loss surfaces in deep learning).

How SGD Works

Basic SGD Algorithm

The core update rule for SGD is simple and iterative:

Initialize parameters: Randomly initialize \(\theta_0\) (weights/biases).
Shuffle the dataset: Ensure samples are processed in random order (critical for SGD).
For each epoch:a. For each sample i in the shuffled dataset:i. Compute the gradient of the loss for sample i: \(g_i = \nabla_\theta L_i(\theta_t)\).ii. Update parameters by moving in the direction opposite to the gradient:\(\theta_{t+1} = \theta_t – \alpha \cdot g_i\)where \(\alpha\) (learning rate) controls the step size.
Repeat: Until the loss converges (stops decreasing) or a maximum number of epochs is reached.

Key Note: Mini-Batch SGD (Practical SGD)

In practice, “SGD” almost always refers to Mini-Batch SGD (not single-sample SGD), which uses small batches of m samples (e.g., 32, 64, 128) to compute gradients:

\(\theta_{t+1} = \theta_t – \alpha \cdot \frac{1}{m} \sum_{i \in \text{batch}} \nabla_\theta L_i(\theta_t)\)

Mini-Batch SGD balances the noise of single-sample SGD and the computational cost of Batch GD.
Batch size \(m=32\) is a common default for deep learning.

SGD with Momentum (Critical Improvement)

Vanilla SGD suffers from oscillations (zig-zagging) around the loss minimum, especially on steep loss surfaces. Momentum fixes this by accumulating past gradients to smooth updates—like a ball rolling down a hill (momentum keeps it moving in the right direction):

\(v_t = \beta \cdot v_{t-1} + (1 – \beta) \cdot g_t \quad (\text{or } v_t = \beta \cdot v_{t-1} + g_t \text{ (simpler form)})\)

\(\theta_{t+1} = \theta_t – \alpha \cdot v_t\)

\(v_t\): Momentum vector (accumulated gradient).
\(\beta\): Momentum coefficient (default: 0.9)—higher values = more smoothing.
Momentum accelerates convergence and reduces oscillations.

SGD Implementation (Python: Manual + TensorFlow/Keras)

We first implement vanilla SGD and SGD with momentum manually for a regression task, then show standard usage in TensorFlow/Keras (for deep learning).

Step 1: Manual SGD Implementation

python

运行

import numpy as np
import matplotlib.pyplot as plt

# --------------------------
# 1. Synthetic Regression Data
# --------------------------
np.random.seed(42)
x = np.linspace(-5, 5, 100)
y_true = 2 * x + 1  # True model: y = 2x + 1
y = y_true + np.random.normal(0, 1, size=x.shape)  # Add noise

# Model: y_hat = w*x + b (single weight w, bias b)
def predict(x, w, b):
    return w * x + b

# MSE loss for a single sample
def sample_loss(y_hat, y):
    return 0.5 * (y_hat - y)**2  # 0.5 simplifies gradient calculation

# Gradient of loss for a single sample
def compute_sample_gradient(x_i, y_i, w, b):
    y_hat = predict(x_i, w, b)
    dw = (y_hat - y_i) * x_i  # dL/dw
    db = (y_hat - y_i)        # dL/db
    return np.array([dw, db])

# --------------------------
# 2. Vanilla SGD Optimizer
# --------------------------
class VanillaSGD:
    def __init__(self, lr=0.01):
        self.lr = lr  # Learning rate
    
    def update(self, params, grad):
        # Basic SGD update: theta = theta - lr * grad
        params = params - self.lr * grad
        return params

# --------------------------
# 3. SGD with Momentum
# --------------------------
class SGDWithMomentum:
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None  # Momentum vector (initialized later)
    
    def update(self, params, grad):
        # Initialize momentum if first call
        if self.v is None:
            self.v = np.zeros_like(params)
        
        # Update momentum: v = beta*v + grad (simplified form)
        self.v = self.momentum * self.v + grad
        
        # Update parameters with momentum
        params = params - self.lr * self.v
        return params

# --------------------------
# 4. Train with Vanilla SGD and SGD+Momentum
# --------------------------
def train_sgd(optimizer_class, lr=0.01, momentum=None):
    # Initialize parameters (w=0, b=0)
    params = np.array([0.0, 0.0])
    if momentum is not None:
        optimizer = optimizer_class(lr=lr, momentum=momentum)
    else:
        optimizer = optimizer_class(lr=lr)
    
    loss_history = []
    params_history = [params.copy()]
    
    # Training loop (50 epochs)
    epochs = 50
    for epoch in range(epochs):
        # Shuffle data (critical for SGD)
        indices = np.random.permutation(len(x))
        x_shuffled = x[indices]
        y_shuffled = y[indices]
        
        epoch_loss = 0
        # Iterate over single samples (vanilla SGD)
        for x_i, y_i in zip(x_shuffled, y_shuffled):
            # Compute gradient for single sample
            grad = compute_sample_gradient(x_i, y_i, params[0], params[1])
            # Update parameters
            params = optimizer.update(params, grad)
            # Track loss
            y_hat = predict(x_i, params[0], params[1])
            epoch_loss += sample_loss(y_hat, y_i)
        
        # Average loss per epoch
        avg_loss = epoch_loss / len(x)
        loss_history.append(avg_loss)
        params_history.append(params.copy())
        
        # Print progress
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f} | w={params[0]:.4f}, b={params[1]:.4f}")
    
    return loss_history, params_history, params

# Train Vanilla SGD
print("=== Vanilla SGD ===")
loss_vanilla, params_vanilla, final_vanilla = train_sgd(VanillaSGD, lr=0.01)

# Train SGD with Momentum
print("\n=== SGD with Momentum ===")
loss_momentum, params_momentum, final_momentum = train_sgd(SGDWithMomentum, lr=0.01, momentum=0.9)

# --------------------------
# 5. Visualize Results
# --------------------------
# Loss curves
plt.figure(figsize=(10, 4))
plt.plot(loss_vanilla, label="Vanilla SGD (lr=0.01)")
plt.plot(loss_momentum, label="SGD + Momentum (lr=0.01, β=0.9)")
plt.xlabel("Epoch")
plt.ylabel("Average MSE Loss")
plt.title("SGD: Loss Over Time")
plt.legend()
plt.grid(True)
plt.show()

# Regression fit
plt.figure(figsize=(10, 4))
plt.scatter(x, y, label="Data (with noise)", alpha=0.6)
plt.plot(x, y_true, "r-", label="True: y=2x+1", linewidth=2)
plt.plot(x, predict(x, final_vanilla[0], final_vanilla[1]), "g--", label=f"Vanilla SGD: y={final_vanilla[0]:.2f}x+{final_vanilla[1]:.2f}")
plt.plot(x, predict(x, final_momentum[0], final_momentum[1]), "b-.", label=f"SGD+Momentum: y={final_momentum[0]:.2f}x+{final_momentum[1]:.2f}")
plt.legend()
plt.title("SGD: Regression Fit")
plt.show()

Step 2: SGD in TensorFlow/Keras (Standard Usage)

For deep learning, use Keras’s built-in SGD optimizer (which implements Mini-Batch SGD with momentum):

python

运行

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np

# --------------------------
# 1. Load MNIST Data
# --------------------------
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

# --------------------------
# 2. Build CNN Model
# --------------------------
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(10, activation="softmax")
])

# --------------------------
# 3. Compile with SGD Optimizer
# --------------------------
# Vanilla SGD
model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),  # No momentum
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

# Train (Vanilla SGD)
print("=== Training with Vanilla SGD ===")
history_vanilla = model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=64,  # Mini-batch size
    validation_split=0.1
)

# Reset model and compile with SGD + Momentum
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(10, activation="softmax")
])

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),  # With momentum
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

# Train (SGD + Momentum)
print("\n=== Training with SGD + Momentum ===")
history_momentum = model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=64,
    validation_split=0.1
)

# --------------------------
# 4. Evaluate and Visualize
# --------------------------
# Evaluate test accuracy
test_loss_vanilla, test_acc_vanilla = model.evaluate(x_test, y_test)
test_loss_momentum, test_acc_momentum = model.evaluate(x_test, y_test)
print(f"\nTest Accuracy (Vanilla SGD): {test_acc_vanilla:.4f}")
print(f"Test Accuracy (SGD + Momentum): {test_acc_momentum:.4f}")

# Plot accuracy curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history_vanilla.history["accuracy"], label="Vanilla SGD (Train)")
plt.plot(history_vanilla.history["val_accuracy"], label="Vanilla SGD (Val)")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("Vanilla SGD Accuracy")

plt.subplot(1, 2, 2)
plt.plot(history_momentum.history["accuracy"], label="SGD+Momentum (Train)")
plt.plot(history_momentum.history["val_accuracy"], label="SGD+Momentum (Val)")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("SGD+Momentum Accuracy")
plt.tight_layout()
plt.show()

Key Outputs

Manual Implementation: SGD with momentum converges faster and to a lower loss than vanilla SGD (the fitted line is closer to the true model).
TensorFlow/Keras: SGD + momentum achieves ~97–98% test accuracy on MNIST (vs. ~95% for vanilla SGD) in the same number of epochs.

Key Hyperparameters of SGD

SGD has few hyperparameters, but tuning them is critical for performance:

Hyperparameter	Default Value	Purpose	Tuning Tips
Learning Rate (\(\alpha\))	0.01	Step size for parameter updates	– Too small: Slow convergence (gets stuck in local minima).- Too large: Unstable training (loss oscillates or diverges).- Use learning rate scheduling (e.g., decay by 10x after 50 epochs).
Momentum (\(\beta\))	0.9	Smoothing factor for gradient accumulation	– 0.9 = standard choice (balances smoothing and adaptability).- Higher values (0.95) = more smoothing (good for noisy gradients).
Batch Size (m)	32/64	Number of samples per mini-batch	– Small batches (16): More noise, faster updates, better generalization.- Large batches (256): Less noise, slower updates, more stable convergence.- Use powers of 2 (16, 32, 64) for GPU efficiency.

Learning Rate Scheduling for SGD

SGD benefits greatly from learning rate decay (reducing the learning rate over time to fine-tune parameters near the loss minimum):

python

运行

# Example: Step decay (reduce LR by 10x every 10 epochs)
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(
    lambda epoch: 0.01 * (0.1 ** (epoch // 10))
)

# Add to model.fit()
model.fit(..., callbacks=[lr_scheduler])

SGD vs. Adam (Critical Comparison)

SGD and Adam are the two most common optimizers—knowing when to use each is key:

Feature	SGD (with Momentum)	Adam
Convergence Speed	Slow (requires more epochs)	Fast (converges in fewer epochs)
Generalization	Better (noise helps escape overfitting)	Worse (prone to overfitting on small datasets)
Memory Usage	Low (only stores parameters/gradients)	High (stores first/second moments for each parameter)
Hyperparameter Sensitivity	High (lr/momentum need careful tuning)	Low (default hyperparameters work for most tasks)
Sparse Data	Poor (fixed lr for all parameters)	Excellent (adaptive lr per parameter)
Use Cases	Small datasets, RL, edge devices, better generalization	Large datasets, NLP/CNN/Transformer, sparse data

When to Choose SGD Over Adam

Small datasets: SGD generalizes better (Adam may overfit).
Reinforcement Learning: SGD is more stable for policy gradient methods (e.g., REINFORCE, PPO).
Edge devices: Lower memory usage (critical for mobile/embedded AI).
Research: Easier to interpret (fewer moving parts than Adam).

Common Variants of SGD

Nesterov Accelerated Gradient (NAG): A modified momentum that looks ahead to the next update, reducing overshooting the minimum:\(v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla_\theta L(\theta_t – \beta \cdot v_{t-1})\)\(\theta_{t+1} = \theta_t – v_t\)Implemented in Keras as:python运行optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)
SGD with Weight Decay: Adds L2 regularization to prevent overfitting (equivalent to AdamW for SGD):python运行optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, weight_decay=1e-4)

Summary

Use SGD for small datasets, RL, or edge devices; use Adam for large datasets, sparse data, or fast convergence.

Stochastic Gradient Descent (SGD) optimizes the loss function by updating parameters using gradients from single samples (or mini-batches), making it efficient for large datasets.

SGD with momentum is the most practical variant—it smooths updates, reduces oscillations, and accelerates convergence.

SGD has better generalization than Adam but requires more hyperparameter tuning (especially learning rate).