Adam (Adaptive Moment Estimation) is one of the most widely used optimization algorithms for training deep neural networks. Introduced by Kingma and Ba in 2014, Adam combines the strengths of two popular optimizers:
- Momentum: Accelerates gradient descent by accumulating past gradient information (smoothing out updates).
- RMSprop: Adapts the learning rate for each parameter based on the historical variance of its gradients (improves convergence on sparse data).
Adam is adaptive (per-parameter learning rates), computationally efficient, and robust to hyperparameter choices—making it the default optimizer for most deep learning tasks (e.g., image classification, NLP, generative models).
Core Motivation
Traditional Stochastic Gradient Descent (SGD) uses a single fixed learning rate for all parameters, which has two major flaws:
- Slow convergence: SGD oscillates around the loss minimum, especially on non-convex loss surfaces.
- One-size-fits-all learning rate: Parameters with sparse gradients (e.g., word embeddings in NLP) need smaller learning rates than dense gradients, but SGD treats them equally.
Adam solves these issues by:
- Tracking first-order moments (mean) of gradients (like momentum) to smooth updates.
- Tracking second-order moments (uncentered variance) of gradients (like RMSprop) to adapt learning rates per parameter.
- Correcting for bias in the estimated moments (critical for early training steps).
How Adam Works
Key Definitions
Let:
- \(\theta_t\): Model parameters (weights/biases) at time step t.
- \(g_t = \nabla_\theta L(\theta_t)\): Gradient of the loss L with respect to \(\theta_t\) at step t.
- \(\alpha\): Learning rate (default: 0.001).
- \(\beta_1\): Exponential decay rate for first-order moment (mean) (default: 0.9).
- \(\beta_2\): Exponential decay rate for second-order moment (variance) (default: 0.999).
- \(\epsilon\): Small constant to avoid division by zero (default: \(10^{-8}\)).
Adam Algorithm Steps
Adam iteratively updates parameters using four core steps:
Step 1: Compute Gradient
Calculate the gradient of the loss with respect to parameters (via backpropagation):
\(g_t = \nabla_\theta L(\theta_t)\)
Step 2: Update First-Order Moment (Momentum)
The first moment \(m_t\) (exponentially weighted moving average of gradients) acts like momentum—it smooths out noisy gradients:
\(m_t = \beta_1 \cdot m_{t-1} + (1 – \beta_1) \cdot g_t\)
- \(m_0 = 0\) (initialization).
- \(\beta_1 = 0.9\) means we weight recent gradients (10% of current gradient + 90% of past momentum).
Step 3: Update Second-Order Moment (Adaptive Learning Rate)
The second moment \(v_t\) (exponentially weighted moving average of squared gradients) captures the variance of gradients for each parameter:
\(v_t = \beta_2 \cdot v_{t-1} + (1 – \beta_2) \cdot g_t^2\)
- \(v_0 = 0\) (initialization).
- \(g_t^2\) is the element-wise square of the gradient.
- \(\beta_2 = 0.999\) means we weight recent squared gradients very lightly (0.1% of current + 99.9% of past variance).
Step 4: Correct Bias in Moments
Since \(m_0 = 0\) and \(v_0 = 0\), early estimates of \(m_t\) and \(v_t\) are biased toward zero. We correct this with bias-corrected moments:
\(\hat{m}_t = \frac{m_t}{1 – \beta_1^t}\)
\(\hat{v}_t = \frac{v_t}{1 – \beta_2^t}\)
- As t increases, the correction term approaches 1 (bias vanishes).
Step 5: Update Parameters
Finally, update parameters using the bias-corrected moments to adapt the learning rate per parameter:
\(\theta_{t+1} = \theta_t – \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\)
- For parameters with high gradient variance (\(\hat{v}_t\) large), the effective learning rate is small (prevents large updates).
- For parameters with low gradient variance (\(\hat{v}_t\) small), the effective learning rate is large (speeds up convergence).
Adam Optimizer Implementation (Python: Manual vs. TensorFlow/Keras)
We first implement Adam manually for a simple regression task to illustrate the math, then show how to use it in TensorFlow/Keras (the standard approach for real-world use).
Step 1: Manual Adam Implementation
python
运行
import numpy as np
import matplotlib.pyplot as plt
# --------------------------
# 1. Define a simple regression task
# --------------------------
# Generate synthetic data: y = 2x + 1 + noise
np.random.seed(42)
x = np.linspace(-5, 5, 100)
y_true = 2 * x + 1
y = y_true + np.random.normal(0, 1, size=x.shape) # Add noise
# Model: y_hat = w*x + b (single parameter w, bias b)
def predict(x, w, b):
return w * x + b
# MSE loss
def mse_loss(y_hat, y):
return np.mean((y_hat - y)**2)
# Gradient of loss with respect to w and b
def compute_gradient(x, y, w, b):
y_hat = predict(x, w, b)
dw = 2 * np.mean((y_hat - y) * x) # dL/dw
db = 2 * np.mean(y_hat - y) # dL/db
return np.array([dw, db])
# --------------------------
# 2. Manual Adam Optimizer
# --------------------------
class AdamOptimizer:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None # First moment (initialized later)
self.v = None # Second moment (initialized later)
self.t = 0 # Time step
def update(self, params, grad):
# Initialize moments if first call
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t += 1
# Step 2: Update first moment (momentum)
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
# Step 3: Update second moment (variance)
self.v = self.beta2 * self.v + (1 - self.beta2) * (grad **2)
# Step 4: Bias correction
m_hat = self.m / (1 - self.beta1** self.t)
v_hat = self.v / (1 - self.beta2 **self.t)
# Step 5: Update parameters
params = params - self.lr * (m_hat / (np.sqrt(v_hat) + self.epsilon))
return params
# --------------------------
# 3. Train the model with Adam
# --------------------------
# Initialize parameters (w, b)
params = np.array([0.0, 0.0]) # w=0, b=0
adam = AdamOptimizer(lr=0.1) # Higher lr for faster convergence on this simple task
loss_history = []
# Training loop (100 iterations)
for epoch in range(100):
# Compute gradient
grad = compute_gradient(x, y, params[0], params[1])
# Update parameters with Adam
params = adam.update(params, grad)
# Compute and store loss
y_hat = predict(x, params[0], params[1])
loss = mse_loss(y_hat, y)
loss_history.append(loss)
# Print progress
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1} | Loss: {loss:.4f} | w={params[0]:.4f}, b={params[1]:.4f}")
# Plot loss curve
plt.figure(figsize=(8, 4))
plt.plot(loss_history)
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Adam Optimizer: Loss Over Time")
plt.grid(True)
plt.show()
# Plot predicted vs. true values
plt.figure(figsize=(8, 4))
plt.scatter(x, y, label="Data (with noise)")
plt.plot(x, y_true, "r-", label="True: y=2x+1")
plt.plot(x, predict(x, params[0], params[1]), "g--", label=f"Adam Prediction: y={params[0]:.2f}x+{params[1]:.2f}")
plt.legend()
plt.title("Regression Fit with Adam")
plt.show()
Step 2: Adam in TensorFlow/Keras (Standard Usage)
For real-world deep learning, you never need to implement Adam manually—TensorFlow/Keras provides a highly optimized version:
python
运行
import tensorflow as tf
from tensorflow.keras import layers, models
# Build a simple CNN for MNIST (uses Adam by default)
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(10, activation="softmax")
])
# Compile model with Adam optimizer
model.compile(
optimizer=tf.keras.optimizers.Adam(
learning_rate=0.001, # Default lr
beta_1=0.9, # Default beta1
beta_2=0.999, # Default beta2
epsilon=1e-8 # Default epsilon
),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# Load MNIST data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
# Train model
history = model.fit(
x_train, y_train,
epochs=5,
batch_size=64,
validation_split=0.1
)
# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy with Adam: {test_acc:.4f}")
Key Outputs
- Manual Implementation: The loss decreases rapidly, and the model converges to \(w \approx 2\) and \(b \approx 1\) (matching the true values).
- TensorFlow/Keras: The CNN achieves ~98% test accuracy with Adam, outperforming vanilla SGD (which would require more epochs and tuning).
Key Hyperparameters of Adam
Adam’s default hyperparameters work well for most tasks, but you may need to tune them for edge cases:
| Hyperparameter | Default Value | Purpose | Tuning Tips |
|---|---|---|---|
| \(\alpha\) (learning rate) | 0.001 | Base learning rate | – Reduce if training is unstable (loss oscillates).- Increase (e.g., 0.01) for simple tasks (regression, small CNNs). |
| \(\beta_1\) | 0.9 | Momentum decay | – Higher values (0.95) = more smoothing (good for noisy gradients).- Lower values (0.8) = faster adaptation to new gradients. |
| \(\beta_2\) | 0.999 | Variance decay | – Rarely tuned (0.999 is optimal for most cases).- Lower values (0.99) = faster adaptation to variance changes. |
| \(\epsilon\) | \(10^{-8}\) | Numerical stability | – Never change (prevents division by zero). |
Learning Rate Scheduling with Adam
To further improve convergence, you can decay the learning rate over time (e.g., reduce by 10x after 50 epochs):
python
运行
# Example: Reduce LR by 50% every 10 epochs
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
monitor="val_loss",
factor=0.5,
patience=10,
min_lr=1e-6
)
# Add to model.fit()
model.fit(..., callbacks=[lr_scheduler])
Adam vs. Other Optimizers
| Optimizer | Key Features | Pros | Cons |
|---|---|---|---|
| SGD (Vanilla) | Fixed learning rate, no momentum | Simple, low memory usage | Slow convergence, oscillates around minimum |
| SGD + Momentum | Accumulates past gradients | Faster than vanilla SGD | Still uses fixed learning rate |
| RMSprop | Adaptive learning rate (second moment) | Good for sparse gradients | No momentum (slower than Adam) |
| Adam | Momentum + adaptive learning rate | Fast convergence, robust to hyperparameters, good for sparse data | Slightly higher memory usage than SGD |
| AdamW | Adam + weight decay (L2 regularization) | Reduces overfitting | Slightly more hyperparameters to tune |
When to Use Adam
- Default choice for most deep learning tasks (CNNs, Transformers, GANs).
- Sparse data (e.g., NLP tasks with word embeddings).
- Large datasets/models (faster convergence than SGD).
When to Use SGD Instead
- Small datasets (SGD generalizes better).
- Reinforcement learning (more stable for policy gradient methods).
- Edge devices (lower memory usage).
Common Variants of Adam
- AdamW: Adds weight decay (L2 regularization) to Adam, which is more effective than standard L2 regularization with Adam. It is now the default for large models (e.g., Transformers).python运行
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=1e-4) - AMSGrad: Fixes a theoretical flaw in Adam by using the maximum of past second moments instead of the exponential average. Rarely needed (Adam works better in practice).
- AdaBelief: Adapts learning rates based on “belief” in the current gradient (combines Adam and momentum in a novel way). Good for unstable training regimes (e.g., GANs).
Summary
For real-world use, always use optimized implementations (e.g., tf.keras.optimizers.Adam) instead of manual code.
Adam Optimizer combines momentum (first-order moments) and adaptive learning rates (second-order moments) to speed up convergence and improve stability.
It uses bias correction to fix early-stage moment estimates and adapts the learning rate per parameter (smaller for high-variance gradients, larger for low-variance).
Adam is the default optimizer for most deep learning tasks—its default hyperparameters work well for nearly all use cases.
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment