Adam (Adaptive Moment Estimation) is one of the most widely used optimization algorithms for training deep neural networks. Introduced by Kingma and Ba in 2014, Adam combines the strengths of two popular optimizers:
- Momentum: Accelerates gradient descent by accumulating past gradient information (smoothing out updates).
- RMSprop: Adapts the learning rate for each parameter based on the historical variance of its gradients (improves convergence on sparse data).
Adam is adaptive (per-parameter learning rates), computationally efficient, and robust to hyperparameter choices—making it the default optimizer for most deep learning tasks (e.g., image classification, NLP, generative models).
Core Motivation
Traditional Stochastic Gradient Descent (SGD) uses a single fixed learning rate for all parameters, which has two major flaws:
- Slow convergence: SGD oscillates around the loss minimum, especially on non-convex loss surfaces.
- One-size-fits-all learning rate: Parameters with sparse gradients (e.g., word embeddings in NLP) need smaller learning rates than dense gradients, but SGD treats them equally.
Adam solves these issues by:
- Tracking first-order moments (mean) of gradients (like momentum) to smooth updates.
- Tracking second-order moments (uncentered variance) of gradients (like RMSprop) to adapt learning rates per parameter.
- Correcting for bias in the estimated moments (critical for early training steps).
How Adam Works
Key Definitions
Let:
- \(\theta_t\): Model parameters (weights/biases) at time step t.
- \(g_t = \nabla_\theta L(\theta_t)\): Gradient of the loss L with respect to \(\theta_t\) at step t.
- \(\alpha\): Learning rate (default: 0.001).
- \(\beta_1\): Exponential decay rate for first-order moment (mean) (default: 0.9).
- \(\beta_2\): Exponential decay rate for second-order moment (variance) (default: 0.999).
- \(\epsilon\): Small constant to avoid division by zero (default: \(10^{-8}\)).
Adam Algorithm Steps
Adam iteratively updates parameters using four core steps:
Step 1: Compute Gradient
Calculate the gradient of the loss with respect to parameters (via backpropagation):
\(g_t = \nabla_\theta L(\theta_t)\)
Step 2: Update First-Order Moment (Momentum)
The first moment \(m_t\) (exponentially weighted moving average of gradients) acts like momentum—it smooths out noisy gradients:
\(m_t = \beta_1 \cdot m_{t-1} + (1 – \beta_1) \cdot g_t\)
- \(m_0 = 0\) (initialization).
- \(\beta_1 = 0.9\) means we weight recent gradients (10% of current gradient + 90% of past momentum).
Step 3: Update Second-Order Moment (Adaptive Learning Rate)
The second moment \(v_t\) (exponentially weighted moving average of squared gradients) captures the variance of gradients for each parameter:
\(v_t = \beta_2 \cdot v_{t-1} + (1 – \beta_2) \cdot g_t^2\)
- \(v_0 = 0\) (initialization).
- \(g_t^2\) is the element-wise square of the gradient.
- \(\beta_2 = 0.999\) means we weight recent squared gradients very lightly (0.1% of current + 99.9% of past variance).
Step 4: Correct Bias in Moments
Since \(m_0 = 0\) and \(v_0 = 0\), early estimates of \(m_t\) and \(v_t\) are biased toward zero. We correct this with bias-corrected moments:
\(\hat{m}_t = \frac{m_t}{1 – \beta_1^t}\)
\(\hat{v}_t = \frac{v_t}{1 – \beta_2^t}\)
- As t increases, the correction term approaches 1 (bias vanishes).
Step 5: Update Parameters
Finally, update parameters using the bias-corrected moments to adapt the learning rate per parameter:
\(\theta_{t+1} = \theta_t – \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\)
- For parameters with high gradient variance (\(\hat{v}_t\) large), the effective learning rate is small (prevents large updates).
- For parameters with low gradient variance (\(\hat{v}_t\) small), the effective learning rate is large (speeds up convergence).
Adam Optimizer Implementation (Python: Manual vs. TensorFlow/Keras)
We first implement Adam manually for a simple regression task to illustrate the math, then show how to use it in TensorFlow/Keras (the standard approach for real-world use).
Step 1: Manual Adam Implementation
python
运行
import numpy as np
import matplotlib.pyplot as plt
# --------------------------
# 1. Define a simple regression task
# --------------------------
# Generate synthetic data: y = 2x + 1 + noise
np.random.seed(42)
x = np.linspace(-5, 5, 100)
y_true = 2 * x + 1
y = y_true + np.random.normal(0, 1, size=x.shape) # Add noise
# Model: y_hat = w*x + b (single parameter w, bias b)
def predict(x, w, b):
return w * x + b
# MSE loss
def mse_loss(y_hat, y):
return np.mean((y_hat - y)**2)
# Gradient of loss with respect to w and b
def compute_gradient(x, y, w, b):
y_hat = predict(x, w, b)
dw = 2 * np.mean((y_hat - y) * x) # dL/dw
db = 2 * np.mean(y_hat - y) # dL/db
return np.array([dw, db])
# --------------------------
# 2. Manual Adam Optimizer
# --------------------------
class AdamOptimizer:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None # First moment (initialized later)
self.v = None # Second moment (initialized later)
self.t = 0 # Time step
def update(self, params, grad):
# Initialize moments if first call
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
self.t += 1
# Step 2: Update first moment (momentum)
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
# Step 3: Update second moment (variance)
self.v = self.beta2 * self.v + (1 - self.beta2) * (grad **2)
# Step 4: Bias correction
m_hat = self.m / (1 - self.beta1** self.t)
v_hat = self.v / (1 - self.beta2 **self.t)
# Step 5: Update parameters
params = params - self.lr * (m_hat / (np.sqrt(v_hat) + self.epsilon))
return params
# --------------------------
# 3. Train the model with Adam
# --------------------------
# Initialize parameters (w, b)
params = np.array([0.0, 0.0]) # w=0, b=0
adam = AdamOptimizer(lr=0.1) # Higher lr for faster convergence on this simple task
loss_history = []
# Training loop (100 iterations)
for epoch in range(100):
# Compute gradient
grad = compute_gradient(x, y, params[0], params[1])
# Update parameters with Adam
params = adam.update(params, grad)
# Compute and store loss
y_hat = predict(x, params[0], params[1])
loss = mse_loss(y_hat, y)
loss_history.append(loss)
# Print progress
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1} | Loss: {loss:.4f} | w={params[0]:.4f}, b={params[1]:.4f}")
# Plot loss curve
plt.figure(figsize=(8, 4))
plt.plot(loss_history)
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Adam Optimizer: Loss Over Time")
plt.grid(True)
plt.show()
# Plot predicted vs. true values
plt.figure(figsize=(8, 4))
plt.scatter(x, y, label="Data (with noise)")
plt.plot(x, y_true, "r-", label="True: y=2x+1")
plt.plot(x, predict(x, params[0], params[1]), "g--", label=f"Adam Prediction: y={params[0]:.2f}x+{params[1]:.2f}")
plt.legend()
plt.title("Regression Fit with Adam")
plt.show()
Step 2: Adam in TensorFlow/Keras (Standard Usage)
For real-world deep learning, you never need to implement Adam manually—TensorFlow/Keras provides a highly optimized version:
python
运行
import tensorflow as tf
from tensorflow.keras import layers, models
# Build a simple CNN for MNIST (uses Adam by default)
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(10, activation="softmax")
])
# Compile model with Adam optimizer
model.compile(
optimizer=tf.keras.optimizers.Adam(
learning_rate=0.001, # Default lr
beta_1=0.9, # Default beta1
beta_2=0.999, # Default beta2
epsilon=1e-8 # Default epsilon
),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# Load MNIST data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
# Train model
history = model.fit(
x_train, y_train,
epochs=5,
batch_size=64,
validation_split=0.1
)
# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy with Adam: {test_acc:.4f}")
Key Outputs
- Manual Implementation: The loss decreases rapidly, and the model converges to \(w \approx 2\) and \(b \approx 1\) (matching the true values).
- TensorFlow/Keras: The CNN achieves ~98% test accuracy with Adam, outperforming vanilla SGD (which would require more epochs and tuning).
Key Hyperparameters of Adam
Adam’s default hyperparameters work well for most tasks, but you may need to tune them for edge cases:
| Hyperparameter | Default Value | Purpose | Tuning Tips |
|---|---|---|---|
| \(\alpha\) (learning rate) | 0.001 | Base learning rate | – Reduce if training is unstable (loss oscillates).- Increase (e.g., 0.01) for simple tasks (regression, small CNNs). |
| \(\beta_1\) | 0.9 | Momentum decay | – Higher values (0.95) = more smoothing (good for noisy gradients).- Lower values (0.8) = faster adaptation to new gradients. |
| \(\beta_2\) | 0.999 | Variance decay | – Rarely tuned (0.999 is optimal for most cases).- Lower values (0.99) = faster adaptation to variance changes. |
| \(\epsilon\) | \(10^{-8}\) | Numerical stability | – Never change (prevents division by zero). |
Learning Rate Scheduling with Adam
To further improve convergence, you can decay the learning rate over time (e.g., reduce by 10x after 50 epochs):
python
运行
# Example: Reduce LR by 50% every 10 epochs
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
monitor="val_loss",
factor=0.5,
patience=10,
min_lr=1e-6
)
# Add to model.fit()
model.fit(..., callbacks=[lr_scheduler])
Adam vs. Other Optimizers
| Optimizer | Key Features | Pros | Cons |
|---|---|---|---|
| SGD (Vanilla) | Fixed learning rate, no momentum | Simple, low memory usage | Slow convergence, oscillates around minimum |
| SGD + Momentum | Accumulates past gradients | Faster than vanilla SGD | Still uses fixed learning rate |
| RMSprop | Adaptive learning rate (second moment) | Good for sparse gradients | No momentum (slower than Adam) |
| Adam | Momentum + adaptive learning rate | Fast convergence, robust to hyperparameters, good for sparse data | Slightly higher memory usage than SGD |
| AdamW | Adam + weight decay (L2 regularization) | Reduces overfitting | Slightly more hyperparameters to tune |
When to Use Adam
- Default choice for most deep learning tasks (CNNs, Transformers, GANs).
- Sparse data (e.g., NLP tasks with word embeddings).
- Large datasets/models (faster convergence than SGD).
When to Use SGD Instead
- Small datasets (SGD generalizes better).
- Reinforcement learning (more stable for policy gradient methods).
- Edge devices (lower memory usage).
Common Variants of Adam
- AdamW: Adds weight decay (L2 regularization) to Adam, which is more effective than standard L2 regularization with Adam. It is now the default for large models (e.g., Transformers).python运行
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=1e-4) - AMSGrad: Fixes a theoretical flaw in Adam by using the maximum of past second moments instead of the exponential average. Rarely needed (Adam works better in practice).
- AdaBelief: Adapts learning rates based on “belief” in the current gradient (combines Adam and momentum in a novel way). Good for unstable training regimes (e.g., GANs).
Summary
For real-world use, always use optimized implementations (e.g., tf.keras.optimizers.Adam) instead of manual code.
Adam Optimizer combines momentum (first-order moments) and adaptive learning rates (second-order moments) to speed up convergence and improve stability.
It uses bias correction to fix early-stage moment estimates and adapts the learning rate per parameter (smaller for high-variance gradients, larger for low-variance).
Adam is the default optimizer for most deep learning tasks—its default hyperparameters work well for nearly all use cases.
- High-Performance Waterproof Solar Connectors
- Durable IP68 Waterproof Solar Connectors for Outdoor Use
- High-Quality Tinned Copper Material for Durability
- High-Quality Tinned Copper Material for Long Service Life
- Y Branch Parallel Solar Connector for Enhanced Power
- 10AWG Tinned Copper Solar Battery Cables
- NEMA 5-15P to Powercon Extension Cable Overview
- Dual Port USB 3.0 Adapter for Optimal Speed
- 4-Pin XLR Connector: Reliable Audio Transmission
- 4mm Banana to 2mm Pin Connector: Your Audio Solution
- 12GB/s Mini SAS to U.2 NVMe Cable for Fast Data Transfer
- CAB-STK-E Stacking Cable: 40Gbps Performance
- High-Performance CAB-STK-E Stacking Cable Explained
- Best 10M OS2 LC to LC Fiber Patch Cable for Data Centers
- Mini SAS HD Cable: Boost Data Transfer at 12 Gbps
- Multi Rate SFP+: Enhance Your Network Speed
- Best 6.35mm to MIDI Din Cable for Clear Sound
- 15 Pin SATA Power Splitter: Solutions for Your Device Needs
- 9-Pin S-Video Cable: Enhance Your Viewing Experience
- USB 9-Pin to Standard USB 2.0 Adapter: Easy Connection
- 3 Pin to 4 Pin Fan Adapter: Optimize Your PC Cooling
- S-Video to RCA Cable: High-Definition Connections Made Easy
- 6.35mm TS Extension Cable: High-Quality Sound Solution
- BlackBerry Curve 9360: Key Features and Specs






















Leave a comment