Cross-Entropy Loss is a core loss function in machine learning, primarily used for classification tasks (both binary and multi-class). It quantifies the difference between two probability distributions: the true label distribution (ground truth) and the predicted probability distribution output by the model. The goal of training is to minimize this difference, which pushes the model to make more accurate predictions.
Cross-entropy loss is mathematically derived from the concept of information entropy in information theory, and it is widely paired with models that output probabilities via a softmax (multi-class) or sigmoid (binary) activation function.
I. Mathematical Foundations
1. Entropy and Cross-Entropy
- Entropy (\(H(p)\)): Measures the uncertainty of a probability distribution p. For a discrete distribution, it is defined as:\(H(p) = -\sum_{i=1}^n p_i \log(p_i)\)Higher entropy means higher uncertainty (e.g., a uniform distribution has maximum entropy).
- Cross-Entropy (\(H(p,q)\)): Measures the average number of bits needed to encode samples from distribution p using a code optimized for distribution q. The formula is:\(H(p,q) = -\sum_{i=1}^n p_i \log(q_i)\)Where p is the true distribution and q is the predicted distribution.
For classification tasks, the true label distribution p is a one-hot vector (e.g., for class 2 in a 3-class task: \(p = [0,1,0]\)), which simplifies the cross-entropy calculation significantly.
2. Binary Cross-Entropy Loss (BCE Loss)
Used for binary classification tasks (two classes: 0 and 1). The model outputs a probability \(q \in [0,1]\) via a sigmoid function, where q is the probability of the sample belonging to class 1.
Formula
For a single sample with true label \(y \in \{0,1\}\) and predicted probability \(\hat{y}\):
\(\text{BCE}(y, \hat{y}) = – \left[ y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \right]\)
For a batch of m samples, the average BCE loss is:
\(\text{BCE}_{\text{batch}} = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]\)
Key Notes
- If \(y=1\), the loss simplifies to \(-\log(\hat{y})\): the loss increases sharply as \(\hat{y}\) approaches 0.
- If \(y=0\), the loss simplifies to \(-\log(1-\hat{y})\): the loss increases sharply as \(\hat{y}\) approaches 1.
- The sigmoid activation ensures \(\hat{y}\) is within [0,1], avoiding undefined values for \(\log\).
3. Categorical Cross-Entropy Loss (CCE Loss)
Used for multi-class classification tasks (more than two classes). The model outputs a probability vector \(\hat{y} = [\hat{y}_1, \hat{y}_2, …, \hat{y}_k]\) via a softmax function, where k is the number of classes and \(\sum_{i=1}^k \hat{y}_i = 1\).
The true label y is a one-hot vector (e.g., for class c: \(y_c = 1\), and \(y_i = 0\) for \(i \neq c\)).
Formula
For a single sample with true label y (one-hot) and predicted probability vector \(\hat{y}\):
\(\text{CCE}(y, \hat{y}) = -\sum_{i=1}^k y_i \log(\hat{y}_i) = -\log(\hat{y}_c)\)
The last equality holds because all \(y_i = 0\) except for the true class c (where \(y_c = 1\)).
For a batch of m samples, the average CCE loss is:
\(\text{CCE}_{\text{batch}} = -\frac{1}{m} \sum_{j=1}^m \log(\hat{y}_{j, c_j})\)
Where \(c_j\) is the true class of the j-th sample.
Key Notes
- The softmax function normalizes the model’s raw outputs (logits) into valid probabilities, ensuring \(\sum \hat{y}_i = 1\).
- CCE loss penalizes the model more heavily when the predicted probability of the true class is low.
4. Sparse Categorical Cross-Entropy Loss
A variant of CCE loss optimized for memory efficiency. Instead of using one-hot encoded labels (which are sparse and memory-intensive for large k), it accepts integer labels (e.g., class index 2 instead of the vector [0,1,0]).
The formula is identical to CCE loss, but the computation avoids one-hot encoding, making it ideal for tasks with a large number of classes (e.g., ImageNet with 1000 classes).
II. Key Properties
| Property | Binary Cross-Entropy | Categorical Cross-Entropy |
|---|---|---|
| Task Type | Binary classification (2 classes) | Multi-class classification (\(k \ge 2\) classes) |
| Label Format | Integer (0/1) | One-hot vector |
| Activation Function | Sigmoid | Softmax |
| Loss Range | \([0, +\infty)\) | \([0, +\infty)\) |
| Gradient Behavior | Gradients are large when predictions are wrong (drives fast updates) | Gradients are large when the predicted probability of the true class is low |
III. Implementation Example (Python with PyTorch)
PyTorch provides built-in modules for cross-entropy loss: nn.BCELoss (binary), nn.CrossEntropyLoss (multi-class, combines softmax and CCE), and nn.NLLLoss (negative log-likelihood, paired with log-softmax).
1. Binary Cross-Entropy Loss (BCE Loss)
python
运行
import torch
import torch.nn as nn
import torch.optim as optim
# 1. Define a binary classification model (sigmoid output)
class BinaryClassifier(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.fc = nn.Linear(input_dim, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
logits = self.fc(x)
return self.sigmoid(logits) # Output probability in [0,1]
# 2. Initialize model, loss, optimizer
input_dim = 10
model = BinaryClassifier(input_dim)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 3. Generate dummy data (batch_size=32, input_dim=10; labels 0/1)
batch_size = 32
x = torch.randn(batch_size, input_dim)
y = torch.randint(0, 2, (batch_size, 1), dtype=torch.float32) # BCE requires float labels
# 4. Training step
model.train()
optimizer.zero_grad()
y_pred = model(x)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
print(f"Binary Cross-Entropy Loss: {loss.item():.4f}")
2. Categorical Cross-Entropy Loss (CCE Loss)
PyTorch’s nn.CrossEntropyLoss automatically applies softmax to the model’s logits, so the model should not include a softmax layer in its forward pass.
python
运行
# 1. Define a multi-class classification model (logits output)
class MultiClassClassifier(nn.Module):
def __init__(self, input_dim, num_classes):
super().__init__()
self.fc = nn.Linear(input_dim, num_classes) # Output logits (no softmax)
def forward(self, x):
return self.fc(x) # Output raw logits
# 2. Initialize model, loss, optimizer
input_dim = 10
num_classes = 5
model = MultiClassClassifier(input_dim, num_classes)
criterion = nn.CrossEntropyLoss() # Combines softmax + CCE
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 3. Generate dummy data (batch_size=32, input_dim=10; integer labels 0-4)
batch_size = 32
x = torch.randn(batch_size, input_dim)
y = torch.randint(0, num_classes, (batch_size,)) # Integer labels (no one-hot)
# 4. Training step
model.train()
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
print(f"Categorical Cross-Entropy Loss: {loss.item():.4f}")
IV. Practical Best Practices
- Avoid Logarithm of Zero: Ensure the model’s outputs are bounded between \(\epsilon\) and \(1-\epsilon\) (e.g., \(\epsilon=1e-7\)) to prevent \(\log(0)\) (undefined). PyTorch’s
nn.BCELossandnn.CrossEntropyLosshandle this implicitly. - Class Imbalance: For imbalanced datasets, use weighted cross-entropy loss (e.g.,
nn.BCELoss(weight=class_weights)ornn.CrossEntropyLoss(weight=class_weights)), where weights are inversely proportional to class frequencies. - Numerical Stability: When implementing cross-entropy manually, use
log_softmax+NLLLossinstead ofsoftmax+logto avoid numerical underflow (common in deep learning). - Pair with Correct Activation: Use sigmoid + BCE loss for binary classification; use logits +
nn.CrossEntropyLoss(implicit softmax) for multi-class classification.
V. Advantages and Limitations
Advantages
- Theoretical Justification: Directly measures the divergence between true and predicted distributions, aligning with information theory principles.
- Efficient Gradient Updates: Gradients are proportional to the prediction error, driving faster convergence than alternative loss functions (e.g., mean squared error for classification).
- Widely Compatible: Works with all modern classification models (CNNs, Transformers, etc.).
Limitations
Requires Probability Outputs: The model must output probabilities (via sigmoid/softmax) to ensure valid inputs for the logarithm function.
Not for Regression: Cross-entropy loss is designed for probability distributions and should not be used for regression tasks (use MSE or MAE instead).
Sensitive to Class Imbalance: Without weighting, the loss can be dominated by the majority class in imbalanced datasets.
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment