Cross-Entropy Loss is a core loss function in machine learning, primarily used for classification tasks (both binary and multi-class). It quantifies the difference between two probability distributions: the true label distribution (ground truth) and the predicted probability distribution output by the model. The goal of training is to minimize this difference, which pushes the model to make more accurate predictions.
Cross-entropy loss is mathematically derived from the concept of information entropy in information theory, and it is widely paired with models that output probabilities via a softmax (multi-class) or sigmoid (binary) activation function.
I. Mathematical Foundations
1. Entropy and Cross-Entropy
- Entropy (\(H(p)\)): Measures the uncertainty of a probability distribution p. For a discrete distribution, it is defined as:\(H(p) = -\sum_{i=1}^n p_i \log(p_i)\)Higher entropy means higher uncertainty (e.g., a uniform distribution has maximum entropy).
- Cross-Entropy (\(H(p,q)\)): Measures the average number of bits needed to encode samples from distribution p using a code optimized for distribution q. The formula is:\(H(p,q) = -\sum_{i=1}^n p_i \log(q_i)\)Where p is the true distribution and q is the predicted distribution.
For classification tasks, the true label distribution p is a one-hot vector (e.g., for class 2 in a 3-class task: \(p = [0,1,0]\)), which simplifies the cross-entropy calculation significantly.
2. Binary Cross-Entropy Loss (BCE Loss)
Used for binary classification tasks (two classes: 0 and 1). The model outputs a probability \(q \in [0,1]\) via a sigmoid function, where q is the probability of the sample belonging to class 1.
Formula
For a single sample with true label \(y \in \{0,1\}\) and predicted probability \(\hat{y}\):
\(\text{BCE}(y, \hat{y}) = – \left[ y \log(\hat{y}) + (1-y) \log(1-\hat{y}) \right]\)
For a batch of m samples, the average BCE loss is:
\(\text{BCE}_{\text{batch}} = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]\)
Key Notes
- If \(y=1\), the loss simplifies to \(-\log(\hat{y})\): the loss increases sharply as \(\hat{y}\) approaches 0.
- If \(y=0\), the loss simplifies to \(-\log(1-\hat{y})\): the loss increases sharply as \(\hat{y}\) approaches 1.
- The sigmoid activation ensures \(\hat{y}\) is within [0,1], avoiding undefined values for \(\log\).
3. Categorical Cross-Entropy Loss (CCE Loss)
Used for multi-class classification tasks (more than two classes). The model outputs a probability vector \(\hat{y} = [\hat{y}_1, \hat{y}_2, …, \hat{y}_k]\) via a softmax function, where k is the number of classes and \(\sum_{i=1}^k \hat{y}_i = 1\).
The true label y is a one-hot vector (e.g., for class c: \(y_c = 1\), and \(y_i = 0\) for \(i \neq c\)).
Formula
For a single sample with true label y (one-hot) and predicted probability vector \(\hat{y}\):
\(\text{CCE}(y, \hat{y}) = -\sum_{i=1}^k y_i \log(\hat{y}_i) = -\log(\hat{y}_c)\)
The last equality holds because all \(y_i = 0\) except for the true class c (where \(y_c = 1\)).
For a batch of m samples, the average CCE loss is:
\(\text{CCE}_{\text{batch}} = -\frac{1}{m} \sum_{j=1}^m \log(\hat{y}_{j, c_j})\)
Where \(c_j\) is the true class of the j-th sample.
Key Notes
- The softmax function normalizes the model’s raw outputs (logits) into valid probabilities, ensuring \(\sum \hat{y}_i = 1\).
- CCE loss penalizes the model more heavily when the predicted probability of the true class is low.
4. Sparse Categorical Cross-Entropy Loss
A variant of CCE loss optimized for memory efficiency. Instead of using one-hot encoded labels (which are sparse and memory-intensive for large k), it accepts integer labels (e.g., class index 2 instead of the vector [0,1,0]).
The formula is identical to CCE loss, but the computation avoids one-hot encoding, making it ideal for tasks with a large number of classes (e.g., ImageNet with 1000 classes).
II. Key Properties
| Property | Binary Cross-Entropy | Categorical Cross-Entropy |
|---|---|---|
| Task Type | Binary classification (2 classes) | Multi-class classification (\(k \ge 2\) classes) |
| Label Format | Integer (0/1) | One-hot vector |
| Activation Function | Sigmoid | Softmax |
| Loss Range | \([0, +\infty)\) | \([0, +\infty)\) |
| Gradient Behavior | Gradients are large when predictions are wrong (drives fast updates) | Gradients are large when the predicted probability of the true class is low |
III. Implementation Example (Python with PyTorch)
PyTorch provides built-in modules for cross-entropy loss: nn.BCELoss (binary), nn.CrossEntropyLoss (multi-class, combines softmax and CCE), and nn.NLLLoss (negative log-likelihood, paired with log-softmax).
1. Binary Cross-Entropy Loss (BCE Loss)
python
运行
import torch
import torch.nn as nn
import torch.optim as optim
# 1. Define a binary classification model (sigmoid output)
class BinaryClassifier(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.fc = nn.Linear(input_dim, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
logits = self.fc(x)
return self.sigmoid(logits) # Output probability in [0,1]
# 2. Initialize model, loss, optimizer
input_dim = 10
model = BinaryClassifier(input_dim)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 3. Generate dummy data (batch_size=32, input_dim=10; labels 0/1)
batch_size = 32
x = torch.randn(batch_size, input_dim)
y = torch.randint(0, 2, (batch_size, 1), dtype=torch.float32) # BCE requires float labels
# 4. Training step
model.train()
optimizer.zero_grad()
y_pred = model(x)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
print(f"Binary Cross-Entropy Loss: {loss.item():.4f}")
2. Categorical Cross-Entropy Loss (CCE Loss)
PyTorch’s nn.CrossEntropyLoss automatically applies softmax to the model’s logits, so the model should not include a softmax layer in its forward pass.
python
运行
# 1. Define a multi-class classification model (logits output)
class MultiClassClassifier(nn.Module):
def __init__(self, input_dim, num_classes):
super().__init__()
self.fc = nn.Linear(input_dim, num_classes) # Output logits (no softmax)
def forward(self, x):
return self.fc(x) # Output raw logits
# 2. Initialize model, loss, optimizer
input_dim = 10
num_classes = 5
model = MultiClassClassifier(input_dim, num_classes)
criterion = nn.CrossEntropyLoss() # Combines softmax + CCE
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 3. Generate dummy data (batch_size=32, input_dim=10; integer labels 0-4)
batch_size = 32
x = torch.randn(batch_size, input_dim)
y = torch.randint(0, num_classes, (batch_size,)) # Integer labels (no one-hot)
# 4. Training step
model.train()
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
print(f"Categorical Cross-Entropy Loss: {loss.item():.4f}")
IV. Practical Best Practices
- Avoid Logarithm of Zero: Ensure the model’s outputs are bounded between \(\epsilon\) and \(1-\epsilon\) (e.g., \(\epsilon=1e-7\)) to prevent \(\log(0)\) (undefined). PyTorch’s
nn.BCELossandnn.CrossEntropyLosshandle this implicitly. - Class Imbalance: For imbalanced datasets, use weighted cross-entropy loss (e.g.,
nn.BCELoss(weight=class_weights)ornn.CrossEntropyLoss(weight=class_weights)), where weights are inversely proportional to class frequencies. - Numerical Stability: When implementing cross-entropy manually, use
log_softmax+NLLLossinstead ofsoftmax+logto avoid numerical underflow (common in deep learning). - Pair with Correct Activation: Use sigmoid + BCE loss for binary classification; use logits +
nn.CrossEntropyLoss(implicit softmax) for multi-class classification.
V. Advantages and Limitations
Advantages
- Theoretical Justification: Directly measures the divergence between true and predicted distributions, aligning with information theory principles.
- Efficient Gradient Updates: Gradients are proportional to the prediction error, driving faster convergence than alternative loss functions (e.g., mean squared error for classification).
- Widely Compatible: Works with all modern classification models (CNNs, Transformers, etc.).
Limitations
Requires Probability Outputs: The model must output probabilities (via sigmoid/softmax) to ensure valid inputs for the logarithm function.
Not for Regression: Cross-entropy loss is designed for probability distributions and should not be used for regression tasks (use MSE or MAE instead).
Sensitive to Class Imbalance: Without weighting, the loss can be dominated by the majority class in imbalanced datasets.
- High-Performance Waterproof Solar Connectors
- Durable IP68 Waterproof Solar Connectors for Outdoor Use
- High-Quality Tinned Copper Material for Durability
- High-Quality Tinned Copper Material for Long Service Life
- Y Branch Parallel Solar Connector for Enhanced Power
- 10AWG Tinned Copper Solar Battery Cables
- NEMA 5-15P to Powercon Extension Cable Overview
- Dual Port USB 3.0 Adapter for Optimal Speed
- 4-Pin XLR Connector: Reliable Audio Transmission
- 4mm Banana to 2mm Pin Connector: Your Audio Solution
- 12GB/s Mini SAS to U.2 NVMe Cable for Fast Data Transfer
- CAB-STK-E Stacking Cable: 40Gbps Performance
- High-Performance CAB-STK-E Stacking Cable Explained
- Best 10M OS2 LC to LC Fiber Patch Cable for Data Centers
- Mini SAS HD Cable: Boost Data Transfer at 12 Gbps
- Multi Rate SFP+: Enhance Your Network Speed
- Best 6.35mm to MIDI Din Cable for Clear Sound
- 15 Pin SATA Power Splitter: Solutions for Your Device Needs
- 9-Pin S-Video Cable: Enhance Your Viewing Experience
- USB 9-Pin to Standard USB 2.0 Adapter: Easy Connection
- 3 Pin to 4 Pin Fan Adapter: Optimize Your PC Cooling
- S-Video to RCA Cable: High-Definition Connections Made Easy
- 6.35mm TS Extension Cable: High-Quality Sound Solution
- BlackBerry Curve 9360: Key Features and Specs






















Leave a comment