Actor-Critic is a hybrid reinforcement learning (RL) algorithm that combines the strengths of value-based methods (e.g., Q-Learning, SARSA) and policy-based methods (e.g., REINFORCE). It addresses key limitations of both approaches:
- Value-based methods struggle with continuous action spaces and have high variance in updates.
- Policy-based methods have low variance but suffer from slow convergence and high sample complexity.
Actor-Critic algorithms use two separate components:
- Actor: A policy network that learns to select actions (policy-based).
- Critic: A value network that evaluates the quality of the actor’s actions (value-based).
The critic provides feedback to the actor—telling it how good its selected action was—enabling stable, sample-efficient learning. This hybrid design makes Actor-Critic one of the most widely used RL frameworks for real-world applications (e.g., robotics, autonomous driving, game AI).
I. Core Components of Actor-Critic
1. Actor (\(\pi_\theta(a|s)\))
The actor is a parameterized policy network (with weights \(\theta\)) that maps states s to a probability distribution over actions a:
- Discrete Action Spaces: Outputs a categorical distribution (e.g., probability of selecting each action in a game).
- Continuous Action Spaces: Outputs a Gaussian distribution (mean and standard deviation for continuous actions like steering angle).
The actor’s goal is to maximize the expected cumulative reward (return) by updating its parameters \(\theta\).
2. Critic (\(V_\phi(s)\) or \(Q_\phi(s,a)\))
The critic is a parameterized value network (with weights \(\phi\)) that estimates the value of being in a state or taking an action in a state:
- State-Value Critic (\(V_\phi(s)\)): Estimates the expected return starting from state s and following the actor’s policy.
- Action-Value Critic (\(Q_\phi(s,a)\)): Estimates the expected return starting from state s, taking action a, and then following the actor’s policy.
The critic’s goal is to minimize the temporal difference (TD) error—the difference between its value estimate and the actual observed return.
3. Advantage Function (\(A(s,a)\))
The advantage function is the key link between the actor and critic. It measures how much better an action a is than the average action in state s:
\(A(s,a) = Q(s,a) – V(s)\)
Where:
- \(Q(s,a)\): Value of taking action a in state s.
- \(V(s)\): Average value of all actions in state s.
A positive advantage means the action is better than average; a negative advantage means it is worse. The actor uses the advantage function to update its policy: actions with positive advantages are made more likely, while actions with negative advantages are made less likely.
II. Key Concepts & Update Rules
1. Temporal Difference (TD) Error
The critic’s update is driven by the TD error, which quantifies how well the critic’s value estimate matches the actual observed reward and next-state value:
\(\delta_t = r_{t+1} + \gamma V_\phi(s_{t+1}) – V_\phi(s_t)\)
Where:
- \(r_{t+1}\): Reward received after taking action \(a_t\) in state \(s_t\).
- \(\gamma\): Discount factor (weights future rewards).
- \(V_\phi(s_{t+1})\): Critic’s estimate of the next state’s value.
For an action-value critic (\(Q_\phi(s,a)\)), the TD error is:
\(\delta_t = r_{t+1} + \gamma Q_\phi(s_{t+1}, a_{t+1}) – Q_\phi(s_t, a_t)\)
2. Actor Update Rule
The actor’s parameters \(\theta\) are updated to maximize the expected advantage. For a stochastic policy \(\pi_\theta(a|s)\), the policy gradient is:
\(\nabla_\theta J(\theta) \approx \mathbb{E}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(s_t, a_t) \right]\)
Where \(J(\theta)\) is the expected return of the actor’s policy.
The update step for the actor is:
\(\theta \leftarrow \theta + \alpha_\theta \cdot \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(s_t, a_t)\)
Where \(\alpha_\theta\) is the actor’s learning rate.
3. Critic Update Rule
The critic’s parameters \(\phi\) are updated to minimize the squared TD error:
\(\phi \leftarrow \phi + \alpha_\phi \cdot \delta_t \cdot \nabla_\phi V_\phi(s_t)\)
Where \(\alpha_\phi\) is the critic’s learning rate.
4. On-Policy vs. Off-Policy Actor-Critic
Actor-Critic algorithms are typically on-policy—the actor and critic learn from the same trajectory of states/actions/rewards generated by the current policy. However, variants like Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC) extend Actor-Critic to off-policy learning, further improving sample efficiency.
III. Popular Actor-Critic Variants
1. Advantage Actor-Critic (A2C)
A2C is a simple, synchronous Actor-Critic algorithm that uses a state-value critic and the advantage function to update the actor. Key features:
- Parallel Rollouts: Runs multiple environment instances in parallel to collect diverse trajectories, reducing variance.
- Advantage Calculation: Computes the advantage as \(A(s_t,a_t) = \delta_t + \gamma \delta_{t+1} + \dots + \gamma^{T-t-1} \delta_{T-1}\) (discounted sum of TD errors).
- Stable Updates: Uses the critic’s value estimates as a baseline to reduce the variance of policy gradients (compared to REINFORCE).
2. Asynchronous Advantage Actor-Critic (A3C)
A3C is an asynchronous version of A2C that uses multiple worker threads to collect data and update the global actor/critic networks. Key features:
- Asynchronous Training: Each worker runs a copy of the actor/critic, collects data, and updates the global network independently.
- Exploration Diversity: Workers use different exploration policies, leading to more diverse data and faster convergence.
- Low Resource Usage: Does not require a replay buffer (unlike off-policy methods), making it memory-efficient.
3. Proximal Policy Optimization (PPO)
PPO is a state-of-the-art Actor-Critic algorithm that addresses the instability of policy updates by clipping the policy gradient. Key features:
- Clipped Surrogate Objective: Prevents large updates to the policy by clipping the ratio of new/old policy probabilities:\(L^{CLIP}(\theta) = \mathbb{E}\left[ \min\left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]\)Where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the policy ratio and \(\epsilon\) is a clipping hyperparameter (typically 0.2).
- Sample Efficiency: PPO is more sample-efficient than A2C/A3C and does not require complex synchronization.
- Widely Used: PPO is the default algorithm for many RL benchmarks (e.g., OpenAI Gym, Atari games) due to its stability and performance.
4. Deep Deterministic Policy Gradient (DDPG)
DDPG is an off-policy Actor-Critic algorithm designed for continuous action spaces. Key features:
- Deterministic Actor: Outputs a single optimal action \(a = \mu_\theta(s)\) (instead of a probability distribution).
- Target Networks: Uses separate target actor/critic networks to stabilize updates (similar to DQN).
- Replay Buffer: Stores past experiences and samples them randomly for training, improving sample efficiency.
IV. Actor-Critic Implementation (Python with PyTorch: A2C for CartPole)
Below is a practical implementation of Advantage Actor-Critic (A2C) for the CartPole environment (a classic RL task where the agent balances a pole on a cart).
1. Import Libraries
python
运行
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from torch.distributions import Categorical
2. Define Actor-Critic Network
We combine the actor and critic into a single network for efficiency (they share some layers in this implementation).
python
运行
class ActorCritic(nn.Module):
def __init__(self, state_size, action_size, hidden_size=64):
super(ActorCritic, self).__init__()
# Shared layers
self.fc1 = nn.Linear(state_size, hidden_size)
# Actor head: outputs action probabilities (categorical distribution)
self.actor = nn.Linear(hidden_size, action_size)
# Critic head: outputs state value
self.critic = nn.Linear(hidden_size, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
return x
def get_action(self, state):
"""Get action from actor and log probability of the action"""
x = self.forward(state)
action_probs = F.softmax(self.actor(x), dim=-1)
dist = Categorical(action_probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
def get_value(self, state):
"""Get state value from critic"""
x = self.forward(state)
value = self.critic(x)
return value
3. Define A2C Agent
python
运行
class A2CAgent:
def __init__(self, state_size, action_size, hidden_size=64):
self.gamma = 0.99 # Discount factor
self.actor_critic = ActorCritic(state_size, action_size, hidden_size)
self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=3e-4)
def compute_returns(self, rewards, dones):
"""Compute discounted returns (advantages) from rewards and dones"""
returns = []
running_return = 0
for reward, done in zip(reversed(rewards), reversed(dones)):
if done:
running_return = 0
running_return = reward + self.gamma * running_return
returns.insert(0, running_return)
# Convert to tensor and normalize
returns = torch.tensor(returns, dtype=torch.float)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
return returns
def train(self, states, actions, log_probs, rewards, dones, values):
"""Train actor-critic network using A2C update rule"""
# Compute returns (advantages)
returns = self.compute_returns(rewards, dones)
# Convert lists to tensors
log_probs = torch.stack(log_probs)
values = torch.stack(values)
# Calculate advantage: A = R - V(s)
advantages = returns - values.detach()
# Actor loss: -log_prob * advantage (maximize log_prob * advantage)
actor_loss = -(log_probs * advantages).mean()
# Critic loss: MSE between value estimates and returns
critic_loss = F.mse_loss(values, returns)
# Total loss: actor loss + critic loss
total_loss = actor_loss + 0.5 * critic_loss # Weight critic loss to balance
# Backpropagation and optimization
self.optimizer.zero_grad()
total_loss.backward()
self.optimizer.step()
return total_loss.item()
4. Train the A2C Agent on CartPole
python
运行
# Initialize environment and agent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = A2CAgent(state_size, action_size)
# Training parameters
n_episodes = 1000
max_t = 1000
print_every = 50
scores = []
for i_episode in range(1, n_episodes + 1):
state, _ = env.reset()
state = torch.tensor(state, dtype=torch.float).unsqueeze(0)
log_probs = []
values = []
rewards = []
dones = []
score = 0
for t in range(max_t):
# Get action and log probability from actor
action, log_prob = agent.actor_critic.get_action(state)
# Get value estimate from critic
value = agent.actor_critic.get_value(state)
# Take action in environment
next_state, reward, done, _, _ = env.step(action)
next_state = torch.tensor(next_state, dtype=torch.float).unsqueeze(0)
# Store experience
log_probs.append(log_prob)
values.append(value)
rewards.append(reward)
dones.append(done)
score += reward
# Update state
state = next_state
if done:
break
# Train agent on episode experience
loss = agent.train(state, action, log_probs, rewards, dones, values)
scores.append(score)
# Print progress
if i_episode % print_every == 0:
avg_score = np.mean(scores[-print_every:])
print(f'Episode {i_episode}\tAverage Score: {avg_score:.2f}\tLoss: {loss:.4f}')
# Check if environment is solved (average score >= 495)
if np.mean(scores[-100:]) >= 495.0:
print(f'Environment solved in {i_episode-100} episodes!')
torch.save(agent.actor_critic.state_dict(), 'a2c_cartpole.pth')
break
env.close()
V. Actor-Critic vs. Other RL Algorithms
| Algorithm | Type | Key Strengths | Key Weaknesses | Best For |
|---|---|---|---|---|
| Q-Learning/SARSA | Value-based | Simple, low sample complexity | Poor for continuous action spaces | Discrete action spaces, small environments |
| REINFORCE | Policy-based | Works for continuous action spaces | High variance, slow convergence | Simple policy optimization |
| Actor-Critic (A2C/PPO) | Hybrid | Stable, sample-efficient, works for continuous spaces | More complex implementation | Real-world applications (robotics, autonomous driving) |
VI. Key Applications of Actor-Critic
- Robotics: Controlling robot arms, navigation, and manipulation (e.g., picking and placing objects).
- Autonomous Driving: Optimizing steering, acceleration, and braking in dynamic environments.
- Game AI: Beating human players in complex games (e.g., Atari, StarCraft II, Dota 2).
- Recommendation Systems: Optimizing content recommendations to maximize user engagement.
- Finance: Algorithmic trading (maximizing profit while managing risk).
VII. Summary
Advantages: Stable training, sample-efficient, works for both discrete and continuous action spaces.
Actor-Critic is a hybrid RL algorithm that combines a policy-based actor and a value-based critic.
Actor: Learns to select actions by maximizing the expected advantage (feedback from the critic).
Critic: Learns to evaluate state/action values by minimizing the TD error.
Popular Variants: A2C (synchronous), A3C (asynchronous), PPO (stable, clipped updates), DDPG (continuous action spaces).
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment