Policy Gradient is a class of reinforcement learning (RL) algorithms that directly optimize a policy function \(\pi_\theta(a|s)\)—a probability distribution over actions a given state s, parameterized by \(\theta\). Unlike value-based methods (e.g., Q-Learning) that learn a value function and derive a policy from it, policy gradient methods learn the policy directly, making them particularly effective for high-dimensional or continuous action spaces.
Policy gradient methods are on-policy (they learn from the current policy’s experience) and are guaranteed to converge to a local maximum of the expected cumulative reward under mild conditions.
Core Concepts
1. Key Definitions
First, we recap essential RL terms for policy gradients:
| Term | Definition |
|---|---|
| Policy (\(\pi_\theta(a|s)\)) | A stochastic function that outputs the probability of taking action a in state s (e.g., \(\pi_\theta(\text{right}|s) = 0.7\)). For discrete actions, it’s a categorical distribution; for continuous actions, a Gaussian distribution. |
| Return (\(G_t\)) | The cumulative discounted reward an agent receives from time step t onward: \(G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots\), where \(\gamma \in [0,1]\) is the discount factor. |
| Objective Function (\(J(\theta)\)) | The expected return of the policy \(\pi_\theta\), which we aim to maximize:\(J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta} [G_t | s_t = s, a_t = a]\)where \(\rho^\pi\) is the state distribution induced by policy \(\pi_\theta\). |
2. Core Idea: Gradient Ascent on the Objective
Policy gradient methods use gradient ascent to update the policy parameters \(\theta\) in the direction that increases \(J(\theta)\):
\(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)
where \(\alpha\) is the learning rate, and \(\nabla_\theta J(\theta)\) is the policy gradient.
The key insight of policy gradients is deriving a tractable form of \(\nabla_\theta J(\theta)\). Using the likelihood ratio trick, the policy gradient can be rewritten as:
\(\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot G_t \right]\)
This is the REINFORCE formula—the simplest and most fundamental policy gradient algorithm.
3. Intuition Behind the Formula
- \(\nabla_\theta \log \pi_\theta(a|s)\): Measures how much the log-probability of action a changes with respect to \(\theta\). It acts as a “direction” for parameter updates.
- \(G_t\): The return (cumulative reward) serves as a weight for the update:
- If \(G_t > 0\) (good action), we increase the probability of taking a in state s.
- If \(G_t < 0\) (bad action), we decrease the probability of taking a in state s.
REINFORCE Algorithm (Vanilla Policy Gradient)
REINFORCE (REward Increment Non-negative Factor times Offset Reinforcement Characteristic Eligibility) is the foundational policy gradient algorithm. It uses Monte Carlo sampling to estimate the return \(G_t\) from complete episodes.
Algorithm Steps
- Initialize Policy Parameters: Randomly initialize \(\theta\) for the policy network \(\pi_\theta(a|s)\).
- For Each Episode:a. Generate Trajectory: Use \(\pi_\theta\) to collect a sequence of states, actions, and rewards: \(\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \dots, s_T, a_T, r_{T+1})\).b. Compute Returns: For each time step t in the episode, calculate the return \(G_t = \sum_{k=t+1}^T \gamma^{k-t-1} r_k\).c. Update Policy Parameters: For each time step t, compute the gradient estimate:\(\hat{\nabla}_\theta J(\theta) = \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\)Update \(\theta\) via gradient ascent:\(\theta \leftarrow \theta + \alpha \hat{\nabla}_\theta J(\theta)\)
- Repeat: Until the policy converges to a local maximum of \(J(\theta)\).
Key Improvements to REINFORCE
Vanilla REINFORCE has high variance (due to noisy Monte Carlo return estimates). To reduce variance without bias, we use two common techniques:
- Baseline Subtraction: Subtract a state-dependent baseline \(b(s)\) from the return \(G_t\):\(\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot (G_t – b(s)) \right]\)The baseline is typically a value function \(V^\pi(s)\) (expected return from state s), learned alongside the policy. This reduces variance because \(G_t – b(s)\) has zero mean.
- Discount Factor (\(\gamma\)): Down-weights future rewards to focus on immediate feedback and reduce variance.
Policy Gradient Implementation (Python with PyTorch: CartPole)
We implement the REINFORCE algorithm with a baseline to solve the CartPole-v1 environment (OpenAI Gym). The goal is to balance a pole on a cart by moving left/right; the episode ends when the pole falls or the cart moves out of bounds.
Step 1: Install Dependencies
bash
运行
pip install torch gymnasium[classic-control] numpy matplotlib
Step 2: Full Implementation
python
运行
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
# Set device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# --------------------------
# 1. Policy Network Definition
# --------------------------
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super(PolicyNetwork, self).__init__()
# Policy network: outputs log probabilities of actions
self.policy = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)
# Baseline value network: predicts state value V(s)
self.value = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state):
# Return action probabilities and state value
action_probs = self.policy(state)
state_value = self.value(state)
return action_probs, state_value
# --------------------------
# 2. REINFORCE Agent with Baseline
# --------------------------
class REINFORCEAgent:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
self.gamma = gamma
self.policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim).to(device)
self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
# Store trajectory data for each episode
self.states = []
self.actions = []
self.rewards = []
def select_action(self, state):
# Convert state to tensor
state = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(device)
# Get action probabilities from policy network
action_probs, _ = self.policy_net(state)
# Sample action from categorical distribution
action_dist = torch.distributions.Categorical(action_probs)
action = action_dist.sample()
# Return action and log probability of the action
return action.item(), action_dist.log_prob(action)
def store_transition(self, state, action, reward):
self.states.append(state)
self.actions.append(action)
self.rewards.append(reward)
def compute_returns(self):
# Compute discounted returns for each time step
returns = []
running_return = 0
# Iterate rewards in reverse order
for reward in reversed(self.rewards):
running_return = reward + self.gamma * running_return
returns.insert(0, running_return)
# Normalize returns to reduce variance
returns = torch.tensor(returns, dtype=torch.float32).to(device)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
return returns
def update(self):
# Convert trajectory data to tensors
states = torch.tensor(self.states, dtype=torch.float32).to(device)
actions = torch.tensor(self.actions, dtype=torch.int64).to(device)
returns = self.compute_returns()
# Forward pass: get action probabilities and state values
action_probs, state_values = self.policy_net(states)
# Get log probabilities of the taken actions
action_dist = torch.distributions.Categorical(action_probs)
log_probs = action_dist.log_prob(actions)
# Calculate loss: policy loss + value loss
# Policy loss: -log_prob * (return - baseline) (gradient ascent → minimize negative loss)
advantage = returns - state_values.squeeze()
policy_loss = -torch.mean(log_probs * advantage)
# Value loss: MSE between predicted value and actual return
value_loss = nn.MSELoss()(state_values.squeeze(), returns)
total_loss = policy_loss + 0.5 * value_loss
# Backpropagation and optimization
self.optimizer.zero_grad()
total_loss.backward()
self.optimizer.step()
# Reset trajectory data
self.states = []
self.actions = []
self.rewards = []
return total_loss.item()
# --------------------------
# 3. Training the Agent
# --------------------------
def train():
# Initialize environment
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
# Hyperparameters
EPISODES = 1000
LR = 1e-3
GAMMA = 0.99
HIDDEN_DIM = 128
# Initialize agent
agent = REINFORCEAgent(state_dim, action_dim, HIDDEN_DIM, LR, GAMMA)
# Training metrics
episode_rewards = []
avg_rewards = []
for episode in range(EPISODES):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
# Select action and get log probability
action, _ = agent.select_action(state)
# Take action in environment
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Store transition
agent.store_transition(state, action, reward)
# Update state and total reward
state = next_state
total_reward += reward
# Update policy after each episode
loss = agent.update()
# Record metrics
episode_rewards.append(total_reward)
# Calculate average reward over last 100 episodes
if len(episode_rewards) >= 100:
avg_reward = np.mean(episode_rewards[-100:])
avg_rewards.append(avg_reward)
else:
avg_rewards.append(np.mean(episode_rewards))
# Print progress
if (episode + 1) % 50 == 0:
print(f"Episode {episode+1}/{EPISODES} | Avg Reward (last 100): {avg_rewards[-1]:.2f} | Loss: {loss:.4f}")
# Early stopping: if avg reward reaches 500 (max for CartPole-v1)
if avg_rewards[-1] >= 495:
print(f"Environment solved in {episode+1} episodes!")
break
# Plot training results
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards, label="Episode Reward")
plt.xlabel("Episode")
plt.ylabel("Reward")
plt.title("Reward per Episode")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(avg_rewards, label="Avg Reward (last 100)", color="orange")
plt.xlabel("Episode")
plt.ylabel("Avg Reward")
plt.title("Average Reward Over Time")
plt.legend()
plt.tight_layout()
plt.show()
# Close environment
env.close()
# Run training
if __name__ == "__main__":
train()
Key Outputs
- Reward Curve: The total reward per episode increases rapidly, and the average reward over 100 episodes converges to the maximum possible value (500 for CartPole-v1).
- Loss Curve: The total loss (policy loss + value loss) decreases over time, indicating the policy is improving.
Key Properties of Policy Gradient Methods
| Property | Description |
|---|---|
| Direct Policy Optimization | Learns \(\pi_\theta(a|s)\) directly, avoiding the need to derive a policy from a value function (critical for continuous action spaces). |
| On-Policy Learning | Requires fresh experience from the current policy—old experience becomes obsolete as the policy changes. This can be inefficient, but variants like PPO (Proximal Policy Optimization) mitigate this. |
| Stochastic Policies | Naturally learn stochastic policies, which are useful for exploration and tasks requiring randomness (e.g., rock-paper-scissors). |
| Handling Continuous Actions | Easily extended to continuous action spaces by using a Gaussian policy \(\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2(s))\), where \(\mu\) and \(\sigma\) are outputs of the policy network. |
| Convergence | Guaranteed to converge to a local maximum of the expected return \(J(\theta)\) (not necessarily global, but sufficient for most practical tasks). |
Popular Policy Gradient Variants
To address the limitations of vanilla REINFORCE (high variance, inefficiency), researchers developed advanced policy gradient algorithms:
- Actor-Critic: Combines policy gradient (actor) with value function learning (critic). The critic estimates the advantage \(A(s,a) = Q(s,a) – V(s)\), reducing variance by replacing Monte Carlo returns with temporal difference (TD) estimates.
- Proximal Policy Optimization (PPO): The most widely used policy gradient algorithm in practice. It clips the policy update to prevent large, destabilizing changes to the policy, balancing exploration and exploitation efficiently.
- Trust Region Policy Optimization (TRPO): Ensures policy updates stay within a “trust region” (small change in policy) to guarantee monotonic improvement in the objective function. More mathematically rigorous than PPO but computationally more expensive.
Policy Gradient vs. Value-Based Methods (Q-Learning)
| Feature | Policy Gradient | Q-Learning |
|---|---|---|
| Optimization Target | Directly optimizes the policy \(\pi_\theta(a|s)\) | Optimizes a value function \(Q(s,a)\), then derives a policy (\(\epsilon\)-greedy) |
| Action Spaces | Excels at continuous/high-dimensional action spaces | Best for discrete, low-dimensional action spaces |
| Policy Type | Stochastic (probabilistic actions) | Deterministic (greedy action selection after training) |
| On/Off-Policy | On-policy (requires fresh experience) | Off-policy (can reuse old experience) |
| Variance/Bias | High variance, low bias | Low variance, high bias (due to max operator in Q-update) |
| Use Case | Robotics, autonomous driving, continuous control | Grid worlds, simple games, discrete decision-making |
Real-World Applications
- Robotics: Controlling robot arms, legged robots, and drones (continuous action spaces).
- Autonomous Driving: Optimizing steering, acceleration, and braking policies.
- Game AI: Building AI agents for complex games like Dota 2, StarCraft II, and Atari games.
- Finance: Algorithmic trading (optimizing buy/sell decisions in continuous price spaces).
- Natural Language Processing: Generative models (e.g., text generation) by treating token selection as a policy optimization problem.
Summary
Policy gradient methods are the best choice for continuous action spaces and complex, real-world RL tasks.
Policy Gradient is a class of RL algorithms that directly optimize a policy function \(\pi_\theta(a|s)\) via gradient ascent on the expected return.
The core formula (REINFORCE) uses the likelihood ratio trick to compute the policy gradient, weighted by cumulative rewards.
Advanced variants (Actor-Critic, PPO) address the high variance of vanilla REINFORCE and are widely used in practice.
- iPhone 15 Pro Review: Ultimate Features and Specs
- iPhone 15 Pro Max: Key Features and Specifications
- iPhone 16: Features, Specs, and Innovations
- iPhone 16 Plus: Key Features & Specs
- iPhone 16 Pro: Premium Features & Specs Explained
- iPhone 16 Pro Max: Features & Innovations Explained
- iPhone 17 Pro: Features and Innovations Explained
- iPhone 17 Review: Features, Specs, and Innovations
- iPhone Air Concept: Mid-Range Power & Portability
- iPhone 13 Pro Max Review: Features, Specs & Performance
- iPhone SE Review: Budget Performance Unpacked
- iPhone 14 Review: Key Features and Upgrades
- Apple iPhone 14 Plus: The Ultimate Mid-range 5G Smartphone
- iPhone 14 Pro: Key Features and Innovations Explained
- Why the iPhone 14 Pro Max Redefines Smartphone Technology
- iPhone 15 Review: Key Features and Specs
- iPhone 15 Plus: Key Features and Specs Explained
- iPhone 12 Mini Review: Compact Powerhouse Unleashed
- iPhone 12: Key Features and Specs Unveiled
- iPhone 12 Pro: Premium Features and 5G Connectivity
- Why the iPhone 12 Pro Max is a Top Choice in 2023
- iPhone 13 Mini: Compact Powerhouse in Your Hand
- iPhone 13: Key Features and Specs Overview
- iPhone 13 Pro Review: Features and Specifications






















Leave a comment