REINFORCE Algorithm: A Key in Policy Gradient Learning

Policy Gradient is a class of reinforcement learning (RL) algorithms that directly optimize a policy function \(\pi_\theta(a|s)\)—a probability distribution over actions a given state s, parameterized by \(\theta\). Unlike value-based methods (e.g., Q-Learning) that learn a value function and derive a policy from it, policy gradient methods learn the policy directly, making them particularly effective for high-dimensional or continuous action spaces.

Policy gradient methods are on-policy (they learn from the current policy’s experience) and are guaranteed to converge to a local maximum of the expected cumulative reward under mild conditions.

Core Concepts

1. Key Definitions

First, we recap essential RL terms for policy gradients:

Term	Definition
Policy (\(\pi_\theta(a\|s)\))	A stochastic function that outputs the probability of taking action a in state s (e.g., \(\pi_\theta(\text{right}\|s) = 0.7\)). For discrete actions, it’s a categorical distribution; for continuous actions, a Gaussian distribution.
Return (\(G_t\))	The cumulative discounted reward an agent receives from time step t onward: \(G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots\), where \(\gamma \in [0,1]\) is the discount factor.
Objective Function (\(J(\theta)\))	The expected return of the policy \(\pi_\theta\), which we aim to maximize:\(J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta} [G_t \| s_t = s, a_t = a]\)where \(\rho^\pi\) is the state distribution induced by policy \(\pi_\theta\).

2. Core Idea: Gradient Ascent on the Objective

Policy gradient methods use gradient ascent to update the policy parameters \(\theta\) in the direction that increases \(J(\theta)\):

\(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)

where \(\alpha\) is the learning rate, and \(\nabla_\theta J(\theta)\) is the policy gradient.

The key insight of policy gradients is deriving a tractable form of \(\nabla_\theta J(\theta)\). Using the likelihood ratio trick, the policy gradient can be rewritten as:

\(\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot G_t \right]\)

This is the REINFORCE formula—the simplest and most fundamental policy gradient algorithm.

3. Intuition Behind the Formula

\(\nabla_\theta \log \pi_\theta(a|s)\): Measures how much the log-probability of action a changes with respect to \(\theta\). It acts as a “direction” for parameter updates.
\(G_t\): The return (cumulative reward) serves as a weight for the update:
- If \(G_t > 0\) (good action), we increase the probability of taking a in state s.
- If \(G_t < 0\) (bad action), we decrease the probability of taking a in state s.

REINFORCE Algorithm (Vanilla Policy Gradient)

REINFORCE (REward Increment Non-negative Factor times Offset Reinforcement Characteristic Eligibility) is the foundational policy gradient algorithm. It uses Monte Carlo sampling to estimate the return \(G_t\) from complete episodes.

Algorithm Steps

Initialize Policy Parameters: Randomly initialize \(\theta\) for the policy network \(\pi_\theta(a|s)\).
For Each Episode:a. Generate Trajectory: Use \(\pi_\theta\) to collect a sequence of states, actions, and rewards: \(\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \dots, s_T, a_T, r_{T+1})\).b. Compute Returns: For each time step t in the episode, calculate the return \(G_t = \sum_{k=t+1}^T \gamma^{k-t-1} r_k\).c. Update Policy Parameters: For each time step t, compute the gradient estimate:\(\hat{\nabla}_\theta J(\theta) = \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\)Update \(\theta\) via gradient ascent:\(\theta \leftarrow \theta + \alpha \hat{\nabla}_\theta J(\theta)\)
Repeat: Until the policy converges to a local maximum of \(J(\theta)\).

Key Improvements to REINFORCE

Vanilla REINFORCE has high variance (due to noisy Monte Carlo return estimates). To reduce variance without bias, we use two common techniques:

Baseline Subtraction: Subtract a state-dependent baseline \(b(s)\) from the return \(G_t\):\(\nabla_\theta J(\theta) = \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot (G_t – b(s)) \right]\)The baseline is typically a value function \(V^\pi(s)\) (expected return from state s), learned alongside the policy. This reduces variance because \(G_t – b(s)\) has zero mean.
Discount Factor (\(\gamma\)): Down-weights future rewards to focus on immediate feedback and reduce variance.

Policy Gradient Implementation (Python with PyTorch: CartPole)

We implement the REINFORCE algorithm with a baseline to solve the CartPole-v1 environment (OpenAI Gym). The goal is to balance a pole on a cart by moving left/right; the episode ends when the pole falls or the cart moves out of bounds.

Step 1: Install Dependencies

bash

运行

pip install torch gymnasium[classic-control] numpy matplotlib

Step 2: Full Implementation

python

运行

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym

# Set device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --------------------------
# 1. Policy Network Definition
# --------------------------
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(PolicyNetwork, self).__init__()
        # Policy network: outputs log probabilities of actions
        self.policy = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        # Baseline value network: predicts state value V(s)
        self.value = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        # Return action probabilities and state value
        action_probs = self.policy(state)
        state_value = self.value(state)
        return action_probs, state_value

# --------------------------
# 2. REINFORCE Agent with Baseline
# --------------------------
class REINFORCEAgent:
    def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
        self.gamma = gamma
        self.policy_net = PolicyNetwork(state_dim, action_dim, hidden_dim).to(device)
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        # Store trajectory data for each episode
        self.states = []
        self.actions = []
        self.rewards = []
    
    def select_action(self, state):
        # Convert state to tensor
        state = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(device)
        # Get action probabilities from policy network
        action_probs, _ = self.policy_net(state)
        # Sample action from categorical distribution
        action_dist = torch.distributions.Categorical(action_probs)
        action = action_dist.sample()
        # Return action and log probability of the action
        return action.item(), action_dist.log_prob(action)
    
    def store_transition(self, state, action, reward):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
    
    def compute_returns(self):
        # Compute discounted returns for each time step
        returns = []
        running_return = 0
        # Iterate rewards in reverse order
        for reward in reversed(self.rewards):
            running_return = reward + self.gamma * running_return
            returns.insert(0, running_return)
        # Normalize returns to reduce variance
        returns = torch.tensor(returns, dtype=torch.float32).to(device)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        return returns
    
    def update(self):
        # Convert trajectory data to tensors
        states = torch.tensor(self.states, dtype=torch.float32).to(device)
        actions = torch.tensor(self.actions, dtype=torch.int64).to(device)
        returns = self.compute_returns()
        
        # Forward pass: get action probabilities and state values
        action_probs, state_values = self.policy_net(states)
        # Get log probabilities of the taken actions
        action_dist = torch.distributions.Categorical(action_probs)
        log_probs = action_dist.log_prob(actions)
        
        # Calculate loss: policy loss + value loss
        # Policy loss: -log_prob * (return - baseline) (gradient ascent → minimize negative loss)
        advantage = returns - state_values.squeeze()
        policy_loss = -torch.mean(log_probs * advantage)
        # Value loss: MSE between predicted value and actual return
        value_loss = nn.MSELoss()(state_values.squeeze(), returns)
        total_loss = policy_loss + 0.5 * value_loss
        
        # Backpropagation and optimization
        self.optimizer.zero_grad()
        total_loss.backward()
        self.optimizer.step()
        
        # Reset trajectory data
        self.states = []
        self.actions = []
        self.rewards = []
        
        return total_loss.item()

# --------------------------
# 3. Training the Agent
# --------------------------
def train():
    # Initialize environment
    env = gym.make("CartPole-v1")
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # Hyperparameters
    EPISODES = 1000
    LR = 1e-3
    GAMMA = 0.99
    HIDDEN_DIM = 128
    
    # Initialize agent
    agent = REINFORCEAgent(state_dim, action_dim, HIDDEN_DIM, LR, GAMMA)
    
    # Training metrics
    episode_rewards = []
    avg_rewards = []
    
    for episode in range(EPISODES):
        state, _ = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Select action and get log probability
            action, _ = agent.select_action(state)
            # Take action in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            # Store transition
            agent.store_transition(state, action, reward)
            # Update state and total reward
            state = next_state
            total_reward += reward
        
        # Update policy after each episode
        loss = agent.update()
        # Record metrics
        episode_rewards.append(total_reward)
        # Calculate average reward over last 100 episodes
        if len(episode_rewards) >= 100:
            avg_reward = np.mean(episode_rewards[-100:])
            avg_rewards.append(avg_reward)
        else:
            avg_rewards.append(np.mean(episode_rewards))
        
        # Print progress
        if (episode + 1) % 50 == 0:
            print(f"Episode {episode+1}/{EPISODES} | Avg Reward (last 100): {avg_rewards[-1]:.2f} | Loss: {loss:.4f}")
        
        # Early stopping: if avg reward reaches 500 (max for CartPole-v1)
        if avg_rewards[-1] >= 495:
            print(f"Environment solved in {episode+1} episodes!")
            break
    
    # Plot training results
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(episode_rewards, label="Episode Reward")
    plt.xlabel("Episode")
    plt.ylabel("Reward")
    plt.title("Reward per Episode")
    plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.plot(avg_rewards, label="Avg Reward (last 100)", color="orange")
    plt.xlabel("Episode")
    plt.ylabel("Avg Reward")
    plt.title("Average Reward Over Time")
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    # Close environment
    env.close()

# Run training
if __name__ == "__main__":
    train()

Key Outputs

Reward Curve: The total reward per episode increases rapidly, and the average reward over 100 episodes converges to the maximum possible value (500 for CartPole-v1).
Loss Curve: The total loss (policy loss + value loss) decreases over time, indicating the policy is improving.

Key Properties of Policy Gradient Methods

Property	Description
Direct Policy Optimization	Learns \(\pi_\theta(a\|s)\) directly, avoiding the need to derive a policy from a value function (critical for continuous action spaces).
On-Policy Learning	Requires fresh experience from the current policy—old experience becomes obsolete as the policy changes. This can be inefficient, but variants like PPO (Proximal Policy Optimization) mitigate this.
Stochastic Policies	Naturally learn stochastic policies, which are useful for exploration and tasks requiring randomness (e.g., rock-paper-scissors).
Handling Continuous Actions	Easily extended to continuous action spaces by using a Gaussian policy \(\pi_\theta(a\|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2(s))\), where \(\mu\) and \(\sigma\) are outputs of the policy network.
Convergence	Guaranteed to converge to a local maximum of the expected return \(J(\theta)\) (not necessarily global, but sufficient for most practical tasks).

Popular Policy Gradient Variants

To address the limitations of vanilla REINFORCE (high variance, inefficiency), researchers developed advanced policy gradient algorithms:

Actor-Critic: Combines policy gradient (actor) with value function learning (critic). The critic estimates the advantage \(A(s,a) = Q(s,a) – V(s)\), reducing variance by replacing Monte Carlo returns with temporal difference (TD) estimates.
Proximal Policy Optimization (PPO): The most widely used policy gradient algorithm in practice. It clips the policy update to prevent large, destabilizing changes to the policy, balancing exploration and exploitation efficiently.
Trust Region Policy Optimization (TRPO): Ensures policy updates stay within a “trust region” (small change in policy) to guarantee monotonic improvement in the objective function. More mathematically rigorous than PPO but computationally more expensive.

Policy Gradient vs. Value-Based Methods (Q-Learning)

Feature	Policy Gradient	Q-Learning
Optimization Target	Directly optimizes the policy \(\pi_\theta(a\|s)\)	Optimizes a value function \(Q(s,a)\), then derives a policy (\(\epsilon\)-greedy)
Action Spaces	Excels at continuous/high-dimensional action spaces	Best for discrete, low-dimensional action spaces
Policy Type	Stochastic (probabilistic actions)	Deterministic (greedy action selection after training)
On/Off-Policy	On-policy (requires fresh experience)	Off-policy (can reuse old experience)
Variance/Bias	High variance, low bias	Low variance, high bias (due to max operator in Q-update)
Use Case	Robotics, autonomous driving, continuous control	Grid worlds, simple games, discrete decision-making

Real-World Applications

Robotics: Controlling robot arms, legged robots, and drones (continuous action spaces).
Autonomous Driving: Optimizing steering, acceleration, and braking policies.
Game AI: Building AI agents for complex games like Dota 2, StarCraft II, and Atari games.
Finance: Algorithmic trading (optimizing buy/sell decisions in continuous price spaces).
Natural Language Processing: Generative models (e.g., text generation) by treating token selection as a policy optimization problem.

Summary

Policy gradient methods are the best choice for continuous action spaces and complex, real-world RL tasks.

Policy Gradient is a class of RL algorithms that directly optimize a policy function \(\pi_\theta(a|s)\) via gradient ascent on the expected return.

The core formula (REINFORCE) uses the likelihood ratio trick to compute the policy gradient, weighted by cumulative rewards.

Advanced variants (Actor-Critic, PPO) address the high variance of vanilla REINFORCE and are widely used in practice.