SARSA vs Q-Learning: Key Differences in Algorithms

SARSA is an on-policy temporal difference (TD) learning algorithm for reinforcement learning (RL), designed to learn the optimal action-value function \(q_*(s,a)\) and derive the optimal policy \(\pi_*\) from it. The name SARSA comes from the sequence of events that drive its update rule:

State (s) → Action (a) → Reward (r) → Next State (\(s’\)) → Next Action (\(a’\))

Unlike Q-Learning (an off-policy algorithm), SARSA learns the value of the policy that the agent is currently following (the behavior policy). This makes SARSA more conservative in exploration, which is often advantageous for safety-critical tasks (e.g., robot navigation, where avoiding hazards is prioritized over maximizing reward).

I. Core Principles of SARSA

1. On-Policy vs. Off-Policy Learning

A key distinction between SARSA and Q-Learning lies in their approach to policy learning:

Property	SARSA (On-Policy)	Q-Learning (Off-Policy)
Target Policy	Learns the value of the behavior policy (the policy the agent uses to act).	Learns the value of the optimal policy (independent of the behavior policy).
Update Target	Uses the next action \(a’\) taken by the behavior policy in \(s’\).	Uses the maximizing action \(a’ = \arg\max_{a} Q(s’,a)\) (ignores the behavior policy).
Exploration Strategy	Conservative—avoids risky actions during exploration.	Aggressive—may take risky actions to find the optimal policy.
Use Case	Safety-critical tasks (e.g., robot obstacle avoidance).	Tasks where reward maximization is prioritized (e.g., game playing).

2. The SARSA Update Rule

SARSA updates the action-value function \(Q(s,a)\) using the TD error, which measures the difference between the expected return and the observed return. The update rule is:

\(Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q(s’,a’) – Q(s,a) \right]\)

Where:

\(\alpha \in [0,1]\): Learning rate (controls the magnitude of the update; smaller values = more stable learning).
\(\gamma \in [0,1]\): Discount factor (weights future rewards relative to immediate rewards).
r: Reward received after taking action a in state s.
\(s’\): The next state transitioned to after taking a.
\(a’\): The next action selected by the behavior policy (e.g., \(\epsilon\)-greedy) in state \(s’\).

3. Behavior Policy: \(\epsilon\)-Greedy Exploration

SARSA typically uses an \(\epsilon\)-greedy policy as its behavior policy to balance exploration and exploitation:

With probability \(1-\epsilon\): Exploit—select the action with the highest Q-value: \(a = \arg\max_{a} Q(s,a)\).
With probability \(\epsilon\): Explore—select a random action from the action space.

Over time, \(\epsilon\) is often decayed (e.g., \(\epsilon = \epsilon_{\text{start}} \times \epsilon_{\text{decay}}^t\)) to reduce exploration and converge to the optimal policy.

II. SARSA Algorithm Steps

The full SARSA algorithm for discrete state and action spaces is as follows:

Initialize: Set the action-value function \(Q(s,a)\) to arbitrary values (e.g., 0 for all \(s,a\)). Set \(Q(\text{terminal state}, a) = 0\) for all a.
For each episode:a. Initialize state: \(s \leftarrow \text{initial state of the environment}\).b. Select action: Choose a from s using the \(\epsilon\)-greedy policy based on Q.c. For each step of the episode:i. Take action a: Observe reward r and next state \(s’\).ii. Select next action: Choose \(a’\) from \(s’\) using the \(\epsilon\)-greedy policy based on Q.iii. Update Q-value: Apply the SARSA update rule to \(Q(s,a)\).iv. Update state and action: Set \(s \leftarrow s’\), \(a \leftarrow a’\).v. Terminate: If s is a terminal state, end the episode.
Repeat: Continue until the policy converges to the optimal policy \(\pi_*\).

Key Difference from Q-Learning

The critical difference is in step c.ii:

SARSA selects \(a’\) using the behavior policy (e.g., \(\epsilon\)-greedy).
Q-Learning skips selecting \(a’\) and directly uses \(\max_{a’} Q(s’,a’)\) as the target.

This means SARSA’s update depends on the agent’s actual next action, while Q-Learning’s update depends on the best possible next action.

III. Example: SARSA for Cliff Walking

The Cliff Walking task is a classic RL problem that highlights the difference between SARSA and Q-Learning. The environment is a 4×12 grid:

The agent starts at the bottom-left corner (S).
The goal is to reach the bottom-right corner (G).
A “cliff” (marked by X) runs along the bottom row between S and G. Stepping into the cliff results in a large negative reward (-100) and reset to the start.
Each step (up/down/left/right) gives a reward of -1 (to encourage the shortest path).

SARSA vs. Q-Learning in Cliff Walking

Algorithm	Learned Path	Rationale
SARSA	Takes a safe path around the top of the grid (avoids the cliff entirely).	SARSA is on-policy—during exploration, it learns that stepping near the cliff is risky (since the behavior policy might select a random action into the cliff). The update rule discourages actions that could lead to danger.
Q-Learning	Takes a risky path along the edge of the cliff (shortest path to G).	Q-Learning is off-policy—it ignores the risk of random exploration and optimizes for the shortest path, even if it means walking near the cliff.

This example demonstrates why SARSA is preferred for safety-critical tasks: it prioritizes avoiding hazards over maximizing reward.

IV. SARSA Implementation (Python with PyTorch: Cliff Walking)

Below is a practical implementation of SARSA for the Cliff Walking environment. We use a neural network to approximate the Q-value function (Deep SARSA) for scalability to high-dimensional state spaces.

1. Import Libraries

python

运行

import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque

2. Define Q-Network (Deep SARSA Approximator)

python

运行

class QNetwork(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(QNetwork, self).__init__()
        # Fully connected layers to approximate Q(s,a)
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)  # Output Q-values for all actions

3. Define Deep SARSA Agent

python

运行

class SARSAAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        
        # Hyperparameters
        self.gamma = 0.99  # Discount factor
        self.alpha = 0.001  # Learning rate
        self.epsilon = 1.0  # Exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995  # Decay epsilon over time
        
        # Q-Network and optimizer
        self.qnetwork = QNetwork(state_size, action_size)
        self.optimizer = optim.Adam(self.qnetwork.parameters(), lr=self.alpha)

    def act(self, state):
        """Select action using epsilon-greedy policy (behavior policy)"""
        if random.random() > self.epsilon:
            # Exploit: select action with max Q-value
            state = torch.from_numpy(state).float().unsqueeze(0)
            self.qnetwork.eval()
            with torch.no_grad():
                action_values = self.qnetwork(state)
            self.qnetwork.train()
            return np.argmax(action_values.cpu().data.numpy())
        else:
            # Explore: select random action
            return random.choice(np.arange(self.action_size))

    def learn(self, state, action, reward, next_state, next_action, done):
        """Update Q-network using SARSA update rule"""
        # Convert to tensors
        state = torch.from_numpy(state).float().unsqueeze(0)
        next_state = torch.from_numpy(next_state).float().unsqueeze(0)
        action = torch.tensor([[action]], dtype=torch.long)
        next_action = torch.tensor([[next_action]], dtype=torch.long)
        reward = torch.tensor([reward], dtype=torch.float)
        done = torch.tensor([done], dtype=torch.float)

        # Current Q-value: Q(s,a)
        current_q = self.qnetwork(state).gather(1, action)

        # Target Q-value: r + gamma * Q(s',a') (SARSA target)
        next_q = self.qnetwork(next_state).gather(1, next_action)
        target_q = reward + (self.gamma * next_q * (1 - done))

        # Compute loss (MSE between current Q and target Q)
        loss = nn.MSELoss()(current_q, target_q)

        # Backpropagation and optimization
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Decay epsilon
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

4. Train the Agent on Cliff Walking

python

运行

# Create Cliff Walking environment (4x12 grid)
env = gym.make('CliffWalking-v0')
state_size = env.observation_space.n  # 48 states (4x12 grid)
action_size = env.action_space.n      # 4 actions (0: up, 1: right, 2: down, 3: left)

# Convert discrete state to one-hot vector (required for Q-network input)
def one_hot(state, state_size):
    vec = np.zeros(state_size)
    vec[state] = 1.0
    return vec

# Initialize agent
agent = SARSAAgent(state_size, action_size)

# Training parameters
n_episodes = 500
max_t = 1000  # Max steps per episode
print_every = 50

scores = []  # Store cumulative reward per episode

for i_episode in range(1, n_episodes + 1):
    state, _ = env.reset()
    state = one_hot(state, state_size)
    score = 0
    action = agent.act(state)  # Select initial action
    
    for t in range(max_t):
        # Take action and observe next state/reward
        next_state, reward, done, _, _ = env.step(action)
        next_state = one_hot(next_state, state_size)
        score += reward
        
        # Select next action (critical for SARSA)
        next_action = agent.act(next_state)
        
        # Learn from experience (s, a, r, s', a')
        agent.learn(state, action, reward, next_state, next_action, done)
        
        # Update state and action for next step
        state = next_state
        action = next_action
        
        if done:
            break
    
    scores.append(score)
    
    # Print progress
    if i_episode % print_every == 0:
        avg_score = np.mean(scores[-print_every:])
        print(f'Episode {i_episode}\tAverage Score: {avg_score:.2f}\tEpsilon: {agent.epsilon:.2f}')

# Close environment
env.close()

Key Implementation Notes

One-Hot State Encoding: The Cliff Walking environment uses discrete states (0–47). We convert these to one-hot vectors to feed into the neural network.
Next Action Selection: The agent selects next_action before the learn step—this is the defining feature of SARSA.
Epsilon Decay: Over time, the agent reduces exploration (lower \(\epsilon\)) and converges to the optimal policy.

V. SARSA Variants

1. Expected SARSA

Expected SARSA is a variant that reduces the variance of SARSA updates by using the expected value of the next Q-value instead of the sampled next action \(a’\). The update rule is:

\(Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \mathbb{E}_{\pi}[Q(s’,a’)] – Q(s,a) \right]\)

Where \(\mathbb{E}_{\pi}[Q(s’,a’)] = \sum_{a’} \pi(a’|s’) Q(s’,a’)\) is the expected Q-value under the behavior policy \(\pi\). Expected SARSA is more stable than standard SARSA but requires computing the expectation over all actions.

2. Double SARSA

Double SARSA addresses the overestimation bias of Q-learning (and SARSA) by using two separate Q-networks:

One network selects the next action \(a’ = \arg\max_{a} Q_1(s’,a)\).
The other network evaluates the Q-value of that action: \(Q_2(s’,a’)\).

This reduces the tendency of the algorithm to overestimate Q-values, leading to more robust learning.

VI. Summary

Core Difference from Q-Learning: SARSA uses the agent’s actual next action, while Q-Learning uses the best possible next action.

SARSA is an on-policy TD learning algorithm that learns the action-value function of the agent’s current behavior policy.

Update Rule: Depends on the sequence \(s \to a \to r \to s’ \to a’\), where \(a’\) is selected by the behavior policy (e.g., \(\epsilon\)-greedy).

Key Advantage: More conservative than Q-Learning, making it ideal for safety-critical tasks (e.g., robot navigation, cliff walking).

Variants: Expected SARSA (reduces variance) and Double SARSA (reduces overestimation bias).