SARSA is an on-policy temporal difference (TD) learning algorithm for reinforcement learning (RL), designed to learn the optimal action-value function \(q_*(s,a)\) and derive the optimal policy \(\pi_*\) from it. The name SARSA comes from the sequence of events that drive its update rule:
State (s) → Action (a) → Reward (r) → Next State (\(s’\)) → Next Action (\(a’\))
Unlike Q-Learning (an off-policy algorithm), SARSA learns the value of the policy that the agent is currently following (the behavior policy). This makes SARSA more conservative in exploration, which is often advantageous for safety-critical tasks (e.g., robot navigation, where avoiding hazards is prioritized over maximizing reward).
I. Core Principles of SARSA
1. On-Policy vs. Off-Policy Learning
A key distinction between SARSA and Q-Learning lies in their approach to policy learning:
| Property | SARSA (On-Policy) | Q-Learning (Off-Policy) |
|---|---|---|
| Target Policy | Learns the value of the behavior policy (the policy the agent uses to act). | Learns the value of the optimal policy (independent of the behavior policy). |
| Update Target | Uses the next action \(a’\) taken by the behavior policy in \(s’\). | Uses the maximizing action \(a’ = \arg\max_{a} Q(s’,a)\) (ignores the behavior policy). |
| Exploration Strategy | Conservative—avoids risky actions during exploration. | Aggressive—may take risky actions to find the optimal policy. |
| Use Case | Safety-critical tasks (e.g., robot obstacle avoidance). | Tasks where reward maximization is prioritized (e.g., game playing). |
2. The SARSA Update Rule
SARSA updates the action-value function \(Q(s,a)\) using the TD error, which measures the difference between the expected return and the observed return. The update rule is:
\(Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma Q(s’,a’) – Q(s,a) \right]\)
Where:
- \(\alpha \in [0,1]\): Learning rate (controls the magnitude of the update; smaller values = more stable learning).
- \(\gamma \in [0,1]\): Discount factor (weights future rewards relative to immediate rewards).
- r: Reward received after taking action a in state s.
- \(s’\): The next state transitioned to after taking a.
- \(a’\): The next action selected by the behavior policy (e.g., \(\epsilon\)-greedy) in state \(s’\).
3. Behavior Policy: \(\epsilon\)-Greedy Exploration
SARSA typically uses an \(\epsilon\)-greedy policy as its behavior policy to balance exploration and exploitation:
- With probability \(1-\epsilon\): Exploit—select the action with the highest Q-value: \(a = \arg\max_{a} Q(s,a)\).
- With probability \(\epsilon\): Explore—select a random action from the action space.
Over time, \(\epsilon\) is often decayed (e.g., \(\epsilon = \epsilon_{\text{start}} \times \epsilon_{\text{decay}}^t\)) to reduce exploration and converge to the optimal policy.
II. SARSA Algorithm Steps
The full SARSA algorithm for discrete state and action spaces is as follows:
- Initialize: Set the action-value function \(Q(s,a)\) to arbitrary values (e.g., 0 for all \(s,a\)). Set \(Q(\text{terminal state}, a) = 0\) for all a.
- For each episode:a. Initialize state: \(s \leftarrow \text{initial state of the environment}\).b. Select action: Choose a from s using the \(\epsilon\)-greedy policy based on Q.c. For each step of the episode:i. Take action a: Observe reward r and next state \(s’\).ii. Select next action: Choose \(a’\) from \(s’\) using the \(\epsilon\)-greedy policy based on Q.iii. Update Q-value: Apply the SARSA update rule to \(Q(s,a)\).iv. Update state and action: Set \(s \leftarrow s’\), \(a \leftarrow a’\).v. Terminate: If s is a terminal state, end the episode.
- Repeat: Continue until the policy converges to the optimal policy \(\pi_*\).
Key Difference from Q-Learning
The critical difference is in step c.ii:
- SARSA selects \(a’\) using the behavior policy (e.g., \(\epsilon\)-greedy).
- Q-Learning skips selecting \(a’\) and directly uses \(\max_{a’} Q(s’,a’)\) as the target.
This means SARSA’s update depends on the agent’s actual next action, while Q-Learning’s update depends on the best possible next action.
III. Example: SARSA for Cliff Walking
The Cliff Walking task is a classic RL problem that highlights the difference between SARSA and Q-Learning. The environment is a 4×12 grid:
- The agent starts at the bottom-left corner (S).
- The goal is to reach the bottom-right corner (G).
- A “cliff” (marked by X) runs along the bottom row between S and G. Stepping into the cliff results in a large negative reward (-100) and reset to the start.
- Each step (up/down/left/right) gives a reward of -1 (to encourage the shortest path).
SARSA vs. Q-Learning in Cliff Walking
| Algorithm | Learned Path | Rationale |
|---|---|---|
| SARSA | Takes a safe path around the top of the grid (avoids the cliff entirely). | SARSA is on-policy—during exploration, it learns that stepping near the cliff is risky (since the behavior policy might select a random action into the cliff). The update rule discourages actions that could lead to danger. |
| Q-Learning | Takes a risky path along the edge of the cliff (shortest path to G). | Q-Learning is off-policy—it ignores the risk of random exploration and optimizes for the shortest path, even if it means walking near the cliff. |
This example demonstrates why SARSA is preferred for safety-critical tasks: it prioritizes avoiding hazards over maximizing reward.
IV. SARSA Implementation (Python with PyTorch: Cliff Walking)
Below is a practical implementation of SARSA for the Cliff Walking environment. We use a neural network to approximate the Q-value function (Deep SARSA) for scalability to high-dimensional state spaces.
1. Import Libraries
python
运行
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque
2. Define Q-Network (Deep SARSA Approximator)
python
运行
class QNetwork(nn.Module):
def __init__(self, state_size, action_size, hidden_size=64):
super(QNetwork, self).__init__()
# Fully connected layers to approximate Q(s,a)
self.fc1 = nn.Linear(state_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x) # Output Q-values for all actions
3. Define Deep SARSA Agent
python
运行
class SARSAAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
# Hyperparameters
self.gamma = 0.99 # Discount factor
self.alpha = 0.001 # Learning rate
self.epsilon = 1.0 # Exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995 # Decay epsilon over time
# Q-Network and optimizer
self.qnetwork = QNetwork(state_size, action_size)
self.optimizer = optim.Adam(self.qnetwork.parameters(), lr=self.alpha)
def act(self, state):
"""Select action using epsilon-greedy policy (behavior policy)"""
if random.random() > self.epsilon:
# Exploit: select action with max Q-value
state = torch.from_numpy(state).float().unsqueeze(0)
self.qnetwork.eval()
with torch.no_grad():
action_values = self.qnetwork(state)
self.qnetwork.train()
return np.argmax(action_values.cpu().data.numpy())
else:
# Explore: select random action
return random.choice(np.arange(self.action_size))
def learn(self, state, action, reward, next_state, next_action, done):
"""Update Q-network using SARSA update rule"""
# Convert to tensors
state = torch.from_numpy(state).float().unsqueeze(0)
next_state = torch.from_numpy(next_state).float().unsqueeze(0)
action = torch.tensor([[action]], dtype=torch.long)
next_action = torch.tensor([[next_action]], dtype=torch.long)
reward = torch.tensor([reward], dtype=torch.float)
done = torch.tensor([done], dtype=torch.float)
# Current Q-value: Q(s,a)
current_q = self.qnetwork(state).gather(1, action)
# Target Q-value: r + gamma * Q(s',a') (SARSA target)
next_q = self.qnetwork(next_state).gather(1, next_action)
target_q = reward + (self.gamma * next_q * (1 - done))
# Compute loss (MSE between current Q and target Q)
loss = nn.MSELoss()(current_q, target_q)
# Backpropagation and optimization
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Decay epsilon
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
4. Train the Agent on Cliff Walking
python
运行
# Create Cliff Walking environment (4x12 grid)
env = gym.make('CliffWalking-v0')
state_size = env.observation_space.n # 48 states (4x12 grid)
action_size = env.action_space.n # 4 actions (0: up, 1: right, 2: down, 3: left)
# Convert discrete state to one-hot vector (required for Q-network input)
def one_hot(state, state_size):
vec = np.zeros(state_size)
vec[state] = 1.0
return vec
# Initialize agent
agent = SARSAAgent(state_size, action_size)
# Training parameters
n_episodes = 500
max_t = 1000 # Max steps per episode
print_every = 50
scores = [] # Store cumulative reward per episode
for i_episode in range(1, n_episodes + 1):
state, _ = env.reset()
state = one_hot(state, state_size)
score = 0
action = agent.act(state) # Select initial action
for t in range(max_t):
# Take action and observe next state/reward
next_state, reward, done, _, _ = env.step(action)
next_state = one_hot(next_state, state_size)
score += reward
# Select next action (critical for SARSA)
next_action = agent.act(next_state)
# Learn from experience (s, a, r, s', a')
agent.learn(state, action, reward, next_state, next_action, done)
# Update state and action for next step
state = next_state
action = next_action
if done:
break
scores.append(score)
# Print progress
if i_episode % print_every == 0:
avg_score = np.mean(scores[-print_every:])
print(f'Episode {i_episode}\tAverage Score: {avg_score:.2f}\tEpsilon: {agent.epsilon:.2f}')
# Close environment
env.close()
Key Implementation Notes
- One-Hot State Encoding: The Cliff Walking environment uses discrete states (0–47). We convert these to one-hot vectors to feed into the neural network.
- Next Action Selection: The agent selects
next_actionbefore the learn step—this is the defining feature of SARSA. - Epsilon Decay: Over time, the agent reduces exploration (lower \(\epsilon\)) and converges to the optimal policy.
V. SARSA Variants
1. Expected SARSA
Expected SARSA is a variant that reduces the variance of SARSA updates by using the expected value of the next Q-value instead of the sampled next action \(a’\). The update rule is:
\(Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \mathbb{E}_{\pi}[Q(s’,a’)] – Q(s,a) \right]\)
Where \(\mathbb{E}_{\pi}[Q(s’,a’)] = \sum_{a’} \pi(a’|s’) Q(s’,a’)\) is the expected Q-value under the behavior policy \(\pi\). Expected SARSA is more stable than standard SARSA but requires computing the expectation over all actions.
2. Double SARSA
Double SARSA addresses the overestimation bias of Q-learning (and SARSA) by using two separate Q-networks:
- One network selects the next action \(a’ = \arg\max_{a} Q_1(s’,a)\).
- The other network evaluates the Q-value of that action: \(Q_2(s’,a’)\).
This reduces the tendency of the algorithm to overestimate Q-values, leading to more robust learning.
VI. Summary
Core Difference from Q-Learning: SARSA uses the agent’s actual next action, while Q-Learning uses the best possible next action.
SARSA is an on-policy TD learning algorithm that learns the action-value function of the agent’s current behavior policy.
Update Rule: Depends on the sequence \(s \to a \to r \to s’ \to a’\), where \(a’\) is selected by the behavior policy (e.g., \(\epsilon\)-greedy).
Key Advantage: More conservative than Q-Learning, making it ideal for safety-critical tasks (e.g., robot navigation, cliff walking).
Variants: Expected SARSA (reduces variance) and Double SARSA (reduces overestimation bias).
- High-Performance Waterproof Solar Connectors
- Durable IP68 Waterproof Solar Connectors for Outdoor Use
- High-Quality Tinned Copper Material for Durability
- High-Quality Tinned Copper Material for Long Service Life
- Y Branch Parallel Solar Connector for Enhanced Power
- 10AWG Tinned Copper Solar Battery Cables
- NEMA 5-15P to Powercon Extension Cable Overview
- Dual Port USB 3.0 Adapter for Optimal Speed
- 4-Pin XLR Connector: Reliable Audio Transmission
- 4mm Banana to 2mm Pin Connector: Your Audio Solution
- 12GB/s Mini SAS to U.2 NVMe Cable for Fast Data Transfer
- CAB-STK-E Stacking Cable: 40Gbps Performance
- High-Performance CAB-STK-E Stacking Cable Explained
- Best 10M OS2 LC to LC Fiber Patch Cable for Data Centers
- Mini SAS HD Cable: Boost Data Transfer at 12 Gbps
- Multi Rate SFP+: Enhance Your Network Speed
- Best 6.35mm to MIDI Din Cable for Clear Sound
- 15 Pin SATA Power Splitter: Solutions for Your Device Needs
- 9-Pin S-Video Cable: Enhance Your Viewing Experience
- USB 9-Pin to Standard USB 2.0 Adapter: Easy Connection
- 3 Pin to 4 Pin Fan Adapter: Optimize Your PC Cooling
- S-Video to RCA Cable: High-Definition Connections Made Easy
- 6.35mm TS Extension Cable: High-Quality Sound Solution
- BlackBerry Curve 9360: Key Features and Specs






















Leave a comment