Reinforcement Learning for Robotics
Overview
Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. In robotics, RL offers a powerful approach to learn complex behaviors that are difficult to program explicitly, such as walking gaits, manipulation skills, and adaptive behaviors. This chapter explores how RL can be applied to create intelligent, adaptive robot control systems.
Fundamentals of Reinforcement Learning
Core Components
A reinforcement learning system consists of:
- Agent: The learning entity (the robot)
- Environment: The world the agent interacts with
- State (s): Complete description of the environment
- Action (a): What the agent can do
- Reward (r): Feedback signal indicating success
- Policy (π): Strategy for selecting actions
- Value Function (V): Expected future rewards
- Model: Agent's representation of the environment
The RL Loop
Environment State → Agent Decision → Action → Environment → Reward & Next State
↑ ↓
←—————————— Feedback Loop ————————————————←
- Agent observes environment state
- Agent selects action based on policy
- Action is executed in environment
- Environment transitions to new state
- Agent receives reward signal
- Process repeats
Markov Decision Process (MDP)
RL problems are often formulated as MDPs with the tuple (S, A, P, R, γ):
- S: Set of states
- A: Set of actions
- P: State transition probabilities
- R: Reward function
- γ: Discount factor (0 ≤ γ ≤ 1)
RL Algorithms for Robotics
Value-Based Methods
Q-Learning
Q-Learning learns an action-value function Q(s,a) representing the expected future reward for taking action a in state s.
# Q-Learning update rule
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
Where:
- α is the learning rate
- γ is the discount factor
- s' is the next state
Deep Q-Networks (DQN)
DQN uses neural networks to approximate the Q-function, enabling handling of high-dimensional state spaces.
import torch
import torch.nn as nn
import numpy as np
class DQN(nn.Module):
def __init__(self, state_size, action_size, hidden_size=64):
super(DQN, self).__init__()
self.fc1 = nn.Linear(state_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, hidden_size)
self.fc3 = nn.Linear(hidden_size, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
# Example usage for robot control
def select_action(state, policy_net, epsilon):
if np.random.random() < epsilon:
# Explore: random action
return np.random.choice(action_space)
else:
# Exploit: best known action
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0)
q_values = policy_net(state_tensor)
return q_values.max(1)[1].item()
Policy-Based Methods
Policy Gradient
Policy gradient methods directly optimize the policy function π(a|s).
# REINFORCE algorithm
def reinforce_update(trajectories, policy_network, optimizer):
for trajectory in trajectories:
states, actions, rewards = trajectory
# Calculate returns
returns = calculate_returns(rewards)
# Calculate policy gradient
log_probs = policy_network.get_log_probs(states, actions)
loss = -torch.mean(log_probs * returns)
# Update policy
optimizer.zero_grad()
loss.backward()
optimizer.step()
Actor-Critic Methods
Combine value-based and policy-based approaches with separate networks for policy (actor) and value estimation (critic).
class ActorCritic(nn.Module):
def __init__(self, state_size, action_size, hidden_size=64):
super(ActorCritic, self).__init__()
# Shared feature extractor
self.feature_extractor = nn.Sequential(
nn.Linear(state_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, hidden_size),
nn.ReLU()
)
# Actor (policy network)
self.actor = nn.Linear(hidden_size, action_size)
# Critic (value network)
self.critic = nn.Linear(hidden_size, 1)
def forward(self, state):
features = self.feature_extractor(state)
action_probs = torch.softmax(self.actor(features), dim=-1)
state_value = self.critic(features)
return action_probs, state_value
# Advantage Actor-Critic (A2C) update
def a2c_update(trajectories, actor_critic, optimizer):
for trajectory in trajectories:
states, actions, rewards = trajectory
# Calculate returns and advantages
returns = calculate_returns(rewards)
values = actor_critic.critic(states)
advantages = returns - values
# Calculate losses
action_probs, state_values = actor_critic(states)
log_probs = torch.log(action_probs.gather(1, actions.unsqueeze(1)))
actor_loss = -(log_probs * advantages.detach()).mean()
critic_loss = advantages.pow(2).mean()
# Update network
optimizer.zero_grad()
(actor_loss + 0.5 * critic_loss).backward()
optimizer.step()
Advanced RL Algorithms
Deep Deterministic Policy Gradient (DDPG)
For continuous action spaces common in robotics.
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
class DDPGAgent:
def __init__(self, state_size, action_size, random_seed=0):
self.state_size = state_size
self.action_size = action_size
self.seed = torch.manual_seed(random_seed)
# Actor Network (w/ Target Network)
self.actor_local = Actor(state_size, action_size, random_seed)
self.actor_target = Actor(state_size, action_size, random_seed)
self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=1e-4)
# Critic Network (w/ Target Network)
self.critic_local = Critic(state_size, action_size, random_seed)
self.critic_target = Critic(state_size, action_size, random_seed)
self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=1e-3)
def step(self, state, action, reward, next_state, done):
# Save experience in replay buffer
self.memory.add(state, action, reward, next_state, done)
# Learn every UPDATE_EVERY time steps
self.timestep += 1
if self.timestep % UPDATE_EVERY == 0:
if len(self.memory) > BATCH_SIZE:
experiences = self.memory.sample()
self.learn(experiences, GAMMA)
def act(self, state, add_noise=True):
"""Returns actions for given state as per current policy."""
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.actor_local.eval()
with torch.no_grad():
action = self.actor_local(state).cpu().data.numpy()
self.actor_local.train()
if add_noise:
action += self.noise.sample()
return np.clip(action, -1, 1)
def learn(self, experiences, gamma):
"""Update policy and value parameters using given batch of experience tuples."""
states, actions, rewards, next_states, dones = experiences
# ---------------------------- update critic ---------------------------- #
# Get predicted next-state actions and Q values from target models
actions_next = self.actor_target(next_states)
Q_targets_next = self.critic_target(next_states, actions_next)
# Compute Q targets for current states (y_i)
Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
# Compute critic loss
Q_expected = self.critic_local(states, actions)
critic_loss = F.mse_loss(Q_expected, Q_targets)
# Minimize the loss
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# ---------------------------- update actor ---------------------------- #
# Compute actor loss
actions_pred = self.actor_local(states)
actor_loss = -self.critic_local(states, actions_pred).mean()
# Minimize the loss
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# ----------------------- update target networks ----------------------- #
self.soft_update(self.critic_local, self.critic_target, TAU)
self.soft_update(self.actor_local, self.actor_target, TAU)
Twin Delayed DDPG (TD3)
Improves DDPG with twin critics and delayed updates.
Soft Actor-Critic (SAC)
Maximum entropy RL algorithm with better sample efficiency.
RL for Humanoid Robotics
Unique Challenges
High-Dimensional Action Spaces
Humanoid robots have 20+ joints, creating complex action spaces:
- Walking: 6+ leg joints coordinated for balance
- Manipulation: 7+ arm joints for dexterous tasks
- Whole-body control: All joints simultaneously
Continuous Control
Most robot tasks require continuous control signals rather than discrete actions:
- Joint position targets
- Torque commands
- Velocity profiles
Safety Constraints
RL training must maintain robot safety:
- Joint limits
- Collision avoidance
- Balance maintenance
Real-Time Requirements
Robot control often requires high-frequency updates (100-1000 Hz).
Applications in Humanoid Robotics
Locomotion Learning
- Walking: Learning stable walking gaits
- Running: Dynamic running patterns
- Climbing: Stair and obstacle navigation
- Balance: Recovery from disturbances
Manipulation Learning
- Grasping: Learning to grasp various objects
- Tool use: Using tools for specific tasks
- Bimanual tasks: Coordinated two-handed operations
Human-Robot Interaction
- Social behaviors: Appropriate social responses
- Collaboration: Working with humans safely
- Adaptation: Adjusting to human preferences
Simulation-to-Real Transfer
Domain Randomization
Training in simulation with randomized parameters to improve real-world transfer:
class DomainRandomizationEnv:
def __init__(self):
# Randomize physics parameters
self.mass_range = (0.8, 1.2) # ±20% mass variation
self.friction_range = (0.5, 1.5) # Friction variation
self.gravity_range = (-11.0, -8.5) # Gravity variation
def randomize_environment(self):
"""Randomize environment parameters"""
random_mass = np.random.uniform(*self.mass_range)
random_friction = np.random.uniform(*self.friction_range)
random_gravity = np.random.uniform(*self.gravity_range)
# Apply randomization to simulator
self.set_robot_mass(random_mass)
self.set_surface_friction(random_friction)
self.set_gravity(random_gravity)
def reset(self):
self.randomize_environment()
return self.env.reset()
System Identification
Learning to match real robot dynamics:
def system_identification(robot_sim, robot_real):
"""Learn parameters to match real robot behavior"""
# Collect data from both sim and real
sim_data = collect_robot_data(robot_sim)
real_data = collect_robot_data(robot_real)
# Optimize simulation parameters
def parameter_loss(params):
robot_sim.update_params(params)
sim_response = robot_sim.get_response()
return np.mean((sim_response - real_data)**2)
# Use optimization to find best parameters
optimal_params = optimize.minimize(parameter_loss, initial_params)
return optimal_params
Sim-to-Real Transfer Techniques
- Domain Adaptation: Adapt models to new domains
- Meta-Learning: Learn to learn quickly in new environments
- Imitation Learning: Learn from demonstrations
Sample-Efficient RL for Robotics
Prioritized Experience Replay
Focus learning on important transitions:
import heapq
class PrioritizedReplayBuffer:
def __init__(self, capacity, alpha=0.6, beta_start=0.4):
self.capacity = capacity
self.alpha = alpha # How much prioritization
self.beta = beta_start # Importance sampling
self.buffer = []
self.priorities = []
self.position = 0
def push(self, state, action, reward, next_state, done, priority=None):
if priority is None:
priority = 1.0 # Default priority for new experiences
if len(self.buffer) < self.capacity:
self.buffer.append((state, action, reward, next_state, done))
self.priorities.append(priority)
else:
self.buffer[self.position] = (state, action, reward, next_state, done)
self.priorities[self.position] = priority
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
if len(self.buffer) == 0:
return None
# Calculate probabilities
priorities = np.array(self.priorities)
probabilities = priorities ** self.alpha
probabilities /= probabilities.sum()
indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)
# Calculate importance sampling weights
total = len(self.buffer)
weights = (total * probabilities[indices]) ** (-self.beta)
weights /= weights.max()
weights = np.array(weights, dtype=np.float32)
samples = [self.buffer[idx] for idx in indices]
return samples, indices, weights
Hindsight Experience Replay (HER)
Learn from failed attempts by redefining goals:
def hindsight_experience_replay(episode_transitions, reward_func):
"""Convert failed episodes to successful ones with different goals"""
her_transitions = []
for transition in episode_transitions:
state, action, reward, next_state, done = transition
# Add original transition
her_transitions.append((state, action, reward, next_state, done))
# Add HER transitions with alternative goals
for future_transition in episode_transitions:
future_state = future_transition[3] # next_state of future transition
# Recompute reward with future state as goal
her_reward = reward_func(next_state, future_state)
her_done = (her_reward > threshold)
her_transitions.append((state, action, her_reward, next_state, her_done))
return her_transitions
Reward Engineering for Robotics
Designing Effective Rewards
Sparse vs. Dense Rewards
- Sparse rewards: Only at task completion (hard to learn)
- Dense rewards: Frequent feedback (easier to learn)
# Example reward function for reaching a target
def reach_target_reward(robot_state, target_position, reached_threshold=0.1):
distance = np.linalg.norm(robot_state.position - target_position)
# Dense reward based on distance
dense_reward = -distance # Negative distance encourages getting closer
# Sparse reward at completion
sparse_reward = 1.0 if distance < reached_threshold else 0.0
# Combined reward
total_reward = 0.1 * dense_reward + 10.0 * sparse_reward
return total_reward
Shaping Rewards
Provide intermediate rewards to guide learning:
def walking_reward(robot_state, target_velocity, dt):
"""Reward function for bipedal walking"""
# Forward velocity reward
forward_vel_reward = robot_state.forward_velocity / target_velocity
# Balance reward (upright orientation)
gravity_vec = np.array([0, 0, -1])
body_orientation = robot_state.body_orientation
balance_reward = np.dot(gravity_vec, body_orientation)
# Energy efficiency (minimize joint torques)
energy_penalty = -np.sum(np.abs(robot_state.joint_torques))
# Stay alive penalty (small negative reward per step)
time_penalty = -0.01
total_reward = (0.5 * forward_vel_reward +
0.3 * balance_reward +
0.1 * energy_penalty +
0.1 * time_penalty)
return total_reward
Avoiding Reward Hacking
Prevent agents from exploiting reward functions:
# Bad: Encourages falling over to get "up" reward
def bad_get_up_reward(robot_state):
return 1.0 if robot_state.is_upright() else 0.0
# Good: Rewards being upright AND stable
def good_get_up_reward(robot_state):
is_upright = robot_state.is_upright()
is_stable = robot_state.linear_velocity < threshold
is_balanced = robot_state.angular_velocity < threshold
return is_upright and is_stable and is_balanced
Deep RL Frameworks for Robotics
Stable Baselines3
Popular library for deep RL with robotics support:
from stable_baselines3 import PPO, SAC, DDPG
from stable_baselines3.common.env_util import make_vec_env
# Create environment
env = make_vec_env('RobotEnv-v1', n_envs=4)
# Create and train model
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=100000)
# Save and load model
model.save("ppo_robot")
loaded_model = PPO.load("ppo_robot")
Ray RLlib
Scalable RL library for distributed training:
import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPO
ray.init()
config = {
"env": "RobotEnv-v1",
"num_workers": 4,
"lr": 0.0003,
"framework": "torch",
}
tune.run(
PPO,
stop={"episode_reward_mean": 200},
config=config,
checkpoint_freq=1,
local_dir="~/ray_results"
)
Isaac Gym
NVIDIA's GPU-accelerated RL environment:
from isaacgym import gymapi
from isaacgym import gymtorch
import torch
# Create Isaac Gym environment
gym = gymapi.acquire_gym()
sim = gym.create_sim(0, 0, gymapi.SIM_PHYSX, params)
# Create environment with multiple robots
envs = []
for i in range(num_envs):
env = gym.create_env(sim, gymapi.Vec3(0, 0, 0), gymapi.Vec3(2.5, 2.5, 2.5), 1)
# Add robot to environment
asset = gym.load_asset(sim, asset_root, asset_file)
gym.create_actor(env, asset, start_pose, "robot", i, 1, 0)
envs.append(env)
Safety in RL for Robotics
Safe RL Approaches
Constrained RL
Add safety constraints to the optimization:
def constrained_policy_optimization(safety_threshold):
"""Optimize policy while satisfying safety constraints"""
# Maximize reward subject to safety constraint
# E[cost(state, action)] <= safety_threshold
# Use Lagrange multipliers to handle constraints
lagrange_multiplier = 0.0
for episode in episodes:
# Collect trajectory
trajectory = collect_trajectory(policy)
# Calculate reward and safety cost
reward = calculate_reward(trajectory)
safety_cost = calculate_safety_cost(trajectory)
# Update policy with constrained objective
constrained_objective = reward - lagrange_multiplier * (safety_cost - safety_threshold)
update_policy(constrained_objective)
# Update Lagrange multiplier
if safety_cost > safety_threshold:
lagrange_multiplier += learning_rate * (safety_cost - safety_threshold)
Shielding
Prevent unsafe actions at runtime:
class SafetyShield:
def __init__(self, robot_model, safety_constraints):
self.robot_model = robot_model
self.safety_constraints = safety_constraints
def is_safe_action(self, state, action):
"""Check if action is safe"""
next_state = self.robot_model.predict_next_state(state, action)
return self.safety_constraints.is_satisfied(next_state)
def safe_action(self, state, proposed_action):
"""Return safe action or modify proposed action"""
if self.is_safe_action(state, proposed_action):
return proposed_action
else:
# Find closest safe action
safe_action = self.find_safe_action(state, proposed_action)
return safe_action
Safe Exploration
Methods to explore safely:
- Safe Exploration: Only explore in safe regions
- Learnable Safety: Learn safety constraints during training
- Robust RL: Train with adversarial perturbations
Practical Implementation Tips
Hyperparameter Tuning
def hyperparameter_search():
"""Search for optimal hyperparameters"""
param_grid = {
'learning_rate': [1e-4, 1e-3, 1e-2],
'batch_size': [32, 64, 128],
'gamma': [0.9, 0.95, 0.99],
'tau': [0.001, 0.005, 0.01]
}
best_score = -float('inf')
best_params = None
for params in ParameterGrid(param_grid):
score = train_and_evaluate(params)
if score > best_score:
best_score = score
best_params = params
return best_params
Curriculum Learning
Gradually increase task difficulty:
class Curriculum:
def __init__(self):
self.levels = [
{'difficulty': 0.1, 'tasks': ['balance_training']},
{'difficulty': 0.3, 'tasks': ['simple_walking']},
{'difficulty': 0.6, 'tasks': ['walking_with_obstacles']},
{'difficulty': 1.0, 'tasks': ['complex_navigation']}
]
self.current_level = 0
def update_level(self, performance):
"""Advance to next level if performance is good"""
if performance > self.levels[self.current_level]['threshold']:
if self.current_level < len(self.levels) - 1:
self.current_level += 1
self.modify_environment(self.levels[self.current_level])
Transfer Learning
Leverage pre-trained models:
def transfer_learning(source_model, target_task):
"""Transfer knowledge from source to target task"""
# Load pre-trained model
model = load_model(source_model)
# Fine-tune on target task
model.freeze_feature_extractor() # Keep learned features
model.unfreeze_policy_head() # Retrain policy head
# Train on target task
for episode in target_episodes:
train_step(model, episode)
return model
Evaluation and Validation
Metrics for RL in Robotics
Performance Metrics
- Task Success Rate: Percentage of successful task completions
- Learning Efficiency: Performance improvement over time
- Sample Efficiency: Performance per number of samples
- Generalization: Performance on unseen scenarios
Safety Metrics
- Collision Rate: Frequency of unsafe events
- Stability: Balance and control performance
- Robustness: Performance under perturbations
Testing Procedures
Simulation Testing
- Unit tests for individual components
- Integration tests for complete systems
- Stress tests with extreme conditions
- Long-duration tests for stability
Real Robot Testing
- Safety checks before deployment
- Gradual deployment with monitoring
- Emergency stop procedures
- Post-deployment analysis
Troubleshooting Common Issues
Training Instability
- Problem: Training diverges or oscillates
- Solution: Reduce learning rate, add regularization, use target networks
Sample Inefficiency
- Problem: Too many samples needed for learning
- Solution: Use better exploration, HER, prioritized replay
Sim-to-Real Gap
- Problem: Policy works in simulation but not real robot
- Solution: Domain randomization, system identification, robust RL
Safety Violations
- Problem: Unsafe behaviors during learning
- Solution: Safe exploration, shielding, constrained RL
Summary
Reinforcement Learning offers powerful approaches for creating adaptive, intelligent robot control systems. For humanoid robotics, RL can learn complex behaviors that are difficult to program explicitly, from walking gaits to manipulation skills. Success requires careful attention to reward design, safety considerations, and the unique challenges of real-world robotic systems. The combination of simulation for training and careful real-world validation enables the development of sophisticated AI brains for humanoid robots.