Reinforcement Learning for Robotics

Overview

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. In robotics, RL offers a powerful approach to learn complex behaviors that are difficult to program explicitly, such as walking gaits, manipulation skills, and adaptive behaviors. This chapter explores how RL can be applied to create intelligent, adaptive robot control systems.

Fundamentals of Reinforcement Learning

Core Components

A reinforcement learning system consists of:

Agent: The learning entity (the robot)
Environment: The world the agent interacts with
State (s): Complete description of the environment
Action (a): What the agent can do
Reward (r): Feedback signal indicating success
Policy (π): Strategy for selecting actions
Value Function (V): Expected future rewards
Model: Agent's representation of the environment

The RL Loop

Environment State → Agent Decision → Action → Environment → Reward & Next State
      ↑                                      ↓
      ←—————————— Feedback Loop ————————————————←

Agent observes environment state
Agent selects action based on policy
Action is executed in environment
Environment transitions to new state
Agent receives reward signal
Process repeats

Markov Decision Process (MDP)

RL problems are often formulated as MDPs with the tuple (S, A, P, R, γ):

S: Set of states
A: Set of actions
P: State transition probabilities
R: Reward function
γ: Discount factor (0 ≤ γ ≤ 1)

RL Algorithms for Robotics

Value-Based Methods

Q-Learning

Q-Learning learns an action-value function Q(s,a) representing the expected future reward for taking action a in state s.

# Q-Learning update rule
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Where:
- α is the learning rate
- γ is the discount factor
- s' is the next state

Deep Q-Networks (DQN)

DQN uses neural networks to approximate the Q-function, enabling handling of high-dimensional state spaces.

import torch
import torch.nn as nn
import numpy as np

class DQN(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Example usage for robot control
def select_action(state, policy_net, epsilon):
    if np.random.random() < epsilon:
        # Explore: random action
        return np.random.choice(action_space)
    else:
        # Exploit: best known action
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = policy_net(state_tensor)
            return q_values.max(1)[1].item()

Policy-Based Methods

Policy Gradient

Policy gradient methods directly optimize the policy function π(a|s).

# REINFORCE algorithm
def reinforce_update(trajectories, policy_network, optimizer):
    for trajectory in trajectories:
        states, actions, rewards = trajectory

        # Calculate returns
        returns = calculate_returns(rewards)

        # Calculate policy gradient
        log_probs = policy_network.get_log_probs(states, actions)
        loss = -torch.mean(log_probs * returns)

        # Update policy
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Actor-Critic Methods

Combine value-based and policy-based approaches with separate networks for policy (actor) and value estimation (critic).

class ActorCritic(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(ActorCritic, self).__init__()

        # Shared feature extractor
        self.feature_extractor = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU()
        )

        # Actor (policy network)
        self.actor = nn.Linear(hidden_size, action_size)

        # Critic (value network)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, state):
        features = self.feature_extractor(state)
        action_probs = torch.softmax(self.actor(features), dim=-1)
        state_value = self.critic(features)
        return action_probs, state_value

# Advantage Actor-Critic (A2C) update
def a2c_update(trajectories, actor_critic, optimizer):
    for trajectory in trajectories:
        states, actions, rewards = trajectory

        # Calculate returns and advantages
        returns = calculate_returns(rewards)
        values = actor_critic.critic(states)
        advantages = returns - values

        # Calculate losses
        action_probs, state_values = actor_critic(states)
        log_probs = torch.log(action_probs.gather(1, actions.unsqueeze(1)))

        actor_loss = -(log_probs * advantages.detach()).mean()
        critic_loss = advantages.pow(2).mean()

        # Update network
        optimizer.zero_grad()
        (actor_loss + 0.5 * critic_loss).backward()
        optimizer.step()

Advanced RL Algorithms

Deep Deterministic Policy Gradient (DDPG)

For continuous action spaces common in robotics.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class DDPGAgent:
    def __init__(self, state_size, action_size, random_seed=0):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = torch.manual_seed(random_seed)

        # Actor Network (w/ Target Network)
        self.actor_local = Actor(state_size, action_size, random_seed)
        self.actor_target = Actor(state_size, action_size, random_seed)
        self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=1e-4)

        # Critic Network (w/ Target Network)
        self.critic_local = Critic(state_size, action_size, random_seed)
        self.critic_target = Critic(state_size, action_size, random_seed)
        self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=1e-3)

    def step(self, state, action, reward, next_state, done):
        # Save experience in replay buffer
        self.memory.add(state, action, reward, next_state, done)

        # Learn every UPDATE_EVERY time steps
        self.timestep += 1
        if self.timestep % UPDATE_EVERY == 0:
            if len(self.memory) > BATCH_SIZE:
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

    def act(self, state, add_noise=True):
        """Returns actions for given state as per current policy."""
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.actor_local.eval()
        with torch.no_grad():
            action = self.actor_local(state).cpu().data.numpy()
        self.actor_local.train()
        if add_noise:
            action += self.noise.sample()
        return np.clip(action, -1, 1)

    def learn(self, experiences, gamma):
        """Update policy and value parameters using given batch of experience tuples."""
        states, actions, rewards, next_states, dones = experiences

        # ---------------------------- update critic ---------------------------- #
        # Get predicted next-state actions and Q values from target models
        actions_next = self.actor_target(next_states)
        Q_targets_next = self.critic_target(next_states, actions_next)
        # Compute Q targets for current states (y_i)
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
        # Compute critic loss
        Q_expected = self.critic_local(states, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)
        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # ---------------------------- update actor ---------------------------- #
        # Compute actor loss
        actions_pred = self.actor_local(states)
        actor_loss = -self.critic_local(states, actions_pred).mean()
        # Minimize the loss
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # ----------------------- update target networks ----------------------- #
        self.soft_update(self.critic_local, self.critic_target, TAU)
        self.soft_update(self.actor_local, self.actor_target, TAU)

Twin Delayed DDPG (TD3)

Improves DDPG with twin critics and delayed updates.

Soft Actor-Critic (SAC)

Maximum entropy RL algorithm with better sample efficiency.

RL for Humanoid Robotics

Unique Challenges

High-Dimensional Action Spaces

Humanoid robots have 20+ joints, creating complex action spaces:

Walking: 6+ leg joints coordinated for balance
Manipulation: 7+ arm joints for dexterous tasks
Whole-body control: All joints simultaneously

Continuous Control

Most robot tasks require continuous control signals rather than discrete actions:

Joint position targets
Torque commands
Velocity profiles

Safety Constraints

RL training must maintain robot safety:

Joint limits
Collision avoidance
Balance maintenance

Real-Time Requirements

Robot control often requires high-frequency updates (100-1000 Hz).

Applications in Humanoid Robotics

Locomotion Learning

Walking: Learning stable walking gaits
Running: Dynamic running patterns
Climbing: Stair and obstacle navigation
Balance: Recovery from disturbances

Manipulation Learning

Grasping: Learning to grasp various objects
Tool use: Using tools for specific tasks
Bimanual tasks: Coordinated two-handed operations

Human-Robot Interaction

Social behaviors: Appropriate social responses
Collaboration: Working with humans safely
Adaptation: Adjusting to human preferences

Simulation-to-Real Transfer

Domain Randomization

Training in simulation with randomized parameters to improve real-world transfer:

class DomainRandomizationEnv:
    def __init__(self):
        # Randomize physics parameters
        self.mass_range = (0.8, 1.2)  # ±20% mass variation
        self.friction_range = (0.5, 1.5)  # Friction variation
        self.gravity_range = (-11.0, -8.5)  # Gravity variation

    def randomize_environment(self):
        """Randomize environment parameters"""
        random_mass = np.random.uniform(*self.mass_range)
        random_friction = np.random.uniform(*self.friction_range)
        random_gravity = np.random.uniform(*self.gravity_range)

        # Apply randomization to simulator
        self.set_robot_mass(random_mass)
        self.set_surface_friction(random_friction)
        self.set_gravity(random_gravity)

    def reset(self):
        self.randomize_environment()
        return self.env.reset()

System Identification

Learning to match real robot dynamics:

def system_identification(robot_sim, robot_real):
    """Learn parameters to match real robot behavior"""
    # Collect data from both sim and real
    sim_data = collect_robot_data(robot_sim)
    real_data = collect_robot_data(robot_real)

    # Optimize simulation parameters
    def parameter_loss(params):
        robot_sim.update_params(params)
        sim_response = robot_sim.get_response()
        return np.mean((sim_response - real_data)**2)

    # Use optimization to find best parameters
    optimal_params = optimize.minimize(parameter_loss, initial_params)
    return optimal_params

Sim-to-Real Transfer Techniques

Domain Adaptation: Adapt models to new domains
Meta-Learning: Learn to learn quickly in new environments
Imitation Learning: Learn from demonstrations

Sample-Efficient RL for Robotics

Prioritized Experience Replay

Focus learning on important transitions:

import heapq

class PrioritizedReplayBuffer:
    def __init__(self, capacity, alpha=0.6, beta_start=0.4):
        self.capacity = capacity
        self.alpha = alpha  # How much prioritization
        self.beta = beta_start  # Importance sampling
        self.buffer = []
        self.priorities = []
        self.position = 0

    def push(self, state, action, reward, next_state, done, priority=None):
        if priority is None:
            priority = 1.0  # Default priority for new experiences

        if len(self.buffer) < self.capacity:
            self.buffer.append((state, action, reward, next_state, done))
            self.priorities.append(priority)
        else:
            self.buffer[self.position] = (state, action, reward, next_state, done)
            self.priorities[self.position] = priority
            self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        if len(self.buffer) == 0:
            return None

        # Calculate probabilities
        priorities = np.array(self.priorities)
        probabilities = priorities ** self.alpha
        probabilities /= probabilities.sum()
        indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)

        # Calculate importance sampling weights
        total = len(self.buffer)
        weights = (total * probabilities[indices]) ** (-self.beta)
        weights /= weights.max()
        weights = np.array(weights, dtype=np.float32)

        samples = [self.buffer[idx] for idx in indices]
        return samples, indices, weights

Hindsight Experience Replay (HER)

Learn from failed attempts by redefining goals:

def hindsight_experience_replay(episode_transitions, reward_func):
    """Convert failed episodes to successful ones with different goals"""
    her_transitions = []

    for transition in episode_transitions:
        state, action, reward, next_state, done = transition

        # Add original transition
        her_transitions.append((state, action, reward, next_state, done))

        # Add HER transitions with alternative goals
        for future_transition in episode_transitions:
            future_state = future_transition[3]  # next_state of future transition

            # Recompute reward with future state as goal
            her_reward = reward_func(next_state, future_state)
            her_done = (her_reward > threshold)

            her_transitions.append((state, action, her_reward, next_state, her_done))

    return her_transitions

Reward Engineering for Robotics

Designing Effective Rewards

Sparse vs. Dense Rewards

Sparse rewards: Only at task completion (hard to learn)
Dense rewards: Frequent feedback (easier to learn)

# Example reward function for reaching a target
def reach_target_reward(robot_state, target_position, reached_threshold=0.1):
    distance = np.linalg.norm(robot_state.position - target_position)

    # Dense reward based on distance
    dense_reward = -distance  # Negative distance encourages getting closer

    # Sparse reward at completion
    sparse_reward = 1.0 if distance < reached_threshold else 0.0

    # Combined reward
    total_reward = 0.1 * dense_reward + 10.0 * sparse_reward

    return total_reward

Shaping Rewards

Provide intermediate rewards to guide learning:

def walking_reward(robot_state, target_velocity, dt):
    """Reward function for bipedal walking"""
    # Forward velocity reward
    forward_vel_reward = robot_state.forward_velocity / target_velocity

    # Balance reward (upright orientation)
    gravity_vec = np.array([0, 0, -1])
    body_orientation = robot_state.body_orientation
    balance_reward = np.dot(gravity_vec, body_orientation)

    # Energy efficiency (minimize joint torques)
    energy_penalty = -np.sum(np.abs(robot_state.joint_torques))

    # Stay alive penalty (small negative reward per step)
    time_penalty = -0.01

    total_reward = (0.5 * forward_vel_reward +
                   0.3 * balance_reward +
                   0.1 * energy_penalty +
                   0.1 * time_penalty)

    return total_reward

Avoiding Reward Hacking

Prevent agents from exploiting reward functions:

# Bad: Encourages falling over to get "up" reward
def bad_get_up_reward(robot_state):
    return 1.0 if robot_state.is_upright() else 0.0

# Good: Rewards being upright AND stable
def good_get_up_reward(robot_state):
    is_upright = robot_state.is_upright()
    is_stable = robot_state.linear_velocity < threshold
    is_balanced = robot_state.angular_velocity < threshold

    return is_upright and is_stable and is_balanced

Deep RL Frameworks for Robotics

Stable Baselines3

Popular library for deep RL with robotics support:

from stable_baselines3 import PPO, SAC, DDPG
from stable_baselines3.common.env_util import make_vec_env

# Create environment
env = make_vec_env('RobotEnv-v1', n_envs=4)

# Create and train model
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=100000)

# Save and load model
model.save("ppo_robot")
loaded_model = PPO.load("ppo_robot")

Ray RLlib

Scalable RL library for distributed training:

import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPO

ray.init()

config = {
    "env": "RobotEnv-v1",
    "num_workers": 4,
    "lr": 0.0003,
    "framework": "torch",
}

tune.run(
    PPO,
    stop={"episode_reward_mean": 200},
    config=config,
    checkpoint_freq=1,
    local_dir="~/ray_results"
)

Isaac Gym

NVIDIA's GPU-accelerated RL environment:

from isaacgym import gymapi
from isaacgym import gymtorch
import torch

# Create Isaac Gym environment
gym = gymapi.acquire_gym()
sim = gym.create_sim(0, 0, gymapi.SIM_PHYSX, params)

# Create environment with multiple robots
envs = []
for i in range(num_envs):
    env = gym.create_env(sim, gymapi.Vec3(0, 0, 0), gymapi.Vec3(2.5, 2.5, 2.5), 1)

    # Add robot to environment
    asset = gym.load_asset(sim, asset_root, asset_file)
    gym.create_actor(env, asset, start_pose, "robot", i, 1, 0)

    envs.append(env)

Safety in RL for Robotics

Safe RL Approaches

Constrained RL

Add safety constraints to the optimization:

def constrained_policy_optimization(safety_threshold):
    """Optimize policy while satisfying safety constraints"""
    # Maximize reward subject to safety constraint
    # E[cost(state, action)] <= safety_threshold

    # Use Lagrange multipliers to handle constraints
    lagrange_multiplier = 0.0

    for episode in episodes:
        # Collect trajectory
        trajectory = collect_trajectory(policy)

        # Calculate reward and safety cost
        reward = calculate_reward(trajectory)
        safety_cost = calculate_safety_cost(trajectory)

        # Update policy with constrained objective
        constrained_objective = reward - lagrange_multiplier * (safety_cost - safety_threshold)
        update_policy(constrained_objective)

        # Update Lagrange multiplier
        if safety_cost > safety_threshold:
            lagrange_multiplier += learning_rate * (safety_cost - safety_threshold)

Shielding

Prevent unsafe actions at runtime:

class SafetyShield:
    def __init__(self, robot_model, safety_constraints):
        self.robot_model = robot_model
        self.safety_constraints = safety_constraints

    def is_safe_action(self, state, action):
        """Check if action is safe"""
        next_state = self.robot_model.predict_next_state(state, action)
        return self.safety_constraints.is_satisfied(next_state)

    def safe_action(self, state, proposed_action):
        """Return safe action or modify proposed action"""
        if self.is_safe_action(state, proposed_action):
            return proposed_action
        else:
            # Find closest safe action
            safe_action = self.find_safe_action(state, proposed_action)
            return safe_action

Safe Exploration

Methods to explore safely:

Safe Exploration: Only explore in safe regions
Learnable Safety: Learn safety constraints during training
Robust RL: Train with adversarial perturbations

Practical Implementation Tips

Hyperparameter Tuning

def hyperparameter_search():
    """Search for optimal hyperparameters"""
    param_grid = {
        'learning_rate': [1e-4, 1e-3, 1e-2],
        'batch_size': [32, 64, 128],
        'gamma': [0.9, 0.95, 0.99],
        'tau': [0.001, 0.005, 0.01]
    }

    best_score = -float('inf')
    best_params = None

    for params in ParameterGrid(param_grid):
        score = train_and_evaluate(params)
        if score > best_score:
            best_score = score
            best_params = params

    return best_params

Curriculum Learning

Gradually increase task difficulty:

class Curriculum:
    def __init__(self):
        self.levels = [
            {'difficulty': 0.1, 'tasks': ['balance_training']},
            {'difficulty': 0.3, 'tasks': ['simple_walking']},
            {'difficulty': 0.6, 'tasks': ['walking_with_obstacles']},
            {'difficulty': 1.0, 'tasks': ['complex_navigation']}
        ]
        self.current_level = 0

    def update_level(self, performance):
        """Advance to next level if performance is good"""
        if performance > self.levels[self.current_level]['threshold']:
            if self.current_level < len(self.levels) - 1:
                self.current_level += 1
                self.modify_environment(self.levels[self.current_level])

Transfer Learning

Leverage pre-trained models:

def transfer_learning(source_model, target_task):
    """Transfer knowledge from source to target task"""
    # Load pre-trained model
    model = load_model(source_model)

    # Fine-tune on target task
    model.freeze_feature_extractor()  # Keep learned features
    model.unfreeze_policy_head()      # Retrain policy head

    # Train on target task
    for episode in target_episodes:
        train_step(model, episode)

    return model

Evaluation and Validation

Metrics for RL in Robotics

Performance Metrics

Task Success Rate: Percentage of successful task completions
Learning Efficiency: Performance improvement over time
Sample Efficiency: Performance per number of samples
Generalization: Performance on unseen scenarios

Safety Metrics

Collision Rate: Frequency of unsafe events
Stability: Balance and control performance
Robustness: Performance under perturbations

Testing Procedures

Simulation Testing

Unit tests for individual components
Integration tests for complete systems
Stress tests with extreme conditions
Long-duration tests for stability

Real Robot Testing

Safety checks before deployment
Gradual deployment with monitoring
Emergency stop procedures
Post-deployment analysis

Troubleshooting Common Issues

Training Instability

Problem: Training diverges or oscillates
Solution: Reduce learning rate, add regularization, use target networks

Sample Inefficiency

Problem: Too many samples needed for learning
Solution: Use better exploration, HER, prioritized replay

Sim-to-Real Gap

Problem: Policy works in simulation but not real robot
Solution: Domain randomization, system identification, robust RL

Safety Violations

Problem: Unsafe behaviors during learning
Solution: Safe exploration, shielding, constrained RL

Summary

Reinforcement Learning offers powerful approaches for creating adaptive, intelligent robot control systems. For humanoid robotics, RL can learn complex behaviors that are difficult to program explicitly, from walking gaits to manipulation skills. Success requires careful attention to reward design, safety considerations, and the unique challenges of real-world robotic systems. The combination of simulation for training and careful real-world validation enables the development of sophisticated AI brains for humanoid robots.

Overview​

Fundamentals of Reinforcement Learning​

Core Components​

The RL Loop​

Markov Decision Process (MDP)​

RL Algorithms for Robotics​

Value-Based Methods​

Q-Learning​

Deep Q-Networks (DQN)​

Policy-Based Methods​

Policy Gradient​

Actor-Critic Methods​

Advanced RL Algorithms​

Deep Deterministic Policy Gradient (DDPG)​

Twin Delayed DDPG (TD3)​

Soft Actor-Critic (SAC)​

RL for Humanoid Robotics​

Unique Challenges​

High-Dimensional Action Spaces​

Continuous Control​

Safety Constraints​

Real-Time Requirements​

Applications in Humanoid Robotics​

Locomotion Learning​

Manipulation Learning​

Human-Robot Interaction​

Simulation-to-Real Transfer​

Domain Randomization​

System Identification​

Sim-to-Real Transfer Techniques​

Sample-Efficient RL for Robotics​

Prioritized Experience Replay​

Hindsight Experience Replay (HER)​

Reward Engineering for Robotics​

Designing Effective Rewards​

Sparse vs. Dense Rewards​

Shaping Rewards​

Avoiding Reward Hacking​

Deep RL Frameworks for Robotics​

Stable Baselines3​

Ray RLlib​

Isaac Gym​

Safety in RL for Robotics​

Safe RL Approaches​

Constrained RL​

Shielding​

Safe Exploration​

Practical Implementation Tips​

Hyperparameter Tuning​

Curriculum Learning​

Transfer Learning​

Evaluation and Validation​

Metrics for RL in Robotics​

Performance Metrics​

Safety Metrics​

Testing Procedures​

Simulation Testing​

Real Robot Testing​

Troubleshooting Common Issues​

Training Instability​

Sample Inefficiency​

Sim-to-Real Gap​

Safety Violations​

Summary​