genrec_blog
Panfeng Cao/

Learning to Grasp with a Simulated Hand Using Deep Reinforcement Learning

10 min read

Introduction

This project applies Deep Q-Networks (DQN) to the robotic task of object grasping. The objective is to enable a simulated robotic hand to learn how to grasp different objects autonomously using visual data from a camera, processed by a DQN. In about 4 hours and 150,000 steps, the gripper learned to grasp objects with increasing reward values, demonstrating the potential of DQN in robotic manipulation.

Deep Reinforcement Learning is a method that combines reinforcement learning with deep learning to create end-to-end learning models. The DQN model, which is well-known for its success in tasks like Atari game control, uses a convolutional neural network (CNN) to map high-dimensional inputs to optimal actions. This research investigates using DQN to train a robotic hand to grasp objects in a simulated environment, leveraging a MuJoCo physics engine to emulate real-world conditions without needing predefined models or exact object locations.

Related Work

In the last decade, deep reinforcement learning has expanded from game environments to control systems. DQN has been adapted for continuous action fields, particularly in robotic applications, by discretizing action spaces into manageable sets. Despite arguments that discretization increases dimensionality, recent studies show that it can efficiently manage small control systems with fewer action possibilities. This approach contrasts with other reinforcement learning methods like Double Q-learning or DDPG, which address different limitations in DQN but are generally more complex to implement for real-time learning tasks.

Methods

Algorithm and Process Flow

Our system architecture uses a camera to capture real-time frames of the gripper and objects. These frames are fed into the DQN model, which outputs Q-values for each possible action. The action with the highest Q-value is selected, performed by the gripper, and rewarded based on grasp success. The goal is for the agent to maximize cumulative discounted rewards, guiding the gripper toward more effective grasping strategies over time.

robotics_arch
System Architecture

Preprocessing

Given the high-dimensional nature of visual input, the images from MuJoCo are preprocessed to improve efficiency and model accuracy. We use the maximum pixel values between consecutive frames to reduce noise and downsample the image to an 84x84 grayscale frame, which forms the input to the DQN.

Model Architecture

The CNN model comprises three convolutional layers followed by a fully connected layer and an output layer for action selection. Each convolutional layer uses rectified linear units (ReLU) as activations to capture important spatial features:

  • Layer 1: 32 filters of 8x8 with a stride of 4.
  • Layer 2: 64 filters of 4x4 with a stride of 2.
  • Layer 3: 64 filters of 3x3 with a stride of 1.

The final fully connected layer has 512 ReLU units, and the output layer represents the Q-values for each of the 37 discrete grasping actions.

Training Details and Hyperparameters

We train the agent using an ε-greedy policy to balance exploration and exploitation. Key hyperparameters are as follows:

Reward System

The reward system encourages successful grasping while discouraging energy waste and dropping objects. Specific rewards are:

  • -20: if the box moves 0.7 meters from the boundary.
  • +4: if the gripper lifts the box over 30% of its length from the ground.
  • -1: for all other actions to account for energy consumption.

Deep Q-learning with Experience Replay

Our DQN algorithm improves stability with experience replay and a separate target network. This setup avoids correlation between sequential samples and prevents oscillations in the Q-value updates. At each training step, experiences are stored in memory, and samples are randomly drawn to calculate Q-values.

Experimental Evaluation

Training Results

Training took four hours, processing over 150,000 steps. The gripper improved from an initial average reward of -40 to +40, showing it could successfully grasp objects for extended periods. With ε reducing to 0.1 after 40 epochs, the model stabilized, maximizing its reward as shown below:

robotics_performance
Training Curve

Conclusion

Applying DQN to robotic grasping is feasible. The gripper successfully learnt to grasp the object after approximately 150,000 steps and four hours of training. Despite this, the observed reward level is slightly below expectations. Adjusting ε decay or experimenting with more refined reward functions may further enhance outcomes.

author-profile-imageAbout the article

I worked on this project when I was studying at University of Michigan. This was my final project of the Machine Learning course.

Gripper learnt to grasp successfully!