Introduction
This project applies Deep Q-Networks (DQN) to the robotic task of object grasping. The objective is to enable a simulated robotic hand to learn how to grasp different objects autonomously using visual data from a camera, processed by a DQN. In about 4 hours and 150,000 steps, the gripper learned to grasp objects with increasing reward values, demonstrating the potential of DQN in robotic manipulation.
Deep Reinforcement Learning is a method that combines reinforcement learning with deep learning to create end-to-end learning models. The DQN model, which is well-known for its success in tasks like Atari game control, uses a convolutional neural network (CNN) to map high-dimensional inputs to optimal actions. This research investigates using DQN to train a robotic hand to grasp objects in a simulated environment, leveraging a MuJoCo physics engine to emulate real-world conditions without needing predefined models or exact object locations.
Related Work
In the last decade, deep reinforcement learning has expanded from game environments to control systems. DQN has been adapted for continuous action fields, particularly in robotic applications, by discretizing action spaces into manageable sets. Despite arguments that discretization increases dimensionality, recent studies show that it can efficiently manage small control systems with fewer action possibilities. This approach contrasts with other reinforcement learning methods like Double Q-learning or DDPG, which address different limitations in DQN but are generally more complex to implement for real-time learning tasks.
Methods
Algorithm and Process Flow
Our system architecture uses a camera to capture real-time frames of the gripper and objects. These frames are fed into the DQN model, which outputs Q-values for each possible action. The action with the highest Q-value is selected, performed by the gripper, and rewarded based on grasp success. The goal is for the agent to maximize cumulative discounted rewards, guiding the gripper toward more effective grasping strategies over time.
Preprocessing
Given the high-dimensional nature of visual input, the images from MuJoCo are preprocessed to improve efficiency and model accuracy. We use the maximum pixel values between consecutive frames to reduce noise and downsample the image to an 84x84 grayscale frame, which forms the input to the DQN.
Model Architecture
The CNN model comprises three convolutional layers followed by a fully connected layer and an output layer for action selection. Each convolutional layer uses rectified linear units (ReLU) as activations to capture important spatial features:
- Layer 1: 32 filters of 8x8 with a stride of 4.
- Layer 2: 64 filters of 4x4 with a stride of 2.
- Layer 3: 64 filters of 3x3 with a stride of 1.
The final fully connected layer has 512 ReLU units, and the output layer represents the Q-values for each of the 37 discrete grasping actions.
Training Details and Hyperparameters
We train the agent using an ε-greedy policy to balance exploration and exploitation. Key hyperparameters are as follows:
Reward System
The reward system encourages successful grasping while discouraging energy waste and dropping objects. Specific rewards are:
- -20: if the box moves 0.7 meters from the boundary.
- +4: if the gripper lifts the box over 30% of its length from the ground.
- -1: for all other actions to account for energy consumption.
Deep Q-learning with Experience Replay
Our DQN algorithm improves stability with experience replay and a separate target network. This setup avoids correlation between sequential samples and prevents oscillations in the Q-value updates. At each training step, experiences are stored in memory, and samples are randomly drawn to calculate Q-values.
Experimental Evaluation
Training Results
Training took four hours, processing over 150,000 steps. The gripper improved from an initial average reward of -40 to +40, showing it could successfully grasp objects for extended periods. With ε reducing to 0.1 after 40 epochs, the model stabilized, maximizing its reward as shown below:
Conclusion
Applying DQN to robotic grasping is feasible. The gripper successfully learnt to grasp the object after approximately 150,000 steps and four hours of training. Despite this, the observed reward level is slightly below expectations. Adjusting ε decay or experimenting with more refined reward functions may further enhance outcomes.