Deep Q-Network
An implementation of DeepMind's reinforcement learning paper, evaluated in OpenAI Gym
github project
- About
- From Q-Learning to DQN
- DQN-Lite
- Training and Testing
- Results: CartPole-v0
- Results: LunarLander-v2
- Results: MountainCar-v0
About
In this project I implemented a slight variation of Deep Q-Network (DQN), which I informally call DQN-Lite, and used it to learn several policies that solve three of the environments in OpenAI Gym: CartPole-v0, LunarLander-v2 and MountainCar-v0.
It turned out that the key in solving the different problems was not to deviate from the network topology in the original paper (i.e.: number of parameters, layers or activation functions) but rather to find a suitable combination of training steps and ε-greedy decay schedule. I was able to identify hyperparameter combinations which score above the solved criteria for each problem and also found one combination which generated policies that solve all three problems.
From Q-Learning to DQN
- Solving a Markov Decision Process from Q-values
- Solving a Markov decision process (MDP) means deriving an optimal policy that for a given state, returns the action that maximizes the expected discounted utility a reflex agent following will attain from that point onwards. Selecting the best action however, requires knowing the optimal Q-values or state-action values () for the MDP. The Q-values can be derived from the MDP’s model if this model is known (Eq. 1). If the model is unknown however, learning methods must be employed.
- Q-Learning
- Q-Learning is a TD (temporal-difference) method which learns without relying on knowing the true probability distribution over states and actions nor the reward function of the MDP (i.e.: without access to the true model). The algorithm replaces the sum over weighted rewards given the true model with an exponential moving average of the difference between observations acquired by acting in the environment and its current best estimate at any given timestep . With enough exploration, the algorithm’s empirical estimate converges to .
- Because Q-Learning does not attempt to learn the underlying MDP model it falls in the category of model-free methods. The algorithm is also online because it learns at every timestep, as new observations are produced and it’s off-policy, because it learns the Q-values that derive an optimal policy even if during training it acts suboptimally (Algorithm 1).
The classic formulation of Q-Learning however is tabular, meaning that that Q-values are stored in a table. This has two limitations:
- Space requirements grow with the number of states and actions in a problem.
- There is no way to exploit similarities between Q-states; the algorithm only knows Q-values for Q-states it has experienced.
One way to overcome these limitations is by replacing the Q-table with a function that maps from state features to values. This allows computing Q-values on the fly without storing them explicitly. Also, assuming the features are good enough to capture similarities between states, the function should even be able to extrapolate to Q-states never before seen. This is referred to as approximate Q-Learning and the way to learn this Q-function is posed as an optimization problem.
- Deep Q-Network
- DQN is a form of approximate Q-learning in which the Q-function is learned thru convolutional neural networks. For a complete specification of the algorithm and its many hyperparameters refer to the DQN paper. The most prominent features of the algorithm however are:
- A feature map . The paper does not describe the mapping in much detail but one purpose for it is to average several pixel frames into a new frame, thus encoding the concept of velocity (i.e.: motion blur). The CNNs are then trained on these augmented raw states.
- Two convolutional neural nets (CNNs) The convolutional layers in these networks learn features from raw pixels. The fully connected layers then learn a function that maps from these feature-based representation of states to all possible actions and Q-values for those state-action pairs.
- Experience replay. A buffer that stores a fixed number of past observations used to train the nets (via mini-batch supervised learning).
- Clipping of the training loss. The DQN paper mentions clipping the magnitude of the training loss to achieve better learning stability. OpenAI further characterizes this as the Huber loss in their Baselines DQN post.
DQN-Lite
My implementation (Algorithm 2) follows the general DQN algorithm but makes some simplifications. The key differences are:
- No feature mapping
- For this project I worked with Gym environments whose states are already feature-based. For example, CartPole states include features for position, velocity, etc. This means that states have already undergone dimensionality reduction (from raw pixels) and include temporal information.
- FFNs instead of CNNs
- Since there are no features to be learned from pixels, I removed the convolutional layers, effectively turning the neural nets into 2-layer feed-forward nets (FFNs) (Figure 1).
- Optional Loss Clipping
- For this project I used the L2 squared loss though I implemented the algorithm in such way that the loss can be easily replaced with any loss, including the Huber loss.
Training and Testing
Training:
Two recurring themes in reinforcement learning are exploration vs exploitation and credit assignment. Since the Gym environments I solved in this project have well-defined rewards, I focused instead on balancing exploration vs exploitation.
Tables 1 and 2 show the fixed and variable settings I used in training.
Optimizer | Loss | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
77 | 512 | 500 | 2000 | 50000 | 32 | 0.001 | 0.99 | 1 | Adam (momentum = 0.99, rms = 0.999) |
Model ID | |||
---|---|---|---|
0 | 400000 | 0.1 | |
1 | 500000 | 0.1 | |
2 | 600000 | 0.1 | |
3 | 400000 | 0.1 | |
4 | 500000 | 0.1 | |
5 | 600000 | 0.1 | |
6 | 400000 | 0.02 | |
7 | 500000 | 0.02 | |
8 | 600000 | 0.02 | |
9 | 400000 | 0.02 | |
10 | 500000 | 0.02 | |
11 | 600000 | 0.02 |
Plots of the training stats for each problem are included. When looking at these keep in mind that training was done in terms of total timesteps but logging was done in terms of episodes. This has two implications:
- Episodes contain different number of timesteps so ε-schedules often look parabolic even though they are linear in the number of steps.
- Training losses in the plots are averaged over whole episodes.
Testing:
OpenAI Gym defines the solved criteria for these environments as an average reward over 100 consecutive episodes. My tests consisted of three separate runs of 100 consecutive episodes each. If a policy scored lower than the required average in any of the three runs, I disqualified it even if the 3-run average was within the solved criteria. An example of this occurred in LunarLander, where model 6 scored above 200 over 3 runs but scored 195.874 in the first 100-episode run.
Results: CartPole-v0
- Solved Criteria: avg. reward >= 195 over 100 episodes.
- Success Model Ratio: 7/12 = 0.583
- Best Vs General Model Improvement: 200.000/200.000 = 0.000
- OpenAI Gym Wiki
Model ID | Avg. (1st 100 Episodes) | Avg. (2nd 100 Episodes) | Avg. (3rd 100 Episodes) | 3-Run Average |
---|---|---|---|---|
1 | 200.000 | 200.000 | 200.000 | 200.000 |
10 | 200.000 | 200.000 | 200.000 | 200.000 |
4 | 200.000 | 200.000 | 200.000 | 200.000 |
2 | 200.000 | 200.000 | 200.000 | 200.000 |
3 | 199.960 | 200.000 | 199.980 | 199.980 |
7 | 200.000 | 199.970 | 199.910 | 199.960 |
11 | 199.560 | 199.180 | 199.030 | 199.257 |
6 | 188.320 | 188.690 | 189.010 | 188.673 |
5 | 184.720 | 185.080 | 185.220 | 185.007 |
9 | 177.780 | 178.390 | 178.360 | 178.177 |
0 | 121.860 | 121.440 | 123.650 | 122.317 |
8 | 105.590 | 102.780 | 104.470 | 104.280 |
Results: LunarLander-v2
- Solved Criteria: avg. reward >= 200 over 100 episodes.
- Success Model Ratio: 4/12 = 0.333
- Best Vs General Model Improvement: 220.943/204.024 = 0.077
- OpenAI Gym Wiki
Model ID | Avg. (1st 100 Episodes) | Avg. (2nd 100 Episodes) | Avg. (3rd 100 Episodes) | 3-Run Average |
---|---|---|---|---|
2 | 222.706 | 222.526 | 217.598 | 220.943 |
11 | 203.564 | 209.379 | 211.384 | 208.109 |
10 | 206.826 | 203.496 | 201.749 | 204.024 |
3 | 201.327 | 200.895 | 201.755 | 201.326 |
6 | 195.874 | 206.744 | 221.396 | 208.005 |
7 | 194.678 | 200.999 | 191.255 | 195.644 |
5 | 199.365 | 167.034 | 187.666 | 184.688 |
9 | 162.903 | 171.823 | 170.917 | 168.548 |
4 | 162.274 | 169.549 | 172.882 | 168.235 |
8 | 154.618 | 153.856 | 144.726 | 151.067 |
1 | 126.118 | 112.337 | 156.416 | 131.624 |
0 | 32.636 | -7.415 | 19.758 | 14.993 |
Results: MountainCar-v0
- Solved Criteria: avg. reward >= -110 over 100 episodes.
- Success Model Ratio: 3/12 = 0.250
- Best Vs General Model Improvement: -101.550/-103.337 = 0.017
- OpenAI Gym Wiki
Model ID | Avg. (1st 100 Episodes) | Avg. (2nd 100 Episodes) | Avg. (3rd 100 Episodes) | 3-Run Average |
---|---|---|---|---|
1 | -99.750 | -103.360 | -101.540 | -101.550 |
10 | -104.730 | -102.850 | -102.430 | -103.337 |
5 | -110.770 | -109.340 | -109.320 | -109.810 |
3 | -112.660 | -108.730 | -111.270 | -110.887 |
2 | -112.930 | -115.700 | -111.220 | -113.283 |
9 | -119.350 | -113.040 | -115.850 | -116.080 |
4 | -119.990 | -120.810 | -120.900 | -120.567 |
6 | -123.090 | -121.170 | -120.320 | -121.527 |
8 | -126.610 | -123.840 | -122.990 | -124.480 |
7 | -129.260 | -127.880 | -129.480 | -128.873 |
0 | -139.710 | -136.210 | -145.030 | -140.317 |
11 | -136.870 | -144.230 | -149.360 | -143.487 |