Deep Q-Network

An implementation of DeepMind's reinforcement learning paper, evaluated in OpenAI Gym

github project

About

In this project I implemented a slight variation of Deep Q-Network (DQN), which I informally call DQN-Lite, and used it to learn several policies that solve three of the environments in OpenAI Gym: CartPole-v0, LunarLander-v2 and MountainCar-v0.

It turned out that the key in solving the different problems was not to deviate from the network topology in the original paper (i.e.: number of parameters, layers or activation functions) but rather to find a suitable combination of training steps and ε-greedy decay schedule. I was able to identify hyperparameter combinations which score above the solved criteria for each problem and also found one combination which generated policies that solve all three problems.

Video 1a. Cartpole-v0 policy.
Video 2b. LunarLander-v2 policy
Video 3c. MountainCar-v0 policy.

From Q-Learning to DQN

Solving a Markov Decision Process from Q-values
Solving a Markov decision process (MDP) means deriving an optimal policy that for a given state, returns the action that maximizes the expected discounted utility a reflex agent following will attain from that point onwards. Selecting the best action however, requires knowing the optimal Q-values or state-action values () for the MDP. The Q-values can be derived from the MDP’s model if this model is known (Eq. 1). If the model is unknown however, learning methods must be employed.
Eq 1. (a) Optimal policy in terms of optimal Q-values. (b) Optimal Q-values in terms of known MDP model.
Q-Learning
Q-Learning is a TD (temporal-difference) method which learns without relying on knowing the true probability distribution over states and actions nor the reward function of the MDP (i.e.: without access to the true model). The algorithm replaces the sum over weighted rewards given the true model with an exponential moving average of the difference between observations acquired by acting in the environment and its current best estimate at any given timestep . With enough exploration, the algorithm’s empirical estimate converges to .

Because Q-Learning does not attempt to learn the underlying MDP model it falls in the category of model-free methods. The algorithm is also online because it learns at every timestep, as new observations are produced and it’s off-policy, because it learns the Q-values that derive an optimal policy even if during training it acts suboptimally (Algorithm 1).
Algorithm 1. Q-Learning an optimal policy over a finite number of timesteps.

The classic formulation of Q-Learning however is tabular, meaning that that Q-values are stored in a table. This has two limitations:

  1. Space requirements grow with the number of states and actions in a problem.
  2. There is no way to exploit similarities between Q-states; the algorithm only knows Q-values for Q-states it has experienced.

One way to overcome these limitations is by replacing the Q-table with a function that maps from state features to values. This allows computing Q-values on the fly without storing them explicitly. Also, assuming the features are good enough to capture similarities between states, the function should even be able to extrapolate to Q-states never before seen. This is referred to as approximate Q-Learning and the way to learn this Q-function is posed as an optimization problem.

Deep Q-Network
DQN is a form of approximate Q-learning in which the Q-function is learned thru convolutional neural networks. For a complete specification of the algorithm and its many hyperparameters refer to the DQN paper. The most prominent features of the algorithm however are:

  • A feature map . The paper does not describe the mapping in much detail but one purpose for it is to average several pixel frames into a new frame, thus encoding the concept of velocity (i.e.: motion blur). The CNNs are then trained on these augmented raw states.
  • Two convolutional neural nets (CNNs) The convolutional layers in these networks learn features from raw pixels. The fully connected layers then learn a function that maps from these feature-based representation of states to all possible actions and Q-values for those state-action pairs.
  • Experience replay. A buffer that stores a fixed number of past observations used to train the nets (via mini-batch supervised learning).
  • Clipping of the training loss. The DQN paper mentions clipping the magnitude of the training loss to achieve better learning stability. OpenAI further characterizes this as the Huber loss in their Baselines DQN post.

DQN-Lite

My implementation (Algorithm 2) follows the general DQN algorithm but makes some simplifications. The key differences are:

No feature mapping
For this project I worked with Gym environments whose states are already feature-based. For example, CartPole states include features for position, velocity, etc. This means that states have already undergone dimensionality reduction (from raw pixels) and include temporal information.
FFNs instead of CNNs
Since there are no features to be learned from pixels, I removed the convolutional layers, effectively turning the neural nets into 2-layer feed-forward nets (FFNs) (Figure 1).
Figure 1. FFN-based Q-function. n = batch-size. f = features per state. a = number of actions for a state.
Optional Loss Clipping
For this project I used the L2 squared loss though I implemented the algorithm in such way that the loss can be easily replaced with any loss, including the Huber loss.
Algorithm 2. DQN-Lite.

Training and Testing

Training:

Two recurring themes in reinforcement learning are exploration vs exploitation and credit assignment. Since the Gym environments I solved in this project have well-defined rewards, I focused instead on balancing exploration vs exploitation.

Tables 1 and 2 show the fixed and variable settings I used in training.

Optimizer Loss
77 512 500 2000 50000 32 0.001 0.99 1 Adam (momentum = 0.99, rms = 0.999)
Table 1. Fixed hyperparameters during training.
Model ID
0 400000 0.1
1 500000 0.1
2 600000 0.1
3 400000 0.1
4 500000 0.1
5 600000 0.1
6 400000 0.02
7 500000 0.02
8 600000 0.02
9 400000 0.02
10 500000 0.02
11 600000 0.02
Table 2. Different combinations of total training steps and ε-annealing hyperparameters.

Plots of the training stats for each problem are included. When looking at these keep in mind that training was done in terms of total timesteps but logging was done in terms of episodes. This has two implications:

  1. Episodes contain different number of timesteps so ε-schedules often look parabolic even though they are linear in the number of steps.
  2. Training losses in the plots are averaged over whole episodes.

Testing:

OpenAI Gym defines the solved criteria for these environments as an average reward over 100 consecutive episodes. My tests consisted of three separate runs of 100 consecutive episodes each. If a policy scored lower than the required average in any of the three runs, I disqualified it even if the 3-run average was within the solved criteria. An example of this occurred in LunarLander, where model 6 scored above 200 over 3 runs but scored 195.874 in the first 100-episode run.

Results: CartPole-v0

  • Solved Criteria: avg. reward >= 195 over 100 episodes.
  • Success Model Ratio: 7/12 = 0.583
  • Best Vs General Model Improvement: 200.000/200.000 = 0.000
  • OpenAI Gym Wiki
Video 1b. Best model (1).
Video 2b. General model (10).
Video 3b. Worst model (8).
Figure 1b. Training stats for Model 1.
Figure 2b. Training stats for Model 10.
Figure 3b. Training stats for Model 8.
Model ID Avg. (1st 100 Episodes) Avg. (2nd 100 Episodes) Avg. (3rd 100 Episodes) 3-Run Average
1 200.000 200.000 200.000 200.000
10 200.000 200.000 200.000 200.000
4 200.000 200.000 200.000 200.000
2 200.000 200.000 200.000 200.000
3 199.960 200.000 199.980 199.980
7 200.000 199.970 199.910 199.960
11 199.560 199.180 199.030 199.257
6 188.320 188.690 189.010 188.673
5 184.720 185.080 185.220 185.007
9 177.780 178.390 178.360 178.177
0 121.860 121.440 123.650 122.317
8 105.590 102.780 104.470 104.280
Table 1b. CartPole-v0 test stats. + = best model, * = general model (i.e.: resulted in a policy which solves all three problems), - = worst model.

Results: LunarLander-v2

  • Solved Criteria: avg. reward >= 200 over 100 episodes.
  • Success Model Ratio: 4/12 = 0.333
  • Best Vs General Model Improvement: 220.943/204.024 = 0.077
  • OpenAI Gym Wiki
Video 1c. Best model (2).
Video 2c. General model (10).
Video 3c. Worst model (0).
Figure 1c. Training stats for Model 2.
Figure 2c. Training stats for Model 10.
Figure 3c. Training stats for Model 0.
Model ID Avg. (1st 100 Episodes) Avg. (2nd 100 Episodes) Avg. (3rd 100 Episodes) 3-Run Average
2 222.706 222.526 217.598 220.943
11 203.564 209.379 211.384 208.109
10 206.826 203.496 201.749 204.024
3 201.327 200.895 201.755 201.326
6 195.874 206.744 221.396 208.005
7 194.678 200.999 191.255 195.644
5 199.365 167.034 187.666 184.688
9 162.903 171.823 170.917 168.548
4 162.274 169.549 172.882 168.235
8 154.618 153.856 144.726 151.067
1 126.118 112.337 156.416 131.624
0 32.636 -7.415 19.758 14.993
Table 1c. LunarLander-v2 test stats. + = best model, * = general model (i.e.: resulted in a policy which solves all three problems), - = worst model, ? = failed a run even though average falls within solved.

Results: MountainCar-v0

  • Solved Criteria: avg. reward >= -110 over 100 episodes.
  • Success Model Ratio: 3/12 = 0.250
  • Best Vs General Model Improvement: -101.550/-103.337 = 0.017
  • OpenAI Gym Wiki
Video 1d. Best model (1).
Video 2d. General model (10).
Video 3d. Worst model (11).
Figure 1d. Training stats for Model 1.
Figure 2d. Training stats for Model 10.
Figure 3d. Training stats for Model 11.
Model ID Avg. (1st 100 Episodes) Avg. (2nd 100 Episodes) Avg. (3rd 100 Episodes) 3-Run Average
1 -99.750 -103.360 -101.540 -101.550
10 -104.730 -102.850 -102.430 -103.337
5 -110.770 -109.340 -109.320 -109.810
3 -112.660 -108.730 -111.270 -110.887
2 -112.930 -115.700 -111.220 -113.283
9 -119.350 -113.040 -115.850 -116.080
4 -119.990 -120.810 -120.900 -120.567
6 -123.090 -121.170 -120.320 -121.527
8 -126.610 -123.840 -122.990 -124.480
7 -129.260 -127.880 -129.480 -128.873
0 -139.710 -136.210 -145.030 -140.317
11 -136.870 -144.230 -149.360 -143.487
Table 1d. MountainCar-v0 test stats. + = best model, * = general model (i.e.: resulted in a policy which solves all three problems), - = worst model.