In this project I implemented a slight variation of Deep Q-Network (DQN), which I informally call DQN-Lite, and used it to learn several policies that solve three of the environments in OpenAI Gym: CartPole-v0, LunarLander-v2 and MountainCar-v0.
It turned out that the key in solving the different problems was not to deviate from the network topology in the original paper (i.e.: number of parameters, layers or activation functions) but rather to find a suitable combination of training steps and ε-greedy decay schedule. I was able to identify hyperparameter combinations which score above the solved criteria for each problem and also found one combination which
generated policies that solve all three problems.
Video 1a. Cartpole-v0 policy.
Video 2b. LunarLander-v2 policy
Video 3c. MountainCar-v0 policy.
From Q-Learning to DQN
Solving a Markov Decision Process from Q-values
Solving a Markov decision process (MDP) means deriving an optimal policy that for a given state, returns the action that maximizes the expected discounted utility a reflex agent following will attain from that point onwards. Selecting the best action however, requires knowing the optimal Q-values or state-action values () for the MDP. The Q-values can be derived from the MDP’s model if this model is known (Eq. 1). If the model is unknown however, learning methods must be employed.
Eq 1.
(a) Optimal policy in terms of optimal Q-values. (b) Optimal Q-values in terms of known MDP model.
Q-Learning
Q-Learning is a TD (temporal-difference) method which learns without relying on knowing the true probability distribution over states and actions nor the reward function of the MDP (i.e.: without access to the true model). The algorithm replaces the sum over weighted rewards given the true model with an exponential moving average of the difference between observations acquired by acting in the environment and its current best estimate at any given timestep . With enough exploration, the algorithm’s empirical estimate converges to .
Because Q-Learning does not attempt to learn the underlying MDP model it falls in the category of model-free methods. The algorithm is also online because it learns at every timestep, as new observations are produced and it’s off-policy, because it learns the Q-values that derive an optimal policy even if during training it acts suboptimally (Algorithm 1).
Algorithm 1.
Q-Learning an optimal policy over a finite number of timesteps.
The classic formulation of Q-Learning however is tabular, meaning that that Q-values are stored in a table. This has two limitations:
Space requirements grow with the number of states and actions in a problem.
There is no way to exploit similarities between Q-states; the algorithm only knows Q-values for Q-states it has experienced.
One way to overcome these limitations is by replacing the Q-table with a function that maps from state features to values. This allows computing Q-values on the fly without storing them explicitly. Also, assuming the features are good enough to capture similarities between states, the function should even be able to extrapolate to Q-states never before seen. This is referred to as approximate Q-Learning and the way to learn this Q-function is posed as an optimization problem.
Deep Q-Network
DQN is a form of approximate Q-learning in which the Q-function is learned thru convolutional neural networks. For a complete specification of the algorithm and its many hyperparameters refer to the DQN paper. The most prominent features of the algorithm however are:
A feature map . The paper does not describe the mapping in much detail but one purpose for it is to average several pixel frames into a new frame, thus encoding the concept of velocity (i.e.: motion blur). The CNNs are then trained on these augmented raw states.
Two convolutional neural nets (CNNs) The convolutional layers in these networks learn features from raw pixels. The fully connected layers then learn a function that maps from these feature-based representation of states to all possible actions and Q-values for those state-action pairs.
Experience replay. A buffer that stores a fixed number of past observations used to train the nets (via mini-batch supervised learning).
Clipping of the training loss. The DQN paper mentions clipping the magnitude of the training loss to achieve better learning stability. OpenAI further characterizes this as the Huber loss in their Baselines DQN post.
DQN-Lite
My implementation (Algorithm 2) follows the general DQN algorithm but makes some simplifications. The key differences are:
No feature mapping
For this project I worked with Gym environments whose states are already feature-based. For example, CartPole states include features
for position, velocity, etc. This means that states have already undergone dimensionality reduction (from raw pixels) and include temporal information.
FFNs instead of CNNs
Since there are no features to be learned from pixels, I removed the convolutional layers, effectively turning the neural nets into 2-layer feed-forward nets (FFNs) (Figure 1).
Figure 1.
FFN-based Q-function. n = batch-size. f = features per state. a = number of actions for a state.
Optional Loss Clipping
For this project I used the L2 squared loss though I implemented the algorithm in such way that the loss can be easily replaced with any loss, including the Huber loss.
Algorithm 2.
DQN-Lite.
Training and Testing
Training:
Two recurring themes in reinforcement learning are exploration vs exploitation and credit assignment. Since the Gym environments I solved in this project have well-defined rewards, I focused instead on balancing exploration vs exploitation.
Tables 1 and 2 show the fixed and variable settings I used in training.
Optimizer
Loss
77
512
500
2000
50000
32
0.001
0.99
1
Adam (momentum = 0.99, rms = 0.999)
Table 1. Fixed hyperparameters during training.
Model ID
0
400000
0.1
1
500000
0.1
2
600000
0.1
3
400000
0.1
4
500000
0.1
5
600000
0.1
6
400000
0.02
7
500000
0.02
8
600000
0.02
9
400000
0.02
10
500000
0.02
11
600000
0.02
Table 2. Different combinations of total training steps and ε-annealing hyperparameters.
Plots of the training stats for each problem are included. When looking at these keep in mind that training was done in terms of total timesteps but logging was done in terms of episodes. This has two implications:
Episodes contain different number of timesteps so ε-schedules often look parabolic even though they are linear in the number of steps.
Training losses in the plots are averaged over whole episodes.
Testing:
OpenAI Gym defines the solved criteria for these environments as an average reward over 100 consecutive episodes. My tests consisted of three separate runs of 100 consecutive episodes each. If a policy scored lower than the required average in any of the three runs, I disqualified it even if the 3-run average was within the solved criteria. An example of this occurred in LunarLander, where model 6 scored above 200 over 3 runs but scored 195.874 in the first 100-episode run.
Results: CartPole-v0
Solved Criteria: avg. reward >= 195 over 100 episodes.
Success Model Ratio: 7/12 = 0.583
Best Vs General Model Improvement: 200.000/200.000 = 0.000
Table 1c. LunarLander-v2 test stats. + = best model, * = general model (i.e.: resulted in a policy which solves all three problems), - = worst model, ? = failed a run even though average falls within solved.
Results: MountainCar-v0
Solved Criteria: avg. reward >= -110 over 100 episodes.
Success Model Ratio: 3/12 = 0.250
Best Vs General Model Improvement: -101.550/-103.337 = 0.017