Last updated 6 months ago

This is a draft of *Deep Q-Network*, an introductory book to Deep Q-Networks for those familiar with reinforcement learning. DQN was the first successful attempt to incorporate deep learning into reinforcement learning algorithms. Thus, DQNs have been a crucial part of deep reinforcement learning, and they are worth a full book for discussion.

The whole discussion of DQNs is very recent, since the paper that introduced DQNs was written in 2013. Thus, the majority of discussion is written only in research papers. Because each paper enhancing DQNs in a different way, it is nontrivial how one technique relates to another. This book hopes to clarify how each technique is dependent on or independent of each other.

In **Chapter 1**, we remind the readers of the basic concepts of reinforcement learning such as Markov Decision Processes (MDP), Monte Carlo methods (MC), and Temporal Difference learning (TD). We restrict the discussion to tabular methods and introduce function approximation in Chapter 2.

**Chapter 2** focuses on the “vanilla” Deep Q-Network proposed in 2015. We first show that a naive integration of deep learning and reinforcement learning is unstable, and introduce two techniques - *experience replay* and *target network* - to stabilize learning.

Then, in **Chapter 3**, we focus on *experience replay*. Experience replay is a technique of storing previous experiences and learning from them while interacting with data. In the original DQN, experience are selected randomly from memory. We introduce *prioritized experience replay* that improves the effects of experience replay. We also introduce *combined experience replay* that mitigates the side effects.

In **Chapter 4**, we focus on *target networks*. Target networks is the idea of stabilizing learning by fixing the TD target of the Q-learning. In the original DQN, target network is fixed and periodically updated from the behavior network. We introduce *soft* target updates to replace fixed target updates. We also introduce Double Q-Learning variant of target network, where two separate network is trained and averaged.

In **Chapter X**, we introduce a distributional perspective to DQNs. Up to now, DQNs predicted the expected state-action value of an action. Instead, we introduce *distributional DQN* that introduces a distributional bellman operator and models the value distribution using a discrete distribution. We then project the distributional Bellman update onto a parametrized quantile distribution and introduce *quantile regression DQN* and *implicit quantile network*.

In **Chapter Y**, we direct our attention to exploration. By definition, Q-learning is an off-policy algorithm that uses an $\varepsilon$-greedy behavior policy and a greedy target policy. An $\varepsilon$-greedy policy is a simple exploration strategy that meets the needs, but there are more sophisticated methods that allow more efficient exploration.

Here are some additional materials we hope to include in this book, but have not yet decided their placement.