Double DQN & Dueling DQN

Value-based Sequential Decision

Implement TODO

Double DQN

因为我们的神经网络预测 Qmax 本来就有误差, 每次也向着最大误差的 Q现实 改进神经网络, 就是因为这个 Qmax 导致了 overestimate(过估计)

As a consequence, at the beginning of the training, we don't have enough information about the best action to take. Therefore, taking the maximum q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.

The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:

  • use our DQN network to select what is the best action to take for the next state (the action with the highest Q value).
  • use our target network to calculate the target Q value of taking that action at the next state.

有两个神经网络: Q_eval Q (Q估计中的), Q_target [公式] (Q现实中的).

Natural DQN:

[公式]

Double DQN:

[公式]

对比原始和Double的cost曲线:

Dueling DQN

Theory

Remember that Q-values correspond to how good it is to be at that state and taking an action at that state Q(s, a).

So we can decompose Q(s, a) as the sum of:

  • V(s): the value of being at that state
  • A(s, a): the advantage of taking that action at that state (how much better is to take this action versus all other possible actions at that state).

[公式]

With Dueling DQN, we want to separate the estimator of these two elements, using two new streams:

  • one that estimates the state value V(s)
  • one that estimates the advantage for each action A(s, a)

And then we combine these two streams through a special aggregation layer to get an estimate of Q(s, a).

Reason

With our normal DQN, we need to calculate the value of each action at that state. But what's the point if the value of the state is bad? What's the point to calculate all actions at one state when all these actions lead to death?

As a consequence, by decoupling we're able to calculate V(s) . This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision. And, in most states, the choice of the action has no effect on what happens.

Implementation

The combination is not what we gave in Equation above. Because if we do that, we'll fall into the issue of identifiability, that is — given Q(s, a) we're unable to find A(s, a) and V(s). It will be a problem for backpropagation.

To avoid this problem, we can force our advantage function estimator to have 0 advantage at the chosen action. To do that, we subtract the average advantage of all actions possible of the state.

It can help us find much more reliable Q values for each action by decoupling the estimation between two streams.