Promisingly, he showed that Q-learning would always “converge,” namely, as long as the system had the opportunity to try every action, from every state, as many times as necessary, it would always, eventually develop the perfect value function:
Brian Christian • The Alignment Problem
The DQN system used epsilon-greedy exploration, which involves learning about which actions produce reward by simply hitting buttons at random a certain fraction of the time.
Brian Christian • The Alignment Problem
我们可以通过确定agent是否了解环境模型来划分可用的RL算法。 了解模型可以使agent提前知道状态转移概率矩阵和未来的reward