- Felix Grün1, 2 , Muhammed Saif-ur-Rehman1 , Ioannis Iossifidis1
- Department for Computer Science, Ruhr-West University of Applied Sciences, Lützowstraße 5, 46236 Bottrop, Germany
- Institut für Neuroinformatik, Ruhr-University Bochum, Universitätsstraße 150, 44801 Bochum, Germany
The relation between the activity of dopaminergic neurons and the temporal difference error in Reinforcement Learning (RL) problems  is well-known to many in the fields of machine learning and neuroscience. More recently, distributional RL has inspired the successful search for evidence in favor of an equivalent neural mechnism . Distributional RL methods aim to make better use of the available interactions of the agent with the environment. They do this by learning the probability distribution of the amount of reward expected in the future where non-distributional agents usually learn only the expectation of that distribution, the value. Distributional algorithms often outperform comparable non-distributional methods in terms of learning speed and final performance (usually benchmarked using the ALE). Increasingly sample efficient Distributional RL algorithms for the discrete action domain have been developed over time that vary primarily in the way they parameterize their approximations of value distributions. We transfer three of the most well-known and successful of those to the continuous action domain by extending two powerful actor-critic algorithms with distributional critics. The parameterizations are all based on the quantile regression approach  and crucially differ in how the quantiles to be predicted are selected. We investigate whether the relative performance of the methods for the discrete action space translates to the continuous case. To that end we compare them empirically on the pybullet implementations of a set of mujoco continuous control tasks.
This work is supported by the Ministry of Economics, Innovation, Digitization and Energy of the State of North Rhine-Westphalia and the European Union, grants GE-2-2-023A (REXO) and IT-2-2-023 (VAFES)