DSR: Reinforcement Learning with Dynamical Skill Refinement

Abstract

Skill-based reinforcement learning (Skill-based RL) is an efficient paradigm for solving sparse-reward tasks by extracting skills from demonstration datasets and learning high-level policy which selects skills. Because each selected skill by high-level policy is executed for multiple consecutive timesteps, the high-level policy is essentially learned in a temporally abstract Markov decision process (TA-MDP) built on the skills, which shortens the task horizon and reduces the exploration cost. However, these skills are usually sub-optimal because of the potential low quality and low coverage of the datasets, which causes the sub-optimal performance in the downstream task. It is intuitive to refine the skills. However, it is a hard issue to refine the skills while ensuring performance improvement and avoiding non-stationarity of transition dynamics caused by skill changes. To address the dilemma of sub-optimality and ineffectiveness, we propose a unified optimization objective for the entire hierarchical policy. We theoretically prove that the unified optimization objective guarantees the performance improvement in TA-MDP, and that optimizing the performance in TA-MDP is equivalent to optimizing the performance lower bound of the entire hierarchical policy in original MDP. Furthermore, in order to overcome the phenomenon of skill space collapse, we propose the dynamical skill refinement (DSR) mechanism which names our method. The experiment results empirically validate the effectiveness of our method, and show the advantages over the state-of-the-art (SOTA) methods.

Type
Publication
Frontiers of Computer Science