Watch and Match: Supercharging Imitation with Regularized Optimal Transport

Abstract

Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8× faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.

Method

Regularized Optimal Transport (ROT) is a new imitation learning algorithm that adaptively combines offline behavior cloning with online trajectory-matching based rewards (top). This enables signficantly faster imitation across a variety of simulated and real robotics tasks, while being compatible with high-dimensional visual observation. On our xArm robot, ROT can learn visual policies with only a single human demonstration and under an hour of online training.

Our main findings can be summarized as:

  • ROT outperforms prior state-of-the-art imitation methods, reaching 90% of expert performance 7.8× faster than our strongest baselines on simulated visual control benchmarks.
  • On real-world tasks, with a single human demonstration and an hour of training, ROT achieves an average success rate of 90.1% with randomized robot initialization and image observations. This is significantly higher than behavior cloning (36.1%) and adversarial IRL (14.6%) based approaches.
  • ROT exceeds the performance of state-of-the-art RL trained with rewards, while coming close to methods that augment RL with demonstrations. Unlike standard RL methods, ROT does not require hand-specification of the reward function.
  • Ablation studies demonstrate the importance of every component in ROT, particularly the role that soft Q-filtering plays in stabilizing training and the need for OT-based rewards during online learning.

Robot Results

We provide evaluation rollouts of ROT on a set of 14 real-world manipulation tasks. With just one demonstration and one hour of online training, ROT achieved an average sucess rate of 90.1% across 14 tasks. This is significantly higher than behavior cloning (36.1%) and adversarial IRL (14.6%) based approaches.





Trees
Mountains and houses
Sky and mountains
Sky and mountains
Sky and mountains
Snow on leafs
ROT performs well across varied start positions
Snow on leafs
ROT failure cases

Simulation Results

Our experiments on 20 tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark, demonstrate an average of 7.8× faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. Individually, to reach 90% of expert performance, ROT is on average

  • 8.7× faster on DeepMind Control tasks
  • 2.1× faster on OpenAI Robotics tasks
  • 8.9× faster on Meta-world tasks

Citation

          @article{haldar2022watch,
                    title={Watch and Match: Supercharging Imitation with Regularized Optimal Transport},
                    author={Haldar, Siddhant and Mathur, Vaibhav and Yarats, Denis and Pinto, Lerrel},
                    journal={CoRL},
                    year={2022}
                  }