Hands-On Imitation Learning: From Behavior Cloning to Multi-Modal Imitation Learning

Reinforcement learning is one branch of machine learning concerned with learning by guidance of scalar signals (rewards); in contrast to supervised learning, which needs full labels of the target variable.
An intuitive example to explain reinforcement learning can be given in terms of a school with two classes having two types of tests repeated continuously. The first class solves the test and gets the full correct answers (supervised learning: SL). The second class solves the test and gets only the grades for each question (reinforcement learning: RL). In the first case, it seems easier for the students to learn the correct answers and memorize them. In the second class, the task is harder because they can learn only by trial and error. However, their learning will be more robust because they don't only know what is right but also all the wrong answers to avoid.
In order to learn efficiently with RL, an accurate reward signal (the grades) should be designed, which is considered a difficult task, especially for real-world applications. For example, a human expert driver knows how to drive, but cannot set rewards for ‘correct driving' skill, same thing for cooking or painting. This created the need for imitation learning methods (IL). IL is a new branch of RL concerned with learning from mere expert trajectories, without knowing the rewards. Main application areas of IL are in robotics and autonomous driving fields.
In the following, we will explore the most famous methods of IL in the literature, ordered by their proposal time from old to new, as shown in the timeline picture below.

The mathematical formulations will be shown along with nomenclature of the symbols. However, the theoretical derivation is kept to a minimum here; if further depth is needed, the original references can be looked up as cited in the references section at the end. The full code for recreating all the experiments is provided in the accompanying github repo.
So, buckle up! and let's dive through imitation learning, from behavior cloning (BC) to information maximization generative adversarial imitation learning (InfoGAIL).
Example Environment
The environment used in this post is represented as a 15×15 grid. The environment state is illustrated below:
- Agent: red color
- Initial agent location: blue color
- Walls: green color

The goal of the agent is to reach the first row in the shortest possible way through any of the three windows and towards a symmetrical location to its initial position with respect to the vertical axis passing through the middle of the grid. The goal location will not be shown in the state grid.
So the initial position has 15 possibilities only, and the goal location is changed based on that.
Action Space
The action space A consists of a discrete number from 0 to 4 representing movements in four directions and the stopping action, as illustrated below:

Reward Function
The ground truth reward here R(s,a) is a function of the current state and action, with a value equal to the displacement distance towards the goal:
where