Unity-Technologies · ervteng · Dec 12, 2019 · Nov 25, 2019 · Nov 25, 2019 · Nov 26, 2019
diff --git a/config/gail_config.yaml b/config/gail_config.yaml
@@ -31,7 +31,7 @@ Pyramids:
     beta: 1.0e-2
     max_steps: 5.0e5
     num_epoch: 3
-    pretraining:
+    behavioral_cloning:
         demo_path: ./demos/ExpertPyramid.demo
         strength: 0.5
         steps: 10000
@@ -59,6 +59,10 @@ CrawlerStatic:
     summary_freq: 3000
     num_layers: 3
     hidden_units: 512
+    behavioral_cloning:
+        demo_path: ./demos/ExpertCrawlerSta.demo
+        strength: 0.5
+        steps: 5000
     reward_signals:
         gail:
             strength: 1.0

diff --git a/docs/Migrating.md b/docs/Migrating.md
@@ -1,6 +1,12 @@
 # Migrating
 
-## Migrating from ML-Agents toolkit v0.11.0 to v0.12.0
+## Migrating from ML-Agents toolkit v0.12.0
+
+### Important Changes
+* Offline Behavioral Cloning has been removed. To learn from demonstrations, use the GAIL and
+Behavioral Cloning features with either PPO or SAC. See [Imitation Learning](Training-Imitation-Learning.md) for more information.
+
+## Migrating from ML-Agents toolkit v0.11.0
 
 ### Important Changes
 * Text actions and observations, and custom action and observation protos have been removed.

diff --git a/docs/Reward-Signals.md b/docs/Reward-Signals.md
@@ -135,11 +135,10 @@ discriminator is trained to better distinguish between demonstrations and agent
 In this way, while the agent gets better and better at mimicing the demonstrations, the
 discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.
 
-This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires
-far fewer demonstrations to be provided. After all, we are still learning a policy that happens
-to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
-is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can
-also be used independently to purely learn from demonstrations.
+This approach learns a _policy_ that produces states and actions similar to the demonstrations,
+requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
+from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
+the learning process.
 
 Using GAIL requires recorded demonstrations from your Unity environment. See the
 [imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.

diff --git a/docs/Training-Behavioral-Cloning.md b/docs/Training-Behavioral-Cloning.md
diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md
@@ -19,40 +19,39 @@ imitation learning combined with reinforcement learning can dramatically
 reduce the time the agent takes to solve the environment.
 For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
 using 6 episodes of demonstrations can reduce training steps by more than 4 times.
-See PreTraining + GAIL + Curiosity + RL below.
+See Behavioral Cloning + GAIL + Curiosity + RL below.
 
 <p align="center">
   <img src="images/mlagents-ImitationAndRL.png"
        alt="Using Demonstrations with Reinforcement Learning"
        width="700" border="0" />
 </p>
 
-The ML-Agents toolkit provides several ways to learn from demonstrations.
+The ML-Agents toolkit provides two features that enable your agent to learn from demonstrations.
+In most scenarios, you should combine these two features
 
-* To train using GAIL (Generative Adversarial Imitation Learning) you can add the
+* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
+  reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
   [GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be
   used with or without environment rewards, and works well when there are a limited
   number of demonstrations.
-* To help bootstrap reinforcement learning, you can enable
-  [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
-  on the PPO trainer, in addition to using a small GAIL reward signal.
-* To train an agent to exactly mimic demonstrations, you can use the
-  [Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be
-  used with demonstrations (in-editor), and learns very quickly. However, it usually is ineffective
-  on more complex environments without a large number of demonstrations.
+* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
+  shown in a set of demonstrations.
+  [Behavioral Cloning](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
+  can be enabled on the PPO or SAC trainer. Behavioral Cloning tends to work best when
+  there are a lot of demonstrations, or in conjunction with GAIL and/or an extrinsic reward.
 
 ### How to Choose
 
 If you want to help your agents learn (especially with environments that have sparse rewards)
-using pre-recorded demonstrations, you can generally enable both GAIL and Pretraining.
+using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning
+at low strengths in addition to having an extrinsic reward.
 An example of this is provided for the Pyramids example environment under
  `PyramidsLearning` in `config/gail_config.yaml`.
 
-If you want to train purely from demonstrations, GAIL is generally the preferred approach, especially
-if you have few (<10) episodes of demonstrations. An example of this is provided for the Crawler example
-environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
-
-If you have plenty of demonstrations and/or a very simple environment, Offline Behavioral Cloning can be effective and quick. However, it cannot be combined with RL.
+If you want to train purely from demonstrations, GAIL and Behavioral Cloning _without_ an
+extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler
+example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.
 
 ## Recording Demonstrations
 

diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
@@ -196,7 +196,7 @@ example environments are included in the provided config file.
 | normalize            | Whether to automatically normalize observations.                                                                                                                                        | PPO, SAC                 |
 | num_epoch            | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
 | num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, SAC, BC             |
-| pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                           | PPO, SAC                 |
+| behavioral_cloning          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations).                           | PPO, SAC                 |
 | reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options.                                         | PPO, SAC, BC             |
 | save_replay_buffer   | Saves the replay buffer when exiting training, and loads it on resume.                                                                                                                  | SAC                      |
 | sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC             |

diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md
@@ -224,24 +224,22 @@ the agent will need to remember in order to successfully complete the task.
 
 Typical Range: `64` - `512`
 
-## (Optional) Pretraining Using Demonstrations
+## (Optional) Behavioral Cloning Using Demonstrations
 
 In some cases, you might want to bootstrap the agent's policy using behavior recorded
-from a player. This can help guide the agent towards the reward. Pretraining adds
+from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
 training operations that mimic a demonstration rather than attempting to maximize reward.
-It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md)
-in-line with PPO.
 
-To use pretraining, add a `pretraining` section to the trainer_config. For instance:
+To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
 
 ```
-    pretraining:
+    behavioral_cloning:
         demo_path: ./demos/ExpertPyramid.demo
         strength: 0.5
         steps: 10000
 ```
 
-Below are the available hyperparameters for pretraining.
+Below are the available hyperparameters for BC.
 
 ### Strength
 
@@ -258,10 +256,10 @@ See the [imitation learning guide](Training-Imitation-Learning.md) for more on `
 
 ### Steps
 
-During pretraining, it is often desirable to stop using demonstrations after the agent has
+During BC, it is often desirable to stop using demonstrations after the agent has
 "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
 outside of the provided demonstrations. `steps` corresponds to the training steps over which
-pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
+BC is active. The learning rate of the cloning will anneal over the steps. Set
 the steps to 0 for constant imitation over the entire training run.
 
 ### (Optional) Batch Size

diff --git a/docs/Training-SAC.md b/docs/Training-SAC.md
@@ -239,24 +239,22 @@ default.
 
 Default: `False`
 
-## (Optional) Pretraining Using Demonstrations
+## (Optional) Behavioral Cloning Using Demonstrations
 
 In some cases, you might want to bootstrap the agent's policy using behavior recorded
-from a player. This can help guide the agent towards the reward. Pretraining adds
+from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
 training operations that mimic a demonstration rather than attempting to maximize reward.
-It is essentially equivalent to running [behavioral cloning](./Training-Behavioral-Cloning.md)
-in-line with SAC.
 
-To use pretraining, add a `pretraining` section to the trainer_config. For instance:
+To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:
 
 ```
-    pretraining:
+    behavioral_cloning:
         demo_path: ./demos/ExpertPyramid.demo
         strength: 0.5
         steps: 10000
 ```
 
-Below are the available hyperparameters for pretraining.
+Below are the available hyperparameters for BC.
 
 ### Strength
 
@@ -273,10 +271,10 @@ See the [imitation learning guide](Training-Imitation-Learning.md) for more on `
 
 ### Steps
 
-During pretraining, it is often desirable to stop using demonstrations after the agent has
+During BC, it is often desirable to stop using demonstrations after the agent has
 "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
 outside of the provided demonstrations. `steps` corresponds to the training steps over which
-pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
+BC is active. The learning rate of the pretrainer will anneal over the steps. Set
 the steps to 0 for constant imitation over the entire training run.
 
 ### (Optional) Batch Size

diff --git a/ml-agents/mlagents/trainers/bc/__init__.py b/ml-agents/mlagents/trainers/bc/__init__.py
diff --git a/ml-agents/mlagents/trainers/bc/models.py b/ml-agents/mlagents/trainers/bc/models.py