Skip to content
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion config/gail_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Pyramids:
beta: 1.0e-2
max_steps: 5.0e5
num_epoch: 3
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 0.5
steps: 10000
Expand Down Expand Up @@ -59,6 +59,10 @@ CrawlerStatic:
summary_freq: 3000
num_layers: 3
hidden_units: 512
behavioral_cloning:
demo_path: ./demos/ExpertCrawlerSta.demo
strength: 0.5
steps: 5000
reward_signals:
gail:
strength: 1.0
Expand Down
8 changes: 7 additions & 1 deletion docs/Migrating.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Migrating

## Migrating from ML-Agents toolkit v0.11.0 to v0.12.0
## Migrating from ML-Agents toolkit v0.12.0

### Important Changes
* Offline Behavioral Cloning has been removed. To learn from demonstrations, use the GAIL and
Behavioral Cloning features with either PPO or SAC. See [Imitation Learning](Training-Imitation-Learning.md) for more information.

## Migrating from ML-Agents toolkit v0.11.0

### Important Changes
* Text actions and observations, and custom action and observation protos have been removed.
Expand Down
9 changes: 4 additions & 5 deletions docs/Reward-Signals.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,11 +135,10 @@ discriminator is trained to better distinguish between demonstrations and agent
In this way, while the agent gets better and better at mimicing the demonstrations, the
discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it.

This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires
far fewer demonstrations to be provided. After all, we are still learning a policy that happens
to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It
is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can
also be used independently to purely learn from demonstrations.
This approach learns a _policy_ that produces states and actions similar to the demonstrations,
requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely
from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide
the learning process.

Using GAIL requires recorded demonstrations from your Unity environment. See the
[imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations.
Expand Down
30 changes: 0 additions & 30 deletions docs/Training-Behavioral-Cloning.md

This file was deleted.

31 changes: 15 additions & 16 deletions docs/Training-Imitation-Learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,40 +19,39 @@ imitation learning combined with reinforcement learning can dramatically
reduce the time the agent takes to solve the environment.
For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids),
using 6 episodes of demonstrations can reduce training steps by more than 4 times.
See PreTraining + GAIL + Curiosity + RL below.
See Behavioral Cloning + GAIL + Curiosity + RL below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you'll need to change the legend in the linked image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed


<p align="center">
<img src="images/mlagents-ImitationAndRL.png"
alt="Using Demonstrations with Reinforcement Learning"
width="700" border="0" />
</p>

The ML-Agents toolkit provides several ways to learn from demonstrations.
The ML-Agents toolkit provides two features that enable your agent to learn from demonstrations.
In most scenarios, you should combine these two features

* To train using GAIL (Generative Adversarial Imitation Learning) you can add the
* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to
reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the
[GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be
used with or without environment rewards, and works well when there are a limited
number of demonstrations.
* To help bootstrap reinforcement learning, you can enable
[pretraining](Training-PPO.md#optional-pretraining-using-demonstrations)
on the PPO trainer, in addition to using a small GAIL reward signal.
* To train an agent to exactly mimic demonstrations, you can use the
[Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be
used with demonstrations (in-editor), and learns very quickly. However, it usually is ineffective
on more complex environments without a large number of demonstrations.
* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions
shown in a set of demonstrations.
[Behavioral Cloning](Training-PPO.md#optional-behavioral-cloning-using-demonstrations)
can be enabled on the PPO or SAC trainer. Behavioral Cloning tends to work best when
there are a lot of demonstrations, or in conjunction with GAIL and/or an extrinsic reward.

### How to Choose

If you want to help your agents learn (especially with environments that have sparse rewards)
using pre-recorded demonstrations, you can generally enable both GAIL and Pretraining.
using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning
at low strengths in addition to having an extrinsic reward.
An example of this is provided for the Pyramids example environment under
`PyramidsLearning` in `config/gail_config.yaml`.

If you want to train purely from demonstrations, GAIL is generally the preferred approach, especially
if you have few (<10) episodes of demonstrations. An example of this is provided for the Crawler example
environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.

If you have plenty of demonstrations and/or a very simple environment, Offline Behavioral Cloning can be effective and quick. However, it cannot be combined with RL.
If you want to train purely from demonstrations, GAIL and Behavioral Cloning _without_ an
extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler
example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`.

## Recording Demonstrations

Expand Down
2 changes: 1 addition & 1 deletion docs/Training-ML-Agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ example environments are included in the provided config file.
| normalize | Whether to automatically normalize observations. | PPO, SAC |
| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO |
| num_layers | The number of hidden layers in the neural network. | PPO, SAC, BC |
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO, SAC |
| behavioral_cloning | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations). | PPO, SAC |
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC, BC |
| save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC |
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC |
Expand Down
16 changes: 7 additions & 9 deletions docs/Training-PPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,24 +224,22 @@ the agent will need to remember in order to successfully complete the task.

Typical Range: `64` - `512`

## (Optional) Pretraining Using Demonstrations
## (Optional) Behavioral Cloning Using Demonstrations

In some cases, you might want to bootstrap the agent's policy using behavior recorded
from a player. This can help guide the agent towards the reward. Pretraining adds
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
training operations that mimic a demonstration rather than attempting to maximize reward.
It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md)
in-line with PPO.

To use pretraining, add a `pretraining` section to the trainer_config. For instance:
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:

```
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 0.5
steps: 10000
```

Below are the available hyperparameters for pretraining.
Below are the available hyperparameters for BC.

### Strength

Expand All @@ -258,10 +256,10 @@ See the [imitation learning guide](Training-Imitation-Learning.md) for more on `

### Steps

During pretraining, it is often desirable to stop using demonstrations after the agent has
During BC, it is often desirable to stop using demonstrations after the agent has
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
outside of the provided demonstrations. `steps` corresponds to the training steps over which
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
BC is active. The learning rate of the cloning will anneal over the steps. Set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The learning rate of the cloning behavioral cloning will anneal over the steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to the abbreviation BC

the steps to 0 for constant imitation over the entire training run.

### (Optional) Batch Size
Expand Down
16 changes: 7 additions & 9 deletions docs/Training-SAC.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,24 +239,22 @@ default.

Default: `False`

## (Optional) Pretraining Using Demonstrations
## (Optional) Behavioral Cloning Using Demonstrations

In some cases, you might want to bootstrap the agent's policy using behavior recorded
from a player. This can help guide the agent towards the reward. Pretraining adds
from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds
training operations that mimic a demonstration rather than attempting to maximize reward.
It is essentially equivalent to running [behavioral cloning](./Training-Behavioral-Cloning.md)
in-line with SAC.

To use pretraining, add a `pretraining` section to the trainer_config. For instance:
To use BC, add a `behavioral_cloning` section to the trainer_config. For instance:

```
pretraining:
behavioral_cloning:
demo_path: ./demos/ExpertPyramid.demo
strength: 0.5
steps: 10000
```

Below are the available hyperparameters for pretraining.
Below are the available hyperparameters for BC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Behavioral Cloning be abbreviated? I think it would be better to keep it consistent in the docs. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I abbreviated it since we abbreviate PPO, SAC, and GAIL. I think the 1st mention on any particular page should be full with the abbreviation in parenthesis, then abbreviated - what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.


### Strength

Expand All @@ -273,10 +271,10 @@ See the [imitation learning guide](Training-Imitation-Learning.md) for more on `

### Steps

During pretraining, it is often desirable to stop using demonstrations after the agent has
During BC, it is often desirable to stop using demonstrations after the agent has
"seen" rewards, and allow it to optimize past the available demonstrations and/or generalize
outside of the provided demonstrations. `steps` corresponds to the training steps over which
pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set
BC is active. The learning rate of the pretrainer will anneal over the steps. Set
the steps to 0 for constant imitation over the entire training run.

### (Optional) Batch Size
Expand Down
Empty file.
107 changes: 0 additions & 107 deletions ml-agents/mlagents/trainers/bc/models.py

This file was deleted.

Loading