diff --git a/config/gail_config.yaml b/config/gail_config.yaml index da91c9e694..63e508ae9a 100644 --- a/config/gail_config.yaml +++ b/config/gail_config.yaml @@ -31,7 +31,7 @@ Pyramids: beta: 1.0e-2 max_steps: 5.0e5 num_epoch: 3 - pretraining: + behavioral_cloning: demo_path: ./demos/ExpertPyramid.demo strength: 0.5 steps: 10000 @@ -59,6 +59,10 @@ CrawlerStatic: summary_freq: 3000 num_layers: 3 hidden_units: 512 + behavioral_cloning: + demo_path: ./demos/ExpertCrawlerSta.demo + strength: 0.5 + steps: 5000 reward_signals: gail: strength: 1.0 diff --git a/docs/Migrating.md b/docs/Migrating.md index 5f42bb5801..3f15f48aa8 100644 --- a/docs/Migrating.md +++ b/docs/Migrating.md @@ -16,6 +16,8 @@ The versions can be found in * `reset()` on the Low-Level Python API no longer takes a `config` argument. `UnityEnvironment` no longer has a `reset_parameters` field. To modify float properties in the environment, you must use a `FloatPropertiesChannel`. For more information, refer to the [Low Level Python API documentation](Python-API.md) * The Academy no longer has a `Training Configuration` nor `Inference Configuration` field in the inspector. To modify the configuration from the Low-Level Python API, use an `EngineConfigurationChannel`. To modify it during training, use the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate` in `mlagents-learn`. * The Academy no longer has a `Default Reset Parameters` field in the inspector. The Academy class no longer has a `ResetParameters`. To access shared float properties with Python, use the new `FloatProperties` field on the Academy. +* Offline Behavioral Cloning has been removed. To learn from demonstrations, use the GAIL and +Behavioral Cloning features with either PPO or SAC. See [Imitation Learning](Training-Imitation-Learning.md) for more information. ### Steps to Migrate * If you had a custom `Training Configuration` in the Academy inspector, you will need to pass your custom configuration at every training run using the new command line arguments `--width`, `--height`, `--quality-level`, `--time-scale` and `--target-frame-rate`. diff --git a/docs/Reward-Signals.md b/docs/Reward-Signals.md index 7adbcf4861..04149556c7 100644 --- a/docs/Reward-Signals.md +++ b/docs/Reward-Signals.md @@ -135,11 +135,10 @@ discriminator is trained to better distinguish between demonstrations and agent In this way, while the agent gets better and better at mimicing the demonstrations, the discriminator keeps getting stricter and stricter and the agent must try harder to "fool" it. -This approach, when compared to [Behavioral Cloning](Training-Behavioral-Cloning.md), requires -far fewer demonstrations to be provided. After all, we are still learning a policy that happens -to be similar to the demonstrations, not directly copying the behavior of the demonstrations. It -is especially effective when combined with an Extrinsic signal. However, the GAIL reward signal can -also be used independently to purely learn from demonstrations. +This approach learns a _policy_ that produces states and actions similar to the demonstrations, +requiring fewer demonstrations than direct cloning of the actions. In addition to learning purely +from demonstrations, the GAIL reward signal can be mixed with an extrinsic reward signal to guide +the learning process. Using GAIL requires recorded demonstrations from your Unity environment. See the [imitation learning guide](Training-Imitation-Learning.md) to learn more about recording demonstrations. diff --git a/docs/Training-Behavioral-Cloning.md b/docs/Training-Behavioral-Cloning.md deleted file mode 100644 index bdca019eae..0000000000 --- a/docs/Training-Behavioral-Cloning.md +++ /dev/null @@ -1,30 +0,0 @@ -# Training with Behavioral Cloning - -There are a variety of possible imitation learning algorithms which can -be used, the simplest one of them is Behavioral Cloning. It works by collecting -demonstrations from a teacher, and then simply uses them to directly learn a -policy, in the same way the supervised learning for image classification -or other traditional Machine Learning tasks work. - -## Offline Training - -With offline behavioral cloning, we can use demonstrations (`.demo` files) -generated using the `Demonstration Recorder` as the dataset used to train a behavior. - -1. Choose an agent you would like to learn to imitate some set of demonstrations. -2. Record a set of demonstration using the `Demonstration Recorder` (see [here](Training-Imitation-Learning.md)). - For illustrative purposes we will refer to this file as `AgentRecording.demo`. -3. Build the scene(make sure the Agent is not using its heuristic). -4. Open the `config/offline_bc_config.yaml` file. -5. Modify the `demo_path` parameter in the file to reference the path to the - demonstration file recorded in step 2. In our case this is: - `./UnitySDK/Assets/Demonstrations/AgentRecording.demo` -6. Launch `mlagent-learn`, providing `./config/offline_bc_config.yaml` - as the config parameter, and include the `--run-id` and `--train` as usual. - Provide your environment as the `--env` parameter if it has been compiled - as standalone, or omit to train in the editor. -7. (Optional) Observe training performance using TensorBoard. - -This will use the demonstration file to train a neural network driven agent -to directly imitate the actions provided in the demonstration. The environment -will launch and be used for evaluating the agent's performance during training. diff --git a/docs/Training-Imitation-Learning.md b/docs/Training-Imitation-Learning.md index b1475747b9..b31c26541e 100644 --- a/docs/Training-Imitation-Learning.md +++ b/docs/Training-Imitation-Learning.md @@ -19,7 +19,7 @@ imitation learning combined with reinforcement learning can dramatically reduce the time the agent takes to solve the environment. For instance, on the [Pyramids environment](Learning-Environment-Examples.md#pyramids), using 6 episodes of demonstrations can reduce training steps by more than 4 times. -See PreTraining + GAIL + Curiosity + RL below. +See Behavioral Cloning + GAIL + Curiosity + RL below.

-The ML-Agents toolkit provides several ways to learn from demonstrations. +The ML-Agents toolkit provides two features that enable your agent to learn from demonstrations. +In most scenarios, you should combine these two features -* To train using GAIL (Generative Adversarial Imitation Learning) you can add the +* GAIL (Generative Adversarial Imitation Learning) uses an adversarial approach to + reward your Agent for behaving similar to a set of demonstrations. To use GAIL, you can add the [GAIL reward signal](Reward-Signals.md#gail-reward-signal). GAIL can be used with or without environment rewards, and works well when there are a limited number of demonstrations. -* To help bootstrap reinforcement learning, you can enable - [pretraining](Training-PPO.md#optional-pretraining-using-demonstrations) - on the PPO trainer, in addition to using a small GAIL reward signal. -* To train an agent to exactly mimic demonstrations, you can use the - [Behavioral Cloning](Training-Behavioral-Cloning.md) trainer. Behavioral Cloning can be - used with demonstrations (in-editor), and learns very quickly. However, it usually is ineffective - on more complex environments without a large number of demonstrations. +* Behavioral Cloning (BC) trains the Agent's neural network to exactly mimic the actions + shown in a set of demonstrations. + [The BC feature](Training-PPO.md#optional-behavioral-cloning-using-demonstrations) + can be enabled on the PPO or SAC trainer. BC tends to work best when + there are a lot of demonstrations, or in conjunction with GAIL and/or an extrinsic reward. ### How to Choose If you want to help your agents learn (especially with environments that have sparse rewards) -using pre-recorded demonstrations, you can generally enable both GAIL and Pretraining. +using pre-recorded demonstrations, you can generally enable both GAIL and Behavioral Cloning +at low strengths in addition to having an extrinsic reward. An example of this is provided for the Pyramids example environment under `PyramidsLearning` in `config/gail_config.yaml`. -If you want to train purely from demonstrations, GAIL is generally the preferred approach, especially -if you have few (<10) episodes of demonstrations. An example of this is provided for the Crawler example -environment under `CrawlerStaticLearning` in `config/gail_config.yaml`. - -If you have plenty of demonstrations and/or a very simple environment, Offline Behavioral Cloning can be effective and quick. However, it cannot be combined with RL. +If you want to train purely from demonstrations, GAIL and BC _without_ an +extrinsic reward signal is the preferred approach. An example of this is provided for the Crawler +example environment under `CrawlerStaticLearning` in `config/gail_config.yaml`. ## Recording Demonstrations It is possible to record demonstrations of agent behavior from the Unity Editor, and save them as assets. These demonstrations contain information on the observations, actions, and rewards for a given agent during the recording session. -They can be managed from the Editor, as well as used for training with Offline -Behavioral Cloning and GAIL. +They can be managed from the Editor, as well as used for training with BC and GAIL. In order to record demonstrations from an agent, add the `Demonstration Recorder` component to a GameObject in the scene which contains an `Agent` component. diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md index 8c6ce458fc..823f7dc021 100644 --- a/docs/Training-ML-Agents.md +++ b/docs/Training-ML-Agents.md @@ -175,9 +175,9 @@ The training config files `config/trainer_config.yaml`, `config/sac_trainer_conf `config/gail_config.yaml` and `config/offline_bc_config.yaml` specifies the training method, the hyperparameters, and a few additional values to use when training with Proximal Policy Optimization(PPO), Soft Actor-Critic(SAC), GAIL (Generative Adversarial Imitation Learning) -with PPO, and online and offline Behavioral Cloning(BC)/Imitation. These files are divided +with PPO/SAC, and Behavioral Cloning(BC)/Imitation with PPO/SAC. These files are divided into sections. The **default** section defines the default values for all the available -training with PPO, SAC, GAIL (with PPO), and offline BC. These files are divided into sections. +training with PPO, SAC, GAIL (with PPO), and BC. These files are divided into sections. The **default** section defines the default values for all the available settings. You can also add new sections to override these defaults to train specific Behaviors. Name each of these override sections after the appropriate `Behavior Name`. Sections for the @@ -185,35 +185,34 @@ example environments are included in the provided config file. | **Setting** | **Description** | **Applies To Trainer\*** | | :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------- | -| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC, BC | -| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | BC | +| batch_size | The number of experiences in each iteration of gradient descent. | PPO, SAC | +| batches_per_epoch | In imitation learning, the number of batches of training examples to collect before training the model. | | | beta | The strength of entropy regularization. | PPO | -| demo_path | For offline imitation learning, the file path of the recorded demonstration file | (offline)BC | | buffer_size | The number of experiences to collect before updating the policy model. In SAC, the max size of the experience buffer. | PPO, SAC | | buffer_init_steps | The number of experiences to collect into the buffer before updating the policy model. | SAC | | epsilon | Influences how rapidly the policy can evolve during training. | PPO | -| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC, BC | +| hidden_units | The number of units in the hidden layers of the neural network. | PPO, SAC | | init_entcoef | How much the agent should explore in the beginning of training. | SAC | | lambd | The regularization parameter. | PPO | -| learning_rate | The initial learning rate for gradient descent. | PPO, SAC, BC | -| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC, BC | -| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC | +| learning_rate | The initial learning rate for gradient descent. | PPO, SAC | +| max_steps | The maximum number of simulation steps to run during a training session. | PPO, SAC | +| memory_size | The size of the memory an agent must keep. Used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | | normalize | Whether to automatically normalize observations. | PPO, SAC | | num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO | -| num_layers | The number of hidden layers in the neural network. | PPO, SAC, BC | -| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO, SAC | -| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC, BC | +| num_layers | The number of hidden layers in the neural network. | PPO, SAC | +| behavioral_cloning | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-behavioral-cloning-using-demonstrations). | PPO, SAC | +| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Reward-Signals.md) for configuration options. | PPO, SAC | | save_replay_buffer | Saves the replay buffer when exiting training, and loads it on resume. | SAC | -| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC | -| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC, BC | +| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | +| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, SAC | | tau | How aggressively to update the target network used for bootstrapping value estimation in SAC. | SAC | -| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC, (online)BC | -| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC, BC | +| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, SAC | +| trainer | The type of training to perform: "ppo", "sac", "offline_bc" or "online_bc". | PPO, SAC | | train_interval | How often to update the agent. | SAC | | num_update | Number of mini-batches to update the agent with during each update. | SAC | -| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC, BC | +| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, SAC | -\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation) +\*PPO = Proximal Policy Optimization, SAC = Soft Actor-Critic, BC = Behavioral Cloning (Imitation), GAIL = Generative Adversarial Imitaiton Learning For specific advice on setting hyperparameters based on the type of training you are conducting, see: diff --git a/docs/Training-PPO.md b/docs/Training-PPO.md index 0a0ec6b61d..d45dd9fa34 100644 --- a/docs/Training-PPO.md +++ b/docs/Training-PPO.md @@ -224,29 +224,27 @@ the agent will need to remember in order to successfully complete the task. Typical Range: `64` - `512` -## (Optional) Pretraining Using Demonstrations +## (Optional) Behavioral Cloning Using Demonstrations In some cases, you might want to bootstrap the agent's policy using behavior recorded -from a player. This can help guide the agent towards the reward. Pretraining adds +from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds training operations that mimic a demonstration rather than attempting to maximize reward. -It is essentially equivalent to running [behavioral cloning](Training-Behavioral-Cloning.md) -in-line with PPO. -To use pretraining, add a `pretraining` section to the trainer_config. For instance: +To use BC, add a `behavioral_cloning` section to the trainer_config. For instance: ``` - pretraining: + behavioral_cloning: demo_path: ./demos/ExpertPyramid.demo strength: 0.5 steps: 10000 ``` -Below are the available hyperparameters for pretraining. +Below are the available hyperparameters for BC. ### Strength `strength` corresponds to the learning rate of the imitation relative to the learning -rate of PPO, and roughly corresponds to how strongly we allow the behavioral cloning +rate of PPO, and roughly corresponds to how strongly we allow BC to influence the policy. Typical Range: `0.1` - `0.5` @@ -258,10 +256,10 @@ See the [imitation learning guide](Training-Imitation-Learning.md) for more on ` ### Steps -During pretraining, it is often desirable to stop using demonstrations after the agent has +During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. `steps` corresponds to the training steps over which -pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set +BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run. ### (Optional) Batch Size diff --git a/docs/Training-SAC.md b/docs/Training-SAC.md index 7a838ce516..36c44c05dc 100644 --- a/docs/Training-SAC.md +++ b/docs/Training-SAC.md @@ -239,29 +239,27 @@ default. Default: `False` -## (Optional) Pretraining Using Demonstrations +## (Optional) Behavioral Cloning Using Demonstrations In some cases, you might want to bootstrap the agent's policy using behavior recorded -from a player. This can help guide the agent towards the reward. Pretraining adds +from a player. This can help guide the agent towards the reward. Behavioral Cloning (BC) adds training operations that mimic a demonstration rather than attempting to maximize reward. -It is essentially equivalent to running [behavioral cloning](./Training-Behavioral-Cloning.md) -in-line with SAC. -To use pretraining, add a `pretraining` section to the trainer_config. For instance: +To use BC, add a `behavioral_cloning` section to the trainer_config. For instance: ``` - pretraining: + behavioral_cloning: demo_path: ./demos/ExpertPyramid.demo strength: 0.5 steps: 10000 ``` -Below are the available hyperparameters for pretraining. +Below are the available hyperparameters for BC. ### Strength `strength` corresponds to the learning rate of the imitation relative to the learning -rate of SAC, and roughly corresponds to how strongly we allow the behavioral cloning +rate of SAC, and roughly corresponds to how strongly we allow BC to influence the policy. Typical Range: `0.1` - `0.5` @@ -273,10 +271,10 @@ See the [imitation learning guide](Training-Imitation-Learning.md) for more on ` ### Steps -During pretraining, it is often desirable to stop using demonstrations after the agent has +During BC, it is often desirable to stop using demonstrations after the agent has "seen" rewards, and allow it to optimize past the available demonstrations and/or generalize outside of the provided demonstrations. `steps` corresponds to the training steps over which -pretraining is active. The learning rate of the pretrainer will anneal over the steps. Set +BC is active. The learning rate of BC will anneal over the steps. Set the steps to 0 for constant imitation over the entire training run. ### (Optional) Batch Size diff --git a/docs/images/mlagents-ImitationAndRL.png b/docs/images/mlagents-ImitationAndRL.png index ffa61d1b11..614f7473b9 100644 Binary files a/docs/images/mlagents-ImitationAndRL.png and b/docs/images/mlagents-ImitationAndRL.png differ diff --git a/ml-agents/mlagents/trainers/bc/__init__.py b/ml-agents/mlagents/trainers/bc/__init__.py deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/ml-agents/mlagents/trainers/bc/models.py b/ml-agents/mlagents/trainers/bc/models.py deleted file mode 100644 index 7619972bac..0000000000 --- a/ml-agents/mlagents/trainers/bc/models.py +++ /dev/null @@ -1,107 +0,0 @@ -from mlagents.tf_utils import tf - -from mlagents.trainers.models import LearningModel - - -class BehavioralCloningModel(LearningModel): - def __init__( - self, - brain, - h_size=128, - lr=1e-4, - n_layers=2, - m_size=128, - normalize=False, - use_recurrent=False, - seed=0, - ): - LearningModel.__init__(self, m_size, normalize, use_recurrent, brain, seed) - num_streams = 1 - hidden_streams = self.create_observation_streams(num_streams, h_size, n_layers) - hidden = hidden_streams[0] - self.dropout_rate = tf.placeholder( - dtype=tf.float32, shape=[], name="dropout_rate" - ) - hidden_reg = tf.layers.dropout(hidden, self.dropout_rate) - if self.use_recurrent: - tf.Variable( - self.m_size, name="memory_size", trainable=False, dtype=tf.int32 - ) - self.memory_in = tf.placeholder( - shape=[None, self.m_size], dtype=tf.float32, name="recurrent_in" - ) - hidden_reg, self.memory_out = self.create_recurrent_encoder( - hidden_reg, self.memory_in, self.sequence_length - ) - self.memory_out = tf.identity(self.memory_out, name="recurrent_out") - - if brain.vector_action_space_type == "discrete": - policy_branches = [] - for size in self.act_size: - policy_branches.append( - tf.layers.dense( - hidden_reg, - size, - activation=None, - use_bias=False, - kernel_initializer=tf.initializers.variance_scaling(0.01), - ) - ) - self.action_probs = tf.concat( - [tf.nn.softmax(branch) for branch in policy_branches], - axis=1, - name="action_probs", - ) - self.action_masks = tf.placeholder( - shape=[None, sum(self.act_size)], dtype=tf.float32, name="action_masks" - ) - self.sample_action_float, _, normalized_logits = self.create_discrete_action_masking_layer( - tf.concat(policy_branches, axis=1), self.action_masks, self.act_size - ) - tf.identity(normalized_logits, name="action") - self.sample_action = tf.cast(self.sample_action_float, tf.int32) - self.true_action = tf.placeholder( - shape=[None, len(policy_branches)], - dtype=tf.int32, - name="teacher_action", - ) - self.action_oh = tf.concat( - [ - tf.one_hot(self.true_action[:, i], self.act_size[i]) - for i in range(len(self.act_size)) - ], - axis=1, - ) - self.loss = tf.reduce_sum( - -tf.log(self.action_probs + 1e-10) * self.action_oh - ) - self.action_percent = tf.reduce_mean( - tf.cast( - tf.equal( - tf.cast(tf.argmax(self.action_probs, axis=1), tf.int32), - self.sample_action, - ), - tf.float32, - ) - ) - else: - self.policy = tf.layers.dense( - hidden_reg, - self.act_size[0], - activation=None, - use_bias=False, - name="pre_action", - kernel_initializer=tf.initializers.variance_scaling(0.01), - ) - self.clipped_sample_action = tf.clip_by_value(self.policy, -1, 1) - self.sample_action = tf.identity(self.clipped_sample_action, name="action") - self.true_action = tf.placeholder( - shape=[None, self.act_size[0]], dtype=tf.float32, name="teacher_action" - ) - self.clipped_true_action = tf.clip_by_value(self.true_action, -1, 1) - self.loss = tf.reduce_sum( - tf.squared_difference(self.clipped_true_action, self.sample_action) - ) - - optimizer = tf.train.AdamOptimizer(learning_rate=lr) - self.update = optimizer.minimize(self.loss) diff --git a/ml-agents/mlagents/trainers/bc/offline_trainer.py b/ml-agents/mlagents/trainers/bc/offline_trainer.py deleted file mode 100644 index 011e6acd0e..0000000000 --- a/ml-agents/mlagents/trainers/bc/offline_trainer.py +++ /dev/null @@ -1,66 +0,0 @@ -# # Unity ML-Agents Toolkit -# ## ML-Agent Learning (Behavioral Cloning) -# Contains an implementation of Behavioral Cloning Algorithm - -import logging -import copy - -from mlagents.trainers.bc.trainer import BCTrainer -from mlagents.trainers.demo_loader import demo_to_buffer -from mlagents.trainers.trainer import UnityTrainerException - -logger = logging.getLogger("mlagents.trainers") - - -class OfflineBCTrainer(BCTrainer): - """The OfflineBCTrainer is an implementation of Offline Behavioral Cloning.""" - - def __init__(self, brain, trainer_parameters, training, load, seed, run_id): - """ - Responsible for collecting experiences and training PPO model. - :param trainer_parameters: The parameters for the trainer (dictionary). - :param training: Whether the trainer is set for training. - :param load: Whether the model should be loaded. - :param seed: The seed the model will be initialized with - :param run_id: The identifier of the current run - """ - super(OfflineBCTrainer, self).__init__( - brain, trainer_parameters, training, load, seed, run_id - ) - - self.param_keys = [ - "batch_size", - "summary_freq", - "max_steps", - "batches_per_epoch", - "use_recurrent", - "hidden_units", - "learning_rate", - "num_layers", - "sequence_length", - "memory_size", - "model_path", - "demo_path", - ] - - self.check_param_keys() - self.batches_per_epoch = trainer_parameters["batches_per_epoch"] - self.n_sequences = max( - int(trainer_parameters["batch_size"] / self.policy.sequence_length), 1 - ) - - brain_params, self.demonstration_buffer = demo_to_buffer( - trainer_parameters["demo_path"], self.policy.sequence_length - ) - - policy_brain = copy.deepcopy(brain.__dict__) - expert_brain = copy.deepcopy(brain_params.__dict__) - policy_brain.pop("brain_name") - expert_brain.pop("brain_name") - policy_brain.pop("vector_action_descriptions") - expert_brain.pop("vector_action_descriptions") - if expert_brain != policy_brain: - raise UnityTrainerException( - "The provided demonstration is not compatible with the " - "brain being used for performance evaluation." - ) diff --git a/ml-agents/mlagents/trainers/bc/policy.py b/ml-agents/mlagents/trainers/bc/policy.py deleted file mode 100644 index cbebe72dca..0000000000 --- a/ml-agents/mlagents/trainers/bc/policy.py +++ /dev/null @@ -1,97 +0,0 @@ -import logging - -import numpy as np -from mlagents.trainers.bc.models import BehavioralCloningModel -from mlagents.trainers.tf_policy import TFPolicy - -logger = logging.getLogger("mlagents.trainers") - - -class BCPolicy(TFPolicy): - def __init__(self, seed, brain, trainer_parameters, load): - """ - :param seed: Random seed. - :param brain: Assigned Brain object. - :param trainer_parameters: Defined training parameters. - :param load: Whether a pre-trained model will be loaded or a new one created. - """ - super(BCPolicy, self).__init__(seed, brain, trainer_parameters) - - with self.graph.as_default(): - with self.graph.as_default(): - self.model = BehavioralCloningModel( - h_size=int(trainer_parameters["hidden_units"]), - lr=float(trainer_parameters["learning_rate"]), - n_layers=int(trainer_parameters["num_layers"]), - m_size=self.m_size, - normalize=False, - use_recurrent=trainer_parameters["use_recurrent"], - brain=brain, - seed=seed, - ) - - if load: - self._load_graph() - else: - self._initialize_graph() - - self.inference_dict = {"action": self.model.sample_action} - self.update_dict = { - "policy_loss": self.model.loss, - "update_batch": self.model.update, - } - if self.use_recurrent: - self.inference_dict["memory_out"] = self.model.memory_out - - self.evaluate_rate = 1.0 - self.update_rate = 0.5 - - def evaluate(self, brain_info): - """ - Evaluates policy for the agent experiences provided. - :param brain_info: BrainInfo input to network. - :return: Results of evaluation. - """ - feed_dict = { - self.model.dropout_rate: self.evaluate_rate, - self.model.sequence_length: 1, - } - - feed_dict = self.fill_eval_dict(feed_dict, brain_info) - if self.use_recurrent: - feed_dict[self.model.memory_in] = self.retrieve_memories(brain_info.agents) - run_out = self._execute_model(feed_dict, self.inference_dict) - return run_out - - def update(self, mini_batch, num_sequences): - """ - Performs update on model. - :param mini_batch: Batch of experiences. - :param num_sequences: Number of sequences to process. - :return: Results of update. - """ - - feed_dict = { - self.model.dropout_rate: self.update_rate, - self.model.batch_size: num_sequences, - self.model.sequence_length: self.sequence_length, - } - if self.use_continuous_act: - feed_dict[self.model.true_action] = mini_batch["actions"] - else: - feed_dict[self.model.true_action] = mini_batch["actions"] - feed_dict[self.model.action_masks] = np.ones( - (num_sequences, sum(self.brain.vector_action_space_size)), - dtype=np.float32, - ) - if self.use_vec_obs: - feed_dict[self.model.vector_in] = mini_batch["vector_obs"] - for i, _ in enumerate(self.model.visual_in): - visual_obs = mini_batch["visual_obs%d" % i] - feed_dict[self.model.visual_in[i]] = visual_obs - if self.use_recurrent: - feed_dict[self.model.memory_in] = np.zeros( - [num_sequences, self.m_size], dtype=np.float32 - ) - run_out = self._execute_model(feed_dict, self.update_dict) - return run_out diff --git a/ml-agents/mlagents/trainers/bc/trainer.py b/ml-agents/mlagents/trainers/bc/trainer.py deleted file mode 100644 index 198dd3447e..0000000000 --- a/ml-agents/mlagents/trainers/bc/trainer.py +++ /dev/null @@ -1,141 +0,0 @@ -# # Unity ML-Agents Toolkit -# ## ML-Agent Learning (Behavioral Cloning) -# Contains an implementation of Behavioral Cloning Algorithm - -import logging - -import numpy as np - -from mlagents.trainers.brain import BrainInfo -from mlagents.trainers.action_info import ActionInfoOutputs -from mlagents.trainers.bc.policy import BCPolicy -from mlagents.trainers.buffer import AgentBuffer -from mlagents.trainers.agent_processor import ProcessingBuffer -from mlagents.trainers.trainer import Trainer - -logger = logging.getLogger("mlagents.trainers") - - -class BCTrainer(Trainer): - """The BCTrainer is an implementation of Behavioral Cloning.""" - - def __init__(self, brain, trainer_parameters, training, load, seed, run_id): - """ - Responsible for collecting experiences and training PPO model. - :param trainer_parameters: The parameters for the trainer (dictionary). - :param training: Whether the trainer is set for training. - :param load: Whether the model should be loaded. - :param seed: The seed the model will be initialized with - :param run_id: The identifier of the current run - """ - super(BCTrainer, self).__init__(brain, trainer_parameters, training, run_id) - self.policy = BCPolicy(seed, brain, trainer_parameters, load) - self.n_sequences = 1 - self.cumulative_rewards = {} - self.episode_steps = {} - self.stats = { - "Losses/Cloning Loss": [], - "Environment/Episode Length": [], - "Environment/Cumulative Reward": [], - } - - self.batches_per_epoch = trainer_parameters["batches_per_epoch"] - - self.demonstration_buffer = AgentBuffer() - self.evaluation_buffer = ProcessingBuffer() - - def add_experiences( - self, - curr_info: BrainInfo, - next_info: BrainInfo, - take_action_outputs: ActionInfoOutputs, - ) -> None: - """ - Adds experiences to each agent's experience history. - :param curr_info: Current BrainInfo - :param next_info: Next BrainInfo - :param take_action_outputs: The outputs of the take action method. - """ - - # Used to collect information about student performance. - for agent_id in curr_info.agents: - self.evaluation_buffer[agent_id].last_brain_info = curr_info - - for agent_id in next_info.agents: - stored_next_info = self.evaluation_buffer[agent_id].last_brain_info - if stored_next_info is None: - continue - else: - next_idx = next_info.agents.index(agent_id) - if agent_id not in self.cumulative_rewards: - self.cumulative_rewards[agent_id] = 0 - self.cumulative_rewards[agent_id] += next_info.rewards[next_idx] - if not next_info.local_done[next_idx]: - if agent_id not in self.episode_steps: - self.episode_steps[agent_id] = 0 - self.episode_steps[agent_id] += 1 - - def process_experiences( - self, current_info: BrainInfo, next_info: BrainInfo - ) -> None: - """ - Checks agent histories for processing condition, and processes them as necessary. - Processing involves calculating value and advantage targets for model updating step. - :param current_info: Current BrainInfo - :param next_info: Next BrainInfo - """ - for l in range(len(next_info.agents)): - if next_info.local_done[l]: - agent_id = next_info.agents[l] - self.stats["Environment/Cumulative Reward"].append( - self.cumulative_rewards.get(agent_id, 0) - ) - self.stats["Environment/Episode Length"].append( - self.episode_steps.get(agent_id, 0) - ) - self.reward_buffer.appendleft(self.cumulative_rewards.get(agent_id, 0)) - self.cumulative_rewards[agent_id] = 0 - self.episode_steps[agent_id] = 0 - - def end_episode(self): - """ - A signal that the Episode has ended. The buffer must be reset. - Get only called when the academy resets. - """ - self.evaluation_buffer.reset_local_buffers() - for agent_id in self.cumulative_rewards: - self.cumulative_rewards[agent_id] = 0 - for agent_id in self.episode_steps: - self.episode_steps[agent_id] = 0 - - def is_ready_update(self): - """ - Returns whether or not the trainer has enough elements to run update model - :return: A boolean corresponding to whether or not update_model() can be run - """ - return self.demonstration_buffer.num_experiences > self.n_sequences - - def update_policy(self): - """ - Updates the policy. - """ - self.demonstration_buffer.shuffle(self.policy.sequence_length) - batch_losses = [] - batch_size = self.n_sequences * self.policy.sequence_length - # We either divide the entire buffer into num_batches batches, or limit the number - # of batches to batches_per_epoch. - num_batches = min( - self.demonstration_buffer.num_experiences // batch_size, - self.batches_per_epoch, - ) - - for i in range(0, num_batches * batch_size, batch_size): - update_buffer = self.demonstration_buffer - mini_batch = update_buffer.make_mini_batch(i, i + batch_size) - run_out = self.policy.update(mini_batch, self.n_sequences) - loss = run_out["policy_loss"] - batch_losses.append(loss) - if len(batch_losses) > 0: - self.stats["Losses/Cloning Loss"].append(np.mean(batch_losses)) - else: - self.stats["Losses/Cloning Loss"].append(0) diff --git a/ml-agents/mlagents/trainers/components/bc/module.py b/ml-agents/mlagents/trainers/components/bc/module.py index 2b8aea254b..a25b9e3d8d 100644 --- a/ml-agents/mlagents/trainers/components/bc/module.py +++ b/ml-agents/mlagents/trainers/components/bc/module.py @@ -22,7 +22,7 @@ def __init__( samples_per_update: int = 0, ): """ - A BC trainer that can be used inline with RL, especially for pretraining. + A BC trainer that can be used inline with RL. :param policy: The policy of the learning model :param policy_learning_rate: The initial Learning Rate of the policy. Used to set an appropriate learning rate for the pretrainer. @@ -33,7 +33,7 @@ def __init__( :param demo_path: The path to the demonstration file. :param batch_size: The batch size to use during BC training. :param num_epoch: Number of epochs to train for during each update. - :param samples_per_update: Maximum number of samples to train on during each pretraining update. + :param samples_per_update: Maximum number of samples to train on during each BC update. """ self.policy = policy self.current_lr = policy_learning_rate * strength @@ -60,7 +60,7 @@ def __init__( @staticmethod def check_config(config_dict: Dict[str, Any]) -> None: """ - Check the pretraining config for the required keys. + Check the behavioral_cloning config for the required keys. :param config_dict: Pretraining section of trainer_config """ param_keys = ["strength", "demo_path", "steps"] diff --git a/ml-agents/mlagents/trainers/ppo/policy.py b/ml-agents/mlagents/trainers/ppo/policy.py index 7978a1b8dd..1b508e9a4a 100644 --- a/ml-agents/mlagents/trainers/ppo/policy.py +++ b/ml-agents/mlagents/trainers/ppo/policy.py @@ -52,14 +52,14 @@ def __init__( with self.graph.as_default(): self.bc_module: Optional[BCModule] = None # Create pretrainer if needed - if "pretraining" in trainer_params: - BCModule.check_config(trainer_params["pretraining"]) + if "behavioral_cloning" in trainer_params: + BCModule.check_config(trainer_params["behavioral_cloning"]) self.bc_module = BCModule( self, policy_learning_rate=trainer_params["learning_rate"], default_batch_size=trainer_params["batch_size"], - default_num_epoch=trainer_params["num_epoch"], - **trainer_params["pretraining"], + default_num_epoch=3, + **trainer_params["behavioral_cloning"], ) if load: diff --git a/ml-agents/mlagents/trainers/sac/models.py b/ml-agents/mlagents/trainers/sac/models.py index 1e2911d8e8..b1886a6f6d 100644 --- a/ml-agents/mlagents/trainers/sac/models.py +++ b/ml-agents/mlagents/trainers/sac/models.py @@ -788,7 +788,7 @@ def create_inputs_and_outputs(self): self.dones_holder = tf.placeholder( shape=[None], dtype=tf.float32, name="dones_holder" ) - # This is just a dummy to get pretraining to work. PPO has this but SAC doesn't. + # This is just a dummy to get BC to work. PPO has this but SAC doesn't. # TODO: Proper input and output specs for models self.epsilon = tf.placeholder( shape=[None, self.act_size[0]], dtype=tf.float32, name="epsilon" diff --git a/ml-agents/mlagents/trainers/sac/policy.py b/ml-agents/mlagents/trainers/sac/policy.py index c23be111ec..104ddbbee6 100644 --- a/ml-agents/mlagents/trainers/sac/policy.py +++ b/ml-agents/mlagents/trainers/sac/policy.py @@ -59,18 +59,18 @@ def __init__( with self.graph.as_default(): # Create pretrainer if needed self.bc_module: Optional[BCModule] = None - if "pretraining" in trainer_params: - BCModule.check_config(trainer_params["pretraining"]) + if "behavioral_cloning" in trainer_params: + BCModule.check_config(trainer_params["behavioral_cloning"]) self.bc_module = BCModule( self, policy_learning_rate=trainer_params["learning_rate"], default_batch_size=trainer_params["batch_size"], default_num_epoch=1, samples_per_update=trainer_params["batch_size"], - **trainer_params["pretraining"], + **trainer_params["behavioral_cloning"], ) # SAC-specific setting - we don't want to do a whole epoch each update! - if "samples_per_update" in trainer_params["pretraining"]: + if "samples_per_update" in trainer_params["behavioral_cloning"]: logger.warning( "Pretraining: Samples Per Update is not a valid setting for SAC." ) diff --git a/ml-agents/mlagents/trainers/tests/test_barracuda_converter.py b/ml-agents/mlagents/trainers/tests/test_barracuda_converter.py index c86ab7ccea..7c22ca8d6a 100644 --- a/ml-agents/mlagents/trainers/tests/test_barracuda_converter.py +++ b/ml-agents/mlagents/trainers/tests/test_barracuda_converter.py @@ -1,10 +1,7 @@ import os -import yaml -import pytest import tempfile import mlagents.trainers.tensorflow_to_barracuda as tf2bc -from mlagents.trainers.tests.test_bc import create_bc_trainer def test_barracuda_converter(): @@ -27,29 +24,3 @@ def test_barracuda_converter(): # cleanup os.remove(tmpfile) - - -@pytest.fixture -def bc_dummy_config(): - return yaml.safe_load( - """ - hidden_units: 32 - learning_rate: 3.0e-4 - num_layers: 1 - use_recurrent: false - sequence_length: 32 - memory_size: 64 - batches_per_epoch: 1 - batch_size: 64 - summary_freq: 2000 - max_steps: 4000 - """ - ) - - -@pytest.mark.parametrize("use_lstm", [False, True], ids=["nolstm", "lstm"]) -@pytest.mark.parametrize("use_discrete", [True, False], ids=["disc", "cont"]) -def test_bc_export(bc_dummy_config, use_lstm, use_discrete): - bc_dummy_config["use_recurrent"] = use_lstm - trainer, env = create_bc_trainer(bc_dummy_config, use_discrete) - trainer.export_model() diff --git a/ml-agents/mlagents/trainers/tests/test_bc.py b/ml-agents/mlagents/trainers/tests/test_bc.py deleted file mode 100644 index 219b651ba4..0000000000 --- a/ml-agents/mlagents/trainers/tests/test_bc.py +++ /dev/null @@ -1,236 +0,0 @@ -import unittest.mock as mock -import pytest -import os - -import numpy as np -from mlagents.tf_utils import tf -import yaml - -from mlagents.trainers.bc.models import BehavioralCloningModel -import mlagents.trainers.tests.mock_brain as mb -from mlagents.trainers.bc.policy import BCPolicy -from mlagents.trainers.bc.offline_trainer import BCTrainer - -from mlagents.envs.mock_communicator import MockCommunicator -from mlagents.trainers.tests.mock_brain import make_brain_parameters -from mlagents.envs.environment import UnityEnvironment -from mlagents.trainers.brain_conversion_utils import ( - step_result_to_brain_info, - group_spec_to_brain_parameters, -) - - -@pytest.fixture -def dummy_config(): - return yaml.safe_load( - """ - hidden_units: 32 - learning_rate: 3.0e-4 - num_layers: 1 - use_recurrent: false - sequence_length: 32 - memory_size: 32 - batches_per_epoch: 100 # Force code to use all possible batches - batch_size: 32 - summary_freq: 2000 - max_steps: 4000 - """ - ) - - -def create_bc_trainer(dummy_config, is_discrete=False, use_recurrent=False): - mock_env = mock.Mock() - if is_discrete: - mock_brain = mb.create_mock_pushblock_brain() - mock_braininfo = mb.create_mock_braininfo( - num_agents=12, num_vector_observations=70 - ) - else: - mock_brain = mb.create_mock_3dball_brain() - mock_braininfo = mb.create_mock_braininfo( - num_agents=12, num_vector_observations=8 - ) - mb.setup_mock_unityenvironment(mock_env, mock_brain, mock_braininfo) - env = mock_env() - - trainer_parameters = dummy_config - trainer_parameters["summary_path"] = "tmp" - trainer_parameters["model_path"] = "tmp" - trainer_parameters["demo_path"] = ( - os.path.dirname(os.path.abspath(__file__)) + "/test.demo" - ) - trainer_parameters["use_recurrent"] = use_recurrent - trainer = BCTrainer( - mock_brain, trainer_parameters, training=True, load=False, seed=0, run_id=0 - ) - trainer.demonstration_buffer = mb.simulate_rollout(env, trainer.policy, 100) - return trainer, env - - -@pytest.mark.parametrize("use_recurrent", [True, False]) -def test_bc_trainer_step(dummy_config, use_recurrent): - trainer, env = create_bc_trainer(dummy_config, use_recurrent=use_recurrent) - # Test get_step - assert trainer.get_step == 0 - # Test update policy - trainer.update_policy() - assert len(trainer.stats["Losses/Cloning Loss"]) > 0 - # Test increment step - trainer.increment_step(1) - assert trainer.step == 1 - - -def test_bc_trainer_add_proc_experiences(dummy_config): - trainer, env = create_bc_trainer(dummy_config) - # Test add_experiences - returned_braininfo = env.step() - brain_name = "Ball3DBrain" - trainer.add_experiences( - returned_braininfo[brain_name], returned_braininfo[brain_name], {} - ) # Take action outputs is not used - for agent_id in returned_braininfo[brain_name].agents: - assert trainer.evaluation_buffer[agent_id].last_brain_info is not None - assert trainer.episode_steps[agent_id] > 0 - assert trainer.cumulative_rewards[agent_id] > 0 - # Test process_experiences by setting done - returned_braininfo[brain_name].local_done = 12 * [True] - trainer.process_experiences( - returned_braininfo[brain_name], returned_braininfo[brain_name] - ) - for agent_id in returned_braininfo[brain_name].agents: - assert trainer.episode_steps[agent_id] == 0 - assert trainer.cumulative_rewards[agent_id] == 0 - - -def test_bc_trainer_end_episode(dummy_config): - trainer, env = create_bc_trainer(dummy_config) - returned_braininfo = env.step() - brain_name = "Ball3DBrain" - trainer.add_experiences( - returned_braininfo[brain_name], returned_braininfo[brain_name], {} - ) # Take action outputs is not used - trainer.process_experiences( - returned_braininfo[brain_name], returned_braininfo[brain_name] - ) - # Should set everything to 0 - trainer.end_episode() - for agent_id in returned_braininfo[brain_name].agents: - assert trainer.episode_steps[agent_id] == 0 - assert trainer.cumulative_rewards[agent_id] == 0 - - -@mock.patch("mlagents.envs.environment.UnityEnvironment.executable_launcher") -@mock.patch("mlagents.envs.environment.UnityEnvironment.get_communicator") -def test_bc_policy_evaluate(mock_communicator, mock_launcher, dummy_config): - tf.reset_default_graph() - mock_communicator.return_value = MockCommunicator( - discrete_action=False, visual_inputs=0 - ) - env = UnityEnvironment(" ") - env.reset() - brain_name = env.get_agent_groups()[0] - brain_info = step_result_to_brain_info( - env.get_step_result(brain_name), env.get_agent_group_spec(brain_name) - ) - brain_params = group_spec_to_brain_parameters( - brain_name, env.get_agent_group_spec(brain_name) - ) - - trainer_parameters = dummy_config - model_path = brain_name - trainer_parameters["model_path"] = model_path - trainer_parameters["keep_checkpoints"] = 3 - policy = BCPolicy(0, brain_params, trainer_parameters, False) - run_out = policy.evaluate(brain_info) - assert run_out["action"].shape == (3, 2) - - env.close() - - -def test_cc_bc_model(): - tf.reset_default_graph() - with tf.Session() as sess: - with tf.variable_scope("FakeGraphScope"): - model = BehavioralCloningModel( - make_brain_parameters(discrete_action=False, visual_inputs=0) - ) - init = tf.global_variables_initializer() - sess.run(init) - - run_list = [model.sample_action, model.policy] - feed_dict = { - model.batch_size: 2, - model.sequence_length: 1, - model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]), - } - sess.run(run_list, feed_dict=feed_dict) - # env.close() - - -def test_dc_bc_model(): - tf.reset_default_graph() - with tf.Session() as sess: - with tf.variable_scope("FakeGraphScope"): - model = BehavioralCloningModel( - make_brain_parameters(discrete_action=True, visual_inputs=0) - ) - init = tf.global_variables_initializer() - sess.run(init) - - run_list = [model.sample_action, model.action_probs] - feed_dict = { - model.batch_size: 2, - model.dropout_rate: 1.0, - model.sequence_length: 1, - model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]), - model.action_masks: np.ones([2, 2], dtype=np.float32), - } - sess.run(run_list, feed_dict=feed_dict) - - -def test_visual_dc_bc_model(): - tf.reset_default_graph() - with tf.Session() as sess: - with tf.variable_scope("FakeGraphScope"): - model = BehavioralCloningModel( - make_brain_parameters(discrete_action=True, visual_inputs=2) - ) - init = tf.global_variables_initializer() - sess.run(init) - - run_list = [model.sample_action, model.action_probs] - feed_dict = { - model.batch_size: 2, - model.dropout_rate: 1.0, - model.sequence_length: 1, - model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]), - model.visual_in[0]: np.ones([2, 40, 30, 3], dtype=np.float32), - model.visual_in[1]: np.ones([2, 40, 30, 3], dtype=np.float32), - model.action_masks: np.ones([2, 2], dtype=np.float32), - } - sess.run(run_list, feed_dict=feed_dict) - - -def test_visual_cc_bc_model(): - tf.reset_default_graph() - with tf.Session() as sess: - with tf.variable_scope("FakeGraphScope"): - model = BehavioralCloningModel( - make_brain_parameters(discrete_action=False, visual_inputs=2) - ) - init = tf.global_variables_initializer() - sess.run(init) - - run_list = [model.sample_action, model.policy] - feed_dict = { - model.batch_size: 2, - model.sequence_length: 1, - model.vector_in: np.array([[1, 2, 3, 1, 2, 3], [3, 4, 5, 3, 4, 5]]), - model.visual_in[0]: np.ones([2, 40, 30, 3], dtype=np.float32), - model.visual_in[1]: np.ones([2, 40, 30, 3], dtype=np.float32), - } - sess.run(run_list, feed_dict=feed_dict) - - -if __name__ == "__main__": - pytest.main() diff --git a/ml-agents/mlagents/trainers/tests/test_bcmodule.py b/ml-agents/mlagents/trainers/tests/test_bcmodule.py index 3a26fd9f56..71071c9233 100644 --- a/ml-agents/mlagents/trainers/tests/test_bcmodule.py +++ b/ml-agents/mlagents/trainers/tests/test_bcmodule.py @@ -30,7 +30,7 @@ def ppo_dummy_config(): summary_freq: 1000 use_recurrent: false memory_size: 8 - pretraining: + behavioral_cloning: demo_path: ./demos/ExpertPyramid.demo strength: 1.0 steps: 10000000 @@ -64,7 +64,7 @@ def sac_dummy_config(): tau: 0.005 use_recurrent: false vis_encode_type: simple - pretraining: + behavioral_cloning: demo_path: ./demos/ExpertPyramid.demo strength: 1.0 steps: 10000000 @@ -87,7 +87,7 @@ def create_policy_with_bc_mock( trainer_config["model_path"] = model_path trainer_config["keep_checkpoints"] = 3 trainer_config["use_recurrent"] = use_rnn - trainer_config["pretraining"]["demo_path"] = ( + trainer_config["behavioral_cloning"]["demo_path"] = ( os.path.dirname(os.path.abspath(__file__)) + "/" + demo_file ) @@ -108,12 +108,12 @@ def test_bcmodule_defaults(mock_env): env, policy = create_policy_with_bc_mock( mock_env, mock_brain, trainer_config, False, "test.demo" ) - assert policy.bc_module.num_epoch == trainer_config["num_epoch"] + assert policy.bc_module.num_epoch == 3 assert policy.bc_module.batch_size == trainer_config["batch_size"] env.close() # Assign strange values and see if it overrides properly - trainer_config["pretraining"]["num_epoch"] = 100 - trainer_config["pretraining"]["batch_size"] = 10000 + trainer_config["behavioral_cloning"]["num_epoch"] = 100 + trainer_config["behavioral_cloning"]["batch_size"] = 10000 env, policy = create_policy_with_bc_mock( mock_env, mock_brain, trainer_config, False, "test.demo" ) @@ -145,7 +145,7 @@ def test_bcmodule_update(mock_env, trainer_config): @mock.patch("mlagents.envs.environment.UnityEnvironment") def test_bcmodule_constant_lr_update(mock_env, trainer_config): mock_brain = mb.create_mock_3dball_brain() - trainer_config["pretraining"]["steps"] = 0 + trainer_config["behavioral_cloning"]["steps"] = 0 env, policy = create_policy_with_bc_mock( mock_env, mock_brain, trainer_config, False, "test.demo" ) diff --git a/ml-agents/mlagents/trainers/tests/test_reward_signals.py b/ml-agents/mlagents/trainers/tests/test_reward_signals.py index 460ae57972..3b7639dc2f 100644 --- a/ml-agents/mlagents/trainers/tests/test_reward_signals.py +++ b/ml-agents/mlagents/trainers/tests/test_reward_signals.py @@ -58,7 +58,7 @@ def sac_dummy_config(): tau: 0.005 use_recurrent: false vis_encode_type: simple - pretraining: + behavioral_cloning: demo_path: ./demos/ExpertPyramid.demo strength: 1.0 steps: 10000000 diff --git a/ml-agents/mlagents/trainers/tests/test_trainer_util.py b/ml-agents/mlagents/trainers/tests/test_trainer_util.py index c763080d45..03fd4f397c 100644 --- a/ml-agents/mlagents/trainers/tests/test_trainer_util.py +++ b/ml-agents/mlagents/trainers/tests/test_trainer_util.py @@ -1,6 +1,5 @@ import pytest import yaml -import os import io from unittest.mock import patch @@ -8,7 +7,6 @@ from mlagents.trainers.trainer_util import load_config, _load_config from mlagents.trainers.trainer_metrics import TrainerMetrics from mlagents.trainers.ppo.trainer import PPOTrainer -from mlagents.trainers.bc.offline_trainer import OfflineBCTrainer from mlagents.envs.exception import UnityEnvironmentException @@ -43,42 +41,8 @@ def dummy_config(): @pytest.fixture -def dummy_offline_bc_config(): - return yaml.safe_load( - """ - default: - trainer: offline_bc - demo_path: """ - + os.path.dirname(os.path.abspath(__file__)) - + """/test.demo - batches_per_epoch: 16 - batch_size: 32 - beta: 5.0e-3 - buffer_size: 512 - epsilon: 0.2 - gamma: 0.99 - hidden_units: 128 - lambd: 0.95 - learning_rate: 3.0e-4 - max_steps: 5.0e4 - normalize: true - num_epoch: 5 - num_layers: 2 - time_horizon: 64 - sequence_length: 64 - summary_freq: 1000 - use_recurrent: false - memory_size: 8 - use_curiosity: false - curiosity_strength: 0.0 - curiosity_enc_size: 1 - """ - ) - - -@pytest.fixture -def dummy_offline_bc_config_with_override(): - base = dummy_offline_bc_config() +def dummy_config_with_override(): + base = dummy_config() base["testbrain"] = {} base["testbrain"]["normalize"] = False return base @@ -122,8 +86,9 @@ def test_initialize_trainer_parameters_override_defaults(BrainParametersMock): train_model = True load_model = False seed = 11 + expected_reward_buff_cap = 1 - base_config = dummy_offline_bc_config_with_override() + base_config = dummy_config_with_override() expected_config = base_config["default"] expected_config["summary_path"] = summaries_dir + f"/{run_id}_testbrain" expected_config["model_path"] = model_path + "/testbrain" @@ -136,15 +101,28 @@ def test_initialize_trainer_parameters_override_defaults(BrainParametersMock): BrainParametersMock.return_value.brain_name = "testbrain" external_brains = {"testbrain": brain_params_mock} - def mock_constructor(self, brain, trainer_parameters, training, load, seed, run_id): + def mock_constructor( + self, + brain, + reward_buff_cap, + trainer_parameters, + training, + load, + seed, + run_id, + multi_gpu, + ): + self.trainer_metrics = TrainerMetrics("", "") assert brain == brain_params_mock assert trainer_parameters == expected_config + assert reward_buff_cap == expected_reward_buff_cap assert training == train_model assert load == load_model assert seed == seed assert run_id == run_id + assert multi_gpu == multi_gpu - with patch.object(OfflineBCTrainer, "__init__", mock_constructor): + with patch.object(PPOTrainer, "__init__", mock_constructor): trainer_factory = trainer_util.TrainerFactory( trainer_config=base_config, summaries_dir=summaries_dir, @@ -159,7 +137,7 @@ def mock_constructor(self, brain, trainer_parameters, training, load, seed, run_ for _, brain_parameters in external_brains.items(): trainers["testbrain"] = trainer_factory.generate(brain_parameters) assert "testbrain" in trainers - assert isinstance(trainers["testbrain"], OfflineBCTrainer) + assert isinstance(trainers["testbrain"], PPOTrainer) @patch("mlagents.trainers.brain.BrainParameters") diff --git a/ml-agents/mlagents/trainers/trainer_util.py b/ml-agents/mlagents/trainers/trainer_util.py index 06dac52512..6850d201c3 100644 --- a/ml-agents/mlagents/trainers/trainer_util.py +++ b/ml-agents/mlagents/trainers/trainer_util.py @@ -3,11 +3,10 @@ from mlagents.trainers.meta_curriculum import MetaCurriculum from mlagents.envs.exception import UnityEnvironmentException -from mlagents.trainers.trainer import Trainer +from mlagents.trainers.trainer import Trainer, UnityTrainerException from mlagents.trainers.brain import BrainParameters from mlagents.trainers.ppo.trainer import PPOTrainer from mlagents.trainers.sac.trainer import SACTrainer -from mlagents.trainers.bc.offline_trainer import OfflineBCTrainer class TrainerFactory: @@ -98,8 +97,10 @@ def initialize_trainer( trainer = None if trainer_parameters["trainer"] == "offline_bc": - trainer = OfflineBCTrainer( - brain_parameters, trainer_parameters, train_model, load_model, seed, run_id + raise UnityTrainerException( + "The offline_bc trainer has been removed. To train with demonstrations, " + "please use a PPO or SAC trainer with the GAIL Reward Signal and/or the " + "Behavioral Cloning feature enabled." ) elif trainer_parameters["trainer"] == "ppo": trainer = PPOTrainer(